🌐 English · 한국어
Summary
OntoEKG is an LLM-driven pipeline by Oyewale & Soru (Liber AI Research) that generates domain-specific RDF/OWL ontologies for enterprise knowledge graphs (EKGs) directly from unstructured enterprise text. It decomposes ontology modelling into two phases: an extraction module that identifies core classes and properties, and an entailment module that logically structures those classes into a hierarchy before RDF serialisation. Evaluated on a new three-sector dataset (Data, Finance, Logistics), it reaches a fuzzy-match F1 of 0.724 in the Data domain but reveals clear limitations in scope definition and hierarchical reasoning. The paper doubles as a call to action for a comprehensive end-to-end ontology-construction benchmark.
Key Contributions
- OntoEKG pipeline — a two-step LLM process (extraction then entailment) that turns unstructured enterprise text into a formal RDF ontology serialised to Turtle.
- A call for a benchmark — argues existing benchmarks (OntoURL, Text2KGBench, OSKGC, LLMs4OL) do not support end-to-end ontology construction from unstructured text, and urges the community to build one.
- A new evaluation dataset — three enterprise policy-text use cases in the Data, Finance, and Logistics sectors (released in the OntoEKG GitHub repo).
Methodology and Architecture
Formalisation: from text T, infer classes C^T and properties P^T. Each class has a label and description; each property has a label, a domain class, and a range class. Classes form a hierarchy (c1 ⊆ c2). In RDF, classes are owl:Class, properties are owl:ObjectProperty, and datatypes are reified into their own classes (Schema.org style).
Four-stage pipeline:
- Data Ingestion — unstructured text in; Pydantic data models force valid JSON output (classes, properties, descriptions, domains, ranges).
- Ontological Element Extraction — an extraction LLM with a specialised system prompt identifies Classes (entity types) and Properties (relationships), constrained to the provided schema.
- Hierarchy Construction with Entailment — an entailment LLM iteratively reasons over class descriptions to infer subclass/inheritance relationships and build the taxonomy.
- RDF Serialisation — properties + hierarchy merged and written to Turtle via rdflib.
Models: Extraction = Google Gemini 3 Flash (preview); Entailment = Anthropic Claude 4.5 Opus. Run on Google Colab. Other entailment candidates (Gemini 2.5 Flash/Pro, 3 Flash preview, Claude 4.5 Sonnet) were tried but underperformed; Gemini 2.5 Pro was dropped for efficiency.
Results
Two matching schemes on the three use cases (fuzzy match = embedding-based triple alignment, thresholds 0.94/0.94/0.95):
| Use case | Exact F1 | Fuzzy F1 |
|---|---|---|
| Data | 0.102 | 0.724 |
| Finance | 0.000 | 0.121 |
| Logistics | 0.048 | 0.431 |
- Data was best (fuzzy P=0.656, R=0.807, F1=0.724); Finance worst (F1=0.121), due to disagreement over which terms belong in the ontology.
- Qualitative failures: “Policy” and “GovernanceStandard” each declared a subclass of the other (spurious equivalence), and an ambiguous “isTypeOf” property unclear between
rdf:subClassOfandrdf:type. - Limitations: LLMs struggle to set ontology scope autonomously, sometimes propose individuals instead of classes (no declared abstraction level), and confuse the directionality of hierarchy relations with loose subsumption — hurting logical consistency.
- Future work: end-to-end text→RDF translation, handling named individuals and provenance metadata, progressive ontology construction by feeding the existing model back into OntoEKG, and a community benchmark.
Related Papers
- No closely related papers in the wiki yet. The current siblings (pedestrian-robot interaction, sidewalk delivery-robot evaluation, pedestrian capacity) are HRI/logistics topics that do not overlap with this paper’s LLM ontology-construction method.