🌐 English · 한국어

Summary

OntoEKG is an LLM-driven pipeline by Oyewale & Soru (Liber AI Research) that generates domain-specific RDF/OWL ontologies for enterprise knowledge graphs (EKGs) directly from unstructured enterprise text. It decomposes ontology modelling into two phases: an extraction module that identifies core classes and properties, and an entailment module that logically structures those classes into a hierarchy before RDF serialisation. Evaluated on a new three-sector dataset (Data, Finance, Logistics), it reaches a fuzzy-match F1 of 0.724 in the Data domain but reveals clear limitations in scope definition and hierarchical reasoning. The paper doubles as a call to action for a comprehensive end-to-end ontology-construction benchmark.

Key Contributions

  1. OntoEKG pipeline — a two-step LLM process (extraction then entailment) that turns unstructured enterprise text into a formal RDF ontology serialised to Turtle.
  2. A call for a benchmark — argues existing benchmarks (OntoURL, Text2KGBench, OSKGC, LLMs4OL) do not support end-to-end ontology construction from unstructured text, and urges the community to build one.
  3. A new evaluation dataset — three enterprise policy-text use cases in the Data, Finance, and Logistics sectors (released in the OntoEKG GitHub repo).

Methodology and Architecture

Formalisation: from text T, infer classes C^T and properties P^T. Each class has a label and description; each property has a label, a domain class, and a range class. Classes form a hierarchy (c1 ⊆ c2). In RDF, classes are owl:Class, properties are owl:ObjectProperty, and datatypes are reified into their own classes (Schema.org style).

Four-stage pipeline:

  1. Data Ingestion — unstructured text in; Pydantic data models force valid JSON output (classes, properties, descriptions, domains, ranges).
  2. Ontological Element Extraction — an extraction LLM with a specialised system prompt identifies Classes (entity types) and Properties (relationships), constrained to the provided schema.
  3. Hierarchy Construction with Entailment — an entailment LLM iteratively reasons over class descriptions to infer subclass/inheritance relationships and build the taxonomy.
  4. RDF Serialisation — properties + hierarchy merged and written to Turtle via rdflib.

Models: Extraction = Google Gemini 3 Flash (preview); Entailment = Anthropic Claude 4.5 Opus. Run on Google Colab. Other entailment candidates (Gemini 2.5 Flash/Pro, 3 Flash preview, Claude 4.5 Sonnet) were tried but underperformed; Gemini 2.5 Pro was dropped for efficiency.

Results

Two matching schemes on the three use cases (fuzzy match = embedding-based triple alignment, thresholds 0.94/0.94/0.95):

Use caseExact F1Fuzzy F1
Data0.1020.724
Finance0.0000.121
Logistics0.0480.431
  • Data was best (fuzzy P=0.656, R=0.807, F1=0.724); Finance worst (F1=0.121), due to disagreement over which terms belong in the ontology.
  • Qualitative failures: “Policy” and “GovernanceStandard” each declared a subclass of the other (spurious equivalence), and an ambiguous “isTypeOf” property unclear between rdf:subClassOf and rdf:type.
  • Limitations: LLMs struggle to set ontology scope autonomously, sometimes propose individuals instead of classes (no declared abstraction level), and confuse the directionality of hierarchy relations with loose subsumption — hurting logical consistency.
  • Future work: end-to-end text→RDF translation, handling named individuals and provenance metadata, progressive ontology construction by feeding the existing model back into OntoEKG, and a community benchmark.
  • No closely related papers in the wiki yet. The current siblings (pedestrian-robot interaction, sidewalk delivery-robot evaluation, pedestrian capacity) are HRI/logistics topics that do not overlap with this paper’s LLM ontology-construction method.