OASC — MIM2 Representing Data

Ontology alignment pipeline

From detection to execution — the three phases of ontology alignment, and the spectrum of techniques from rule-based to generative AI.

Detection — finding correspondences

Input: two ontologies (source + target). Output: a set of candidate correspondences with confidence scores. This is where most of the intelligence lives.

Lexical / string matching rule-based

Compare concept labels using string similarity (Levenshtein, Jaccard, n-grams), synonym lookup (WordNet), and multilingual dictionaries. Fast but brittle — misses semantically equivalent concepts with different names.

Ex: "Temperature" ↔ "Temperatura" → match via translation table

Structural / graph matching rule-based

Compare the topology of the ontology graphs: hierarchy position, shared sub-classes, domain/range of properties, cardinality constraints. Two concepts with similar "neighbourhoods" are likely equivalent.

Ex: both have sub-classes {Indoor, Outdoor} and property "hasUnit" → structural match

Logic-based reasoning symbolic AI

Use Description Logic reasoners (HermiT, Pellet) to infer correspondences from OWL axioms. Can detect subsumption (A ⊑ B), equivalence (A ≡ B), and disjointness. Precise but requires well-formalised ontologies.

Ex: OWL axioms imply Sensor ⊑ Device → subsumption detected

Embedding-based (BERT, Sentence-BERT) machine learning

Encode concept labels + descriptions into vector embeddings, then compute cosine similarity. Captures semantic proximity even when labels are very different. Not generative — it's a retrieval/similarity approach.

Ex: "AirQualityIndex" ↔ "PollutionLevel" → cosine sim 0.87

LLM-based alignment (RAG + prompting) generative AI

Feed concept pairs (with their definitions, axioms, module context) to an LLM and ask it to judge the relationship. RAG retrieves the most relevant candidates first, then the LLM reasons about complex correspondences. State of the art for complex (1-to-N) alignments that previously required human experts.

Ex: "Does saref:Measurement subsume or equal schema:QuantitativeValue?" → LLM reasons with context

Hybrid / ensemble systems combined

Most competitive systems (LogMap, AML, OntoAligner) combine multiple matchers: lexical first (fast filtering), then structural, then ML-based refinement. Outputs are aggregated with weighted voting or learned fusion.

Ex: LogMap uses lexical + structural + logic repair; OntoAligner chains retrieval + LLM

Correspondences expressed as mappings

Representation — expressing the mappings

The detected correspondences need a standard format to be stored, shared, and consumed by transformation engines.

SSSOM (Simple Standard for Sharing Ontological Mappings) standard

Emerging OBO Foundry standard. TSV/JSON format. Each row: subject_id, predicate (skos:exactMatch, broadMatch...), object_id, confidence, mapping_tool. Simple, shareable, versionable. Growing adoption.

SKOS mapping properties W3C standard

skos:exactMatch, skos:closeMatch, skos:broadMatch, skos:narrowMatch, skos:relatedMatch. Lightweight, widely understood, directly usable in RDF/JSON-LD.

OWL axioms W3C standard

owl:equivalentClass, owl:sameAs, rdfs:subClassOf. Formal, machine-reasonable, but heavier. Best for ontologies that will be merged rather than just mapped.

Mappings drive data transformation

Execution — transforming the data

The mappings are consumed by transformation engines that convert actual data instances from one model to another.

RML / R2RML rule generation rule-based

Translate correspondences into declarative mapping rules. Input: SSSOM/SKOS mappings. Output: RML rules that an engine (RMLMapper, Morph-KGC) executes on data streams. This step is rule-based by design.

SPARQL CONSTRUCT queries rule-based

Each correspondence becomes a CONSTRUCT pattern that rewrites triples from the source graph into the target vocabulary. Executed by any SPARQL endpoint.

LLM-assisted rule generation generative AI

Emerging approach: use an LLM to generate RML/SPARQL rules from natural-language descriptions of the correspondence. The human validates, the engine executes. Hybrid human-AI workflow.

Spectrum of approaches — from rules to generative AI

Rules

Symbolic

ML / embeddings

Generative AI

Deterministic, explainable Probabilistic, handles complexity

rule-based deterministic symbolic AI logic reasoning ML embeddings / retrieval gen AI LLM prompting / RAG standard format / spec