OASC — MIM2 Representing Data

Ontology alignment pipeline

From detection to execution — the three phases of ontology alignment, and the spectrum of techniques from rule-based to generative AI.

1
Detection — finding correspondences
Input: two ontologies (source + target). Output: a set of candidate correspondences with confidence scores. This is where most of the intelligence lives.
Lexical / string matching rule-based
Compare concept labels using string similarity (Levenshtein, Jaccard, n-grams), synonym lookup (WordNet), and multilingual dictionaries. Fast but brittle — misses semantically equivalent concepts with different names.
Ex: "Temperature" ↔ "Temperatura" → match via translation table
Structural / graph matching rule-based
Compare the topology of the ontology graphs: hierarchy position, shared sub-classes, domain/range of properties, cardinality constraints. Two concepts with similar "neighbourhoods" are likely equivalent.
Ex: both have sub-classes {Indoor, Outdoor} and property "hasUnit" → structural match
Logic-based reasoning symbolic AI
Use Description Logic reasoners (HermiT, Pellet) to infer correspondences from OWL axioms. Can detect subsumption (A ⊑ B), equivalence (A ≡ B), and disjointness. Precise but requires well-formalised ontologies.
Ex: OWL axioms imply Sensor ⊑ Device → subsumption detected
Embedding-based (BERT, Sentence-BERT) machine learning
Encode concept labels + descriptions into vector embeddings, then compute cosine similarity. Captures semantic proximity even when labels are very different. Not generative — it's a retrieval/similarity approach.
Ex: "AirQualityIndex" ↔ "PollutionLevel" → cosine sim 0.87
LLM-based alignment (RAG + prompting) generative AI
Feed concept pairs (with their definitions, axioms, module context) to an LLM and ask it to judge the relationship. RAG retrieves the most relevant candidates first, then the LLM reasons about complex correspondences. State of the art for complex (1-to-N) alignments that previously required human experts.
Ex: "Does saref:Measurement subsume or equal schema:QuantitativeValue?" → LLM reasons with context
Hybrid / ensemble systems combined
Most competitive systems (LogMap, AML, OntoAligner) combine multiple matchers: lexical first (fast filtering), then structural, then ML-based refinement. Outputs are aggregated with weighted voting or learned fusion.
Ex: LogMap uses lexical + structural + logic repair; OntoAligner chains retrieval + LLM
Correspondences expressed as mappings
2
Representation — expressing the mappings
The detected correspondences need a standard format to be stored, shared, and consumed by transformation engines.
SSSOM (Simple Standard for Sharing Ontological Mappings) standard
Emerging OBO Foundry standard. TSV/JSON format. Each row: subject_id, predicate (skos:exactMatch, broadMatch...), object_id, confidence, mapping_tool. Simple, shareable, versionable. Growing adoption.
SKOS mapping properties W3C standard
skos:exactMatch, skos:closeMatch, skos:broadMatch, skos:narrowMatch, skos:relatedMatch. Lightweight, widely understood, directly usable in RDF/JSON-LD.
OWL axioms W3C standard
owl:equivalentClass, owl:sameAs, rdfs:subClassOf. Formal, machine-reasonable, but heavier. Best for ontologies that will be merged rather than just mapped.
Mappings drive data transformation
3
Execution — transforming the data
The mappings are consumed by transformation engines that convert actual data instances from one model to another.
RML / R2RML rule generation rule-based
Translate correspondences into declarative mapping rules. Input: SSSOM/SKOS mappings. Output: RML rules that an engine (RMLMapper, Morph-KGC) executes on data streams. This step is rule-based by design.
SPARQL CONSTRUCT queries rule-based
Each correspondence becomes a CONSTRUCT pattern that rewrites triples from the source graph into the target vocabulary. Executed by any SPARQL endpoint.
LLM-assisted rule generation generative AI
Emerging approach: use an LLM to generate RML/SPARQL rules from natural-language descriptions of the correspondence. The human validates, the engine executes. Hybrid human-AI workflow.
Spectrum of approaches — from rules to generative AI
Rules
Symbolic
ML / embeddings
Generative AI
Deterministic, explainable Probabilistic, handles complexity
rule-based deterministic symbolic AI logic reasoning ML embeddings / retrieval gen AI LLM prompting / RAG standard format / spec