OASC
UC3 · team GEX
MIMathon Porto 2026 · Use case 03

CityData Harmonizer:
any source, any domain, one pipeline.

Team GEX shipped an AI-assisted city-data harmonization pipeline, built around nine specialised "rangers" trained as MIM Champions in the spirit of the OASC Academy. Tested end-to-end on the Porto energy track over five heterogeneous sources, then proven domain-agnostic on water-quality data with zero pipeline change.

TeamGEX · Marwen, Mohamed, Olaf
Datasets tested5 energy + 2 water
Pipeline9 rangers, 4 AI + 5 deterministic
SDM proposedEnergyConsumptionObserved

Cities are drowning in data.

Cities are drowning in data, but the data does not fail because it is missing. It fails because meaning and format fragment across systems. In the Porto energy use case, the same quantity appears as energy_kWh, consumption_kWh, and energyConsumed across providers: one concept, a dozen names, zero compatibility.

Multiple formats

CSV, XLSX, JSON.

Different delimiters, encodings, and shapes. Every provider does it differently.

Conflicting schemas

One concept, twelve names.

energy_kwh vs consumption_kWh vs energyConsumed. Same concept. Zero compatibility.

No shared model

No canonical target.

Without a shared canonical target, comparing or aggregating across providers is impossible at scale.

Beyond the simple match

Can we make the harmonization process replicable? Can we develop a reusable, domain-agnostic system? The task is not a 1:1 field match. It is, first, a semantic problem: take several non-interoperable datasets within one domain, identify the right canonical target, fill the gaps, and produce one combined output that carries all content from every source without losing provenance.

AI proposes, deterministic verifies.

The CityData Harmonizer is a nine-act pipeline of agents, the Rangers, trained in the spirit of the OASC Academy as MIM champions. Its founding principle: a strict separation between probabilistic proposal and deterministic execution. The AI rangers propose, the deterministic rangers verify and execute, and nothing reaches production data without passing a rule-based quality gate.

9
rangers / MIM champions
4 / 5
AI / deterministic split
7
architectural layers (L1–L7)
0.85
auto-approval confidence

Confidence routing, explicit and deterministic

Control / data separation Deterministic reproducibility Self-learning convergence Full provenance MCP interoperability Quality gate Format-agnostic Schema-agnostic

Meet the 9 rangers.

Four AI-powered rangers handle investigation and judgment. Five deterministic rangers handle parsing, verification, execution, validation, and lifecycle.

SchemaSelector
AI · pre-pipeline
"I pick the right model before the mission starts."
Discovers the best-fitting OASC Smart Data Model for the source via the Smart Data Models MCP server.
Adapter
DET · ACT 1, ingestion
"I don't ask questions. I just read files."
Reads CSV, XLSX, JSON. Turns any format into a clean list of records. Pluggable: new formats need no pipeline changes.
Profiler
AI · ACT 2, profiling
"I observe. I never judge. Well, almost never."
Reads source + manifest metadata, produces a Source Intelligence File: field names, types, distributions, quality hints.
Planner
AI · ACT 3, planning
"Give me a schema and I will give you a plan."
Maps source fields to canonical targets. Routes by confidence score. Reasons about required fields when absent from source.
Resolver
AI · ACT 4, resolution
"I handle the hard cases the Planner didn't want."
Deep analysis of uncertain mappings. Returns one decision per field: INFERRED, UNMAPPED, or REVIEW.
Knowledge Engine
DET · ACT 5, quality gate
"Nothing passes without my approval. Nothing."
Four rule-based checks: schema_fit, vocab_match, unit_check, provenance_check. APPROVE, REJECT or ESCALATE. No LLM involved.
Executor
DET · ACT 6, execution
"No LLM. No drama. Just execution."
Applies the verified plan row by row. 9 pure transform functions: rename, convert_unit, aggregate, map_value, parse_date, normalize_text, restructure, parse_geojson, combine_lat_lon.
Validator
DET · ACT 7, validation
"I am the last line of defense. Don't disappoint me."
Checks every output record against the canonical JSON Schema. Returns pass / fail with error messages. Failures block publication.
PromotionManager
DET · ACT 5+, lifecycle
"I decide who gets promoted. Yes, really."
Manages Knowledge Base entry lifecycle: candidate → reviewed → approved → promoted. Each confirmed mapping becomes reusable for the next run.

Knowledge Base short-circuit

Once a mapping is promoted to the Knowledge Base, the Planner skips the LLM call entirely on later runs for that field. Zero cost, same quality, faster every run. For stable, recurring sources, the LLM cost converges to zero.

EnergyConsumptionObserved, proposed back to SDM.

Today, no Smart Data Model captures observed energy consumption across providers and fuel types. ACMeasurement is electricity-only and built for phase-level electrical engineering. ConsumptionCost supports only monthly granularity and uses free-text energy types. EnergyConsumer is a grid-topology object for power-flow simulation, not a metered observation.

Team GEX reviewed seven city open-data portals, five international standards (ESPI / Green Button, IEC 61968-9, CityGML Energy ADE, ISO 52000, EPBD) and every existing SDM energy entity, and proposes EnergyConsumptionObserved: one entity for any fuel at any granularity, 8 mandatory fields, 41 in all, designed from day one for AI-driven harmonization.

🔒 Core, measurement
8
all required
idtypedateObservedFromdateObservedTotemporalResolutionenergyTypeconsumptionunitCode
📋 Context, who, what, how much
17
flowDirectionsupplySourcesectorbuildingTypefloorAreaconsumptionIntensityenergyServiceconsumptionPointcostcostCurrencytariffPerioddataQualityisNormalisedprimaryEnergyFactorco2EmissionFactorco2Emissionsphase
☁️ Weather, energy drivers
10
temperatureheatingDegreeDayscoolingDegreeDaysdegreeDayBaseTemprelativeHumiditywindSpeedglobalSolarRadiationprecipitationcloudCoverrefWeatherObserved
🔗 Relationships, linked entities
6
refConsumptionPointrefBuildingrefDevicerefOrganizationrefOperatingArearefWeatherObserved

vs existing SDM energy entities

CapabilityACMeasurementConsumptionCostEnergyConsumerSmartMeteringObsEnergyConsumptionObserved
Multi-fuel (gas, heat, H₂, ...)elec onlyfree textgrid onlyelec focus✓ 12-value enum
Interval timestamps (from / to)single stringyear + monthpartial✓ from / to split
Building / meter / org relationshipsrefDevicerefPointpartial✓ 6 relationships
Flow direction (import / export)separate fields✓ enum
Supply source (grid / onSite / storage)✓ enum
Sector classification✓ Eurostat-aligned
CO₂ emission factor + emissions✓ native
Weather context (inline)✓ 9 fields

Five energy sources, then two water sources, same pipeline.

To test the pipeline, GEX deliberately chose datasets that did not agree with each other. Within the energy domain, five independent sources with different owners, formats, field names, and granularities. Then the same exercise repeated in the water-quality domain with no changes to the pipeline.

Energy domain · five heterogeneous sources, one canonical target

DatasetFormatSourceWhat it tested
Fronius Energy Consumption, MaiaXLSX (private)Pedro C.C. Pimenta, GitHubProprietary monitoring export, closest to a live city feed
Tetouan Smart Grid, 10-min PowerCSVTetouan Smart Grid ResearchThree zones with 10-min timestamps, temporal harmonization + multi-zone
Electricity Consumption with Environment VariablesCSVAamir Ansari, KaggleHousehold-level active/reactive power + environmental variables
Climate and Energy Consumption, 2020–2024CSVEmirhan Akku, KaggleCountry-level aggregation, emissions, renewables, demographics
Smart Energy Consumption and Peak LoadCSVJay Joshi, KaggleBuilding characteristics + occupancy + peak-load indicators

Water-quality domain · zero pipeline change

DatasetFormatSourceWhat it tested
Water Quality MonitoringCSVPratiksha827, GitHubMapping to WaterQualityObserved with no pipeline change
Water PotabilityCSVA. Kadiwal, KaggleMerging two non-interoperable water datasets into one canonical model
Result

Across both domains, heterogeneous non-interoperable inputs converged to a single Smart Data Model output. The energy datasets prove multi-source convergence within one domain. The water datasets prove the same pipeline carries to a new domain and a new canonical model without code changes.

One MIM per ranger.

MIM compliance is not a checklist applied after the fact. Each mechanism is satisfied by a specific ranger at a specific stage. Below: the MIMs satisfied today, and the roadmap for the rest, following the same four-step method (classify, decompose, assign, promote).

MIMs satisfied today

MIMRequirementHow it is satisfiedResponsible ranger
MIM1Unique persistent identifier per entityAuto-generated urn:ngsi-ld:EnergyConsumptionObserved:<uuid>Executor
MIM1Cross-system entity linkingrefMeter / refBuilding NGSI-LD URNsExecutor
MIM1Semantic typing of entitiesEvery record carries an explicit type aligned to NGSI-LDExecutor
MIM1Ontology-level mapping across sourcesHeterogeneous field names mapped to canonical equivalents with confidence scores and persisted evidenceProfiler, Planner, Resolver
MIM2Models explicit, documented, unambiguousSchema defines all attributes with types, units, enumerations, schema.org and OASC SDM referencesSchemaSelector, Planner
MIM2Build on standardized community modelsComposes OASC SDMs via allOf/$ref: GSMA-Commons, Location-Commons, EnergyConsumptionSchemaSelector
MIM2Cross-model transformation to common modelThe 9-act pipeline transforms any provider schema into one canonical model automaticallyPlanner, Executor
MIM3Adequate machine-readable metadataEach source manifest is a structured descriptor: owner, domain, format, columns, units, quality notesProfiler
MIM3Traceable provenance and trust lineageProvenance record per output: source row → mapping IDs → every transform → KE verdict → validation resultExecutor, Validator
MIM7Geospatial data in open standardscombine_lat_lon and parse_geojson transforms produce OGC-compliant GeoJSON Point geometryExecutor
MIM7Contextual datasets comply with MIM1 + MIM2Location embedded in the same NGSI-LD record that carries the MIM1 URN and MIM2 canonical schemaExecutor

Roadmap, four-step method applied once more

MIMRequirementHow the same method extends the system
MIM0Accessible, standardized APIs, no silosPush harmonized NGSI-LD records to an Orion-LD or Scorpio context broker after execution
MIM2Transformation rules reusable and self-learningPromote the Knowledge Base from per-deployment store to a shared community registry
MIM3Governance defined and comprehensibleAdd a governance block to the manifest, expose manifest metadata as a DCAT-AP catalogue endpoint, integrate an IDS / GAIA-X data-sharing policy at the API gateway
MIM4Individuals can control their personal dataAdd a consent-management step (MyData or Solid pattern) before harmonizing any personal record
MIM6Identity lifecycle and cryptographic data integrityOAuth2 / JWT + TLS at the API layer with role-based access, sign harmonized output and provenance files
MIM7CRS compliance and INSPIRE metadata for the EU contextDeclare crs EPSG:4326 on all GeoJSON output, add INSPIRE-compliant metadata fields to the manifest schema
MIM8Local Digital Twin layersDeclare the harmonizer as the LDT layer-3 pre-processing component, feed its UTC-normalized time series into demand forecasting and Climate Contract tracking
Recommendation back to OASC

Give every MIM a machine-readable companion. The MIM text stays as it is, written for people; alongside it sits a structured version, derived from the three Y.4505 parts each MIM already has, that an agent can consume directly.

From prototype to production service for member cities.

Harden the pipeline for production

Standards-compliant NGSI-LD output, live model selection through the Smart Data Models MCP server, expert-in-the-loop interface for the stewards who confirm mappings.

Build the shared skill, "MIM-AI"

A versioned registry where a mapping confirmed in one city becomes reusable by every other. MIM compliance compounds across the network instead of restarting in each city.

Run a pilot with founding cities

A small group of member cities runs the harmonizer on real data, while the OASC Academy trains the first stewards, the AI Rangers, who operate the system and grow the shared knowledge library. MIM Champions to start with.

Propose EnergyConsumptionObserved to SDM

Submit the canonical target back to the Smart Data Models community. 8 mandatory fields, 41 in all, designed from day one for AI-driven harmonization across heterogeneous city data sources.

Read the full work.

Three documents shipped by team GEX. Together they form the groundwork: the proof, the architecture, and the pitch.