OASC
UC2 · team LESL
MIMathon Porto 2026 · Use case 02

Points of Interest,
one taxonomy for them all.

Team LESL: working notes on harmonizing Points of Interest across municipal departments and external sources, using Dolfin as the canonical pivot, schema.org and Wikidata as the category backbone, and the existing Smart Data Models PointOfInterest as the production target.

TeamLESL
Datasets4 Casas de Fado + 54 Postos de Abastecimento
Pivot languageDolfin
Reference standardsSDM, schema.org, Wikidata

What the data actually contains.

Two CitySDK CSV exports from Porto Digital: 4 Casas de Fado in the historic centre, and 54 Postos de Abastecimento (petrol stations) across the city. Same schema, two very different categories, perfect to validate the canonical model on a second dataset.

58
POIs total
2
categories
3
languages per record
17+
x-citysdk keys
Format quirk

Python dict literals as CSV cell values.

Multilingual fields (category, label, description, others) are exported as single-quoted Python list-of-dict strings, not JSON. Parsed via ast.literal_eval in the adapter.

Format quirk

vCard 2.1 inside the address column.

The address column is a flattened vCard with newlines collapsed to spaces. Street, locality, postal code, country are positional in ADR;WORK. Phone, website, email come from TEL, URL, EMAIL. The adapter rebuilds field boundaries by recognising the vCard keyword set.

Observation

Multilingual literals look natural in JSON-LD.

Each name and description maps cleanly to a JSON-LD language-tagged value ({"@language": "pt", "@value": "..."}). The canonical model holds them as a list of LocalizedText, which serializes to RDF correctly out of the box.

Observation

Category coupling is straightforward.

The Casas de Fado set has 1 category, mapped to schema:Restaurant + schema:MusicVenue + Wikidata Q3338148. The petrol stations set adds one more, schema:GasStation + Wikidata Q205495. Each new category is one entry in category_map.json.

Second dataset onboarded, zero adapter change

The petrol stations CSV uses the same CitySDK schema as the Casas de Fado set. We added three lines to category_map.json (one per language label for "Postos de Abastecimento" / "Petrol station" / "Gasolineras") and re-ran the harmonizer. 54 of 54 records harmonized, no new code, no re-test, no schema migration. This is the canonical-model-plus-adapter approach paying off.

This time, Smart Data Models has it.

Unlike use case 01 where Tree was a gap, the SDM dataModel.PointOfInterest repository already defines a generic PointOfInterest plus three specialisations: Museum, Beach, Store.

Reuse, not invent

We align our canonical model on SDM's PointOfInterest attributes (name, category, contactPoint, location, capacity, priceRange) and pre-bind to schema.org so the same canonical record can serialize either as SDM NGSI-LD or as schema.org JSON-LD without re-mapping. Casa de Fado is not in SDM as a sub-type, but doesn't need to be: category alignment via schema.org and Wikidata is enough.

Two-level taxonomy

SDM PointOfInterest schema.org Wikidata NGSI-LD ready

The Dolfin pivot.

Dolfin defines the model independently of any serialization, and we derive JSON-LD (with schema.org context) and GeoJSON from it.

package <http://mimathon.askem.eu/uc2/pois>:
  dolfin_version "1"
  version "0.1.0"
  author "LESL (Lea, Eliott, Sattisvar & Louis), Dolfin & Askem"
  description "Canonical model for points of interest, MIMathon Porto 2026 use case 02. Multilingual names and descriptions, schema.org category alignment, structured address and contact, ready to feed Smart Data Models PointOfInterest."

concept Language:
  one of:
    pt
    en
    es
    fr
    de
    it
    other

concept LocalizedText:
  has lang: one Language
  has value: one string

concept Category:
  has sourceLabel: one string
  has schemaOrgRefs: at least 1 string
  has wikidataRef: optional string

concept PostalAddress:
  has streetName: optional string
  has streetNumber: optional string
  has locality: optional string
  has postalCode: optional string
  has country: optional string

concept ContactPoint:
  has telephone: optional string
  has email: optional string
  has website: optional string

concept Location:
  has latitude: one float
  has longitude: one float

concept PointOfInterest:
  has localId: one string
  has names: at least 1 LocalizedText
  has descriptions: LocalizedText
  has category: one Category
  has location: one Location
  has address: optional PostalAddress
  has contact: optional ContactPoint
  has capacity: optional int
  has costRating: optional int

Design choices

From CitySDK CSV to the pivot.

Source attributeSource exampleCanonical attributeTransformation
id5cd04b43f979e000013bee37PointOfInterest.localIddirect
label[term=primary]{lang: pt-PT, value: "O Fado"}PointOfInterest.namesast.literal_eval, filter term=primary, dedupe by lang
description{lang: pt-PT, value: "Situado num edifício..."}PointOfInterest.descriptionsast.literal_eval, dedupe by lang, first per lang
category{lang: pt-PT, value: "Casas de Fado"}PointOfInterest.categoryast.literal_eval, lookup pt-PT label in category_map.json
latitude, longitude41.142313, -8.617495Location.latitude/longitudeparse float, WGS84
address (vCard ADR;WORK);16-16A;Largo de S. João Novo;Porto;Porto;4050-554;PortugalPostalAddress.*regex split vCard fields, positional ADR parts
address (vCard TEL/URL/EMAIL)+351 222026937 / www.ofado.com / info@ofado.comContactPoint.*regex extract, take first email before /
others[type=x-citysdk/capacity]70PointOfInterest.capacityparse int
others[type=x-citysdk/cost-rating]3PointOfInterest.costRatingparse int

One core, many datasets.

Same architecture as the trees harmonizer: a dataset-agnostic core, plus one adapter per source dataset. Onboarding a new POI dataset means writing one adapter file, no core change required.

CLI

python -m harmonize_pois \
    --adapter porto_pois \
    --input ../uc2-pois-casas-de-fado.csv \
    --output ../out/pois.jsonld \
    --geojson ../out/pois.geojson \
    --base-id "http://mimathon.askem.eu/uc2/pois/"

Add a new dataset

Same workflow as UC1: copy adapters/_template.py, fill in the read generator, and extend category_map.json with any new source labels you encounter.

Before and after.

Same record (O Fado, 5cd04b43...ee37), shown raw from the Porto CitySDK CSV on the left, and harmonized to the canonical model on the right.

Casas de Fado

Postos de Abastecimento

Code & model

Same JSON-LD vs GeoJSON split as UC1: JSON-LD for the data layer (semantic, dereferenceable category IRIs); GeoJSON for map tools with dotted-key flattened properties.

Source · CitySDK CSV rowPorto Digital
{
  "active": "True",
  "base": "https://city-api.wearebitmaker.com/CitySDK/pois",
  "category": "[{'lang': 'pt-PT', 'value': 'Casas de Fado'}, {'lang': 'en-GB', 'value': 'Fado houses'}, {'lang': 'es-ES', 'value': 'Casas de Fado'}]",
  "created": "2019-05-06T14:57:07.255000Z",
  "id": "5cd04b43f979e000013bee37",
  "label": "[{'lang': 'pt-PT', 'term': 'primary', 'value': 'O Fado'}, {'lang': 'en-GB', 'term': 'primary', 'value': 'O Fado'}, {'lang': 'es-ES', 'term': 'primary', 'value': 'O Fado'}]",
  "lang": "pt-PT",
  "address": "BEGIN:VCARD VERSION:2.1 REV:20190226T17:27:45Z N:O Fado;O Fado;;; FN:O Fado O Fado ORG:O Fado ADR;WORK:;16-16A;Largo de …",
  "latitude": "41.14231309679815",
  "longitude": "-8.61749513128281",
  "time": "[{'term': 'open', 'type': 'text/icalendar', 'value': 'BEGIN:VCALENDAR VERSION:2.0 BEGIN:VEVENT SUMMARY: DTSTART:20190101T203000Z DTEND:50190115T010000Z DESCRIPTION: LOCATION: RRULE:FREQ=WEEKLY;INTERVAL=1;BYDAY=MO,TU,WE,TH,FR,SA END:VEVENT END:VCALENDAR'}]",
  "updated": "2021-07-14T15:43:36.376000Z",
  "description": "[ … 3 multilingual entries truncated … ]",
  "others": "[ … 17 typed key-value pairs truncated … ]"
}
Canonical · JSON-LD nodehttp://mimathon.askem.eu/uc2/pois/
{
  "@id": "http://mimathon.askem.eu/uc2/pois/5cd04b43f979e000013bee37",
  "@type": "PointOfInterest",
  "localId": "5cd04b43f979e000013bee37",
  "names": [
    {
      "@language": "pt",
      "@value": "O Fado"
    },
    {
      "@language": "en",
      "@value": "O Fado"
    },
    {
      "@language": "es",
      "@value": "O Fado"
    }
  ],
  "descriptions": [
    {
      "@language": "pt",
      "@value": "Situado num edifício centenário o Restaurante Típico o Fado preserva todo o tipicismo inerente a este tipo de casas, ond…"
    },
    {
      "@language": "en",
      "@value": "Typical restaurant serving regional cuisine, and where you can appreciate the traditional \"fado\".…"
    },
    {
      "@language": "es",
      "@value": "Restaurante típico de cocina regional, donde se puede desfrutar del tradicional fado.…"
    }
  ],
  "category": {
    "@type": "Category",
    "sourceLabel": "Casas de Fado",
    "schemaOrgRefs": [
      "https://schema.org/Restaurant",
      "https://schema.org/MusicVenue"
    ],
    "wikidataRef": "https://www.wikidata.org/entity/Q3338148"
  },
  "location": {
    "@type": "Location",
    "latitude": 41.14231309679815,
    "longitude": -8.61749513128281
  },
  "capacity": 70,
  "costRating": 3,
  "address": {
    "@type": "PostalAddress",
    "streetName": "Largo de S. João Novo",
    "streetNumber": "16-16A",
    "locality": "Porto",
    "postalCode": "4050-554",
    "country": "Portugal"
  },
  "contact": {
    "@type": "ContactPoint",
    "telephone": "+351 222026937",
    "email": "info@ofado.com",
    "website": "www.ofado.com"
  }
}

JSON-LD as a graph

One POI as a property graph. Multilingual literals are RDF language-tagged values; the Category node carries both the source label and the canonical schema.org / Wikidata anchors, dereferenceable IRIs.

JSON-LD instance graph for POI O Fado, with nodes PointOfInterest, names+descriptions, Category, Location, PostalAddress, ContactPoint and external links to schema.org and Wikidata

What changed

Read it, run it, fork it.

Full source of the POI harmonizer, hosted alongside this page. Each file is a one-click download as raw .py or .json. The whole package is bundled as a tarball at the top of the Data section.

model.py Canonical PointOfInterest model, mirrors pois.dolfin View raw
"""Canonical POI model, mirrors pois.dolfin v0.1.0."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional


LANGUAGES = {"pt", "en", "es", "fr", "de", "it", "other"}


def normalize_lang(raw: str | None) -> str:
    if not raw:
        return "other"
    raw = raw.lower().split("-")[0]
    return raw if raw in LANGUAGES else "other"


@dataclass(frozen=True)
class LocalizedText:
    lang: str
    value: str


@dataclass(frozen=True)
class Category:
    sourceLabel: str
    schemaOrgRefs: tuple[str, ...]
    wikidataRef: Optional[str] = None


@dataclass(frozen=True)
class PostalAddress:
    streetName: Optional[str] = None
    streetNumber: Optional[str] = None
    locality: Optional[str] = None
    postalCode: Optional[str] = None
    country: Optional[str] = None


@dataclass(frozen=True)
class ContactPoint:
    telephone: Optional[str] = None
    email: Optional[str] = None
    website: Optional[str] = None


@dataclass(frozen=True)
class Location:
    latitude: float
    longitude: float


@dataclass
class PointOfInterest:
    localId: str
    names: list[LocalizedText]
    category: Category
    location: Location
    descriptions: list[LocalizedText] = field(default_factory=list)
    address: Optional[PostalAddress] = None
    contact: Optional[ContactPoint] = None
    capacity: Optional[int] = None
    costRating: Optional[int] = None
transforms.py Reusable text helpers shared across adapters View raw
"""Reusable text transforms shared across adapters.

Adapters compose these helpers rather than reimplementing them. Helpers
are intentionally minimal: they only do generic text work (cleanup,
regex extraction, keyword routing). Anything dataset-specific belongs
in the adapter itself.
"""
from __future__ import annotations
import re
from typing import Optional


def clean_text(value: Optional[str]) -> Optional[str]:
    """Trim, collapse internal whitespace, return None for empty input."""
    if value is None:
        return None
    txt = re.sub(r"\s+", " ", str(value)).strip()
    return txt or None


def extract_count(value: Optional[str], pattern: str = r"\((\d+)") -> Optional[int]:
    """Pull an integer out of free text, e.g. '... (12 exemplares)' -> 12."""
    if value is None:
        return None
    m = re.search(pattern, value)
    return int(m.group(1)) if m else None


def match_keywords(value: Optional[str], keyword_map: dict[str, str]) -> Optional[str]:
    """Return the first enum value whose regex key matches the input.

    keyword_map: {regex_pattern: enum_value}, e.g.
        {r"conjunto\\s+arb[óo]re[op]": "TreeCluster",
         r"isolad": "IsolatedSpecimen"}
    Patterns are evaluated in insertion order, case-insensitive.
    """
    if not value:
        return None
    for pattern, enum_value in keyword_map.items():
        if re.search(pattern, value, re.IGNORECASE):
            return enum_value
    return None


class Registry:
    """Tiny dedupe registry for value-typed entities like Authority.

    Use when source data has many spelling variants of the same entity:
        reg = Registry({"ICNF": Authority(name="...", acronym="ICNF")})
        a = reg.resolve("ICNF (Instituto da Conservação ...)", needle="ICNF")
    The canonical instance is returned, ensuring downstream graphs share
    one node per real-world entity.
    """

    def __init__(self, known: dict | None = None):
        self._known = dict(known or {})

    def resolve(self, raw, needle: str | None = None, default=None):
        if raw is None:
            return default
        text = str(raw)
        if needle is not None and needle in text and needle in self._known:
            return self._known[needle]
        for key, val in self._known.items():
            if key in text:
                return val
        return default

    def get(self, key: str):
        return self._known.get(key)
jsonld.py JSON-LD writer, schema.org context for shared terms View raw
"""JSON-LD writer for the canonical POI model.

Aligns with schema.org for shared terms (name, description, address,
category) so the output is directly consumable by schema.org-aware
tools, while keeping a local namespace for the wrapping shape.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable

from .model import PointOfInterest


NS = "http://mimathon.askem.eu/uc2/pois#"

CONTEXT = {
    "@vocab": NS,
    "schema": "https://schema.org/",
    "wd": "https://www.wikidata.org/entity/",
    "PointOfInterest": NS + "PointOfInterest",
    "Category": NS + "Category",
    "Location": NS + "Location",
    "PostalAddress": "schema:PostalAddress",
    "ContactPoint": "schema:ContactPoint",
    "LocalizedText": NS + "LocalizedText",
    "names": "schema:name",
    "descriptions": "schema:description",
    "category": "schema:category",
    "location": "schema:location",
    "address": "schema:address",
    "contact": "schema:contactPoint",
    "telephone": "schema:telephone",
    "email": "schema:email",
    "website": "schema:url",
    "streetName": "schema:streetAddress",
    "streetNumber": "schema:streetAddress",
    "locality": "schema:addressLocality",
    "postalCode": "schema:postalCode",
    "country": "schema:addressCountry",
    "schemaOrgRefs": {"@id": NS + "schemaOrgRefs", "@type": "@id"},
    "wikidataRef": {"@id": NS + "wikidataRef", "@type": "@id"},
    "geo": "https://www.w3.org/2003/01/geo/wgs84_pos#",
    "latitude": "geo:lat",
    "longitude": "geo:long",
    "lang": "@language",
    "value": "@value",
    "capacity": "schema:maximumAttendeeCapacity",
}


def _strip_none(d):
    if isinstance(d, dict):
        return {k: _strip_none(v) for k, v in d.items() if v is not None and v != []}
    if isinstance(d, list):
        return [_strip_none(x) for x in d]
    return d


def _localized(items) -> list[dict]:
    return [{"@language": t.lang, "@value": t.value} for t in items]


def poi_to_node(poi: PointOfInterest, base_id: str) -> dict:
    node = {
        "@id": f"{base_id}{poi.localId}",
        "@type": "PointOfInterest",
        "localId": poi.localId,
        "names": _localized(poi.names),
        "descriptions": _localized(poi.descriptions),
        "category": {
            "@type": "Category",
            "sourceLabel": poi.category.sourceLabel,
            "schemaOrgRefs": list(poi.category.schemaOrgRefs),
            "wikidataRef": poi.category.wikidataRef,
        },
        "location": {
            "@type": "Location",
            "latitude": poi.location.latitude,
            "longitude": poi.location.longitude,
        },
        "capacity": poi.capacity,
        "costRating": poi.costRating,
    }
    if poi.address is not None:
        node["address"] = {"@type": "PostalAddress", **asdict(poi.address)}
    if poi.contact is not None:
        node["contact"] = {"@type": "ContactPoint", **asdict(poi.contact)}
    return _strip_none(node)


def build_document(pois: Iterable[PointOfInterest], base_id: str) -> dict:
    return {
        "@context": CONTEXT,
        "@graph": [poi_to_node(p, base_id) for p in pois],
    }
geojson_out.py GeoJSON FeatureCollection writer for GIS tools View raw
"""GeoJSON writer for the canonical POI model.

FeatureCollection in WGS84. Properties are flattened with dotted keys
so non-LD tools (geojson.io, QGIS, Leaflet) can use the data directly.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable

from .model import PointOfInterest


def _flatten(prefix: str, value, target: dict) -> None:
    if value is None or value == []:
        return
    if isinstance(value, dict):
        for k, v in value.items():
            _flatten(f"{prefix}.{k}" if prefix else k, v, target)
    elif isinstance(value, (list, tuple)):
        if value and isinstance(value[0], dict) and {"lang", "value"} <= set(value[0]):
            for item in value:
                target[f"{prefix}_{item['lang']}"] = item["value"]
        else:
            target[prefix] = list(value)
    else:
        target[prefix] = value


def poi_to_feature(poi: PointOfInterest, base_id: str) -> dict:
    props: dict = {"@id": f"{base_id}{poi.localId}", "@type": "PointOfInterest"}
    props["localId"] = poi.localId
    _flatten("name", [{"lang": t.lang, "value": t.value} for t in poi.names], props)
    _flatten("description", [{"lang": t.lang, "value": t.value} for t in poi.descriptions], props)
    _flatten("category", {
        "sourceLabel": poi.category.sourceLabel,
        "schemaOrgRefs": list(poi.category.schemaOrgRefs),
        "wikidataRef": poi.category.wikidataRef,
    }, props)
    if poi.address is not None:
        _flatten("address", asdict(poi.address), props)
    if poi.contact is not None:
        _flatten("contact", asdict(poi.contact), props)
    if poi.capacity is not None:
        props["capacity"] = poi.capacity
    if poi.costRating is not None:
        props["costRating"] = poi.costRating

    return {
        "type": "Feature",
        "id": poi.localId,
        "geometry": {
            "type": "Point",
            "coordinates": [poi.location.longitude, poi.location.latitude],
        },
        "properties": props,
    }


def build_collection(pois: Iterable[PointOfInterest], base_id: str) -> dict:
    return {
        "type": "FeatureCollection",
        "features": [poi_to_feature(p, base_id) for p in pois],
    }
__main__.py CLI orchestrating adapter, JSON-LD and GeoJSON output View raw
"""CLI entry point for the POI harmonizer.

Example:
    python -m harmonize_pois \
        --adapter porto_pois \
        --input ../uc2-pois-casas-de-fado.csv \
        --output ../out/pois.jsonld \
        --geojson ../out/pois.geojson \
        --base-id http://mimathon.askem.eu/uc2/pois/
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from pathlib import Path

from .geojson_out import build_collection
from .jsonld import build_document


def _load_adapter(name: str):
    mod = importlib.import_module(f"harmonize_pois.adapters.{name}")
    if not hasattr(mod, "read"):
        raise SystemExit(f"adapter {name!r} has no read(path) function")
    return mod


def main(argv=None) -> int:
    p = argparse.ArgumentParser(prog="harmonize_pois", description="Harmonize a POI dataset to the canonical PointOfInterest model and emit JSON-LD plus optional GeoJSON.")
    p.add_argument("--adapter", required=True, help="Adapter module name under harmonize_pois.adapters, e.g. porto_pois")
    p.add_argument("--input", required=True, type=Path, help="Source dataset path")
    p.add_argument("--output", required=True, type=Path, help="Destination JSON-LD file")
    p.add_argument("--base-id", default="http://example.org/pois/", help="IRI prefix for POI @id values")
    p.add_argument("--geojson", type=Path, help="Also emit a GeoJSON FeatureCollection")
    args = p.parse_args(argv)

    adapter = _load_adapter(args.adapter)
    print(f"Reading via adapter '{args.adapter}' from {args.input}...")
    pois = list(adapter.read(args.input))
    print(f"  {len(pois)} POIs read")

    print(f"Writing JSON-LD to {args.output}...")
    doc = build_document(pois, base_id=args.base_id)
    args.output.parent.mkdir(parents=True, exist_ok=True)
    args.output.write_text(json.dumps(doc, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"  done, {len(doc['@graph'])} entities in @graph")

    if args.geojson:
        print(f"Writing GeoJSON to {args.geojson}...")
        fc = build_collection(pois, base_id=args.base_id)
        args.geojson.parent.mkdir(parents=True, exist_ok=True)
        args.geojson.write_text(json.dumps(fc, ensure_ascii=False, indent=2), encoding="utf-8")
        print(f"  done, {len(fc['features'])} features")

    return 0


if __name__ == "__main__":
    sys.exit(main())
category_map.json Source-label to schema.org and Wikidata IRI map View raw
{
  "Casas de Fado": {
    "schemaOrgRefs": [
      "https://schema.org/Restaurant",
      "https://schema.org/MusicVenue"
    ],
    "wikidataRef": "https://www.wikidata.org/entity/Q3338148"
  },
  "Fado houses": {
    "schemaOrgRefs": [
      "https://schema.org/Restaurant",
      "https://schema.org/MusicVenue"
    ],
    "wikidataRef": "https://www.wikidata.org/entity/Q3338148"
  },

  "Museum": {
    "schemaOrgRefs": ["https://schema.org/Museum"],
    "wikidataRef": "https://www.wikidata.org/entity/Q33506"
  },
  "Restaurant": {
    "schemaOrgRefs": ["https://schema.org/Restaurant"],
    "wikidataRef": "https://www.wikidata.org/entity/Q11707"
  },
  "Park": {
    "schemaOrgRefs": ["https://schema.org/Park"],
    "wikidataRef": "https://www.wikidata.org/entity/Q22698"
  },
  "Hotel": {
    "schemaOrgRefs": ["https://schema.org/Hotel"],
    "wikidataRef": "https://www.wikidata.org/entity/Q27686"
  },
  "Beach": {
    "schemaOrgRefs": ["https://schema.org/Beach"],
    "wikidataRef": "https://www.wikidata.org/entity/Q40080"
  }
}
adapters/_template.py Skeleton, copy and rename to add a new dataset View raw
"""Skeleton POI adapter, copy and rename to add a new dataset.

Quick start:
    1. Copy to harmonize_pois/adapters/<your_dataset>.py
    2. Replace the read() body with your own parsing
    3. Run:  python -m harmonize_pois --adapter <your_dataset> --input ...

Contract:
    Expose a single function `read(path) -> Iterator[PointOfInterest]`.

Look at adapters/porto_pois.py for a worked example covering CitySDK
multilingual fields, vCard address parsing, and category lookup.
"""
from __future__ import annotations
from pathlib import Path
from typing import Iterator

from ..model import (
    Category, ContactPoint, Location, LocalizedText, PointOfInterest,
    PostalAddress, normalize_lang,
)
from ..transforms import clean_text


def read(path: str | Path) -> Iterator[PointOfInterest]:
    """Yield canonical PointOfInterest records from `path`.

    Replace the body below with your dataset's parsing logic.
    """
    # Example: a CSV with one POI per row
    # import csv
    # with Path(path).open(encoding="utf-8") as f:
    #     for row in csv.DictReader(f):
    #         yield PointOfInterest(
    #             localId=row["id"],
    #             names=[LocalizedText(lang=normalize_lang("en"), value=row["name"])],
    #             category=Category(
    #                 sourceLabel=row["category"],
    #                 schemaOrgRefs=("https://schema.org/Place",),
    #             ),
    #             location=Location(
    #                 latitude=float(row["lat"]),
    #                 longitude=float(row["lon"]),
    #             ),
    #         )
    raise NotImplementedError("Implement read() for your dataset, see porto_pois.py")
adapters/porto_pois.py Adapter for Porto CitySDK CSV (multilingual, vCard, dict-literals) View raw
"""Adapter for the Porto Open Data Casas de Fado CSV (CitySDK schema).

Source quirks handled here:
    - The CSV uses Python dict-literal strings (single-quoted) inside
      multilingual fields, parsed via ast.literal_eval.
    - Address is a vCard 2.1 blob, ADR;WORK fields are extracted
      positionally per the spec (P.O. box; ext addr; street; locality;
      region; postal code; country).
    - The 'others' field is a list of {type, value} where type is a
      namespaced key like x-citysdk/capacity, x-citysdk/cost-rating, etc.

Maps to PointOfInterest via category_map.json lookup for schema.org IRIs.
"""
from __future__ import annotations
import ast
import csv
import json
import re
from pathlib import Path
from typing import Iterator

from ..model import (
    Category, ContactPoint, Location, LocalizedText, PointOfInterest,
    PostalAddress, normalize_lang,
)
from ..transforms import clean_text


def _load_category_map() -> dict:
    p = Path(__file__).resolve().parent.parent / "category_map.json"
    return json.loads(p.read_text(encoding="utf-8"))


_CATEGORY_MAP = _load_category_map()


def _safe_literal(raw: str | None):
    if not raw:
        return []
    try:
        return ast.literal_eval(raw)
    except (ValueError, SyntaxError):
        return []


def _localized_list(raw: str | None, key_lang: str = "lang", key_val: str = "value", filter_term: str | None = None) -> list[LocalizedText]:
    out: list[LocalizedText] = []
    seen_langs: set[str] = set()
    for entry in _safe_literal(raw):
        if not isinstance(entry, dict):
            continue
        if filter_term and entry.get("term") != filter_term:
            continue
        lang = normalize_lang(entry.get(key_lang))
        val = clean_text(entry.get(key_val))
        if not val or lang in seen_langs:
            continue
        seen_langs.add(lang)
        out.append(LocalizedText(lang=lang, value=val))
    return out


def _resolve_category(raw: str | None) -> Category:
    items = _safe_literal(raw)
    pt_label = None
    for item in items:
        if isinstance(item, dict) and item.get("lang") == "pt-PT":
            pt_label = item.get("value")
            break
    label = pt_label or (items[0].get("value") if items and isinstance(items[0], dict) else "Unknown")
    label = clean_text(label) or "Unknown"
    mapped = _CATEGORY_MAP.get(label, {})
    return Category(
        sourceLabel=label,
        schemaOrgRefs=tuple(mapped.get("schemaOrgRefs", ["https://schema.org/Place"])),
        wikidataRef=mapped.get("wikidataRef"),
    )


_VCARD_KEYS = ("BEGIN", "VERSION", "REV", "N", "FN", "ORG", "ADR", "TEL", "URL", "EMAIL", "END")
_VCARD_FIELD = re.compile(r"(?:^|\s)(" + "|".join(_VCARD_KEYS) + r")(?:;[^:]+)?:(.*?)(?=\s(?:" + "|".join(_VCARD_KEYS) + r")(?:;[^:]+)?:|$)", re.DOTALL)


def _parse_vcard(vcard: str) -> dict[str, str]:
    out: dict[str, str] = {}
    for m in _VCARD_FIELD.finditer(vcard or ""):
        key = m.group(1).upper()
        val = m.group(2).strip()
        if key not in out:
            out[key] = val
    return out


def _parse_vcard_address(vcard: str) -> PostalAddress | None:
    fields = _parse_vcard(vcard)
    raw = fields.get("ADR")
    if not raw:
        return None
    parts = raw.split(";")
    while len(parts) < 7:
        parts.append("")
    _po, _ext, street, locality, _region, postal, country = parts[:7]
    return PostalAddress(
        streetName=clean_text(street),
        streetNumber=clean_text(_ext),
        locality=clean_text(locality),
        postalCode=clean_text(postal),
        country=clean_text(country),
    )


def _parse_vcard_contact(vcard: str) -> ContactPoint | None:
    fields = _parse_vcard(vcard)
    if not fields:
        return None
    email_raw = fields.get("EMAIL", "")
    email = email_raw.split("/")[0].split(",")[0].strip() if email_raw else None
    cp = ContactPoint(
        telephone=clean_text(fields.get("TEL")),
        website=clean_text(fields.get("URL")),
        email=clean_text(email),
    )
    return cp if any([cp.telephone, cp.website, cp.email]) else None


def _extract_others(raw: str | None) -> dict[str, list[str]]:
    out: dict[str, list[str]] = {}
    for item in _safe_literal(raw):
        if isinstance(item, dict):
            t = item.get("type")
            v = item.get("value")
            if t and v is not None:
                out.setdefault(t, []).append(str(v))
    return out


def _safe_int(s) -> int | None:
    if s is None:
        return None
    try:
        return int(s)
    except (ValueError, TypeError):
        return None


def read(csv_path: str | Path) -> Iterator[PointOfInterest]:
    """Yield canonical PointOfInterest records from a Porto CitySDK CSV."""
    with Path(csv_path).open(encoding="utf-8") as f:
        reader = csv.DictReader(f, quotechar="'")
        for row in reader:
            others = _extract_others(row.get("others"))
            try:
                lat = float(row["latitude"])
                lon = float(row["longitude"])
            except (KeyError, ValueError, TypeError):
                continue

            names = _localized_list(row.get("label"), key_lang="lang", key_val="value", filter_term="primary")
            if not names:
                continue

            poi = PointOfInterest(
                localId=row.get("id") or "",
                names=names,
                descriptions=_localized_list(row.get("description")),
                category=_resolve_category(row.get("category")),
                location=Location(latitude=lat, longitude=lon),
                address=_parse_vcard_address(row.get("address") or ""),
                contact=_parse_vcard_contact(row.get("address") or ""),
                capacity=_safe_int(others.get("x-citysdk/capacity", [None])[0]),
                costRating=_safe_int(others.get("x-citysdk/cost-rating", [None])[0]),
            )
            yield poi

Remaining work.

Lower the bar for new adapters

Same goal as UC1: a declarative YAML mapping for simple cases (just renames + enum lookups + per-language picks) and a harmonize_pois new-adapter --name <city> scaffolder. Most POI imports are nearly identical apart from field names; the long tail is the messy parts (vCard, iCalendar, Python dict literals).

Onboard a third dataset, with a different schema

The Postos de Abastecimento set was onboarded without writing any code (same CitySDK adapter, three new entries in category_map.json). The next test is a dataset that does not match the CitySDK shape, e.g. an OpenStreetMap export, a Google Places dump, or another city's portal. That will exercise the adapter pluggability for real and surface any canonical-model gap.

Ship as NGSI-LD against SDM PointOfInterest

Translate the canonical record into NGSI-LD entities conformant to dataModel.PointOfInterest: pick a primary language for name, fold the rest into description, set category to the source label, attach contactPoint and location as references. Validate with the SDM tooling.

Resolve categories beyond the lookup table

For datasets with hundreds of categories, hand-curating category_map.json does not scale. A SPARQL probe against Wikidata, a fuzzy match against schema.org class labels, or an LLM-assisted draft with human review would all be reasonable. Each entry should remain explicit and auditable.

Parse opening hours

The CitySDK time column is iCalendar (BEGIN:VCALENDAR with RRULE). Map to schema.org openingHoursSpecification for a richer canonical record. Optional, but useful for tourism-facing apps.