Team LESL: working notes on harmonizing Points of Interest
across municipal departments and external sources, using Dolfin as
the canonical pivot, schema.org and Wikidata as the category
backbone, and the existing Smart Data Models PointOfInterest
as the production target.
Two CitySDK CSV exports from Porto Digital: 4 Casas de Fado in the historic centre, and 54 Postos de Abastecimento (petrol stations) across the city. Same schema, two very different categories, perfect to validate the canonical model on a second dataset.
Multilingual fields (category, label,
description, others) are exported as
single-quoted Python list-of-dict strings, not JSON. Parsed
via ast.literal_eval in the adapter.
The address column is a flattened vCard with
newlines collapsed to spaces. Street, locality, postal code,
country are positional in ADR;WORK. Phone,
website, email come from TEL, URL,
EMAIL. The adapter rebuilds field boundaries by
recognising the vCard keyword set.
Each name and description maps cleanly to a JSON-LD
language-tagged value ({"@language": "pt", "@value": "..."}).
The canonical model holds them as a list of LocalizedText,
which serializes to RDF correctly out of the box.
The Casas de Fado set has 1 category, mapped to
schema:Restaurant + schema:MusicVenue
+ Wikidata Q3338148. The petrol stations set adds
one more, schema:GasStation + Wikidata
Q205495. Each new category is one entry in
category_map.json.
The petrol stations CSV uses the same CitySDK schema as the
Casas de Fado set. We added three lines to
category_map.json (one per language label
for "Postos de Abastecimento" / "Petrol station" / "Gasolineras")
and re-ran the harmonizer. 54 of 54 records harmonized,
no new code, no re-test, no schema migration. This is the
canonical-model-plus-adapter approach paying off.
Unlike use case 01 where Tree was a gap, the SDM
dataModel.PointOfInterest
repository already defines a generic PointOfInterest
plus three specialisations: Museum,
Beach, Store.
We align our canonical model on SDM's PointOfInterest
attributes (name, category,
contactPoint, location,
capacity, priceRange) and pre-bind to
schema.org so the same canonical record can serialize either as
SDM NGSI-LD or as schema.org JSON-LD without re-mapping. Casa
de Fado is not in SDM as a sub-type, but doesn't need to be:
category alignment via schema.org and Wikidata is enough.
PointOfInterest, the entity shape consumed by NGSI-LD platformsRestaurant, MusicVenue, Museum...) and Wikidata Q-IDs as semantic anchors, dereferenceable IRIssourceLabel ("Casas de Fado"), so consumers can audit the alignment trailDolfin defines the model independently of any serialization, and we derive JSON-LD (with schema.org context) and GeoJSON from it.
package <http://mimathon.askem.eu/uc2/pois>:
dolfin_version "1"
version "0.1.0"
author "LESL (Lea, Eliott, Sattisvar & Louis), Dolfin & Askem"
description "Canonical model for points of interest, MIMathon Porto 2026 use case 02. Multilingual names and descriptions, schema.org category alignment, structured address and contact, ready to feed Smart Data Models PointOfInterest."
concept Language:
one of:
pt
en
es
fr
de
it
other
concept LocalizedText:
has lang: one Language
has value: one string
concept Category:
has sourceLabel: one string
has schemaOrgRefs: at least 1 string
has wikidataRef: optional string
concept PostalAddress:
has streetName: optional string
has streetNumber: optional string
has locality: optional string
has postalCode: optional string
has country: optional string
concept ContactPoint:
has telephone: optional string
has email: optional string
has website: optional string
concept Location:
has latitude: one float
has longitude: one float
concept PointOfInterest:
has localId: one string
has names: at least 1 LocalizedText
has descriptions: LocalizedText
has category: one Category
has location: one Location
has address: optional PostalAddress
has contact: optional ContactPoint
has capacity: optional int
has costRating: optional int
LocalizedText as a first-class concept, so every multilingual literal carries its language tag explicitlyCategory separate from the POI, with both the source label (audit) and a list of canonical IRIs (alignment)schemaOrgRefs is at least 1, so every POI must commit to a category interpretation, even if minimal (schema:Place)PostalAddress and ContactPoint use schema.org-aligned attribute names, so the JSON-LD context can map each to schema:streetAddress etc. without renaming| Source attribute | Source example | Canonical attribute | Transformation |
|---|---|---|---|
| id | 5cd04b43f979e000013bee37 | PointOfInterest.localId | direct |
| label[term=primary] | {lang: pt-PT, value: "O Fado"} | PointOfInterest.names | ast.literal_eval, filter term=primary, dedupe by lang |
| description | {lang: pt-PT, value: "Situado num edifício..."} | PointOfInterest.descriptions | ast.literal_eval, dedupe by lang, first per lang |
| category | {lang: pt-PT, value: "Casas de Fado"} | PointOfInterest.category | ast.literal_eval, lookup pt-PT label in category_map.json |
| latitude, longitude | 41.142313, -8.617495 | Location.latitude/longitude | parse float, WGS84 |
| address (vCard ADR;WORK) | ;16-16A;Largo de S. João Novo;Porto;Porto;4050-554;Portugal | PostalAddress.* | regex split vCard fields, positional ADR parts |
| address (vCard TEL/URL/EMAIL) | +351 222026937 / www.ofado.com / info@ofado.com | ContactPoint.* | regex extract, take first email before / |
| others[type=x-citysdk/capacity] | 70 | PointOfInterest.capacity | parse int |
| others[type=x-citysdk/cost-rating] | 3 | PointOfInterest.costRating | parse int |
Same architecture as the trees harmonizer: a dataset-agnostic core, plus one adapter per source dataset. Onboarding a new POI dataset means writing one adapter file, no core change required.
python -m harmonize_pois \
--adapter porto_pois \
--input ../uc2-pois-casas-de-fado.csv \
--output ../out/pois.jsonld \
--geojson ../out/pois.geojson \
--base-id "http://mimathon.askem.eu/uc2/pois/"
Same workflow as UC1: copy adapters/_template.py,
fill in the read generator, and extend
category_map.json with any new source labels you
encounter.
Same record (O Fado, 5cd04b43...ee37), shown
raw from the Porto CitySDK CSV on the left, and harmonized to the
canonical model on the right.
Same JSON-LD vs GeoJSON split as UC1: JSON-LD for the data layer (semantic, dereferenceable category IRIs); GeoJSON for map tools with dotted-key flattened properties.
{
"active": "True",
"base": "https://city-api.wearebitmaker.com/CitySDK/pois",
"category": "[{'lang': 'pt-PT', 'value': 'Casas de Fado'}, {'lang': 'en-GB', 'value': 'Fado houses'}, {'lang': 'es-ES', 'value': 'Casas de Fado'}]",
"created": "2019-05-06T14:57:07.255000Z",
"id": "5cd04b43f979e000013bee37",
"label": "[{'lang': 'pt-PT', 'term': 'primary', 'value': 'O Fado'}, {'lang': 'en-GB', 'term': 'primary', 'value': 'O Fado'}, {'lang': 'es-ES', 'term': 'primary', 'value': 'O Fado'}]",
"lang": "pt-PT",
"address": "BEGIN:VCARD VERSION:2.1 REV:20190226T17:27:45Z N:O Fado;O Fado;;; FN:O Fado O Fado ORG:O Fado ADR;WORK:;16-16A;Largo de …",
"latitude": "41.14231309679815",
"longitude": "-8.61749513128281",
"time": "[{'term': 'open', 'type': 'text/icalendar', 'value': 'BEGIN:VCALENDAR VERSION:2.0 BEGIN:VEVENT SUMMARY: DTSTART:20190101T203000Z DTEND:50190115T010000Z DESCRIPTION: LOCATION: RRULE:FREQ=WEEKLY;INTERVAL=1;BYDAY=MO,TU,WE,TH,FR,SA END:VEVENT END:VCALENDAR'}]",
"updated": "2021-07-14T15:43:36.376000Z",
"description": "[ … 3 multilingual entries truncated … ]",
"others": "[ … 17 typed key-value pairs truncated … ]"
}
{
"@id": "http://mimathon.askem.eu/uc2/pois/5cd04b43f979e000013bee37",
"@type": "PointOfInterest",
"localId": "5cd04b43f979e000013bee37",
"names": [
{
"@language": "pt",
"@value": "O Fado"
},
{
"@language": "en",
"@value": "O Fado"
},
{
"@language": "es",
"@value": "O Fado"
}
],
"descriptions": [
{
"@language": "pt",
"@value": "Situado num edifício centenário o Restaurante Típico o Fado preserva todo o tipicismo inerente a este tipo de casas, ond…"
},
{
"@language": "en",
"@value": "Typical restaurant serving regional cuisine, and where you can appreciate the traditional \"fado\".…"
},
{
"@language": "es",
"@value": "Restaurante típico de cocina regional, donde se puede desfrutar del tradicional fado.…"
}
],
"category": {
"@type": "Category",
"sourceLabel": "Casas de Fado",
"schemaOrgRefs": [
"https://schema.org/Restaurant",
"https://schema.org/MusicVenue"
],
"wikidataRef": "https://www.wikidata.org/entity/Q3338148"
},
"location": {
"@type": "Location",
"latitude": 41.14231309679815,
"longitude": -8.61749513128281
},
"capacity": 70,
"costRating": 3,
"address": {
"@type": "PostalAddress",
"streetName": "Largo de S. João Novo",
"streetNumber": "16-16A",
"locality": "Porto",
"postalCode": "4050-554",
"country": "Portugal"
},
"contact": {
"@type": "ContactPoint",
"telephone": "+351 222026937",
"email": "info@ofado.com",
"website": "www.ofado.com"
}
}
One POI as a property graph. Multilingual literals are RDF
language-tagged values; the Category node carries
both the source label and the canonical schema.org / Wikidata
anchors, dereferenceable IRIs.
{"@language": "pt", "@value": "..."})schema:Restaurant + schema:MusicVenue + Wikidata Q3338148, source label preservedPostalAddress with streetName, locality, postalCode, countryContactPoint with telephone, website, emailx-citysdk/capacity and x-citysdk/cost-rating promoted to first-class POI attributes@id assigned, every fragment carries an @type
Full source of the POI harmonizer, hosted alongside this page.
Each file is a one-click download as raw .py or
.json. The whole package is bundled as a tarball at
the top of the Data section.
"""Canonical POI model, mirrors pois.dolfin v0.1.0."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional
LANGUAGES = {"pt", "en", "es", "fr", "de", "it", "other"}
def normalize_lang(raw: str | None) -> str:
if not raw:
return "other"
raw = raw.lower().split("-")[0]
return raw if raw in LANGUAGES else "other"
@dataclass(frozen=True)
class LocalizedText:
lang: str
value: str
@dataclass(frozen=True)
class Category:
sourceLabel: str
schemaOrgRefs: tuple[str, ...]
wikidataRef: Optional[str] = None
@dataclass(frozen=True)
class PostalAddress:
streetName: Optional[str] = None
streetNumber: Optional[str] = None
locality: Optional[str] = None
postalCode: Optional[str] = None
country: Optional[str] = None
@dataclass(frozen=True)
class ContactPoint:
telephone: Optional[str] = None
email: Optional[str] = None
website: Optional[str] = None
@dataclass(frozen=True)
class Location:
latitude: float
longitude: float
@dataclass
class PointOfInterest:
localId: str
names: list[LocalizedText]
category: Category
location: Location
descriptions: list[LocalizedText] = field(default_factory=list)
address: Optional[PostalAddress] = None
contact: Optional[ContactPoint] = None
capacity: Optional[int] = None
costRating: Optional[int] = None
"""Reusable text transforms shared across adapters.
Adapters compose these helpers rather than reimplementing them. Helpers
are intentionally minimal: they only do generic text work (cleanup,
regex extraction, keyword routing). Anything dataset-specific belongs
in the adapter itself.
"""
from __future__ import annotations
import re
from typing import Optional
def clean_text(value: Optional[str]) -> Optional[str]:
"""Trim, collapse internal whitespace, return None for empty input."""
if value is None:
return None
txt = re.sub(r"\s+", " ", str(value)).strip()
return txt or None
def extract_count(value: Optional[str], pattern: str = r"\((\d+)") -> Optional[int]:
"""Pull an integer out of free text, e.g. '... (12 exemplares)' -> 12."""
if value is None:
return None
m = re.search(pattern, value)
return int(m.group(1)) if m else None
def match_keywords(value: Optional[str], keyword_map: dict[str, str]) -> Optional[str]:
"""Return the first enum value whose regex key matches the input.
keyword_map: {regex_pattern: enum_value}, e.g.
{r"conjunto\\s+arb[óo]re[op]": "TreeCluster",
r"isolad": "IsolatedSpecimen"}
Patterns are evaluated in insertion order, case-insensitive.
"""
if not value:
return None
for pattern, enum_value in keyword_map.items():
if re.search(pattern, value, re.IGNORECASE):
return enum_value
return None
class Registry:
"""Tiny dedupe registry for value-typed entities like Authority.
Use when source data has many spelling variants of the same entity:
reg = Registry({"ICNF": Authority(name="...", acronym="ICNF")})
a = reg.resolve("ICNF (Instituto da Conservação ...)", needle="ICNF")
The canonical instance is returned, ensuring downstream graphs share
one node per real-world entity.
"""
def __init__(self, known: dict | None = None):
self._known = dict(known or {})
def resolve(self, raw, needle: str | None = None, default=None):
if raw is None:
return default
text = str(raw)
if needle is not None and needle in text and needle in self._known:
return self._known[needle]
for key, val in self._known.items():
if key in text:
return val
return default
def get(self, key: str):
return self._known.get(key)
"""JSON-LD writer for the canonical POI model.
Aligns with schema.org for shared terms (name, description, address,
category) so the output is directly consumable by schema.org-aware
tools, while keeping a local namespace for the wrapping shape.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable
from .model import PointOfInterest
NS = "http://mimathon.askem.eu/uc2/pois#"
CONTEXT = {
"@vocab": NS,
"schema": "https://schema.org/",
"wd": "https://www.wikidata.org/entity/",
"PointOfInterest": NS + "PointOfInterest",
"Category": NS + "Category",
"Location": NS + "Location",
"PostalAddress": "schema:PostalAddress",
"ContactPoint": "schema:ContactPoint",
"LocalizedText": NS + "LocalizedText",
"names": "schema:name",
"descriptions": "schema:description",
"category": "schema:category",
"location": "schema:location",
"address": "schema:address",
"contact": "schema:contactPoint",
"telephone": "schema:telephone",
"email": "schema:email",
"website": "schema:url",
"streetName": "schema:streetAddress",
"streetNumber": "schema:streetAddress",
"locality": "schema:addressLocality",
"postalCode": "schema:postalCode",
"country": "schema:addressCountry",
"schemaOrgRefs": {"@id": NS + "schemaOrgRefs", "@type": "@id"},
"wikidataRef": {"@id": NS + "wikidataRef", "@type": "@id"},
"geo": "https://www.w3.org/2003/01/geo/wgs84_pos#",
"latitude": "geo:lat",
"longitude": "geo:long",
"lang": "@language",
"value": "@value",
"capacity": "schema:maximumAttendeeCapacity",
}
def _strip_none(d):
if isinstance(d, dict):
return {k: _strip_none(v) for k, v in d.items() if v is not None and v != []}
if isinstance(d, list):
return [_strip_none(x) for x in d]
return d
def _localized(items) -> list[dict]:
return [{"@language": t.lang, "@value": t.value} for t in items]
def poi_to_node(poi: PointOfInterest, base_id: str) -> dict:
node = {
"@id": f"{base_id}{poi.localId}",
"@type": "PointOfInterest",
"localId": poi.localId,
"names": _localized(poi.names),
"descriptions": _localized(poi.descriptions),
"category": {
"@type": "Category",
"sourceLabel": poi.category.sourceLabel,
"schemaOrgRefs": list(poi.category.schemaOrgRefs),
"wikidataRef": poi.category.wikidataRef,
},
"location": {
"@type": "Location",
"latitude": poi.location.latitude,
"longitude": poi.location.longitude,
},
"capacity": poi.capacity,
"costRating": poi.costRating,
}
if poi.address is not None:
node["address"] = {"@type": "PostalAddress", **asdict(poi.address)}
if poi.contact is not None:
node["contact"] = {"@type": "ContactPoint", **asdict(poi.contact)}
return _strip_none(node)
def build_document(pois: Iterable[PointOfInterest], base_id: str) -> dict:
return {
"@context": CONTEXT,
"@graph": [poi_to_node(p, base_id) for p in pois],
}
"""GeoJSON writer for the canonical POI model.
FeatureCollection in WGS84. Properties are flattened with dotted keys
so non-LD tools (geojson.io, QGIS, Leaflet) can use the data directly.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable
from .model import PointOfInterest
def _flatten(prefix: str, value, target: dict) -> None:
if value is None or value == []:
return
if isinstance(value, dict):
for k, v in value.items():
_flatten(f"{prefix}.{k}" if prefix else k, v, target)
elif isinstance(value, (list, tuple)):
if value and isinstance(value[0], dict) and {"lang", "value"} <= set(value[0]):
for item in value:
target[f"{prefix}_{item['lang']}"] = item["value"]
else:
target[prefix] = list(value)
else:
target[prefix] = value
def poi_to_feature(poi: PointOfInterest, base_id: str) -> dict:
props: dict = {"@id": f"{base_id}{poi.localId}", "@type": "PointOfInterest"}
props["localId"] = poi.localId
_flatten("name", [{"lang": t.lang, "value": t.value} for t in poi.names], props)
_flatten("description", [{"lang": t.lang, "value": t.value} for t in poi.descriptions], props)
_flatten("category", {
"sourceLabel": poi.category.sourceLabel,
"schemaOrgRefs": list(poi.category.schemaOrgRefs),
"wikidataRef": poi.category.wikidataRef,
}, props)
if poi.address is not None:
_flatten("address", asdict(poi.address), props)
if poi.contact is not None:
_flatten("contact", asdict(poi.contact), props)
if poi.capacity is not None:
props["capacity"] = poi.capacity
if poi.costRating is not None:
props["costRating"] = poi.costRating
return {
"type": "Feature",
"id": poi.localId,
"geometry": {
"type": "Point",
"coordinates": [poi.location.longitude, poi.location.latitude],
},
"properties": props,
}
def build_collection(pois: Iterable[PointOfInterest], base_id: str) -> dict:
return {
"type": "FeatureCollection",
"features": [poi_to_feature(p, base_id) for p in pois],
}
"""CLI entry point for the POI harmonizer.
Example:
python -m harmonize_pois \
--adapter porto_pois \
--input ../uc2-pois-casas-de-fado.csv \
--output ../out/pois.jsonld \
--geojson ../out/pois.geojson \
--base-id http://mimathon.askem.eu/uc2/pois/
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from pathlib import Path
from .geojson_out import build_collection
from .jsonld import build_document
def _load_adapter(name: str):
mod = importlib.import_module(f"harmonize_pois.adapters.{name}")
if not hasattr(mod, "read"):
raise SystemExit(f"adapter {name!r} has no read(path) function")
return mod
def main(argv=None) -> int:
p = argparse.ArgumentParser(prog="harmonize_pois", description="Harmonize a POI dataset to the canonical PointOfInterest model and emit JSON-LD plus optional GeoJSON.")
p.add_argument("--adapter", required=True, help="Adapter module name under harmonize_pois.adapters, e.g. porto_pois")
p.add_argument("--input", required=True, type=Path, help="Source dataset path")
p.add_argument("--output", required=True, type=Path, help="Destination JSON-LD file")
p.add_argument("--base-id", default="http://example.org/pois/", help="IRI prefix for POI @id values")
p.add_argument("--geojson", type=Path, help="Also emit a GeoJSON FeatureCollection")
args = p.parse_args(argv)
adapter = _load_adapter(args.adapter)
print(f"Reading via adapter '{args.adapter}' from {args.input}...")
pois = list(adapter.read(args.input))
print(f" {len(pois)} POIs read")
print(f"Writing JSON-LD to {args.output}...")
doc = build_document(pois, base_id=args.base_id)
args.output.parent.mkdir(parents=True, exist_ok=True)
args.output.write_text(json.dumps(doc, ensure_ascii=False, indent=2), encoding="utf-8")
print(f" done, {len(doc['@graph'])} entities in @graph")
if args.geojson:
print(f"Writing GeoJSON to {args.geojson}...")
fc = build_collection(pois, base_id=args.base_id)
args.geojson.parent.mkdir(parents=True, exist_ok=True)
args.geojson.write_text(json.dumps(fc, ensure_ascii=False, indent=2), encoding="utf-8")
print(f" done, {len(fc['features'])} features")
return 0
if __name__ == "__main__":
sys.exit(main())
{
"Casas de Fado": {
"schemaOrgRefs": [
"https://schema.org/Restaurant",
"https://schema.org/MusicVenue"
],
"wikidataRef": "https://www.wikidata.org/entity/Q3338148"
},
"Fado houses": {
"schemaOrgRefs": [
"https://schema.org/Restaurant",
"https://schema.org/MusicVenue"
],
"wikidataRef": "https://www.wikidata.org/entity/Q3338148"
},
"Museum": {
"schemaOrgRefs": ["https://schema.org/Museum"],
"wikidataRef": "https://www.wikidata.org/entity/Q33506"
},
"Restaurant": {
"schemaOrgRefs": ["https://schema.org/Restaurant"],
"wikidataRef": "https://www.wikidata.org/entity/Q11707"
},
"Park": {
"schemaOrgRefs": ["https://schema.org/Park"],
"wikidataRef": "https://www.wikidata.org/entity/Q22698"
},
"Hotel": {
"schemaOrgRefs": ["https://schema.org/Hotel"],
"wikidataRef": "https://www.wikidata.org/entity/Q27686"
},
"Beach": {
"schemaOrgRefs": ["https://schema.org/Beach"],
"wikidataRef": "https://www.wikidata.org/entity/Q40080"
}
}
"""Skeleton POI adapter, copy and rename to add a new dataset.
Quick start:
1. Copy to harmonize_pois/adapters/<your_dataset>.py
2. Replace the read() body with your own parsing
3. Run: python -m harmonize_pois --adapter <your_dataset> --input ...
Contract:
Expose a single function `read(path) -> Iterator[PointOfInterest]`.
Look at adapters/porto_pois.py for a worked example covering CitySDK
multilingual fields, vCard address parsing, and category lookup.
"""
from __future__ import annotations
from pathlib import Path
from typing import Iterator
from ..model import (
Category, ContactPoint, Location, LocalizedText, PointOfInterest,
PostalAddress, normalize_lang,
)
from ..transforms import clean_text
def read(path: str | Path) -> Iterator[PointOfInterest]:
"""Yield canonical PointOfInterest records from `path`.
Replace the body below with your dataset's parsing logic.
"""
# Example: a CSV with one POI per row
# import csv
# with Path(path).open(encoding="utf-8") as f:
# for row in csv.DictReader(f):
# yield PointOfInterest(
# localId=row["id"],
# names=[LocalizedText(lang=normalize_lang("en"), value=row["name"])],
# category=Category(
# sourceLabel=row["category"],
# schemaOrgRefs=("https://schema.org/Place",),
# ),
# location=Location(
# latitude=float(row["lat"]),
# longitude=float(row["lon"]),
# ),
# )
raise NotImplementedError("Implement read() for your dataset, see porto_pois.py")
"""Adapter for the Porto Open Data Casas de Fado CSV (CitySDK schema).
Source quirks handled here:
- The CSV uses Python dict-literal strings (single-quoted) inside
multilingual fields, parsed via ast.literal_eval.
- Address is a vCard 2.1 blob, ADR;WORK fields are extracted
positionally per the spec (P.O. box; ext addr; street; locality;
region; postal code; country).
- The 'others' field is a list of {type, value} where type is a
namespaced key like x-citysdk/capacity, x-citysdk/cost-rating, etc.
Maps to PointOfInterest via category_map.json lookup for schema.org IRIs.
"""
from __future__ import annotations
import ast
import csv
import json
import re
from pathlib import Path
from typing import Iterator
from ..model import (
Category, ContactPoint, Location, LocalizedText, PointOfInterest,
PostalAddress, normalize_lang,
)
from ..transforms import clean_text
def _load_category_map() -> dict:
p = Path(__file__).resolve().parent.parent / "category_map.json"
return json.loads(p.read_text(encoding="utf-8"))
_CATEGORY_MAP = _load_category_map()
def _safe_literal(raw: str | None):
if not raw:
return []
try:
return ast.literal_eval(raw)
except (ValueError, SyntaxError):
return []
def _localized_list(raw: str | None, key_lang: str = "lang", key_val: str = "value", filter_term: str | None = None) -> list[LocalizedText]:
out: list[LocalizedText] = []
seen_langs: set[str] = set()
for entry in _safe_literal(raw):
if not isinstance(entry, dict):
continue
if filter_term and entry.get("term") != filter_term:
continue
lang = normalize_lang(entry.get(key_lang))
val = clean_text(entry.get(key_val))
if not val or lang in seen_langs:
continue
seen_langs.add(lang)
out.append(LocalizedText(lang=lang, value=val))
return out
def _resolve_category(raw: str | None) -> Category:
items = _safe_literal(raw)
pt_label = None
for item in items:
if isinstance(item, dict) and item.get("lang") == "pt-PT":
pt_label = item.get("value")
break
label = pt_label or (items[0].get("value") if items and isinstance(items[0], dict) else "Unknown")
label = clean_text(label) or "Unknown"
mapped = _CATEGORY_MAP.get(label, {})
return Category(
sourceLabel=label,
schemaOrgRefs=tuple(mapped.get("schemaOrgRefs", ["https://schema.org/Place"])),
wikidataRef=mapped.get("wikidataRef"),
)
_VCARD_KEYS = ("BEGIN", "VERSION", "REV", "N", "FN", "ORG", "ADR", "TEL", "URL", "EMAIL", "END")
_VCARD_FIELD = re.compile(r"(?:^|\s)(" + "|".join(_VCARD_KEYS) + r")(?:;[^:]+)?:(.*?)(?=\s(?:" + "|".join(_VCARD_KEYS) + r")(?:;[^:]+)?:|$)", re.DOTALL)
def _parse_vcard(vcard: str) -> dict[str, str]:
out: dict[str, str] = {}
for m in _VCARD_FIELD.finditer(vcard or ""):
key = m.group(1).upper()
val = m.group(2).strip()
if key not in out:
out[key] = val
return out
def _parse_vcard_address(vcard: str) -> PostalAddress | None:
fields = _parse_vcard(vcard)
raw = fields.get("ADR")
if not raw:
return None
parts = raw.split(";")
while len(parts) < 7:
parts.append("")
_po, _ext, street, locality, _region, postal, country = parts[:7]
return PostalAddress(
streetName=clean_text(street),
streetNumber=clean_text(_ext),
locality=clean_text(locality),
postalCode=clean_text(postal),
country=clean_text(country),
)
def _parse_vcard_contact(vcard: str) -> ContactPoint | None:
fields = _parse_vcard(vcard)
if not fields:
return None
email_raw = fields.get("EMAIL", "")
email = email_raw.split("/")[0].split(",")[0].strip() if email_raw else None
cp = ContactPoint(
telephone=clean_text(fields.get("TEL")),
website=clean_text(fields.get("URL")),
email=clean_text(email),
)
return cp if any([cp.telephone, cp.website, cp.email]) else None
def _extract_others(raw: str | None) -> dict[str, list[str]]:
out: dict[str, list[str]] = {}
for item in _safe_literal(raw):
if isinstance(item, dict):
t = item.get("type")
v = item.get("value")
if t and v is not None:
out.setdefault(t, []).append(str(v))
return out
def _safe_int(s) -> int | None:
if s is None:
return None
try:
return int(s)
except (ValueError, TypeError):
return None
def read(csv_path: str | Path) -> Iterator[PointOfInterest]:
"""Yield canonical PointOfInterest records from a Porto CitySDK CSV."""
with Path(csv_path).open(encoding="utf-8") as f:
reader = csv.DictReader(f, quotechar="'")
for row in reader:
others = _extract_others(row.get("others"))
try:
lat = float(row["latitude"])
lon = float(row["longitude"])
except (KeyError, ValueError, TypeError):
continue
names = _localized_list(row.get("label"), key_lang="lang", key_val="value", filter_term="primary")
if not names:
continue
poi = PointOfInterest(
localId=row.get("id") or "",
names=names,
descriptions=_localized_list(row.get("description")),
category=_resolve_category(row.get("category")),
location=Location(latitude=lat, longitude=lon),
address=_parse_vcard_address(row.get("address") or ""),
contact=_parse_vcard_contact(row.get("address") or ""),
capacity=_safe_int(others.get("x-citysdk/capacity", [None])[0]),
costRating=_safe_int(others.get("x-citysdk/cost-rating", [None])[0]),
)
yield poi
Same goal as UC1: a declarative YAML mapping for simple cases (just renames + enum lookups + per-language picks) and a harmonize_pois new-adapter --name <city> scaffolder. Most POI imports are nearly identical apart from field names; the long tail is the messy parts (vCard, iCalendar, Python dict literals).
The Postos de Abastecimento set was onboarded without writing any code (same CitySDK adapter, three new entries in category_map.json). The next test is a dataset that does not match the CitySDK shape, e.g. an OpenStreetMap export, a Google Places dump, or another city's portal. That will exercise the adapter pluggability for real and surface any canonical-model gap.
Translate the canonical record into NGSI-LD entities conformant to dataModel.PointOfInterest: pick a primary language for name, fold the rest into description, set category to the source label, attach contactPoint and location as references. Validate with the SDM tooling.
For datasets with hundreds of categories, hand-curating category_map.json does not scale. A SPARQL probe against Wikidata, a fuzzy match against schema.org class labels, or an LLM-assisted draft with human review would all be reasonable. Each entry should remain explicit and auditable.
The CitySDK time column is iCalendar (BEGIN:VCALENDAR with RRULE). Map to schema.org openingHoursSpecification for a richer canonical record. Optional, but useful for tourism-facing apps.