---
theme: slate
transition: fade
showIndex: true
align: left
---

# Points of Interest

### MIMathon Porto 2026 · Use case 02

team **LESL** · Dolfin & Askem

---

## The problem in one sentence

The same kind of place, named, categorised, and structured differently by every source.

- *"museum"* vs *"cultural site"* vs *"heritage space"*
- Different category hierarchies
- *"restaurant"* vs *"vegan restaurant"* mixed granularities
- Multiple file formats, embedded vCard, embedded iCalendar

→ Cannot unify search, navigation, tourism apps.

---

## The datasets

Two CSV exports from the Porto Digital **CitySDK** API:

- **4 Casas de Fado** in the historic centre
- **54 Postos de Abastecimento** (petrol stations)

Same schema, very different categories.

Every field is **trilingual** (pt-PT, en-GB, es-ES).

---

## Format quirks we found

- **Python dict literals as CSV cells**, parsed with `ast.literal_eval`, not JSON
- **vCard 2.1** embedded in the `address` column (street + phone + email + URL)
- **iCalendar** in the `time` column
- **17+ namespaced keys** in the `others` column (`x-citysdk/capacity`, `x-citysdk/cost-rating`, ...)

A real-world dataset, with all the messy parts you would expect.

---

## Standards check: Smart Data Models

This time, **SDM has it**.

[`dataModel.PointOfInterest`](https://github.com/smart-data-models/dataModel.PointOfInterest) defines:

- `PointOfInterest` (generic shape)
- `Museum`, `Beach`, `Store` (specialisations)

→ We **reuse**, not invent. Align our canonical to SDM's attribute names so the future NGSI-LD serialisation is trivial.

---

## Standards check: taxonomy

For category content, we need a **dereferenceable, stable vocabulary**:

- **schema.org** — well-known, modern, web-ready IRIs (`schema.org/Restaurant`, `schema.org/GasStation`)
- **Wikidata** — culturally aware Q-IDs (`Q3338148` for *Casa de fado*, `Q205495` for *gas station*)

We bind every category to **both**.

---

## The canonical model

```
concept PointOfInterest:
  has localId: one string
  has names: at least 1 LocalizedText
  has descriptions: LocalizedText
  has category: one Category
  has location: one Location
  has address: optional PostalAddress
  has contact: optional ContactPoint
  has capacity: optional int
  has costRating: optional int
```

`LocalizedText`, `Category`, `PostalAddress`, `ContactPoint` are all typed entities.

---

## The model as a graph

![graph](sl-uc2-graph.svg)

---

## Multilingual, the right way

Each name is an RDF **language-tagged value**:

```json
"names": [
  { "@language": "pt", "@value": "O Fado" },
  { "@language": "en", "@value": "O Fado" },
  { "@language": "es", "@value": "O Fado" }
]
```

→ Out-of-the-box compatible with schema.org, JSON-LD, RDF tools.

---

## Category alignment

The `Category` node carries three things:

```json
{
  "sourceLabel": "Casas de Fado",
  "schemaOrgRefs": [
    "https://schema.org/Restaurant",
    "https://schema.org/MusicVenue"
  ],
  "wikidataRef": "https://www.wikidata.org/entity/Q3338148"
}
```

- **Source label** preserved for audit
- **schema.org IRIs** for canonical types
- **Wikidata Q-ID** for cultural precision

---

## A small explicit lookup table

`category_map.json` maps source labels to canonical IRIs:

```json
{
  "Casas de Fado": {
    "schemaOrgRefs": ["https://schema.org/Restaurant",
                      "https://schema.org/MusicVenue"],
    "wikidataRef": "https://www.wikidata.org/entity/Q3338148"
  },
  "Postos de Abastecimento": {
    "schemaOrgRefs": ["https://schema.org/GasStation"],
    "wikidataRef": "https://www.wikidata.org/entity/Q205495"
  }
}
```

Explicit, auditable, easy to extend.

---

## The architecture pays off

We started with one dataset (Casas de Fado, 4 records).

A second arrived (Postos de Abastecimento, 54 records).

**Same CitySDK schema, different category.**

We added **three lines to `category_map.json`** and re-ran. **54 of 54 records harmonized**, **zero code change**.

This is what a canonical-model-plus-adapter architecture is for.

---

## Before and after

**Source CSV row** (one cell shown):

```
'category' = "[{'lang': 'pt-PT', 'value': 'Casas de Fado'},
                {'lang': 'en-GB', 'value': 'Fado houses'},
                {'lang': 'es-ES', 'value': 'Casas de Fado'}]"
```

**Canonical JSON-LD**:

```json
"category": {
  "@type": "Category",
  "sourceLabel": "Casas de Fado",
  "schemaOrgRefs": ["https://schema.org/Restaurant",
                    "https://schema.org/MusicVenue"],
  "wikidataRef": "https://www.wikidata.org/entity/Q3338148"
}
```

---

## What changed

- Multilingual text → RDF language-tagged literals
- vCard `ADR` → typed `PostalAddress` (street, postcode, country)
- vCard `TEL/URL/EMAIL` → typed `ContactPoint`
- `x-citysdk/capacity` and `x-citysdk/cost-rating` → first-class attributes
- Category dual-bound to **schema.org** + **Wikidata**
- Stable `@id`, every fragment carries an `@type`

---

## Find it all

Page: **askem.eu/mimathon/sl-uc2.html**

- canonical model (`pois.dolfin`)
- both datasets (source + canonical, JSON-LD + GeoJSON)
- the source code, including the new `category_map.json`
- a single-tarball download (`sl-uc2-harmonize.tar.gz`)

---

## What's next

1. **Lower the bar for new adapters**: YAML config for simple cases, scaffolder for new datasets
2. **Onboard a third dataset with a different schema** (OSM export, Google Places, another city's portal) to stress-test the adapter layer
3. **Ship as NGSI-LD** against `dataModel.PointOfInterest`
4. **Resolve categories beyond the lookup table** (SPARQL probes against Wikidata, fuzzy schema.org matching, LLM-assisted draft + human review)
5. **Parse opening hours** from the iCalendar column into `schema:openingHoursSpecification`

---

# Thank you

**team LESL** · Dolfin & Askem

askem.eu/mimathon
