184 lines
7.0 KiB
Markdown
184 lines
7.0 KiB
Markdown
# Backend App (`backend/app`)
|
|
|
|
This folder contains the FastAPI backend for `visualizador_instanciados`.
|
|
|
|
The backend can execute SPARQL queries in two interchangeable ways:
|
|
|
|
1. **`GRAPH_BACKEND=rdflib`**: parse a Turtle file into an in-memory RDFLib `Graph` and run SPARQL queries locally.
|
|
2. **`GRAPH_BACKEND=anzograph`**: run SPARQL queries against an AnzoGraph SPARQL endpoint over HTTP (optionally `LOAD` a TTL on startup).
|
|
|
|
Callers (frontend or other clients) interact with a single API surface (`/api/*`) and do not need to know which backend is configured.
|
|
|
|
## Files
|
|
|
|
- `main.py`
|
|
- FastAPI app setup, startup/shutdown (`lifespan`), and HTTP endpoints.
|
|
- `settings.py`
|
|
- Env-driven configuration (`pydantic-settings`).
|
|
- `sparql_engine.py`
|
|
- Backend-agnostic SPARQL execution layer:
|
|
- `RdflibEngine`: `Graph.query(...)` + SPARQL JSON serialization.
|
|
- `AnzoGraphEngine`: HTTP POST to `/sparql` with Basic auth + readiness gate.
|
|
- `create_sparql_engine(settings)` chooses the engine based on `GRAPH_BACKEND`.
|
|
- `graph_export.py`
|
|
- Shared helpers to:
|
|
- build the snapshot SPARQL query used for edge retrieval
|
|
- map SPARQL JSON bindings to `{nodes, edges}`.
|
|
- `models.py`
|
|
- Pydantic response/request models:
|
|
- `Node`, `Edge`, `GraphResponse`, `StatsResponse`, etc.
|
|
- `rdf_store.py`
|
|
- A local parsed representation (dense IDs + neighbor-ish data) built only in `GRAPH_BACKEND=rdflib`.
|
|
- Used by `/api/nodes`, `/api/edges`, and `rdflib`-mode `/api/stats`.
|
|
- `pipelines/graph_snapshot.py`
|
|
- Pipeline used by `/api/graph` to return a `{nodes, edges}` snapshot via SPARQL (works for both RDFLib and AnzoGraph).
|
|
- `pipelines/snapshot_service.py`
|
|
- Snapshot cache layer used by `/api/graph` and `/api/stats` so the backend doesn't run expensive SPARQL twice.
|
|
- `pipelines/subclass_labels.py`
|
|
- Pipeline to extract `rdfs:subClassOf` entities and aligned `rdfs:label` list.
|
|
|
|
## Runtime Flow
|
|
|
|
On startup (FastAPI lifespan):
|
|
|
|
1. `create_sparql_engine(settings)` selects and starts a SPARQL engine.
|
|
2. The engine is stored at `app.state.sparql`.
|
|
3. If `GRAPH_BACKEND=rdflib`, `RDFStore` is also built from the already-loaded RDFLib graph and stored at `app.state.store`.
|
|
|
|
On shutdown:
|
|
|
|
- `app.state.sparql.shutdown()` is called to close the HTTP client (AnzoGraph mode) or no-op (RDFLib mode).
|
|
|
|
## Environment Variables
|
|
|
|
Most configuration is intended to be provided via container environment variables (see repo root `.env` and `docker-compose.yml`).
|
|
|
|
Core:
|
|
|
|
- `GRAPH_BACKEND`: `rdflib` or `anzograph`
|
|
- `INCLUDE_BNODES`: `true`/`false`
|
|
- `CORS_ORIGINS`: comma-separated list or `*`
|
|
|
|
RDFLib mode:
|
|
|
|
- `TTL_PATH`: path inside the backend container to a `.ttl` file (example: `/data/o3po.ttl`)
|
|
- `MAX_TRIPLES`: optional int; if set, stops parsing after this many triples
|
|
|
|
AnzoGraph mode:
|
|
|
|
- `SPARQL_HOST`: base host (example: `http://anzograph:8080`)
|
|
- `SPARQL_ENDPOINT`: optional full endpoint; if set, overrides `${SPARQL_HOST}/sparql`
|
|
- `SPARQL_USER`, `SPARQL_PASS`: Basic auth credentials
|
|
- `SPARQL_DATA_FILE`: file URI as seen by the **AnzoGraph container** (example: `file:///opt/shared-files/o3po.ttl`)
|
|
- `SPARQL_GRAPH_IRI`: optional graph IRI for `LOAD ... INTO GRAPH <...>`
|
|
- `SPARQL_LOAD_ON_START`: `true` to execute `LOAD <SPARQL_DATA_FILE>` during startup
|
|
- `SPARQL_CLEAR_ON_START`: `true` to execute `CLEAR ALL` during startup (dangerous)
|
|
- `SPARQL_TIMEOUT_S`: request timeout for normal SPARQL requests
|
|
- `SPARQL_READY_RETRIES`, `SPARQL_READY_DELAY_S`, `SPARQL_READY_TIMEOUT_S`: readiness gate parameters
|
|
|
|
## AnzoGraph Readiness Gate
|
|
|
|
`AnzoGraphEngine` does not assume "container started" means "SPARQL works".
|
|
It waits for a smoke-test POST:
|
|
|
|
- Method: `POST ${SPARQL_ENDPOINT}`
|
|
- Headers:
|
|
- `Content-Type: application/x-www-form-urlencoded`
|
|
- `Accept: application/sparql-results+json`
|
|
- `Authorization: Basic ...` (if configured)
|
|
- Body: `query=ASK WHERE { ?s ?p ?o }`
|
|
- Success condition: HTTP 2xx and response parses as JSON
|
|
|
|
This matches the behavior described in `docs/anzograph-readiness-julia.md`.
|
|
|
|
## API Endpoints
|
|
|
|
- `GET /api/health`
|
|
- Returns `{ "status": "ok" }`.
|
|
- `GET /api/stats`
|
|
- Returns counts for the same snapshot used by `/api/graph` (via the snapshot cache).
|
|
- `POST /api/sparql`
|
|
- Body: `{ "query": "<SPARQL SELECT/ASK>" }`
|
|
- Returns SPARQL JSON results as-is.
|
|
- Notes:
|
|
- This endpoint is intended for **SELECT/ASK returning SPARQL-JSON**.
|
|
- SPARQL UPDATE is not exposed here (AnzoGraph `LOAD`/`CLEAR` are handled internally during startup).
|
|
- `GET /api/graph?node_limit=...&edge_limit=...`
|
|
- Returns a graph snapshot as `{ nodes: [...], edges: [...] }`.
|
|
- Implemented as a SPARQL edge query + mapping in `pipelines/graph_snapshot.py`.
|
|
- `GET /api/nodes`, `GET /api/edges`
|
|
- Only available in `GRAPH_BACKEND=rdflib` (these use `RDFStore`'s dense ID tables).
|
|
|
|
## Data Contract
|
|
|
|
### Node
|
|
|
|
Returned in `nodes[]` (dense IDs; suitable for indexing in typed arrays):
|
|
|
|
```json
|
|
{
|
|
"id": 0,
|
|
"termType": "uri",
|
|
"iri": "http://example.org/Thing",
|
|
"label": null,
|
|
"x": 0.0,
|
|
"y": 0.0
|
|
}
|
|
```
|
|
|
|
- `id`: integer dense node ID used in edges
|
|
- `termType`: `"uri"` or `"bnode"`
|
|
- `iri`: URI string; blank nodes are normalized to `_:<id>`
|
|
- `label`: currently `null` in `/api/graph` snapshots (pipelines can be used to populate later)
|
|
- `x`/`y`: world-space coordinates for rendering (currently a deterministic spiral layout)
|
|
|
|
### Edge
|
|
|
|
Returned in `edges[]`:
|
|
|
|
```json
|
|
{
|
|
"source": 0,
|
|
"target": 12,
|
|
"predicate": "http://www.w3.org/2000/01/rdf-schema#subClassOf"
|
|
}
|
|
```
|
|
|
|
- `source`/`target`: dense node IDs (indexes into `nodes[]`)
|
|
- `predicate`: predicate IRI string
|
|
|
|
## Snapshot Query (`/api/graph`)
|
|
|
|
`/api/graph` uses a SPARQL query that:
|
|
|
|
- selects triples `?s ?p ?o`
|
|
- excludes literal objects (`FILTER(!isLiteral(?o))`)
|
|
- excludes `rdfs:label`, `skos:prefLabel`, and `skos:altLabel` predicates
|
|
- optionally excludes blank nodes (unless `INCLUDE_BNODES=true`)
|
|
- applies `LIMIT edge_limit`
|
|
|
|
The result bindings are mapped to dense node IDs (first-seen order) and returned to the caller.
|
|
|
|
`/api/graph` also returns `meta` with snapshot counts and engine info so the frontend doesn't need to call `/api/stats`.
|
|
|
|
## Pipelines
|
|
|
|
### `pipelines/graph_snapshot.py`
|
|
|
|
`fetch_graph_snapshot(...)` is the main "export graph" pipeline used by `/api/graph`.
|
|
|
|
### `pipelines/subclass_labels.py`
|
|
|
|
`extract_subclass_entities_and_labels(...)`:
|
|
|
|
1. Queries all `rdfs:subClassOf` triples.
|
|
2. Builds a unique set of subjects+objects, then converts it to a deterministic list.
|
|
3. Queries `rdfs:label` for those entities and returns aligned lists:
|
|
- `entities[i]` corresponds to `labels[i]`.
|
|
|
|
## Notes / Tradeoffs
|
|
|
|
- `/api/graph` returns only nodes that appear in the returned edge result set. Nodes not referenced by those edges will not be present.
|
|
- RDFLib and AnzoGraph may differ in supported SPARQL features (vendor extensions, inference, performance), but the API surface is the same.
|
|
- `rdf_store.py` is currently only needed for `/api/nodes`, `/api/edges`, and rdflib-mode `/api/stats`. If you don't use those endpoints, it can be removed later.
|