# Backend App (`backend/app`) This folder contains the FastAPI backend for `visualizador_instanciados`. The backend can execute SPARQL queries in two interchangeable ways: 1. **`GRAPH_BACKEND=rdflib`**: parse a Turtle file into an in-memory RDFLib `Graph` and run SPARQL queries locally. 2. **`GRAPH_BACKEND=anzograph`**: run SPARQL queries against an AnzoGraph SPARQL endpoint over HTTP (optionally `LOAD` a TTL on startup). Callers (frontend or other clients) interact with a single API surface (`/api/*`) and do not need to know which backend is configured. ## Files - `main.py` - FastAPI app setup, startup/shutdown (`lifespan`), and HTTP endpoints. - `settings.py` - Env-driven configuration (`pydantic-settings`). - `sparql_engine.py` - Backend-agnostic SPARQL execution layer: - `RdflibEngine`: `Graph.query(...)` + SPARQL JSON serialization. - `AnzoGraphEngine`: HTTP POST to `/sparql` with Basic auth + readiness gate. - `create_sparql_engine(settings)` chooses the engine based on `GRAPH_BACKEND`. - `graph_export.py` - Shared helpers to: - build the snapshot SPARQL query used for edge retrieval - map SPARQL JSON bindings to `{nodes, edges}`. - `models.py` - Pydantic response/request models: - `Node`, `Edge`, `GraphResponse`, `StatsResponse`, etc. - `rdf_store.py` - A local parsed representation (dense IDs + neighbor-ish data) built only in `GRAPH_BACKEND=rdflib`. - Used by `/api/nodes`, `/api/edges`, and `rdflib`-mode `/api/stats`. - `pipelines/graph_snapshot.py` - Pipeline used by `/api/graph` to return a `{nodes, edges}` snapshot via SPARQL (works for both RDFLib and AnzoGraph). - `pipelines/snapshot_service.py` - Snapshot cache layer used by `/api/graph` and `/api/stats` so the backend doesn't run expensive SPARQL twice. - `pipelines/subclass_labels.py` - Pipeline to extract `rdfs:subClassOf` entities and aligned `rdfs:label` list. ## Runtime Flow On startup (FastAPI lifespan): 1. `create_sparql_engine(settings)` selects and starts a SPARQL engine. 2. The engine is stored at `app.state.sparql`. 3. If `GRAPH_BACKEND=rdflib`, `RDFStore` is also built from the already-loaded RDFLib graph and stored at `app.state.store`. On shutdown: - `app.state.sparql.shutdown()` is called to close the HTTP client (AnzoGraph mode) or no-op (RDFLib mode). ## Environment Variables Most configuration is intended to be provided via container environment variables (see repo root `.env` and `docker-compose.yml`). Core: - `GRAPH_BACKEND`: `rdflib` or `anzograph` - `INCLUDE_BNODES`: `true`/`false` - `CORS_ORIGINS`: comma-separated list or `*` RDFLib mode: - `TTL_PATH`: path inside the backend container to a `.ttl` file (example: `/data/o3po.ttl`) - `MAX_TRIPLES`: optional int; if set, stops parsing after this many triples AnzoGraph mode: - `SPARQL_HOST`: base host (example: `http://anzograph:8080`) - `SPARQL_ENDPOINT`: optional full endpoint; if set, overrides `${SPARQL_HOST}/sparql` - `SPARQL_USER`, `SPARQL_PASS`: Basic auth credentials - `SPARQL_DATA_FILE`: file URI as seen by the **AnzoGraph container** (example: `file:///opt/shared-files/o3po.ttl`) - `SPARQL_GRAPH_IRI`: optional graph IRI for `LOAD ... INTO GRAPH <...>` - `SPARQL_LOAD_ON_START`: `true` to execute `LOAD ` during startup - `SPARQL_CLEAR_ON_START`: `true` to execute `CLEAR ALL` during startup (dangerous) - `SPARQL_TIMEOUT_S`: request timeout for normal SPARQL requests - `SPARQL_READY_RETRIES`, `SPARQL_READY_DELAY_S`, `SPARQL_READY_TIMEOUT_S`: readiness gate parameters ## AnzoGraph Readiness Gate `AnzoGraphEngine` does not assume "container started" means "SPARQL works". It waits for a smoke-test POST: - Method: `POST ${SPARQL_ENDPOINT}` - Headers: - `Content-Type: application/x-www-form-urlencoded` - `Accept: application/sparql-results+json` - `Authorization: Basic ...` (if configured) - Body: `query=ASK WHERE { ?s ?p ?o }` - Success condition: HTTP 2xx and response parses as JSON This matches the behavior described in `docs/anzograph-readiness-julia.md`. ## API Endpoints - `GET /api/health` - Returns `{ "status": "ok" }`. - `GET /api/stats` - Returns counts for the same snapshot used by `/api/graph` (via the snapshot cache). - `POST /api/sparql` - Body: `{ "query": "" }` - Returns SPARQL JSON results as-is. - Notes: - This endpoint is intended for **SELECT/ASK returning SPARQL-JSON**. - SPARQL UPDATE is not exposed here (AnzoGraph `LOAD`/`CLEAR` are handled internally during startup). - `GET /api/graph?node_limit=...&edge_limit=...` - Returns a graph snapshot as `{ nodes: [...], edges: [...] }`. - Implemented as a SPARQL edge query + mapping in `pipelines/graph_snapshot.py`. - `GET /api/nodes`, `GET /api/edges` - Only available in `GRAPH_BACKEND=rdflib` (these use `RDFStore`'s dense ID tables). ## Data Contract ### Node Returned in `nodes[]` (dense IDs; suitable for indexing in typed arrays): ```json { "id": 0, "termType": "uri", "iri": "http://example.org/Thing", "label": null, "x": 0.0, "y": 0.0 } ``` - `id`: integer dense node ID used in edges - `termType`: `"uri"` or `"bnode"` - `iri`: URI string; blank nodes are normalized to `_:` - `label`: currently `null` in `/api/graph` snapshots (pipelines can be used to populate later) - `x`/`y`: world-space coordinates for rendering (currently a deterministic spiral layout) ### Edge Returned in `edges[]`: ```json { "source": 0, "target": 12, "predicate": "http://www.w3.org/2000/01/rdf-schema#subClassOf" } ``` - `source`/`target`: dense node IDs (indexes into `nodes[]`) - `predicate`: predicate IRI string ## Snapshot Query (`/api/graph`) `/api/graph` uses a SPARQL query that: - selects triples `?s ?p ?o` - excludes literal objects (`FILTER(!isLiteral(?o))`) - excludes `rdfs:label`, `skos:prefLabel`, and `skos:altLabel` predicates - optionally excludes blank nodes (unless `INCLUDE_BNODES=true`) - applies `LIMIT edge_limit` The result bindings are mapped to dense node IDs (first-seen order) and returned to the caller. `/api/graph` also returns `meta` with snapshot counts and engine info so the frontend doesn't need to call `/api/stats`. ## Pipelines ### `pipelines/graph_snapshot.py` `fetch_graph_snapshot(...)` is the main "export graph" pipeline used by `/api/graph`. ### `pipelines/subclass_labels.py` `extract_subclass_entities_and_labels(...)`: 1. Queries all `rdfs:subClassOf` triples. 2. Builds a unique set of subjects+objects, then converts it to a deterministic list. 3. Queries `rdfs:label` for those entities and returns aligned lists: - `entities[i]` corresponds to `labels[i]`. ## Notes / Tradeoffs - `/api/graph` returns only nodes that appear in the returned edge result set. Nodes not referenced by those edges will not be present. - RDFLib and AnzoGraph may differ in supported SPARQL features (vendor extensions, inference, performance), but the API surface is the same. - `rdf_store.py` is currently only needed for `/api/nodes`, `/api/edges`, and rdflib-mode `/api/stats`. If you don't use those endpoints, it can be removed later.