Reorganiza backend

This commit is contained in:
Oxy8
2026-03-02 17:33:45 -03:00
parent bba0ae887d
commit d4bfa5f064
8 changed files with 419 additions and 124 deletions

183
backend/app/README.md Normal file
View File

@@ -0,0 +1,183 @@
# Backend App (`backend/app`)
This folder contains the FastAPI backend for `visualizador_instanciados`.
The backend can execute SPARQL queries in two interchangeable ways:
1. **`GRAPH_BACKEND=rdflib`**: parse a Turtle file into an in-memory RDFLib `Graph` and run SPARQL queries locally.
2. **`GRAPH_BACKEND=anzograph`**: run SPARQL queries against an AnzoGraph SPARQL endpoint over HTTP (optionally `LOAD` a TTL on startup).
Callers (frontend or other clients) interact with a single API surface (`/api/*`) and do not need to know which backend is configured.
## Files
- `main.py`
- FastAPI app setup, startup/shutdown (`lifespan`), and HTTP endpoints.
- `settings.py`
- Env-driven configuration (`pydantic-settings`).
- `sparql_engine.py`
- Backend-agnostic SPARQL execution layer:
- `RdflibEngine`: `Graph.query(...)` + SPARQL JSON serialization.
- `AnzoGraphEngine`: HTTP POST to `/sparql` with Basic auth + readiness gate.
- `create_sparql_engine(settings)` chooses the engine based on `GRAPH_BACKEND`.
- `graph_export.py`
- Shared helpers to:
- build the snapshot SPARQL query used for edge retrieval
- map SPARQL JSON bindings to `{nodes, edges}`.
- `models.py`
- Pydantic response/request models:
- `Node`, `Edge`, `GraphResponse`, `StatsResponse`, etc.
- `rdf_store.py`
- A local parsed representation (dense IDs + neighbor-ish data) built only in `GRAPH_BACKEND=rdflib`.
- Used by `/api/nodes`, `/api/edges`, and `rdflib`-mode `/api/stats`.
- `pipelines/graph_snapshot.py`
- Pipeline used by `/api/graph` to return a `{nodes, edges}` snapshot via SPARQL (works for both RDFLib and AnzoGraph).
- `pipelines/snapshot_service.py`
- Snapshot cache layer used by `/api/graph` and `/api/stats` so the backend doesn't run expensive SPARQL twice.
- `pipelines/subclass_labels.py`
- Pipeline to extract `rdfs:subClassOf` entities and aligned `rdfs:label` list.
## Runtime Flow
On startup (FastAPI lifespan):
1. `create_sparql_engine(settings)` selects and starts a SPARQL engine.
2. The engine is stored at `app.state.sparql`.
3. If `GRAPH_BACKEND=rdflib`, `RDFStore` is also built from the already-loaded RDFLib graph and stored at `app.state.store`.
On shutdown:
- `app.state.sparql.shutdown()` is called to close the HTTP client (AnzoGraph mode) or no-op (RDFLib mode).
## Environment Variables
Most configuration is intended to be provided via container environment variables (see repo root `.env` and `docker-compose.yml`).
Core:
- `GRAPH_BACKEND`: `rdflib` or `anzograph`
- `INCLUDE_BNODES`: `true`/`false`
- `CORS_ORIGINS`: comma-separated list or `*`
RDFLib mode:
- `TTL_PATH`: path inside the backend container to a `.ttl` file (example: `/data/o3po.ttl`)
- `MAX_TRIPLES`: optional int; if set, stops parsing after this many triples
AnzoGraph mode:
- `SPARQL_HOST`: base host (example: `http://anzograph:8080`)
- `SPARQL_ENDPOINT`: optional full endpoint; if set, overrides `${SPARQL_HOST}/sparql`
- `SPARQL_USER`, `SPARQL_PASS`: Basic auth credentials
- `SPARQL_DATA_FILE`: file URI as seen by the **AnzoGraph container** (example: `file:///opt/shared-files/o3po.ttl`)
- `SPARQL_GRAPH_IRI`: optional graph IRI for `LOAD ... INTO GRAPH <...>`
- `SPARQL_LOAD_ON_START`: `true` to execute `LOAD <SPARQL_DATA_FILE>` during startup
- `SPARQL_CLEAR_ON_START`: `true` to execute `CLEAR ALL` during startup (dangerous)
- `SPARQL_TIMEOUT_S`: request timeout for normal SPARQL requests
- `SPARQL_READY_RETRIES`, `SPARQL_READY_DELAY_S`, `SPARQL_READY_TIMEOUT_S`: readiness gate parameters
## AnzoGraph Readiness Gate
`AnzoGraphEngine` does not assume "container started" means "SPARQL works".
It waits for a smoke-test POST:
- Method: `POST ${SPARQL_ENDPOINT}`
- Headers:
- `Content-Type: application/x-www-form-urlencoded`
- `Accept: application/sparql-results+json`
- `Authorization: Basic ...` (if configured)
- Body: `query=ASK WHERE { ?s ?p ?o }`
- Success condition: HTTP 2xx and response parses as JSON
This matches the behavior described in `docs/anzograph-readiness-julia.md`.
## API Endpoints
- `GET /api/health`
- Returns `{ "status": "ok" }`.
- `GET /api/stats`
- Returns counts for the same snapshot used by `/api/graph` (via the snapshot cache).
- `POST /api/sparql`
- Body: `{ "query": "<SPARQL SELECT/ASK>" }`
- Returns SPARQL JSON results as-is.
- Notes:
- This endpoint is intended for **SELECT/ASK returning SPARQL-JSON**.
- SPARQL UPDATE is not exposed here (AnzoGraph `LOAD`/`CLEAR` are handled internally during startup).
- `GET /api/graph?node_limit=...&edge_limit=...`
- Returns a graph snapshot as `{ nodes: [...], edges: [...] }`.
- Implemented as a SPARQL edge query + mapping in `pipelines/graph_snapshot.py`.
- `GET /api/nodes`, `GET /api/edges`
- Only available in `GRAPH_BACKEND=rdflib` (these use `RDFStore`'s dense ID tables).
## Data Contract
### Node
Returned in `nodes[]` (dense IDs; suitable for indexing in typed arrays):
```json
{
"id": 0,
"termType": "uri",
"iri": "http://example.org/Thing",
"label": null,
"x": 0.0,
"y": 0.0
}
```
- `id`: integer dense node ID used in edges
- `termType`: `"uri"` or `"bnode"`
- `iri`: URI string; blank nodes are normalized to `_:<id>`
- `label`: currently `null` in `/api/graph` snapshots (pipelines can be used to populate later)
- `x`/`y`: world-space coordinates for rendering (currently a deterministic spiral layout)
### Edge
Returned in `edges[]`:
```json
{
"source": 0,
"target": 12,
"predicate": "http://www.w3.org/2000/01/rdf-schema#subClassOf"
}
```
- `source`/`target`: dense node IDs (indexes into `nodes[]`)
- `predicate`: predicate IRI string
## Snapshot Query (`/api/graph`)
`/api/graph` uses a SPARQL query that:
- selects triples `?s ?p ?o`
- excludes literal objects (`FILTER(!isLiteral(?o))`)
- excludes `rdfs:label`, `skos:prefLabel`, and `skos:altLabel` predicates
- optionally excludes blank nodes (unless `INCLUDE_BNODES=true`)
- applies `LIMIT edge_limit`
The result bindings are mapped to dense node IDs (first-seen order) and returned to the caller.
`/api/graph` also returns `meta` with snapshot counts and engine info so the frontend doesn't need to call `/api/stats`.
## Pipelines
### `pipelines/graph_snapshot.py`
`fetch_graph_snapshot(...)` is the main "export graph" pipeline used by `/api/graph`.
### `pipelines/subclass_labels.py`
`extract_subclass_entities_and_labels(...)`:
1. Queries all `rdfs:subClassOf` triples.
2. Builds a unique set of subjects+objects, then converts it to a deterministic list.
3. Queries `rdfs:label` for those entities and returns aligned lists:
- `entities[i]` corresponds to `labels[i]`.
## Notes / Tradeoffs
- `/api/graph` returns only nodes that appear in the returned edge result set. Nodes not referenced by those edges will not be present.
- RDFLib and AnzoGraph may differ in supported SPARQL features (vendor extensions, inference, performance), but the API surface is the same.
- `rdf_store.py` is currently only needed for `/api/nodes`, `/api/edges`, and rdflib-mode `/api/stats`. If you don't use those endpoints, it can be removed later.

View File

@@ -5,10 +5,10 @@ from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException, Query
from fastapi.middleware.cors import CORSMiddleware
from .graph_export import edge_retrieval_query, graph_from_sparql_bindings
from .models import EdgesResponse, GraphResponse, NodesResponse, SparqlQueryRequest, StatsResponse
from .pipelines.snapshot_service import GraphSnapshotService
from .rdf_store import RDFStore
from .sparql_engine import AnzoGraphEngine, RdflibEngine, SparqlEngine, create_sparql_engine
from .sparql_engine import RdflibEngine, SparqlEngine, create_sparql_engine
from .settings import Settings
@@ -20,6 +20,7 @@ async def lifespan(app: FastAPI):
sparql: SparqlEngine = create_sparql_engine(settings)
await sparql.startup()
app.state.sparql = sparql
app.state.snapshot_service = GraphSnapshotService(sparql=sparql, settings=settings)
# Only build node/edge tables when running in rdflib mode.
if settings.graph_backend == "rdflib":
@@ -59,70 +60,17 @@ def health() -> dict[str, str]:
@app.get("/api/stats", response_model=StatsResponse)
async def stats() -> StatsResponse:
sparql: SparqlEngine = app.state.sparql
if settings.graph_backend == "rdflib":
store: RDFStore = app.state.store
return StatsResponse(
backend=sparql.name,
ttl_path=settings.ttl_path,
sparql_endpoint=None,
parsed_triples=store.parsed_triples,
nodes=store.node_count,
edges=store.edge_count,
)
# AnzoGraph: compute basic counts via SPARQL.
assert isinstance(sparql, AnzoGraphEngine)
def _count_from(result: dict, *, var: str = "count") -> int:
bindings = (((result.get("results") or {}).get("bindings")) or [])
if not bindings:
return 0
raw = bindings[0].get(var, {}).get("value")
try:
return int(raw)
except Exception:
return 0
bnode_filter = "" if settings.include_bnodes else "FILTER(!isBlank(?n))"
nodes_q = f"""
SELECT (COUNT(DISTINCT ?n) AS ?count)
WHERE {{
{{ ?n ?p ?o }} UNION {{ ?s ?p ?n }}
FILTER(!isLiteral(?n))
{bnode_filter}
}}
"""
triples_q = "SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }"
# Approximate "edges" similarly to our rdflib export: non-literal object, and skip label predicates.
edges_bnode_filter = "" if settings.include_bnodes else "FILTER(!isBlank(?s) && !isBlank(?o))"
edges_q = f"""
SELECT (COUNT(*) AS ?count)
WHERE {{
?s ?p ?o .
FILTER(!isLiteral(?o))
FILTER(?p NOT IN (
<http://www.w3.org/2000/01/rdf-schema#label>,
<http://www.w3.org/2004/02/skos/core#prefLabel>,
<http://www.w3.org/2004/02/skos/core#altLabel>
))
{edges_bnode_filter}
}}
"""
triples_res = await sparql.query_json(triples_q)
nodes_res = await sparql.query_json(nodes_q)
edges_res = await sparql.query_json(edges_q)
# Stats reflect exactly what we send to the frontend (/api/graph), not global graph size.
svc: GraphSnapshotService = app.state.snapshot_service
snap = await svc.get(node_limit=50_000, edge_limit=100_000)
meta = snap.meta
return StatsResponse(
backend=sparql.name,
ttl_path=settings.ttl_path,
sparql_endpoint=settings.effective_sparql_endpoint(),
parsed_triples=_count_from(triples_res),
nodes=_count_from(nodes_res),
edges=_count_from(edges_res),
backend=meta.backend if meta else app.state.sparql.name,
ttl_path=meta.ttl_path if meta and meta.ttl_path else settings.ttl_path,
sparql_endpoint=meta.sparql_endpoint if meta else None,
parsed_triples=len(snap.edges),
nodes=len(snap.nodes),
edges=len(snap.edges),
)
@@ -160,15 +108,5 @@ async def graph(
node_limit: int = Query(default=50_000, ge=1, le=200_000),
edge_limit: int = Query(default=100_000, ge=1, le=500_000),
) -> GraphResponse:
sparql: SparqlEngine = app.state.sparql
# Use SPARQL for graph export in BOTH modes so callers don't care which backend is in use.
edges_q = edge_retrieval_query(edge_limit=edge_limit, include_bnodes=settings.include_bnodes)
res = await sparql.query_json(edges_q)
bindings = (((res.get("results") or {}).get("bindings")) or [])
nodes, edges = graph_from_sparql_bindings(
bindings,
node_limit=node_limit,
include_bnodes=settings.include_bnodes,
)
return GraphResponse(nodes=nodes, edges=edges)
svc: GraphSnapshotService = app.state.snapshot_service
return await svc.get(node_limit=node_limit, edge_limit=edge_limit)

View File

@@ -8,6 +8,9 @@ class Node(BaseModel):
termType: str # "uri" | "bnode"
iri: str
label: str | None = None
# Optional because /api/nodes (RDFStore) doesn't currently provide positions.
x: float | None = None
y: float | None = None
class Edge(BaseModel):
@@ -36,8 +39,19 @@ class EdgesResponse(BaseModel):
class GraphResponse(BaseModel):
class Meta(BaseModel):
backend: str
ttl_path: str | None = None
sparql_endpoint: str | None = None
include_bnodes: bool
node_limit: int
edge_limit: int
nodes: int
edges: int
nodes: list[Node]
edges: list[Edge]
meta: Meta | None = None
class SparqlQueryRequest(BaseModel):

View File

@@ -0,0 +1,46 @@
from __future__ import annotations
from ..graph_export import edge_retrieval_query, graph_from_sparql_bindings
from ..models import GraphResponse
from ..sparql_engine import SparqlEngine
from ..settings import Settings
from .layout_spiral import spiral_positions
async def fetch_graph_snapshot(
sparql: SparqlEngine,
*,
settings: Settings,
node_limit: int,
edge_limit: int,
) -> GraphResponse:
"""
Fetch a graph snapshot (nodes + edges) via SPARQL, independent of whether the
underlying engine is RDFLib or AnzoGraph.
"""
edges_q = edge_retrieval_query(edge_limit=edge_limit, include_bnodes=settings.include_bnodes)
res = await sparql.query_json(edges_q)
bindings = (((res.get("results") or {}).get("bindings")) or [])
nodes, edges = graph_from_sparql_bindings(
bindings,
node_limit=node_limit,
include_bnodes=settings.include_bnodes,
)
# Add positions so the frontend doesn't need to run a layout.
xs, ys = spiral_positions(len(nodes))
for i, node in enumerate(nodes):
node["x"] = float(xs[i])
node["y"] = float(ys[i])
meta = GraphResponse.Meta(
backend=sparql.name,
ttl_path=settings.ttl_path if settings.graph_backend == "rdflib" else None,
sparql_endpoint=settings.effective_sparql_endpoint() if settings.graph_backend == "anzograph" else None,
include_bnodes=settings.include_bnodes,
node_limit=node_limit,
edge_limit=edge_limit,
nodes=len(nodes),
edges=len(edges),
)
return GraphResponse(nodes=nodes, edges=edges, meta=meta)

View File

@@ -0,0 +1,30 @@
from __future__ import annotations
import math
def spiral_positions(n: int, *, max_r: float = 5000.0) -> tuple[list[float], list[float]]:
"""
Deterministic "sunflower" (golden-angle) spiral layout.
This is intentionally simple and stable across runs:
- angle increments by the golden angle to avoid radial spokes
- radius grows with sqrt(i) to keep density roughly uniform over area
"""
if n <= 0:
return ([], [])
xs = [0.0] * n
ys = [0.0] * n
golden = math.pi * (3.0 - math.sqrt(5.0))
denom = float(max(1, n - 1))
for i in range(n):
t = i * golden
r = math.sqrt(i / denom) * max_r
xs[i] = r * math.cos(t)
ys[i] = r * math.sin(t)
return xs, ys

View File

@@ -0,0 +1,63 @@
from __future__ import annotations
import asyncio
from dataclasses import dataclass
from ..models import GraphResponse
from ..sparql_engine import SparqlEngine
from ..settings import Settings
from .graph_snapshot import fetch_graph_snapshot
@dataclass(frozen=True)
class SnapshotKey:
node_limit: int
edge_limit: int
include_bnodes: bool
class GraphSnapshotService:
"""
Caches graph snapshots so the backend doesn't re-run expensive SPARQL for stats/graph.
"""
def __init__(self, *, sparql: SparqlEngine, settings: Settings):
self._sparql = sparql
self._settings = settings
self._cache: dict[SnapshotKey, GraphResponse] = {}
self._locks: dict[SnapshotKey, asyncio.Lock] = {}
self._global_lock = asyncio.Lock()
async def get(self, *, node_limit: int, edge_limit: int) -> GraphResponse:
key = SnapshotKey(
node_limit=node_limit,
edge_limit=edge_limit,
include_bnodes=self._settings.include_bnodes,
)
cached = self._cache.get(key)
if cached is not None:
return cached
# Create/get a per-key lock under a global lock to avoid races.
async with self._global_lock:
lock = self._locks.get(key)
if lock is None:
lock = asyncio.Lock()
self._locks[key] = lock
async with lock:
cached2 = self._cache.get(key)
if cached2 is not None:
return cached2
snapshot = await fetch_graph_snapshot(
self._sparql,
settings=self._settings,
node_limit=node_limit,
edge_limit=edge_limit,
)
self._cache[key] = snapshot
return snapshot