docs: refresh pipeline notes and transport analysis

This commit is contained in:
Oxy8
2026-04-06 13:36:53 -03:00
parent 48ce99aac5
commit 97a30ab769
4 changed files with 967 additions and 225 deletions

View File

@@ -0,0 +1,243 @@
# Current `subClassOf` / `BFO:entity` Pipeline
This document summarizes how the repository currently builds the hierarchy that ends up in the radial Sugiyama layout, with special attention to the fact that "start from `bfo:entity`" is **not** implemented in the initial `subClassOf` query.
`bfo:entity` here means:
- `http://purl.obolibrary.org/obo/BFO_0000001`
## TL;DR
- The current code does **not** query "all `rdfs:subClassOf` relationships rooted at `bfo:entity`" directly.
- It first queries the **entire** `rdfs:subClassOf` graph.
- It builds an in-memory graph from those triples.
- Only later, in the Rust hierarchy layout bridge, it filters that graph to the descendant closure of the configured root IRI.
- Because of that, the "rooted at `bfo:entity`" behavior is currently coupled to the layout pipeline instead of existing as a reusable graph-extraction stage.
## Where The Request Starts
The frontend loads the hierarchy through the normal graph endpoint:
1. `frontend/src/App.tsx`
2. `GET /api/graph?graph_query_id=hierarchy`
3. `backend_go/server.go` -> `handleGraph`
4. `backend_go/snapshot_service.go` -> `Get`
5. `backend_go/graph_snapshot.go` -> `fetchGraphSnapshot`
Important consequence:
- The hierarchy is treated as a graph snapshot mode, not as a dedicated "query descendants of this root" pipeline.
## The Actual SPARQL Query Used For `hierarchy`
The `hierarchy` graph query is defined in:
- `backend_go/graph_queries/hierarchy.go`
It effectively does:
```sparql
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s ?p ?o
WHERE {
VALUES ?p { rdfs:subClassOf }
?s ?p ?o .
FILTER(!isLiteral(?o))
# optionally also FILTER(!isBlank(?s) && !isBlank(?o))
}
ORDER BY ?s ?p ?o
LIMIT ...
OFFSET ...
```
Important facts:
- It queries **all** `rdfs:subClassOf` triples.
- There is **no root restriction** here.
- There is **no `bfo:entity` filter** here.
- Blank nodes are excluded unless `INCLUDE_BNODES=true`.
- Objects that are literals are excluded.
## How The In-Memory Graph Is Built
Graph construction is handled by:
- `backend_go/graph_export.go`
The accumulator logic works like this:
- Every returned `?s` and `?o` becomes a node if it has not been seen before.
- There is no separate node query.
- A class only enters the graph if it appears in at least one fetched edge.
- Isolated classes with no fetched `subClassOf` edge never appear.
- If `node_limit` is reached, new nodes stop being added, and edges that depend on them are skipped.
Edge direction at this stage is:
- `Source = subclass (?s)`
- `Target = superclass (?o)`
So the raw in-memory graph is stored as:
- `subclass -> superclass`
## Where `BFO:entity` Is Actually Applied
The root restriction happens only when the backend chooses the Rust hierarchy layout path.
Relevant files:
- `backend_go/config.go`
- `.env`
- `backend_go/graph_snapshot.go`
- `backend_go/hierarchy_layout_bridge.go`
- `radial_sugiyama/src/bridge.rs`
Current behavior:
- `.env` sets `HIERARCHY_LAYOUT_ENGINE=rust`.
- If `graph_query_id == "hierarchy"` and the engine is `rust`, the backend calls the Rust bridge.
- The root IRI comes from `HIERARCHY_LAYOUT_ROOT_IRI`.
- If that env var is not set, the checked-in default is `http://purl.obolibrary.org/obo/BFO_0000001`.
This means the current repository behavior is effectively:
- query all `subClassOf`
- then filter to descendants of `BFO:entity`
- then lay out the filtered graph
## What Go Sends To Rust
Before calling Rust, Go rewrites the edge orientation in:
- `backend_go/hierarchy_layout_bridge.go`
It converts each stored edge from:
- `subclass -> superclass`
into:
- `parentID = superclass`
- `childID = subclass`
So the Rust side receives:
- `superclass -> subclass`
Go also:
- de-duplicates repeated parent/child edges
- sends the configured `root_iri`
- sends all nodes that were present in the fetched hierarchy graph
## How Rust Filters To Descendants Of The Root
Filtering happens in:
- `radial_sugiyama/src/bridge.rs`
The bridge logic does this:
1. Build an internal graph from the request.
2. Find the node whose label/IRI matches `root_iri`.
3. Build adjacency lists in the `parent -> child` direction.
4. Run a BFS/queue traversal starting at the root.
5. Keep only the visited nodes.
6. Keep only edges whose endpoints are both visited.
7. Run radial Sugiyama layout on that filtered subgraph.
Important consequences:
- Nodes outside the descendant closure of the root are dropped.
- Disconnected components are dropped.
- Ancestors of the root are not kept unless they are also reachable as descendants, which normally they are not.
- If the root is missing, the pipeline errors.
- If the root has no descendants, the pipeline errors.
So the actual "select only those starting from `bfo:entity`" logic is:
- **graph traversal after fetching the full hierarchy**
not:
- **root-constrained SPARQL**
## What Comes Back From Rust
After Rust finishes:
- only the filtered nodes are returned
- only edges between retained nodes are returned
- routed edge segments are returned for drawing
That filtering is applied back onto the original Go snapshot response, so the final `/api/graph?graph_query_id=hierarchy` response only contains the root-descendant subgraph when the Rust path is active.
## Why This Feels Like A Separate Pipeline
The main reason it feels split is that the current behavior crosses multiple stages:
1. SPARQL query stage fetches the whole `subClassOf` graph.
2. Graph materialization stage builds a generic snapshot graph.
3. Layout bridge stage applies the root restriction.
4. Layout stage computes coordinates.
This means the "hierarchy rooted at `BFO:entity`" concept is currently embedded in layout preparation instead of existing as a first-class reusable data pipeline.
In practice, the root filtering is:
- not reusable by itself through a dedicated backend API
- not expressed in the initial SPARQL query
- not controlled per request
- tied to the hierarchy layout engine choice
## Selection Queries Are A Different Mechanism
The repository also has separate selection-query endpoints:
- `backend_go/selection_queries/subclasses.go`
- `backend_go/selection_queries/superclasses.go`
- `backend_go/selection_queries/neighbors.go`
Those are used after nodes are already present in a graph snapshot and the user selects node IDs.
They are **not** the mechanism that initially builds the `BFO:entity` hierarchy used by the radial layout.
Their role is more like:
- "given selected node IDs in the current snapshot, query related triples"
not:
- "materialize the hierarchy rooted at `BFO:entity`"
## Current End-To-End Behavior In One Sentence
The current system gets all `rdfs:subClassOf` triples first, constructs a general hierarchy graph, and only then filters it to the descendants of `http://purl.obolibrary.org/obo/BFO_0000001` inside the Rust radial Sugiyama bridge.
## Files To Read When Rewriting
If you want to rewrite this from zero, these are the main files that define the current behavior:
- `backend_go/server.go`
- `backend_go/snapshot_service.go`
- `backend_go/graph_snapshot.go`
- `backend_go/graph_queries/hierarchy.go`
- `backend_go/graph_export.go`
- `backend_go/hierarchy_layout_bridge.go`
- `backend_go/config.go`
- `.env`
- `radial_sugiyama/src/bridge.rs`
- `backend_go/selection_queries/subclasses.go`
- `backend_go/selection_queries/superclasses.go`
## Rewrite-Oriented Takeaway
If your goal is a cleaner standalone pipeline for:
- query `rdfs:subClassOf`
- start from `bfo:entity`
- materialize only the rooted descendant hierarchy
then the current codebase is doing the root restriction too late. Right now, that concern lives in the layout bridge rather than in the query/materialization layer.

View File

@@ -0,0 +1,655 @@
# Graph Transport Alternatives
## Purpose
This document compares alternatives to the current `/api/graph` transport format with two goals:
1. reduce the cost of building, transferring, and decoding very large graph payloads
2. move the frontend transport shape closer to the renderer/GPU input shape while preserving all data the current frontend and backend pipeline still need
This analysis is based on the current repo state plus official documentation for browser fetch/streaming and candidate transport formats.
## Executive Summary
The current bottleneck is not the renderer's typed-array path. It is the browser's need to fully materialize a huge JSON object graph before the renderer ever runs.
The best candidates for this repo are:
1. **Custom binary columnar payload**
- Best fit for the current renderer.
- Lowest decode overhead.
- Most direct path from backend memory to frontend typed arrays.
- Requires custom protocol/versioning work.
2. **Apache Arrow IPC**
- Best off-the-shelf columnar binary format.
- Very good fit for typed-array-heavy rendering.
- Strong option if you want a standard format instead of inventing one.
- Heavier conceptual/tooling footprint than a custom binary envelope.
3. **Columnar JSON**
- Easiest migration.
- Better than today's row-oriented JSON.
- Still fundamentally JSON, so it does not remove the browser's JSON parse/object-materialization cost.
4. **NDJSON / streamed chunked JSON**
- Good if progressiveness matters.
- Better than one giant monolithic JSON document.
- Still weaker than a binary/columnar format for this renderer.
The strongest overall recommendation is:
- **Long-term**: custom binary columnar payload or Arrow IPC
- **Low-risk interim**: columnar JSON, possibly with chunking/streaming
Not recommended as the primary solution for this repo:
- row-oriented MessagePack
- Protocol Buffers as one giant message
## Verified Current Pipeline
### Backend side
The backend builds a `GraphResponse` and caches it in memory:
- `backend_go/models.go`
- `backend_go/snapshot_service.go`
- `backend_go/graph_snapshot.go`
The response shape is:
```go
type GraphResponse struct {
Nodes []Node
Edges []Edge
RouteSegments []RouteSegment
Meta *GraphMeta
}
```
and it is currently written as one JSON document with:
```go
json.NewEncoder(w).Encode(v)
```
in `backend_go/http_helpers.go`.
### Frontend side
The frontend currently does:
1. `fetch("/api/graph?...")`
2. `await graphRes.json()`
3. read `graph.nodes`, `graph.edges`, `graph.route_segments`, `graph.meta`
4. build:
- `Float32Array xs`
- `Float32Array ys`
- `Uint32Array vertexIds`
- `Uint32Array edgeData`
- `Float32Array routeLineVertices`
5. call `renderer.init(xs, ys, vertexIds, edgeData, routeLineVertices)`
Relevant files:
- `frontend/src/App.tsx`
- `frontend/src/renderer.ts`
This means the current browser path is:
- wire bytes
- JSON text/body handling
- JS arrays of node/edge objects
- typed arrays
- renderer-side typed arrays/maps/GPU buffers
The expensive part happens before step 4.
## Verified Data Access Audit
This section verifies every field currently produced by the backend and whether it is actually needed by the frontend transport.
### Main graph response fields
| Field | Produced in backend | Used by frontend? | Where used | Required on wire for current UX? | Notes |
| --- | --- | --- | --- | --- | --- |
| `nodes[].id` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `vertexIds`, and to map selected renderer indices back to backend IDs for selection queries. |
| `nodes[].x` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `xs`. |
| `nodes[].y` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `ys`. |
| `nodes[].iri` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes, if keeping current hover UX | Used for hover tooltip text. |
| `nodes[].label` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes, if keeping current hover UX | Used for hover tooltip text. |
| `nodes[].termType` | `backend_go/models.go` | No frontend use | none in `frontend/src` | No | Still needed internally by backend snapshot/selection index. |
| `edges[].source` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `edgeData`. |
| `edges[].target` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `edgeData`. |
| `edges[].predicate_id` | `backend_go/models.go` | No main-graph frontend use | none in `frontend/src/App.tsx` | No | Still needed internally by backend snapshot and hierarchy layout preparation. |
| `route_segments[].points` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes when route segments are present | Used to build `routeLineVertices`. |
| `route_segments[].edge_index` | `backend_go/models.go` | Not used after parsing | `graphRouteSegmentArray` validation only | No | Could be dropped from frontend transport if route lines are pre-flattened. |
| `route_segments[].kind` | `backend_go/models.go` | Not used after parsing | `graphRouteSegmentArray` validation only | No | Could be dropped from frontend transport if route lines are pre-flattened. |
| `meta.backend` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
| `meta.nodes` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
| `meta.edges` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
| `meta.graph_query_id` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
| `meta.node_limit` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
| `meta.edge_limit` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
| `meta.ttl_path` | `backend_go/models.go` | No | none in `frontend/src` | No | Frontend type declares it, but current UI does not use it. |
| `meta.sparql_endpoint` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.include_bnodes` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.layout_engine` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.layout_root_iri` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.predicates` | `backend_go/models.go` | No frontend use | none in `frontend/src` | No | Still used internally by backend selection/hierarchy logic. |
### Backend-internal fields that do not need to stay in the frontend transport
This is the most important audit result.
The backend currently reuses one struct for:
- internal cached snapshot
- HTTP response payload
That is convenient, but it means the frontend receives fields that only the backend needs.
Verified internal-only dependencies:
- `snapshot.Nodes[].TermType` is used in `backend_go/selection_query.go` to build the selection index.
- `snapshot.Meta.Predicates` is used in `backend_go/selection_query.go`.
- `Edge.PredicateID` is used internally for hierarchy layout preparation in `backend_go/hierarchy_layout_bridge.go`.
The frontend does **not** need those fields for current behavior.
### What the frontend actually needs
For the current graph view, the hot path can be reduced to:
- `vertexIds[]`
- `xs[]`
- `ys[]`
- `edgeSources[]`
- `edgeTargets[]`
- `routeLineVertices[]` or route geometry equivalent
- `label[]` and `iri[]` by node index
- `meta.backend`
- `meta.nodes`
- `meta.edges`
- `meta.graph_query_id`
- `meta.node_limit`
- `meta.edge_limit`
That is much closer to a columnar or binary payload than to the current array-of-objects JSON.
## Why the Current JSON Path Hurts
`Response.json()` is not just a lightweight decode helper. MDN states that `Response.json()` reads the stream to completion and resolves with the result of parsing the body text as JSON into a JavaScript object.
That matters here because the current payload is row-oriented:
- millions of node objects
- millions of edge objects
Even though the renderer later wants typed arrays, the browser must first create those JS objects.
This is exactly the part that can stall or run out of memory before `renderer.init(...)` starts.
## Alternatives
### 1. Columnar JSON
#### Idea
Keep JSON, but change the schema from row-oriented objects:
```json
{
"nodes": [{ "id": 1, "x": 0.1, "y": 0.2, ... }],
"edges": [{ "source": 1, "target": 2, ... }]
}
```
to column-oriented arrays:
```json
{
"vertex_ids": [...],
"xs": [...],
"ys": [...],
"edge_sources": [...],
"edge_targets": [...],
"node_labels": [...],
"node_iris": [...],
"route_line_vertices": [...],
"meta": { ... }
}
```
#### Pros
- easiest migration from the current API contract
- no schema compiler
- easy to debug with ordinary tooling
- much closer to what the renderer already consumes
- avoids creating per-edge objects in frontend application code
#### Cons
- still goes through JSON parsing
- still materializes JS arrays before typed arrays are built
- huge numeric arrays in JSON are still text, not binary
- string columns are still ordinary JS strings
#### Fit for current pipeline
Good.
No current frontend feature would be lost if the payload includes:
- ids/xs/ys/edge sources/targets
- labels/iris
- route line vertices or equivalent
- the small subset of meta fields currently used
#### Overall assessment
Best low-risk intermediate step.
It is clearly better than today's row-oriented JSON, but it is not the endgame if the goal is to remove the parse bottleneck for 1 GB+ payloads.
### 2. NDJSON / Chunked JSON
#### Idea
Change the backend to stream multiple JSON records instead of one giant JSON object.
Examples:
- one line per chunk of nodes/edges
- one line for metadata
- one line per route segment chunk
NDJSON is explicitly designed for transporting multiple JSON texts in a stream protocol.
#### Pros
- can start processing before the whole payload arrives
- better observability and progress reporting
- easier cancellation/retry semantics
- avoids one monolithic `Response.json()` boundary
#### Cons
- record-per-edge NDJSON would still create far too many JS objects
- to be worth it here, it should be **chunked columnar NDJSON**, not row NDJSON
- frontend load path must become stream-based
- renderer still currently expects all arrays at once
#### Fit for current pipeline
Moderate.
It can preserve all current information, but it does not by itself solve the "final representation should look like GPU inputs" goal unless each chunk is already columnar.
#### Best shape if chosen
Not:
- one JSON object per edge
- one JSON object per node
Better:
- one NDJSON record for metadata
- then NDJSON records where each record contains columnar chunks:
- `vertex_ids_chunk`
- `xs_chunk`
- `ys_chunk`
- `edge_sources_chunk`
- `edge_targets_chunk`
#### Overall assessment
Viable, but only attractive if progressiveness is a major goal. On its own, it is weaker than columnar binary formats for this renderer.
### 3. MessagePack
#### Idea
Use a compact binary encoding instead of JSON.
The official JavaScript implementation supports:
- `encode`
- `decode`
- `decodeAsync(stream)`
- `decodeArrayStream(stream)`
- `decodeMultiStream(stream)`
and even custom extension types for faster handling of large `Float32Array` payloads.
#### Pros
- smaller payload than JSON
- binary transport
- async and stream-capable decoding APIs exist
- mature JS library
#### Cons
- if you keep the current row-oriented schema, you still get one huge object graph after decode
- therefore MessagePack alone does not remove the fundamental object-allocation problem
- custom extension types improve typed-array cases, but then you are already halfway to designing a custom binary protocol
#### Fit for current pipeline
Moderate.
It can preserve all current information easily.
But if the schema remains object-heavy, the browser still ends up with millions of JS objects.
#### Overall assessment
Useful if paired with a **columnar** schema. Not compelling as a first move if the schema stays row-oriented.
### 4. Apache Arrow IPC
#### Idea
Use Arrow's columnar binary format and Arrow JS support.
Arrow JS provides:
- `tableFromIPC(...)`
- support for `fetch(...)`
- typed-array-backed vectors
- dictionary-encoded strings
- a columnar memory model explicitly meant for efficient processing and movement of large in-memory data
#### Pros
- strongest off-the-shelf fit for typed-array-oriented rendering
- columnar by design
- binary rather than textual
- supports large numeric columns very naturally
- supports dictionary encoding for repeated strings like labels or IRIs
- much closer to the renderer/GPU input shape than JSON objects
#### Cons
- larger conceptual/tooling jump than columnar JSON
- route segments are nested/variable-length; representing them cleanly needs design
- frontend code becomes Arrow-aware unless the decode is hidden behind an adapter
- backend must serialize Arrow on the Go side or produce Arrow-compatible IPC
#### Fit for current pipeline
Very good.
Current frontend needs can be represented as columns:
- `vertex_ids: uint32`
- `xs: float32`
- `ys: float32`
- `edge_sources: uint32`
- `edge_targets: uint32`
- `labels: utf8` or dictionary-encoded utf8
- `iris: utf8` or dictionary-encoded utf8
Route geometry should probably not stay as nested route-segment objects. It would fit better as:
- pre-flattened `route_line_vertices` float column/buffer
- or a second Arrow table dedicated to line segments
#### Overall assessment
One of the two best solutions for this repo.
If you want a standard format instead of inventing one, Arrow is the most attractive candidate.
### 5. FlatBuffers
#### Idea
Use a schema-defined binary format designed for direct access without unpacking/parsing.
FlatBuffers explicitly advertises:
- access to serialized data without parsing/unpacking
- memory efficiency and speed
- forwards/backwards compatibility
#### Pros
- very strong memory-efficiency story
- schema evolution support
- no full parse/unpack step in the same way as JSON
- can model both scalars and more complex structures
#### Cons
- requires schema/compiler/generated bindings
- JavaScript integration is more manual than JSON or Arrow
- ergonomics in app code are not as simple as arrays/objects
- strings and nested route structures are supported, but the developer experience is more specialized
#### Fit for current pipeline
Good, technically.
It can preserve all current information and remove the giant object-graph parse step.
However, compared with Arrow or a custom binary envelope, it is a less natural conceptual fit for a renderer whose hot path is already columnar/typed-array-based.
#### Overall assessment
A strong technical option, but probably not the most ergonomic option for this specific frontend.
### 6. Protocol Buffers
#### Idea
Use a schema-defined binary format with generated bindings.
#### Pros
- compact binary encoding
- schema/versioning
- mature ecosystem
#### Cons
- official docs describe protobuf as a good fit for typed structured messages up to a few megabytes
- the same docs warn that large data can require loading entire messages into memory and can cause multiple copies
- large repeated numeric arrays are not protobuf's sweet spot
- still not especially close to the renderer's typed-array model
#### Fit for current pipeline
Poor for this specific payload size and shape.
#### Overall assessment
Not recommended for this main graph transport.
### 7. Custom Binary Typed-Array Envelope
#### Idea
Define a transport specifically around what the renderer and hover/selection pipeline need.
Example structure:
- small fixed header or small JSON header:
- version
- counts
- offsets/lengths
- meta subset
- then raw binary buffers:
- `vertex_ids`
- `xs`
- `ys`
- `edge_sources`
- `edge_targets`
- `route_line_vertices`
- string dictionary / offsets for `label` and `iri`
#### Pros
- closest possible fit to current renderer
- no schema compiler required
- no row-object materialization
- easiest path to zero-copy or near-zero-copy arrays on the frontend
- easiest path to worker transfer via `ArrayBuffer`
- can separate hot render data from cold metadata cleanly
#### Cons
- custom protocol to design, version, validate, and document
- less tooling/interoperability than Arrow
- backend and frontend both need careful binary codecs
#### Fit for current pipeline
Excellent.
You can preserve all current behavior while only sending the data the frontend actually uses.
#### Overall assessment
The best performance-oriented fit if you are comfortable owning a custom format.
## Comparison Table
| Option | Closeness to GPU shape | Avoids giant object graph | Supports all current frontend data | Streaming-friendly | Implementation cost | Recommendation |
| --- | --- | --- | --- | --- | --- | --- |
| Current row JSON | Poor | No | Yes | Poor | Already done | Replace |
| Columnar JSON | Medium | No | Yes | Medium | Low | Good interim |
| NDJSON chunked columnar JSON | Medium | Partially | Yes | Good | Medium | Situational |
| MessagePack row-oriented | Poor | No | Yes | Good | Medium | Not enough alone |
| MessagePack columnar | Medium | Partially | Yes | Good | Medium | Viable but secondary |
| Arrow IPC | Very high | Yes or mostly yes | Yes | Good | Medium-high | Strong candidate |
| FlatBuffers | High | Yes | Yes | Medium | High | Good but specialized |
| Protobuf | Low-medium | No practical win here | Yes | Medium | Medium-high | Not recommended |
| Custom binary typed-array envelope | Very high | Yes | Yes | Good | High | Strongest fit |
## Recommended Data Contract Shapes
### Recommended shape for any non-row-oriented solution
The frontend does not need node/edge objects as its primary graph transport.
The main graph payload should be modeled as:
- `vertex_ids`
- `xs`
- `ys`
- `edge_sources`
- `edge_targets`
- `route_line_vertices`
- `node_labels`
- `node_iris`
- `meta`
This can be represented as:
- columnar JSON
- Arrow columns
- FlatBuffers vectors
- custom binary sections
### Fields that can be removed from the frontend transport immediately
Without changing current visible behavior, the main graph transport does not need to include:
- `nodes[].termType`
- `edges[].predicate_id`
- `meta.predicates`
- `meta.ttl_path`
- `meta.sparql_endpoint`
- `meta.include_bnodes`
- `meta.layout_engine`
- `meta.layout_root_iri`
- `route_segments[].edge_index`
- `route_segments[].kind`
Important:
Some of those fields are still needed by the backend's **internal snapshot**, especially for selection queries and hierarchy layout. That argues for splitting:
- internal snapshot model
- frontend transport DTO
instead of continuing to reuse one struct for both.
## Additional Architectural Notes
### A worker is complementary, not a transport format
Web Workers can move parsing/build work off the main thread, and `ArrayBuffer` is transferable. That is useful, but it does not by itself solve the current over-allocation problem if the payload is still a giant row-oriented JSON document.
Workers are most valuable when paired with:
- binary columnar payloads
- streamed columnar chunks
- transfer of `ArrayBuffer`s rather than giant JS object graphs
### The backend can keep a richer internal snapshot than it sends
This repo already caches snapshots server-side. Selection and triple queries are built from the backend snapshot and the small `graphMeta` values sent back by the client.
That means the frontend transport can be much slimmer than the backend snapshot representation, as long as the backend retains its richer internal data.
This is the cleanest way to avoid losing information while optimizing the frontend transport.
## Final Recommendation
### Best long-term option
Pick one of:
1. **Custom binary typed-array envelope**
2. **Apache Arrow IPC**
Reason:
- both map naturally to the renderer's actual input model
- both avoid the giant row-object parse path
- both can preserve all current frontend-visible information
### Best low-risk migration path
If you want an incremental step before going binary:
1. split backend internal snapshot from frontend transport DTO
2. move `/api/graph` to **columnar JSON**
3. keep only the metadata fields the frontend actually uses
4. later replace the same columnar DTO with Arrow or custom binary
That path reduces waste immediately and keeps the eventual binary migration straightforward.
## Sources
Official documentation and primary sources used for the comparison:
- MDN `Response.json()`
- https://developer.mozilla.org/en-US/docs/Web/API/Response/json
- MDN `TextDecoderStream`
- https://developer.mozilla.org/en-US/docs/Web/API/TextDecoderStream
- MDN Web Workers
- https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers
- MDN Transferable Objects
- https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Transferable_objects
- Apache Arrow JavaScript
- https://arrow.apache.org/js/current/
- https://arrow.apache.org/js/main/functions/Arrow.dom.tableFromIPC.html
- NDJSON specification
- https://github.com/ndjson/ndjson-spec
- MessagePack for JavaScript
- https://github.com/msgpack/msgpack-javascript
- FlatBuffers overview and JavaScript docs
- https://flatbuffers.dev/
- https://flatbuffers.dev/languages/javascript/
- Protocol Buffers overview
- https://protobuf.dev/overview/
- Streaming JSON parser references
- https://github.com/juanjoDiaz/streamparser-json
- https://rictic.github.io/jsonriver/

150
README.md
View File

@@ -1,108 +1,96 @@
# Visualizador Instanciados
This repo is a Docker Compose stack for visualizing large RDF/OWL graphs stored in **AnzoGraph**. It includes:
Docker Compose stack for exploring large RDF/OWL graphs stored in AnzoGraph.
- A **Go backend** that queries AnzoGraph via SPARQL and serves a cached graph snapshot + selection queries.
- A **React/Vite frontend** that renders nodes/edges with WebGL2 and supports “selection query” + “graph query” modes.
- A **Python one-shot service** to combine `owl:imports` into a single Turtle file.
- An **AnzoGraph** container (SPARQL endpoint).
## What Runs Here
## Quick start (Docker Compose)
- `anzograph`: SPARQL store
- `backend`: Go API that queries AnzoGraph and serves cached graph snapshots
- `frontend`: React/Vite app with a WebGL left graph and a right-side `cosmos.gl` selection graph
- `owl_imports_combiner`: one-shot Python service that can merge `owl:imports`
- `radial_sugiyama`: Rust hierarchy layout pipeline used in two ways:
- standalone SVG generator through the `radial` Compose profile
- optional hierarchy layout engine for the Go backend
1) Put your TTL file(s) in `./data/` (this folder is volume-mounted into AnzoGraph as `/opt/shared-files`).
2) Optionally configure `.env` (see `.env.example`).
3) Start the stack:
## Current Flow
- The backend always builds graph snapshots from SPARQL queries against AnzoGraph.
- `graph_query_id=default` and `graph_query_id=types` use the Go layout path.
- `graph_query_id=hierarchy` can use either:
- the Go layout path
- the Rust radial Sugiyama path when `HIERARCHY_LAYOUT_ENGINE=rust`
- When the Rust hierarchy path is enabled, the backend sends the hierarchy graph to Rust over JSON, Rust lays it out, returns node positions + routed edge segments, and also rewrites:
```text
radial_sugiyama/out/layout.svg
```
That SVG is a debug artifact for the exact Rust layout run used by the backend.
## Quick Start
1. Put your TTL files under `./data/`.
2. Copy or edit `.env` as needed.
3. Start the stack:
```bash
docker compose up --build
```
Then open the frontend:
Open:
- `http://localhost:5173`
- Frontend: `http://localhost:5173`
- Backend health: `http://localhost:8000/api/health`
Stop everything:
Stop:
```bash
docker compose down
```
## Services
## Rust Hierarchy Layout
Defined in `docker-compose.yml`:
To use Rust for the `hierarchy` graph mode, set this in the repo root `.env`:
- `anzograph` (image `cambridgesemantics/anzograph:latest`)
- Ports: `8080`, `8443`
- Shared files: `./data → /opt/shared-files`
- `backend` (`./backend_go`)
- Port: `8000` (API under `/api/*`)
- Talks to AnzoGraph at `SPARQL_HOST` / `SPARQL_ENDPOINT`
- `frontend` (`./frontend`)
- Port: `5173`
- Proxies `/api/*` to `VITE_BACKEND_URL`
- `owl_imports_combiner` (`./python_services/owl_imports_combiner`)
- One-shot: optionally produces a combined TTL by following `owl:imports`
```env
HIERARCHY_LAYOUT_ENGINE=rust
```
Service READMEs:
The backend also reads `radial_sugiyama/.env` for the Rust layout settings such as:
- `backend_go/README.md`
- `frontend/README.md`
- `python_services/owl_imports_combiner/README.md`
- `anzograph/README.md`
- `RADIAL_ROOT_CLASS_IRI`
- `RADIAL_OUTPUT_DIR`
- `RADIAL_OUTPUT_FILE`
- `RADIAL_RING_DISTRIBUTION`
## Repo layout
The debug SVG for backend-driven hierarchy requests is written to:
- `backend_go/` Go API service (SPARQL → snapshot + selection queries)
- `frontend/` React/Vite WebGL renderer
- `python_services/owl_imports_combiner/` Python one-shot OWL imports combiner
- `data/` local shared volume for TTL inputs/outputs (gitignored)
- `docker-compose.yml` service wiring
- `flake.nix` optional Nix dev shell
```text
radial_sugiyama/out/layout.svg
```
## Configuration
This repo expects a local `.env` file (not committed). Start from `.env.example`.
Common knobs:
- Backend snapshot size: `DEFAULT_NODE_LIMIT`, `DEFAULT_EDGE_LIMIT`, `MAX_NODE_LIMIT`, `MAX_EDGE_LIMIT`
- SPARQL connectivity: `SPARQL_HOST` or `SPARQL_ENDPOINT`, plus `SPARQL_USER` / `SPARQL_PASS`
- Load data on backend startup: `SPARQL_LOAD_ON_START=true` with `SPARQL_DATA_FILE=file:///opt/shared-files/<file>.ttl`
- Frontend → backend proxy: `VITE_BACKEND_URL`
## API (backend)
Base URL: `http://localhost:8000`
- `GET /api/health` liveness
- `GET /api/stats` snapshot stats (uses default limits)
- `GET /api/graph` graph snapshot
- Query params: `node_limit`, `edge_limit`, `graph_query_id`
- `GET /api/graph_queries` available graph snapshot modes (`graph_query_id` values)
- `GET /api/selection_queries` available selection-highlight modes (`query_id` values)
- `POST /api/selection_query` run a selection query for highlighted neighbors
- Body: `{"query_id":"neighbors","selected_ids":[...],"node_limit":...,"edge_limit":...,"graph_query_id":"default"}`
- `POST /api/sparql` raw SPARQL passthrough (debug/advanced)
- `POST /api/neighbors` legacy alias (same behavior as `query_id="neighbors"`)
## Frontend UI
- Mouse:
- Drag: pan
- Scroll: zoom
- Click: select nodes
- **Top-right buttons:** “selection query” mode (how neighbors/highlights are computed for the current selection)
- **Bottom-right buttons:** “graph query” mode (which SPARQL edge set is used to build the graph snapshot; switching reloads the graph)
## Notes on performance/limits
- The backend caches snapshots in memory; tune `DEFAULT_*_LIMIT` if memory is too high.
- The frontend renders a sampled subset when zoomed out, and only draws edges when fewer than ~20k nodes are visible.
## Nix dev shell (optional)
If you use Nix, `flake.nix` provides a minimal `devShell`:
You can still run the standalone Rust SVG pipeline directly with:
```bash
nix develop
docker compose --profile radial up --build radial_sugiyama
```
## Main API
- `GET /api/health`
- `GET /api/stats`
- `GET /api/graph`
- `GET /api/graph_queries`
- `GET /api/selection_queries`
- `POST /api/selection_query`
- `POST /api/selection_triples`
- `POST /api/sparql`
## Repo Layout
- `backend_go/` Go API and SPARQL snapshot logic
- `frontend/` React/Vite UI
- `radial_sugiyama/` Rust hierarchy layout and SVG export
- `python_services/owl_imports_combiner/` import-flattening helper
- `data/` local shared data mounted into containers
- `docker-compose.yml` service wiring

View File

@@ -1,144 +0,0 @@
# Radial Sugiyama vs Go Snapshot Pipeline
This note delimits the algorithmic intersection between the Rust pipeline in `radial_sugiyama/` and the Go snapshot/export path in:
- `backend_go/graph_export.go`
- `backend_go/graph_snapshot.go`
The goal is not to describe integration mechanics yet, but to mark where the two implementations solve the same problem, where they only touch indirectly, and where they are solving different problems.
## Scope
The Rust pipeline is a hierarchy-specific layout pipeline:
1. import ontology hierarchy from Turtle
2. optionally filter to a rooted descendant subtree
3. validate DAG structure
4. assign hierarchy levels
5. insert dummy nodes for long edges
6. reduce crossings
7. assign coordinates
8. project to radial space
9. generate routed edge artifacts
10. export SVG
The Go path is a snapshot/materialization pipeline:
1. query predicates and edges from SPARQL
2. accumulate nodes and edges
3. build a graph response
4. run a lightweight hierarchy layering + radial placement
5. attach labels
6. return JSON to the frontend
Because of that, the true intersection is narrow in `graph_export.go` and broader in the layout section of `graph_snapshot.go`.
## Legend
- `Direct overlap`: both sides implement essentially the same algorithmic concern
- `Adjacent overlap`: one side prepares or consumes the same kind of structure, but the algorithm differs materially
- `No overlap`: the stage exists only on one side
## Intersection with `graph_export.go`
`graph_export.go` overlaps with the Rust pipeline only at graph materialization time.
| Algorithmic stage | Rust pipeline | `graph_export.go` | Intersection | Notes |
| --- | --- | --- | --- | --- |
| Node identity and deduplication | `ttl.rs` maps class IRIs to stable node indices | `graphAccumulator.getOrAddNode` maps SPARQL terms to stable node IDs | Direct overlap | Both build a unique node set from repeated source records. |
| Edge materialization | `ttl.rs` emits `superclass -> subclass` edges and deduplicates repeats | `graphAccumulator.addBindings` emits `source -> target` edges from SPARQL bindings | Adjacent overlap | Both convert raw triples/bindings into an in-memory graph, but Rust is specialized to `rdfs:subClassOf` while Go is predicate-agnostic. |
| Literal / blank-node filtering | `ttl.rs` ignores blank/literal hierarchy endpoints | `getOrAddNode` skips literals and optionally keeps blank nodes | Adjacent overlap | Similar sanitation step, but not identical semantics. |
| Predicate preservation | Rust discards all predicates except `rdfs:subClassOf` | Go preserves predicate IDs through `PredicateDict` | No overlap | This is Go-only in the compared files. |
| Graph limits / capacity management | Rust does not enforce snapshot-style node and edge caps here | Go enforces `nodeLimit` and preallocates with edge hints | No overlap | This is an operational concern of the Go snapshot path. |
### Boundary for `graph_export.go`
The clean algorithmic seam is:
- Go owns generic SPARQL binding ingestion and generic graph materialization.
- Rust owns hierarchy-specialized interpretation once a hierarchy graph has already been isolated.
That means `graph_export.go` is not competing with the Rust layout pipeline. It is only producing the kind of node/edge structure that Rust would eventually need as input.
## Intersection with `graph_snapshot.go`
`graph_snapshot.go` intersects with the Rust pipeline in two different regions:
1. graph acquisition and hierarchy preparation
2. lightweight layout assignment
## Stage-by-stage comparison
| Algorithmic stage | Rust pipeline | `graph_snapshot.go` | Intersection | Notes |
| --- | --- | --- | --- | --- |
| Source acquisition | `graph_from_ttl_path` parses Turtle directly | `fetchGraphSnapshot` queries SPARQL in batches | Adjacent overlap | Both acquire a graph, but from different upstream sources. |
| Hierarchy graph extraction | Rust keeps only `rdfs:subClassOf` during import | Go accepts a `graph_query_id` and accumulates whatever that query returns | Adjacent overlap | The overlap is meaningful only when the Go query is hierarchy-like. |
| Rooted subtree filtering | `filter_graph_to_descendants` keeps one configured root and its descendants | No equivalent in these two Go files | No overlap | This is currently Rust-only. |
| Cycle detection / DAG validation | `compute_hierarchy_levels` rejects cyclic graphs | `levelSynchronousKahnLayers` returns a `CycleError` if not all nodes are processed | Direct overlap | Both need a DAG to continue with hierarchy layout. |
| Level assignment | Rust computes longest-path hierarchy levels | Go computes level-synchronous Kahn layers | Direct overlap | Same problem, different algorithm. Both assign ring depth from DAG structure. |
| Per-level ordering | Rust later optimizes order for crossings | Go sorts each layer lexicographically by IRI | Adjacent overlap | Both define an order inside a level, but Go is a simple deterministic ordering while Rust is layout-driven. |
| Radial node placement | Rust projects coordinates to rings after Sugiyama coordinate assignment | Go uses `radialPositionsFromLayers` to place each layer on a ring | Direct overlap | Same output shape, very different sophistication. |
| Coordinate shifting / scaling controls | Rust has configurable radius, spacing, borders, and positive-coordinate shifting | Go uses a fixed `maxR = 5000.0` radial envelope | Adjacent overlap | Both map levels to 2D coordinates, but only Rust exposes tuned geometry controls. |
| Label enrichment | Rust keeps node labels as imported IRIs | Go fetches `rdfs:label` after layout | Adjacent overlap | Both carry node naming, but the enrichment algorithm is currently Go-only. |
| Response packaging | Rust writes SVG and layout artifacts | Go returns `GraphResponse` JSON plus metadata | No overlap | Same graph, different downstream consumers. |
## Rust-only algorithms with no counterpart in the compared Go files
These parts of the Rust pipeline do not currently intersect with `graph_export.go` or `graph_snapshot.go`:
- rooted descendant filtering
- dummy-node insertion for long edges
- crossing reduction / sifting
- coordinate assignment before radial projection
- adaptive / packed / distributed ring projection modes
- routed edge generation
- layout artifact generation
- SVG rendering and export
These are the parts that make the Rust pipeline a true Sugiyama-style layout engine rather than a simple radial snapshot placer.
## Go-only algorithms with no counterpart in the Rust pipeline
These parts of the compared Go files do not currently exist in Rust:
- predicate dictionary construction from SPARQL results
- batched SPARQL edge fetching with memory management
- snapshot limits and backend metadata packaging
- `rdfs:label` lookup through SPARQL
- generic graph export over arbitrary predicate sets
These are acquisition and serving concerns rather than layout concerns.
## Algorithmic ownership boundary
If the future integration wants a clean division of responsibility, the strongest ownership boundary is:
### Go-owned stages
- query execution against AnzoGraph / SPARQL
- predicate-aware graph accumulation
- generic graph snapshot materialization
- label fetching and API response orchestration
### Rust-owned stages
- hierarchy-specific filtering
- hierarchy-level assignment
- Sugiyama expansion with dummy nodes
- crossing minimization
- coordinate assignment
- radial projection and route generation
- layout artifact production
## Most important practical conclusion
At algorithm granularity, the Rust pipeline intersects only lightly with `graph_export.go`, but it intersects substantially with the hierarchy-layout portion of `graph_snapshot.go`.
The main replacement candidates in a future integration are therefore not the generic export/materialization routines in `graph_export.go`, but these hierarchy-layout steps currently performed by `graph_snapshot.go`:
1. DAG validation / cycle detection
2. layer assignment
3. per-layer ordering
4. radial coordinate generation
Everything after that depends on how much of the Rust layout artifact model the future integration wants to expose to the frontend.