Files
visualizador_instanciados/GRAPH_TRANSPORT_ALTERNATIVES.md

656 lines
22 KiB
Markdown

# Graph Transport Alternatives
## Purpose
This document compares alternatives to the current `/api/graph` transport format with two goals:
1. reduce the cost of building, transferring, and decoding very large graph payloads
2. move the frontend transport shape closer to the renderer/GPU input shape while preserving all data the current frontend and backend pipeline still need
This analysis is based on the current repo state plus official documentation for browser fetch/streaming and candidate transport formats.
## Executive Summary
The current bottleneck is not the renderer's typed-array path. It is the browser's need to fully materialize a huge JSON object graph before the renderer ever runs.
The best candidates for this repo are:
1. **Custom binary columnar payload**
- Best fit for the current renderer.
- Lowest decode overhead.
- Most direct path from backend memory to frontend typed arrays.
- Requires custom protocol/versioning work.
2. **Apache Arrow IPC**
- Best off-the-shelf columnar binary format.
- Very good fit for typed-array-heavy rendering.
- Strong option if you want a standard format instead of inventing one.
- Heavier conceptual/tooling footprint than a custom binary envelope.
3. **Columnar JSON**
- Easiest migration.
- Better than today's row-oriented JSON.
- Still fundamentally JSON, so it does not remove the browser's JSON parse/object-materialization cost.
4. **NDJSON / streamed chunked JSON**
- Good if progressiveness matters.
- Better than one giant monolithic JSON document.
- Still weaker than a binary/columnar format for this renderer.
The strongest overall recommendation is:
- **Long-term**: custom binary columnar payload or Arrow IPC
- **Low-risk interim**: columnar JSON, possibly with chunking/streaming
Not recommended as the primary solution for this repo:
- row-oriented MessagePack
- Protocol Buffers as one giant message
## Verified Current Pipeline
### Backend side
The backend builds a `GraphResponse` and caches it in memory:
- `backend_go/models.go`
- `backend_go/snapshot_service.go`
- `backend_go/graph_snapshot.go`
The response shape is:
```go
type GraphResponse struct {
Nodes []Node
Edges []Edge
RouteSegments []RouteSegment
Meta *GraphMeta
}
```
and it is currently written as one JSON document with:
```go
json.NewEncoder(w).Encode(v)
```
in `backend_go/http_helpers.go`.
### Frontend side
The frontend currently does:
1. `fetch("/api/graph?...")`
2. `await graphRes.json()`
3. read `graph.nodes`, `graph.edges`, `graph.route_segments`, `graph.meta`
4. build:
- `Float32Array xs`
- `Float32Array ys`
- `Uint32Array vertexIds`
- `Uint32Array edgeData`
- `Float32Array routeLineVertices`
5. call `renderer.init(xs, ys, vertexIds, edgeData, routeLineVertices)`
Relevant files:
- `frontend/src/App.tsx`
- `frontend/src/renderer.ts`
This means the current browser path is:
- wire bytes
- JSON text/body handling
- JS arrays of node/edge objects
- typed arrays
- renderer-side typed arrays/maps/GPU buffers
The expensive part happens before step 4.
## Verified Data Access Audit
This section verifies every field currently produced by the backend and whether it is actually needed by the frontend transport.
### Main graph response fields
| Field | Produced in backend | Used by frontend? | Where used | Required on wire for current UX? | Notes |
| --- | --- | --- | --- | --- | --- |
| `nodes[].id` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `vertexIds`, and to map selected renderer indices back to backend IDs for selection queries. |
| `nodes[].x` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `xs`. |
| `nodes[].y` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `ys`. |
| `nodes[].iri` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes, if keeping current hover UX | Used for hover tooltip text. |
| `nodes[].label` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes, if keeping current hover UX | Used for hover tooltip text. |
| `nodes[].termType` | `backend_go/models.go` | No frontend use | none in `frontend/src` | No | Still needed internally by backend snapshot/selection index. |
| `edges[].source` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `edgeData`. |
| `edges[].target` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `edgeData`. |
| `edges[].predicate_id` | `backend_go/models.go` | No main-graph frontend use | none in `frontend/src/App.tsx` | No | Still needed internally by backend snapshot and hierarchy layout preparation. |
| `route_segments[].points` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes when route segments are present | Used to build `routeLineVertices`. |
| `route_segments[].edge_index` | `backend_go/models.go` | Not used after parsing | `graphRouteSegmentArray` validation only | No | Could be dropped from frontend transport if route lines are pre-flattened. |
| `route_segments[].kind` | `backend_go/models.go` | Not used after parsing | `graphRouteSegmentArray` validation only | No | Could be dropped from frontend transport if route lines are pre-flattened. |
| `meta.backend` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
| `meta.nodes` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
| `meta.edges` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
| `meta.graph_query_id` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
| `meta.node_limit` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
| `meta.edge_limit` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
| `meta.ttl_path` | `backend_go/models.go` | No | none in `frontend/src` | No | Frontend type declares it, but current UI does not use it. |
| `meta.sparql_endpoint` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.include_bnodes` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.layout_engine` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.layout_root_iri` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
| `meta.predicates` | `backend_go/models.go` | No frontend use | none in `frontend/src` | No | Still used internally by backend selection/hierarchy logic. |
### Backend-internal fields that do not need to stay in the frontend transport
This is the most important audit result.
The backend currently reuses one struct for:
- internal cached snapshot
- HTTP response payload
That is convenient, but it means the frontend receives fields that only the backend needs.
Verified internal-only dependencies:
- `snapshot.Nodes[].TermType` is used in `backend_go/selection_query.go` to build the selection index.
- `snapshot.Meta.Predicates` is used in `backend_go/selection_query.go`.
- `Edge.PredicateID` is used internally for hierarchy layout preparation in `backend_go/hierarchy_layout_bridge.go`.
The frontend does **not** need those fields for current behavior.
### What the frontend actually needs
For the current graph view, the hot path can be reduced to:
- `vertexIds[]`
- `xs[]`
- `ys[]`
- `edgeSources[]`
- `edgeTargets[]`
- `routeLineVertices[]` or route geometry equivalent
- `label[]` and `iri[]` by node index
- `meta.backend`
- `meta.nodes`
- `meta.edges`
- `meta.graph_query_id`
- `meta.node_limit`
- `meta.edge_limit`
That is much closer to a columnar or binary payload than to the current array-of-objects JSON.
## Why the Current JSON Path Hurts
`Response.json()` is not just a lightweight decode helper. MDN states that `Response.json()` reads the stream to completion and resolves with the result of parsing the body text as JSON into a JavaScript object.
That matters here because the current payload is row-oriented:
- millions of node objects
- millions of edge objects
Even though the renderer later wants typed arrays, the browser must first create those JS objects.
This is exactly the part that can stall or run out of memory before `renderer.init(...)` starts.
## Alternatives
### 1. Columnar JSON
#### Idea
Keep JSON, but change the schema from row-oriented objects:
```json
{
"nodes": [{ "id": 1, "x": 0.1, "y": 0.2, ... }],
"edges": [{ "source": 1, "target": 2, ... }]
}
```
to column-oriented arrays:
```json
{
"vertex_ids": [...],
"xs": [...],
"ys": [...],
"edge_sources": [...],
"edge_targets": [...],
"node_labels": [...],
"node_iris": [...],
"route_line_vertices": [...],
"meta": { ... }
}
```
#### Pros
- easiest migration from the current API contract
- no schema compiler
- easy to debug with ordinary tooling
- much closer to what the renderer already consumes
- avoids creating per-edge objects in frontend application code
#### Cons
- still goes through JSON parsing
- still materializes JS arrays before typed arrays are built
- huge numeric arrays in JSON are still text, not binary
- string columns are still ordinary JS strings
#### Fit for current pipeline
Good.
No current frontend feature would be lost if the payload includes:
- ids/xs/ys/edge sources/targets
- labels/iris
- route line vertices or equivalent
- the small subset of meta fields currently used
#### Overall assessment
Best low-risk intermediate step.
It is clearly better than today's row-oriented JSON, but it is not the endgame if the goal is to remove the parse bottleneck for 1 GB+ payloads.
### 2. NDJSON / Chunked JSON
#### Idea
Change the backend to stream multiple JSON records instead of one giant JSON object.
Examples:
- one line per chunk of nodes/edges
- one line for metadata
- one line per route segment chunk
NDJSON is explicitly designed for transporting multiple JSON texts in a stream protocol.
#### Pros
- can start processing before the whole payload arrives
- better observability and progress reporting
- easier cancellation/retry semantics
- avoids one monolithic `Response.json()` boundary
#### Cons
- record-per-edge NDJSON would still create far too many JS objects
- to be worth it here, it should be **chunked columnar NDJSON**, not row NDJSON
- frontend load path must become stream-based
- renderer still currently expects all arrays at once
#### Fit for current pipeline
Moderate.
It can preserve all current information, but it does not by itself solve the "final representation should look like GPU inputs" goal unless each chunk is already columnar.
#### Best shape if chosen
Not:
- one JSON object per edge
- one JSON object per node
Better:
- one NDJSON record for metadata
- then NDJSON records where each record contains columnar chunks:
- `vertex_ids_chunk`
- `xs_chunk`
- `ys_chunk`
- `edge_sources_chunk`
- `edge_targets_chunk`
#### Overall assessment
Viable, but only attractive if progressiveness is a major goal. On its own, it is weaker than columnar binary formats for this renderer.
### 3. MessagePack
#### Idea
Use a compact binary encoding instead of JSON.
The official JavaScript implementation supports:
- `encode`
- `decode`
- `decodeAsync(stream)`
- `decodeArrayStream(stream)`
- `decodeMultiStream(stream)`
and even custom extension types for faster handling of large `Float32Array` payloads.
#### Pros
- smaller payload than JSON
- binary transport
- async and stream-capable decoding APIs exist
- mature JS library
#### Cons
- if you keep the current row-oriented schema, you still get one huge object graph after decode
- therefore MessagePack alone does not remove the fundamental object-allocation problem
- custom extension types improve typed-array cases, but then you are already halfway to designing a custom binary protocol
#### Fit for current pipeline
Moderate.
It can preserve all current information easily.
But if the schema remains object-heavy, the browser still ends up with millions of JS objects.
#### Overall assessment
Useful if paired with a **columnar** schema. Not compelling as a first move if the schema stays row-oriented.
### 4. Apache Arrow IPC
#### Idea
Use Arrow's columnar binary format and Arrow JS support.
Arrow JS provides:
- `tableFromIPC(...)`
- support for `fetch(...)`
- typed-array-backed vectors
- dictionary-encoded strings
- a columnar memory model explicitly meant for efficient processing and movement of large in-memory data
#### Pros
- strongest off-the-shelf fit for typed-array-oriented rendering
- columnar by design
- binary rather than textual
- supports large numeric columns very naturally
- supports dictionary encoding for repeated strings like labels or IRIs
- much closer to the renderer/GPU input shape than JSON objects
#### Cons
- larger conceptual/tooling jump than columnar JSON
- route segments are nested/variable-length; representing them cleanly needs design
- frontend code becomes Arrow-aware unless the decode is hidden behind an adapter
- backend must serialize Arrow on the Go side or produce Arrow-compatible IPC
#### Fit for current pipeline
Very good.
Current frontend needs can be represented as columns:
- `vertex_ids: uint32`
- `xs: float32`
- `ys: float32`
- `edge_sources: uint32`
- `edge_targets: uint32`
- `labels: utf8` or dictionary-encoded utf8
- `iris: utf8` or dictionary-encoded utf8
Route geometry should probably not stay as nested route-segment objects. It would fit better as:
- pre-flattened `route_line_vertices` float column/buffer
- or a second Arrow table dedicated to line segments
#### Overall assessment
One of the two best solutions for this repo.
If you want a standard format instead of inventing one, Arrow is the most attractive candidate.
### 5. FlatBuffers
#### Idea
Use a schema-defined binary format designed for direct access without unpacking/parsing.
FlatBuffers explicitly advertises:
- access to serialized data without parsing/unpacking
- memory efficiency and speed
- forwards/backwards compatibility
#### Pros
- very strong memory-efficiency story
- schema evolution support
- no full parse/unpack step in the same way as JSON
- can model both scalars and more complex structures
#### Cons
- requires schema/compiler/generated bindings
- JavaScript integration is more manual than JSON or Arrow
- ergonomics in app code are not as simple as arrays/objects
- strings and nested route structures are supported, but the developer experience is more specialized
#### Fit for current pipeline
Good, technically.
It can preserve all current information and remove the giant object-graph parse step.
However, compared with Arrow or a custom binary envelope, it is a less natural conceptual fit for a renderer whose hot path is already columnar/typed-array-based.
#### Overall assessment
A strong technical option, but probably not the most ergonomic option for this specific frontend.
### 6. Protocol Buffers
#### Idea
Use a schema-defined binary format with generated bindings.
#### Pros
- compact binary encoding
- schema/versioning
- mature ecosystem
#### Cons
- official docs describe protobuf as a good fit for typed structured messages up to a few megabytes
- the same docs warn that large data can require loading entire messages into memory and can cause multiple copies
- large repeated numeric arrays are not protobuf's sweet spot
- still not especially close to the renderer's typed-array model
#### Fit for current pipeline
Poor for this specific payload size and shape.
#### Overall assessment
Not recommended for this main graph transport.
### 7. Custom Binary Typed-Array Envelope
#### Idea
Define a transport specifically around what the renderer and hover/selection pipeline need.
Example structure:
- small fixed header or small JSON header:
- version
- counts
- offsets/lengths
- meta subset
- then raw binary buffers:
- `vertex_ids`
- `xs`
- `ys`
- `edge_sources`
- `edge_targets`
- `route_line_vertices`
- string dictionary / offsets for `label` and `iri`
#### Pros
- closest possible fit to current renderer
- no schema compiler required
- no row-object materialization
- easiest path to zero-copy or near-zero-copy arrays on the frontend
- easiest path to worker transfer via `ArrayBuffer`
- can separate hot render data from cold metadata cleanly
#### Cons
- custom protocol to design, version, validate, and document
- less tooling/interoperability than Arrow
- backend and frontend both need careful binary codecs
#### Fit for current pipeline
Excellent.
You can preserve all current behavior while only sending the data the frontend actually uses.
#### Overall assessment
The best performance-oriented fit if you are comfortable owning a custom format.
## Comparison Table
| Option | Closeness to GPU shape | Avoids giant object graph | Supports all current frontend data | Streaming-friendly | Implementation cost | Recommendation |
| --- | --- | --- | --- | --- | --- | --- |
| Current row JSON | Poor | No | Yes | Poor | Already done | Replace |
| Columnar JSON | Medium | No | Yes | Medium | Low | Good interim |
| NDJSON chunked columnar JSON | Medium | Partially | Yes | Good | Medium | Situational |
| MessagePack row-oriented | Poor | No | Yes | Good | Medium | Not enough alone |
| MessagePack columnar | Medium | Partially | Yes | Good | Medium | Viable but secondary |
| Arrow IPC | Very high | Yes or mostly yes | Yes | Good | Medium-high | Strong candidate |
| FlatBuffers | High | Yes | Yes | Medium | High | Good but specialized |
| Protobuf | Low-medium | No practical win here | Yes | Medium | Medium-high | Not recommended |
| Custom binary typed-array envelope | Very high | Yes | Yes | Good | High | Strongest fit |
## Recommended Data Contract Shapes
### Recommended shape for any non-row-oriented solution
The frontend does not need node/edge objects as its primary graph transport.
The main graph payload should be modeled as:
- `vertex_ids`
- `xs`
- `ys`
- `edge_sources`
- `edge_targets`
- `route_line_vertices`
- `node_labels`
- `node_iris`
- `meta`
This can be represented as:
- columnar JSON
- Arrow columns
- FlatBuffers vectors
- custom binary sections
### Fields that can be removed from the frontend transport immediately
Without changing current visible behavior, the main graph transport does not need to include:
- `nodes[].termType`
- `edges[].predicate_id`
- `meta.predicates`
- `meta.ttl_path`
- `meta.sparql_endpoint`
- `meta.include_bnodes`
- `meta.layout_engine`
- `meta.layout_root_iri`
- `route_segments[].edge_index`
- `route_segments[].kind`
Important:
Some of those fields are still needed by the backend's **internal snapshot**, especially for selection queries and hierarchy layout. That argues for splitting:
- internal snapshot model
- frontend transport DTO
instead of continuing to reuse one struct for both.
## Additional Architectural Notes
### A worker is complementary, not a transport format
Web Workers can move parsing/build work off the main thread, and `ArrayBuffer` is transferable. That is useful, but it does not by itself solve the current over-allocation problem if the payload is still a giant row-oriented JSON document.
Workers are most valuable when paired with:
- binary columnar payloads
- streamed columnar chunks
- transfer of `ArrayBuffer`s rather than giant JS object graphs
### The backend can keep a richer internal snapshot than it sends
This repo already caches snapshots server-side. Selection and triple queries are built from the backend snapshot and the small `graphMeta` values sent back by the client.
That means the frontend transport can be much slimmer than the backend snapshot representation, as long as the backend retains its richer internal data.
This is the cleanest way to avoid losing information while optimizing the frontend transport.
## Final Recommendation
### Best long-term option
Pick one of:
1. **Custom binary typed-array envelope**
2. **Apache Arrow IPC**
Reason:
- both map naturally to the renderer's actual input model
- both avoid the giant row-object parse path
- both can preserve all current frontend-visible information
### Best low-risk migration path
If you want an incremental step before going binary:
1. split backend internal snapshot from frontend transport DTO
2. move `/api/graph` to **columnar JSON**
3. keep only the metadata fields the frontend actually uses
4. later replace the same columnar DTO with Arrow or custom binary
That path reduces waste immediately and keeps the eventual binary migration straightforward.
## Sources
Official documentation and primary sources used for the comparison:
- MDN `Response.json()`
- https://developer.mozilla.org/en-US/docs/Web/API/Response/json
- MDN `TextDecoderStream`
- https://developer.mozilla.org/en-US/docs/Web/API/TextDecoderStream
- MDN Web Workers
- https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers
- MDN Transferable Objects
- https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Transferable_objects
- Apache Arrow JavaScript
- https://arrow.apache.org/js/current/
- https://arrow.apache.org/js/main/functions/Arrow.dom.tableFromIPC.html
- NDJSON specification
- https://github.com/ndjson/ndjson-spec
- MessagePack for JavaScript
- https://github.com/msgpack/msgpack-javascript
- FlatBuffers overview and JavaScript docs
- https://flatbuffers.dev/
- https://flatbuffers.dev/languages/javascript/
- Protocol Buffers overview
- https://protobuf.dev/overview/
- Streaming JSON parser references
- https://github.com/juanjoDiaz/streamparser-json
- https://rictic.github.io/jsonriver/