docs: refresh pipeline notes and transport analysis

2026-04-06 13:36:53 -03:00
parent 48ce99aac5
commit 97a30ab769
4 changed files with 967 additions and 225 deletions
--- a/GRAPH_TRANSPORT_ALTERNATIVES.md
+++ b/GRAPH_TRANSPORT_ALTERNATIVES.md
@@ -0,0 +1,655 @@
+# Graph Transport Alternatives
+
+## Purpose
+
+This document compares alternatives to the current `/api/graph` transport format with two goals:
+
+1. reduce the cost of building, transferring, and decoding very large graph payloads
+2. move the frontend transport shape closer to the renderer/GPU input shape while preserving all data the current frontend and backend pipeline still need
+
+This analysis is based on the current repo state plus official documentation for browser fetch/streaming and candidate transport formats.
+
+## Executive Summary
+
+The current bottleneck is not the renderer's typed-array path. It is the browser's need to fully materialize a huge JSON object graph before the renderer ever runs.
+
+The best candidates for this repo are:
+
+1. **Custom binary columnar payload**
+   - Best fit for the current renderer.
+   - Lowest decode overhead.
+   - Most direct path from backend memory to frontend typed arrays.
+   - Requires custom protocol/versioning work.
+
+2. **Apache Arrow IPC**
+   - Best off-the-shelf columnar binary format.
+   - Very good fit for typed-array-heavy rendering.
+   - Strong option if you want a standard format instead of inventing one.
+   - Heavier conceptual/tooling footprint than a custom binary envelope.
+
+3. **Columnar JSON**
+   - Easiest migration.
+   - Better than today's row-oriented JSON.
+   - Still fundamentally JSON, so it does not remove the browser's JSON parse/object-materialization cost.
+
+4. **NDJSON / streamed chunked JSON**
+   - Good if progressiveness matters.
+   - Better than one giant monolithic JSON document.
+   - Still weaker than a binary/columnar format for this renderer.
+
+The strongest overall recommendation is:
+
+- **Long-term**: custom binary columnar payload or Arrow IPC
+- **Low-risk interim**: columnar JSON, possibly with chunking/streaming
+
+Not recommended as the primary solution for this repo:
+
+- row-oriented MessagePack
+- Protocol Buffers as one giant message
+
+## Verified Current Pipeline
+
+### Backend side
+
+The backend builds a `GraphResponse` and caches it in memory:
+
+- `backend_go/models.go`
+- `backend_go/snapshot_service.go`
+- `backend_go/graph_snapshot.go`
+
+The response shape is:
+
+```go
+type GraphResponse struct {
+    Nodes         []Node
+    Edges         []Edge
+    RouteSegments []RouteSegment
+    Meta          *GraphMeta
+}
+```
+
+and it is currently written as one JSON document with:
+
+```go
+json.NewEncoder(w).Encode(v)
+```
+
+in `backend_go/http_helpers.go`.
+
+### Frontend side
+
+The frontend currently does:
+
+1. `fetch("/api/graph?...")`
+2. `await graphRes.json()`
+3. read `graph.nodes`, `graph.edges`, `graph.route_segments`, `graph.meta`
+4. build:
+   - `Float32Array xs`
+   - `Float32Array ys`
+   - `Uint32Array vertexIds`
+   - `Uint32Array edgeData`
+   - `Float32Array routeLineVertices`
+5. call `renderer.init(xs, ys, vertexIds, edgeData, routeLineVertices)`
+
+Relevant files:
+
+- `frontend/src/App.tsx`
+- `frontend/src/renderer.ts`
+
+This means the current browser path is:
+
+- wire bytes
+- JSON text/body handling
+- JS arrays of node/edge objects
+- typed arrays
+- renderer-side typed arrays/maps/GPU buffers
+
+The expensive part happens before step 4.
+
+## Verified Data Access Audit
+
+This section verifies every field currently produced by the backend and whether it is actually needed by the frontend transport.
+
+### Main graph response fields
+
+| Field | Produced in backend | Used by frontend? | Where used | Required on wire for current UX? | Notes |
+| --- | --- | --- | --- | --- | --- |
+| `nodes[].id` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `vertexIds`, and to map selected renderer indices back to backend IDs for selection queries. |
+| `nodes[].x` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `xs`. |
+| `nodes[].y` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `ys`. |
+| `nodes[].iri` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes, if keeping current hover UX | Used for hover tooltip text. |
+| `nodes[].label` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes, if keeping current hover UX | Used for hover tooltip text. |
+| `nodes[].termType` | `backend_go/models.go` | No frontend use | none in `frontend/src` | No | Still needed internally by backend snapshot/selection index. |
+| `edges[].source` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `edgeData`. |
+| `edges[].target` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Used to build `edgeData`. |
+| `edges[].predicate_id` | `backend_go/models.go` | No main-graph frontend use | none in `frontend/src/App.tsx` | No | Still needed internally by backend snapshot and hierarchy layout preparation. |
+| `route_segments[].points` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes when route segments are present | Used to build `routeLineVertices`. |
+| `route_segments[].edge_index` | `backend_go/models.go` | Not used after parsing | `graphRouteSegmentArray` validation only | No | Could be dropped from frontend transport if route lines are pre-flattened. |
+| `route_segments[].kind` | `backend_go/models.go` | Not used after parsing | `graphRouteSegmentArray` validation only | No | Could be dropped from frontend transport if route lines are pre-flattened. |
+| `meta.backend` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
+| `meta.nodes` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
+| `meta.edges` | `backend_go/models.go` | Yes | `frontend/src/App.tsx` | Yes | Displayed in overlay. |
+| `meta.graph_query_id` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
+| `meta.node_limit` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
+| `meta.edge_limit` | `backend_go/models.go` | Yes | `frontend/src/selection_queries/api.ts` | Yes | Sent back on selection endpoints. |
+| `meta.ttl_path` | `backend_go/models.go` | No | none in `frontend/src` | No | Frontend type declares it, but current UI does not use it. |
+| `meta.sparql_endpoint` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
+| `meta.include_bnodes` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
+| `meta.layout_engine` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
+| `meta.layout_root_iri` | `backend_go/models.go` | No | none in `frontend/src` | No | Not used by current UI. |
+| `meta.predicates` | `backend_go/models.go` | No frontend use | none in `frontend/src` | No | Still used internally by backend selection/hierarchy logic. |
+
+### Backend-internal fields that do not need to stay in the frontend transport
+
+This is the most important audit result.
+
+The backend currently reuses one struct for:
+
+- internal cached snapshot
+- HTTP response payload
+
+That is convenient, but it means the frontend receives fields that only the backend needs.
+
+Verified internal-only dependencies:
+
+- `snapshot.Nodes[].TermType` is used in `backend_go/selection_query.go` to build the selection index.
+- `snapshot.Meta.Predicates` is used in `backend_go/selection_query.go`.
+- `Edge.PredicateID` is used internally for hierarchy layout preparation in `backend_go/hierarchy_layout_bridge.go`.
+
+The frontend does **not** need those fields for current behavior.
+
+### What the frontend actually needs
+
+For the current graph view, the hot path can be reduced to:
+
+- `vertexIds[]`
+- `xs[]`
+- `ys[]`
+- `edgeSources[]`
+- `edgeTargets[]`
+- `routeLineVertices[]` or route geometry equivalent
+- `label[]` and `iri[]` by node index
+- `meta.backend`
+- `meta.nodes`
+- `meta.edges`
+- `meta.graph_query_id`
+- `meta.node_limit`
+- `meta.edge_limit`
+
+That is much closer to a columnar or binary payload than to the current array-of-objects JSON.
+
+## Why the Current JSON Path Hurts
+
+`Response.json()` is not just a lightweight decode helper. MDN states that `Response.json()` reads the stream to completion and resolves with the result of parsing the body text as JSON into a JavaScript object.
+
+That matters here because the current payload is row-oriented:
+
+- millions of node objects
+- millions of edge objects
+
+Even though the renderer later wants typed arrays, the browser must first create those JS objects.
+
+This is exactly the part that can stall or run out of memory before `renderer.init(...)` starts.
+
+## Alternatives
+
+### 1. Columnar JSON
+
+#### Idea
+
+Keep JSON, but change the schema from row-oriented objects:
+
+```json
+{
+  "nodes": [{ "id": 1, "x": 0.1, "y": 0.2, ... }],
+  "edges": [{ "source": 1, "target": 2, ... }]
+}
+```
+
+to column-oriented arrays:
+
+```json
+{
+  "vertex_ids": [...],
+  "xs": [...],
+  "ys": [...],
+  "edge_sources": [...],
+  "edge_targets": [...],
+  "node_labels": [...],
+  "node_iris": [...],
+  "route_line_vertices": [...],
+  "meta": { ... }
+}
+```
+
+#### Pros
+
+- easiest migration from the current API contract
+- no schema compiler
+- easy to debug with ordinary tooling
+- much closer to what the renderer already consumes
+- avoids creating per-edge objects in frontend application code
+
+#### Cons
+
+- still goes through JSON parsing
+- still materializes JS arrays before typed arrays are built
+- huge numeric arrays in JSON are still text, not binary
+- string columns are still ordinary JS strings
+
+#### Fit for current pipeline
+
+Good.
+
+No current frontend feature would be lost if the payload includes:
+
+- ids/xs/ys/edge sources/targets
+- labels/iris
+- route line vertices or equivalent
+- the small subset of meta fields currently used
+
+#### Overall assessment
+
+Best low-risk intermediate step.
+
+It is clearly better than today's row-oriented JSON, but it is not the endgame if the goal is to remove the parse bottleneck for 1 GB+ payloads.
+
+### 2. NDJSON / Chunked JSON
+
+#### Idea
+
+Change the backend to stream multiple JSON records instead of one giant JSON object.
+
+Examples:
+
+- one line per chunk of nodes/edges
+- one line for metadata
+- one line per route segment chunk
+
+NDJSON is explicitly designed for transporting multiple JSON texts in a stream protocol.
+
+#### Pros
+
+- can start processing before the whole payload arrives
+- better observability and progress reporting
+- easier cancellation/retry semantics
+- avoids one monolithic `Response.json()` boundary
+
+#### Cons
+
+- record-per-edge NDJSON would still create far too many JS objects
+- to be worth it here, it should be **chunked columnar NDJSON**, not row NDJSON
+- frontend load path must become stream-based
+- renderer still currently expects all arrays at once
+
+#### Fit for current pipeline
+
+Moderate.
+
+It can preserve all current information, but it does not by itself solve the "final representation should look like GPU inputs" goal unless each chunk is already columnar.
+
+#### Best shape if chosen
+
+Not:
+
+- one JSON object per edge
+- one JSON object per node
+
+Better:
+
+- one NDJSON record for metadata
+- then NDJSON records where each record contains columnar chunks:
+  - `vertex_ids_chunk`
+  - `xs_chunk`
+  - `ys_chunk`
+  - `edge_sources_chunk`
+  - `edge_targets_chunk`
+
+#### Overall assessment
+
+Viable, but only attractive if progressiveness is a major goal. On its own, it is weaker than columnar binary formats for this renderer.
+
+### 3. MessagePack
+
+#### Idea
+
+Use a compact binary encoding instead of JSON.
+
+The official JavaScript implementation supports:
+
+- `encode`
+- `decode`
+- `decodeAsync(stream)`
+- `decodeArrayStream(stream)`
+- `decodeMultiStream(stream)`
+
+and even custom extension types for faster handling of large `Float32Array` payloads.
+
+#### Pros
+
+- smaller payload than JSON
+- binary transport
+- async and stream-capable decoding APIs exist
+- mature JS library
+
+#### Cons
+
+- if you keep the current row-oriented schema, you still get one huge object graph after decode
+- therefore MessagePack alone does not remove the fundamental object-allocation problem
+- custom extension types improve typed-array cases, but then you are already halfway to designing a custom binary protocol
+
+#### Fit for current pipeline
+
+Moderate.
+
+It can preserve all current information easily.
+
+But if the schema remains object-heavy, the browser still ends up with millions of JS objects.
+
+#### Overall assessment
+
+Useful if paired with a **columnar** schema. Not compelling as a first move if the schema stays row-oriented.
+
+### 4. Apache Arrow IPC
+
+#### Idea
+
+Use Arrow's columnar binary format and Arrow JS support.
+
+Arrow JS provides:
+
+- `tableFromIPC(...)`
+- support for `fetch(...)`
+- typed-array-backed vectors
+- dictionary-encoded strings
+- a columnar memory model explicitly meant for efficient processing and movement of large in-memory data
+
+#### Pros
+
+- strongest off-the-shelf fit for typed-array-oriented rendering
+- columnar by design
+- binary rather than textual
+- supports large numeric columns very naturally
+- supports dictionary encoding for repeated strings like labels or IRIs
+- much closer to the renderer/GPU input shape than JSON objects
+
+#### Cons
+
+- larger conceptual/tooling jump than columnar JSON
+- route segments are nested/variable-length; representing them cleanly needs design
+- frontend code becomes Arrow-aware unless the decode is hidden behind an adapter
+- backend must serialize Arrow on the Go side or produce Arrow-compatible IPC
+
+#### Fit for current pipeline
+
+Very good.
+
+Current frontend needs can be represented as columns:
+
+- `vertex_ids: uint32`
+- `xs: float32`
+- `ys: float32`
+- `edge_sources: uint32`
+- `edge_targets: uint32`
+- `labels: utf8` or dictionary-encoded utf8
+- `iris: utf8` or dictionary-encoded utf8
+
+Route geometry should probably not stay as nested route-segment objects. It would fit better as:
+
+- pre-flattened `route_line_vertices` float column/buffer
+- or a second Arrow table dedicated to line segments
+
+#### Overall assessment
+
+One of the two best solutions for this repo.
+
+If you want a standard format instead of inventing one, Arrow is the most attractive candidate.
+
+### 5. FlatBuffers
+
+#### Idea
+
+Use a schema-defined binary format designed for direct access without unpacking/parsing.
+
+FlatBuffers explicitly advertises:
+
+- access to serialized data without parsing/unpacking
+- memory efficiency and speed
+- forwards/backwards compatibility
+
+#### Pros
+
+- very strong memory-efficiency story
+- schema evolution support
+- no full parse/unpack step in the same way as JSON
+- can model both scalars and more complex structures
+
+#### Cons
+
+- requires schema/compiler/generated bindings
+- JavaScript integration is more manual than JSON or Arrow
+- ergonomics in app code are not as simple as arrays/objects
+- strings and nested route structures are supported, but the developer experience is more specialized
+
+#### Fit for current pipeline
+
+Good, technically.
+
+It can preserve all current information and remove the giant object-graph parse step.
+
+However, compared with Arrow or a custom binary envelope, it is a less natural conceptual fit for a renderer whose hot path is already columnar/typed-array-based.
+
+#### Overall assessment
+
+A strong technical option, but probably not the most ergonomic option for this specific frontend.
+
+### 6. Protocol Buffers
+
+#### Idea
+
+Use a schema-defined binary format with generated bindings.
+
+#### Pros
+
+- compact binary encoding
+- schema/versioning
+- mature ecosystem
+
+#### Cons
+
+- official docs describe protobuf as a good fit for typed structured messages up to a few megabytes
+- the same docs warn that large data can require loading entire messages into memory and can cause multiple copies
+- large repeated numeric arrays are not protobuf's sweet spot
+- still not especially close to the renderer's typed-array model
+
+#### Fit for current pipeline
+
+Poor for this specific payload size and shape.
+
+#### Overall assessment
+
+Not recommended for this main graph transport.
+
+### 7. Custom Binary Typed-Array Envelope
+
+#### Idea
+
+Define a transport specifically around what the renderer and hover/selection pipeline need.
+
+Example structure:
+
+- small fixed header or small JSON header:
+  - version
+  - counts
+  - offsets/lengths
+  - meta subset
+- then raw binary buffers:
+  - `vertex_ids`
+  - `xs`
+  - `ys`
+  - `edge_sources`
+  - `edge_targets`
+  - `route_line_vertices`
+  - string dictionary / offsets for `label` and `iri`
+
+#### Pros
+
+- closest possible fit to current renderer
+- no schema compiler required
+- no row-object materialization
+- easiest path to zero-copy or near-zero-copy arrays on the frontend
+- easiest path to worker transfer via `ArrayBuffer`
+- can separate hot render data from cold metadata cleanly
+
+#### Cons
+
+- custom protocol to design, version, validate, and document
+- less tooling/interoperability than Arrow
+- backend and frontend both need careful binary codecs
+
+#### Fit for current pipeline
+
+Excellent.
+
+You can preserve all current behavior while only sending the data the frontend actually uses.
+
+#### Overall assessment
+
+The best performance-oriented fit if you are comfortable owning a custom format.
+
+## Comparison Table
+
+| Option | Closeness to GPU shape | Avoids giant object graph | Supports all current frontend data | Streaming-friendly | Implementation cost | Recommendation |
+| --- | --- | --- | --- | --- | --- | --- |
+| Current row JSON | Poor | No | Yes | Poor | Already done | Replace |
+| Columnar JSON | Medium | No | Yes | Medium | Low | Good interim |
+| NDJSON chunked columnar JSON | Medium | Partially | Yes | Good | Medium | Situational |
+| MessagePack row-oriented | Poor | No | Yes | Good | Medium | Not enough alone |
+| MessagePack columnar | Medium | Partially | Yes | Good | Medium | Viable but secondary |
+| Arrow IPC | Very high | Yes or mostly yes | Yes | Good | Medium-high | Strong candidate |
+| FlatBuffers | High | Yes | Yes | Medium | High | Good but specialized |
+| Protobuf | Low-medium | No practical win here | Yes | Medium | Medium-high | Not recommended |
+| Custom binary typed-array envelope | Very high | Yes | Yes | Good | High | Strongest fit |
+
+## Recommended Data Contract Shapes
+
+### Recommended shape for any non-row-oriented solution
+
+The frontend does not need node/edge objects as its primary graph transport.
+
+The main graph payload should be modeled as:
+
+- `vertex_ids`
+- `xs`
+- `ys`
+- `edge_sources`
+- `edge_targets`
+- `route_line_vertices`
+- `node_labels`
+- `node_iris`
+- `meta`
+
+This can be represented as:
+
+- columnar JSON
+- Arrow columns
+- FlatBuffers vectors
+- custom binary sections
+
+### Fields that can be removed from the frontend transport immediately
+
+Without changing current visible behavior, the main graph transport does not need to include:
+
+- `nodes[].termType`
+- `edges[].predicate_id`
+- `meta.predicates`
+- `meta.ttl_path`
+- `meta.sparql_endpoint`
+- `meta.include_bnodes`
+- `meta.layout_engine`
+- `meta.layout_root_iri`
+- `route_segments[].edge_index`
+- `route_segments[].kind`
+
+Important:
+
+Some of those fields are still needed by the backend's **internal snapshot**, especially for selection queries and hierarchy layout. That argues for splitting:
+
+- internal snapshot model
+- frontend transport DTO
+
+instead of continuing to reuse one struct for both.
+
+## Additional Architectural Notes
+
+### A worker is complementary, not a transport format
+
+Web Workers can move parsing/build work off the main thread, and `ArrayBuffer` is transferable. That is useful, but it does not by itself solve the current over-allocation problem if the payload is still a giant row-oriented JSON document.
+
+Workers are most valuable when paired with:
+
+- binary columnar payloads
+- streamed columnar chunks
+- transfer of `ArrayBuffer`s rather than giant JS object graphs
+
+### The backend can keep a richer internal snapshot than it sends
+
+This repo already caches snapshots server-side. Selection and triple queries are built from the backend snapshot and the small `graphMeta` values sent back by the client.
+
+That means the frontend transport can be much slimmer than the backend snapshot representation, as long as the backend retains its richer internal data.
+
+This is the cleanest way to avoid losing information while optimizing the frontend transport.
+
+## Final Recommendation
+
+### Best long-term option
+
+Pick one of:
+
+1. **Custom binary typed-array envelope**
+2. **Apache Arrow IPC**
+
+Reason:
+
+- both map naturally to the renderer's actual input model
+- both avoid the giant row-object parse path
+- both can preserve all current frontend-visible information
+
+### Best low-risk migration path
+
+If you want an incremental step before going binary:
+
+1. split backend internal snapshot from frontend transport DTO
+2. move `/api/graph` to **columnar JSON**
+3. keep only the metadata fields the frontend actually uses
+4. later replace the same columnar DTO with Arrow or custom binary
+
+That path reduces waste immediately and keeps the eventual binary migration straightforward.
+
+## Sources
+
+Official documentation and primary sources used for the comparison:
+
+- MDN `Response.json()`
+  - https://developer.mozilla.org/en-US/docs/Web/API/Response/json
+- MDN `TextDecoderStream`
+  - https://developer.mozilla.org/en-US/docs/Web/API/TextDecoderStream
+- MDN Web Workers
+  - https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers
+- MDN Transferable Objects
+  - https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Transferable_objects
+- Apache Arrow JavaScript
+  - https://arrow.apache.org/js/current/
+  - https://arrow.apache.org/js/main/functions/Arrow.dom.tableFromIPC.html
+- NDJSON specification
+  - https://github.com/ndjson/ndjson-spec
+- MessagePack for JavaScript
+  - https://github.com/msgpack/msgpack-javascript
+- FlatBuffers overview and JavaScript docs
+  - https://flatbuffers.dev/
+  - https://flatbuffers.dev/languages/javascript/
+- Protocol Buffers overview
+  - https://protobuf.dev/overview/
+- Streaming JSON parser references
+  - https://github.com/juanjoDiaz/streamparser-json
+  - https://rictic.github.io/jsonriver/