ekuster/visualizador_instanciados

Fork 0

Files

Oxy8 97a30ab769 docs: refresh pipeline notes and transport analysis

2026-04-06 13:36:53 -03:00

22 KiB

Raw Blame History

Graph Transport Alternatives

Purpose

This document compares alternatives to the current /api/graph transport format with two goals:

reduce the cost of building, transferring, and decoding very large graph payloads
move the frontend transport shape closer to the renderer/GPU input shape while preserving all data the current frontend and backend pipeline still need

This analysis is based on the current repo state plus official documentation for browser fetch/streaming and candidate transport formats.

Executive Summary

The current bottleneck is not the renderer's typed-array path. It is the browser's need to fully materialize a huge JSON object graph before the renderer ever runs.

The best candidates for this repo are:

Custom binary columnar payload
- Best fit for the current renderer.
- Lowest decode overhead.
- Most direct path from backend memory to frontend typed arrays.
- Requires custom protocol/versioning work.
Apache Arrow IPC
- Best off-the-shelf columnar binary format.
- Very good fit for typed-array-heavy rendering.
- Strong option if you want a standard format instead of inventing one.
- Heavier conceptual/tooling footprint than a custom binary envelope.
Columnar JSON
- Easiest migration.
- Better than today's row-oriented JSON.
- Still fundamentally JSON, so it does not remove the browser's JSON parse/object-materialization cost.
NDJSON / streamed chunked JSON
- Good if progressiveness matters.
- Better than one giant monolithic JSON document.
- Still weaker than a binary/columnar format for this renderer.

The strongest overall recommendation is:

Long-term: custom binary columnar payload or Arrow IPC
Low-risk interim: columnar JSON, possibly with chunking/streaming

Not recommended as the primary solution for this repo:

row-oriented MessagePack
Protocol Buffers as one giant message

Verified Current Pipeline

Backend side

The backend builds a GraphResponse and caches it in memory:

backend_go/models.go
backend_go/snapshot_service.go
backend_go/graph_snapshot.go

The response shape is:

type GraphResponse struct {
    Nodes         []Node
    Edges         []Edge
    RouteSegments []RouteSegment
    Meta          *GraphMeta
}

and it is currently written as one JSON document with:

json.NewEncoder(w).Encode(v)

in backend_go/http_helpers.go.

Frontend side

The frontend currently does:

fetch("/api/graph?...")
await graphRes.json()
read graph.nodes, graph.edges, graph.route_segments, graph.meta
build:
- Float32Array xs
- Float32Array ys
- Uint32Array vertexIds
- Uint32Array edgeData
- Float32Array routeLineVertices
call renderer.init(xs, ys, vertexIds, edgeData, routeLineVertices)

Relevant files:

frontend/src/App.tsx
frontend/src/renderer.ts

This means the current browser path is:

wire bytes
JSON text/body handling
JS arrays of node/edge objects
typed arrays
renderer-side typed arrays/maps/GPU buffers

The expensive part happens before step 4.

Verified Data Access Audit

This section verifies every field currently produced by the backend and whether it is actually needed by the frontend transport.

Main graph response fields

Field	Produced in backend	Used by frontend?	Where used	Required on wire for current UX?	Notes
`nodes[].id`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Used to build `vertexIds`, and to map selected renderer indices back to backend IDs for selection queries.
`nodes[].x`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Used to build `xs`.
`nodes[].y`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Used to build `ys`.
`nodes[].iri`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes, if keeping current hover UX	Used for hover tooltip text.
`nodes[].label`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes, if keeping current hover UX	Used for hover tooltip text.
`nodes[].termType`	`backend_go/models.go`	No frontend use	none in `frontend/src`	No	Still needed internally by backend snapshot/selection index.
`edges[].source`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Used to build `edgeData`.
`edges[].target`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Used to build `edgeData`.
`edges[].predicate_id`	`backend_go/models.go`	No main-graph frontend use	none in `frontend/src/App.tsx`	No	Still needed internally by backend snapshot and hierarchy layout preparation.
`route_segments[].points`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes when route segments are present	Used to build `routeLineVertices`.
`route_segments[].edge_index`	`backend_go/models.go`	Not used after parsing	`graphRouteSegmentArray` validation only	No	Could be dropped from frontend transport if route lines are pre-flattened.
`route_segments[].kind`	`backend_go/models.go`	Not used after parsing	`graphRouteSegmentArray` validation only	No	Could be dropped from frontend transport if route lines are pre-flattened.
`meta.backend`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Displayed in overlay.
`meta.nodes`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Displayed in overlay.
`meta.edges`	`backend_go/models.go`	Yes	`frontend/src/App.tsx`	Yes	Displayed in overlay.
`meta.graph_query_id`	`backend_go/models.go`	Yes	`frontend/src/selection_queries/api.ts`	Yes	Sent back on selection endpoints.
`meta.node_limit`	`backend_go/models.go`	Yes	`frontend/src/selection_queries/api.ts`	Yes	Sent back on selection endpoints.
`meta.edge_limit`	`backend_go/models.go`	Yes	`frontend/src/selection_queries/api.ts`	Yes	Sent back on selection endpoints.
`meta.ttl_path`	`backend_go/models.go`	No	none in `frontend/src`	No	Frontend type declares it, but current UI does not use it.
`meta.sparql_endpoint`	`backend_go/models.go`	No	none in `frontend/src`	No	Not used by current UI.
`meta.include_bnodes`	`backend_go/models.go`	No	none in `frontend/src`	No	Not used by current UI.
`meta.layout_engine`	`backend_go/models.go`	No	none in `frontend/src`	No	Not used by current UI.
`meta.layout_root_iri`	`backend_go/models.go`	No	none in `frontend/src`	No	Not used by current UI.
`meta.predicates`	`backend_go/models.go`	No frontend use	none in `frontend/src`	No	Still used internally by backend selection/hierarchy logic.

Backend-internal fields that do not need to stay in the frontend transport

This is the most important audit result.

The backend currently reuses one struct for:

internal cached snapshot
HTTP response payload

That is convenient, but it means the frontend receives fields that only the backend needs.

Verified internal-only dependencies:

snapshot.Nodes[].TermType is used in backend_go/selection_query.go to build the selection index.
snapshot.Meta.Predicates is used in backend_go/selection_query.go.
Edge.PredicateID is used internally for hierarchy layout preparation in backend_go/hierarchy_layout_bridge.go.

The frontend does not need those fields for current behavior.

What the frontend actually needs

For the current graph view, the hot path can be reduced to:

vertexIds[]
xs[]
ys[]
edgeSources[]
edgeTargets[]
routeLineVertices[] or route geometry equivalent
label[] and iri[] by node index
meta.backend
meta.nodes
meta.edges
meta.graph_query_id
meta.node_limit
meta.edge_limit

That is much closer to a columnar or binary payload than to the current array-of-objects JSON.

Why the Current JSON Path Hurts

Response.json() is not just a lightweight decode helper. MDN states that Response.json() reads the stream to completion and resolves with the result of parsing the body text as JSON into a JavaScript object.

That matters here because the current payload is row-oriented:

millions of node objects
millions of edge objects

Even though the renderer later wants typed arrays, the browser must first create those JS objects.

This is exactly the part that can stall or run out of memory before renderer.init(...) starts.

Alternatives

1. Columnar JSON

Idea

Keep JSON, but change the schema from row-oriented objects:

{
  "nodes": [{ "id": 1, "x": 0.1, "y": 0.2, ... }],
  "edges": [{ "source": 1, "target": 2, ... }]
}

to column-oriented arrays:

{
  "vertex_ids": [...],
  "xs": [...],
  "ys": [...],
  "edge_sources": [...],
  "edge_targets": [...],
  "node_labels": [...],
  "node_iris": [...],
  "route_line_vertices": [...],
  "meta": { ... }
}

Pros

easiest migration from the current API contract
no schema compiler
easy to debug with ordinary tooling
much closer to what the renderer already consumes
avoids creating per-edge objects in frontend application code

Cons

still goes through JSON parsing
still materializes JS arrays before typed arrays are built
huge numeric arrays in JSON are still text, not binary
string columns are still ordinary JS strings

Fit for current pipeline

Good.

No current frontend feature would be lost if the payload includes:

ids/xs/ys/edge sources/targets
labels/iris
route line vertices or equivalent
the small subset of meta fields currently used

Overall assessment

Best low-risk intermediate step.

It is clearly better than today's row-oriented JSON, but it is not the endgame if the goal is to remove the parse bottleneck for 1 GB+ payloads.

2. NDJSON / Chunked JSON

Idea

Change the backend to stream multiple JSON records instead of one giant JSON object.

Examples:

one line per chunk of nodes/edges
one line for metadata
one line per route segment chunk

NDJSON is explicitly designed for transporting multiple JSON texts in a stream protocol.

Pros

can start processing before the whole payload arrives
better observability and progress reporting
easier cancellation/retry semantics
avoids one monolithic Response.json() boundary

Cons

record-per-edge NDJSON would still create far too many JS objects
to be worth it here, it should be chunked columnar NDJSON, not row NDJSON
frontend load path must become stream-based
renderer still currently expects all arrays at once

Fit for current pipeline

Moderate.

It can preserve all current information, but it does not by itself solve the "final representation should look like GPU inputs" goal unless each chunk is already columnar.

Best shape if chosen

Not:

one JSON object per edge
one JSON object per node

Better:

one NDJSON record for metadata
then NDJSON records where each record contains columnar chunks:
- vertex_ids_chunk
- xs_chunk
- ys_chunk
- edge_sources_chunk
- edge_targets_chunk

Overall assessment

Viable, but only attractive if progressiveness is a major goal. On its own, it is weaker than columnar binary formats for this renderer.

3. MessagePack

Idea

Use a compact binary encoding instead of JSON.

The official JavaScript implementation supports:

encode
decode
decodeAsync(stream)
decodeArrayStream(stream)
decodeMultiStream(stream)

and even custom extension types for faster handling of large Float32Array payloads.

Pros

smaller payload than JSON
binary transport
async and stream-capable decoding APIs exist
mature JS library

Cons

if you keep the current row-oriented schema, you still get one huge object graph after decode
therefore MessagePack alone does not remove the fundamental object-allocation problem
custom extension types improve typed-array cases, but then you are already halfway to designing a custom binary protocol

Fit for current pipeline

Moderate.

It can preserve all current information easily.

But if the schema remains object-heavy, the browser still ends up with millions of JS objects.

Overall assessment

Useful if paired with a columnar schema. Not compelling as a first move if the schema stays row-oriented.

4. Apache Arrow IPC

Idea

Use Arrow's columnar binary format and Arrow JS support.

Arrow JS provides:

tableFromIPC(...)
support for fetch(...)
typed-array-backed vectors
dictionary-encoded strings
a columnar memory model explicitly meant for efficient processing and movement of large in-memory data

Pros

strongest off-the-shelf fit for typed-array-oriented rendering
columnar by design
binary rather than textual
supports large numeric columns very naturally
supports dictionary encoding for repeated strings like labels or IRIs
much closer to the renderer/GPU input shape than JSON objects

Cons

larger conceptual/tooling jump than columnar JSON
route segments are nested/variable-length; representing them cleanly needs design
frontend code becomes Arrow-aware unless the decode is hidden behind an adapter
backend must serialize Arrow on the Go side or produce Arrow-compatible IPC

Fit for current pipeline

Very good.

Current frontend needs can be represented as columns:

vertex_ids: uint32
xs: float32
ys: float32
edge_sources: uint32
edge_targets: uint32
labels: utf8 or dictionary-encoded utf8
iris: utf8 or dictionary-encoded utf8

Route geometry should probably not stay as nested route-segment objects. It would fit better as:

pre-flattened route_line_vertices float column/buffer
or a second Arrow table dedicated to line segments

Overall assessment

One of the two best solutions for this repo.

If you want a standard format instead of inventing one, Arrow is the most attractive candidate.

5. FlatBuffers

Idea

Use a schema-defined binary format designed for direct access without unpacking/parsing.

FlatBuffers explicitly advertises:

access to serialized data without parsing/unpacking
memory efficiency and speed
forwards/backwards compatibility

Pros

very strong memory-efficiency story
schema evolution support
no full parse/unpack step in the same way as JSON
can model both scalars and more complex structures

Cons

requires schema/compiler/generated bindings
JavaScript integration is more manual than JSON or Arrow
ergonomics in app code are not as simple as arrays/objects
strings and nested route structures are supported, but the developer experience is more specialized

Fit for current pipeline

Good, technically.

It can preserve all current information and remove the giant object-graph parse step.

However, compared with Arrow or a custom binary envelope, it is a less natural conceptual fit for a renderer whose hot path is already columnar/typed-array-based.

Overall assessment

A strong technical option, but probably not the most ergonomic option for this specific frontend.

6. Protocol Buffers

Idea

Use a schema-defined binary format with generated bindings.

Pros

compact binary encoding
schema/versioning
mature ecosystem

Cons

official docs describe protobuf as a good fit for typed structured messages up to a few megabytes
the same docs warn that large data can require loading entire messages into memory and can cause multiple copies
large repeated numeric arrays are not protobuf's sweet spot
still not especially close to the renderer's typed-array model

Fit for current pipeline

Poor for this specific payload size and shape.

Overall assessment

Not recommended for this main graph transport.

7. Custom Binary Typed-Array Envelope

Idea

Define a transport specifically around what the renderer and hover/selection pipeline need.

Example structure:

small fixed header or small JSON header:
- version
- counts
- offsets/lengths
- meta subset
then raw binary buffers:
- vertex_ids
- xs
- ys
- edge_sources
- edge_targets
- route_line_vertices
- string dictionary / offsets for label and iri

Pros

closest possible fit to current renderer
no schema compiler required
no row-object materialization
easiest path to zero-copy or near-zero-copy arrays on the frontend
easiest path to worker transfer via ArrayBuffer
can separate hot render data from cold metadata cleanly

Cons

custom protocol to design, version, validate, and document
less tooling/interoperability than Arrow
backend and frontend both need careful binary codecs

Fit for current pipeline

Excellent.

You can preserve all current behavior while only sending the data the frontend actually uses.

Overall assessment

The best performance-oriented fit if you are comfortable owning a custom format.

Comparison Table

Option	Closeness to GPU shape	Avoids giant object graph	Supports all current frontend data	Streaming-friendly	Implementation cost	Recommendation
Current row JSON	Poor	No	Yes	Poor	Already done	Replace
Columnar JSON	Medium	No	Yes	Medium	Low	Good interim
NDJSON chunked columnar JSON	Medium	Partially	Yes	Good	Medium	Situational
MessagePack row-oriented	Poor	No	Yes	Good	Medium	Not enough alone
MessagePack columnar	Medium	Partially	Yes	Good	Medium	Viable but secondary
Arrow IPC	Very high	Yes or mostly yes	Yes	Good	Medium-high	Strong candidate
FlatBuffers	High	Yes	Yes	Medium	High	Good but specialized
Protobuf	Low-medium	No practical win here	Yes	Medium	Medium-high	Not recommended
Custom binary typed-array envelope	Very high	Yes	Yes	Good	High	Strongest fit

Recommended Data Contract Shapes

Fields that can be removed from the frontend transport immediately

Without changing current visible behavior, the main graph transport does not need to include:

nodes[].termType
edges[].predicate_id
meta.predicates
meta.ttl_path
meta.sparql_endpoint
meta.include_bnodes
meta.layout_engine
meta.layout_root_iri
route_segments[].edge_index
route_segments[].kind

Important:

Some of those fields are still needed by the backend's internal snapshot, especially for selection queries and hierarchy layout. That argues for splitting:

internal snapshot model
frontend transport DTO

instead of continuing to reuse one struct for both.

Additional Architectural Notes

A worker is complementary, not a transport format

Web Workers can move parsing/build work off the main thread, and ArrayBuffer is transferable. That is useful, but it does not by itself solve the current over-allocation problem if the payload is still a giant row-oriented JSON document.

Workers are most valuable when paired with:

binary columnar payloads
streamed columnar chunks
transfer of ArrayBuffers rather than giant JS object graphs

The backend can keep a richer internal snapshot than it sends

This repo already caches snapshots server-side. Selection and triple queries are built from the backend snapshot and the small graphMeta values sent back by the client.

That means the frontend transport can be much slimmer than the backend snapshot representation, as long as the backend retains its richer internal data.

This is the cleanest way to avoid losing information while optimizing the frontend transport.

Final Recommendation

Best long-term option

Pick one of:

Custom binary typed-array envelope
Apache Arrow IPC

Reason:

both map naturally to the renderer's actual input model
both avoid the giant row-object parse path
both can preserve all current frontend-visible information

Best low-risk migration path

If you want an incremental step before going binary:

split backend internal snapshot from frontend transport DTO
move /api/graph to columnar JSON
keep only the metadata fields the frontend actually uses
later replace the same columnar DTO with Arrow or custom binary

That path reduces waste immediately and keeps the eventual binary migration straightforward.

Sources

Official documentation and primary sources used for the comparison:

MDN Response.json()
- https://developer.mozilla.org/en-US/docs/Web/API/Response/json
MDN TextDecoderStream
- https://developer.mozilla.org/en-US/docs/Web/API/TextDecoderStream
MDN Web Workers
- https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers
MDN Transferable Objects
- https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Transferable_objects
Apache Arrow JavaScript
- https://arrow.apache.org/js/current/
- https://arrow.apache.org/js/main/functions/Arrow.dom.tableFromIPC.html
NDJSON specification
- https://github.com/ndjson/ndjson-spec
MessagePack for JavaScript
- https://github.com/msgpack/msgpack-javascript
FlatBuffers overview and JavaScript docs
- https://flatbuffers.dev/
- https://flatbuffers.dev/languages/javascript/
Protocol Buffers overview
- https://protobuf.dev/overview/
Streaming JSON parser references
- https://github.com/juanjoDiaz/streamparser-json
- https://rictic.github.io/jsonriver/

22 KiB Raw Blame History

Graph Transport Alternatives

Purpose

Executive Summary

Verified Current Pipeline

Backend side

Frontend side

Verified Data Access Audit

Main graph response fields

Backend-internal fields that do not need to stay in the frontend transport

What the frontend actually needs

Why the Current JSON Path Hurts

Alternatives

1. Columnar JSON

Idea

Pros

Cons

Fit for current pipeline

Overall assessment

2. NDJSON / Chunked JSON

Idea

Pros

Cons

Fit for current pipeline

Best shape if chosen

Overall assessment

3. MessagePack

Idea

Pros

Cons

Fit for current pipeline

Overall assessment

4. Apache Arrow IPC

Idea

Pros

Cons

Fit for current pipeline

Overall assessment

5. FlatBuffers

Idea

Pros

Cons

Fit for current pipeline

Overall assessment

6. Protocol Buffers

Idea

Pros

Cons

Fit for current pipeline

Overall assessment

7. Custom Binary Typed-Array Envelope

Idea

Pros

Cons

Fit for current pipeline

Overall assessment

Comparison Table

Recommended Data Contract Shapes

Recommended shape for any non-row-oriented solution

Fields that can be removed from the frontend transport immediately

Additional Architectural Notes

A worker is complementary, not a transport format

The backend can keep a richer internal snapshot than it sends

Final Recommendation

Best long-term option

Best low-risk migration path

Sources

22 KiB

Raw Blame History