Files
visualizador_instanciados/GRAPH_TRANSPORT_ALTERNATIVES.md

22 KiB

Graph Transport Alternatives

Purpose

This document compares alternatives to the current /api/graph transport format with two goals:

  1. reduce the cost of building, transferring, and decoding very large graph payloads
  2. move the frontend transport shape closer to the renderer/GPU input shape while preserving all data the current frontend and backend pipeline still need

This analysis is based on the current repo state plus official documentation for browser fetch/streaming and candidate transport formats.

Executive Summary

The current bottleneck is not the renderer's typed-array path. It is the browser's need to fully materialize a huge JSON object graph before the renderer ever runs.

The best candidates for this repo are:

  1. Custom binary columnar payload

    • Best fit for the current renderer.
    • Lowest decode overhead.
    • Most direct path from backend memory to frontend typed arrays.
    • Requires custom protocol/versioning work.
  2. Apache Arrow IPC

    • Best off-the-shelf columnar binary format.
    • Very good fit for typed-array-heavy rendering.
    • Strong option if you want a standard format instead of inventing one.
    • Heavier conceptual/tooling footprint than a custom binary envelope.
  3. Columnar JSON

    • Easiest migration.
    • Better than today's row-oriented JSON.
    • Still fundamentally JSON, so it does not remove the browser's JSON parse/object-materialization cost.
  4. NDJSON / streamed chunked JSON

    • Good if progressiveness matters.
    • Better than one giant monolithic JSON document.
    • Still weaker than a binary/columnar format for this renderer.

The strongest overall recommendation is:

  • Long-term: custom binary columnar payload or Arrow IPC
  • Low-risk interim: columnar JSON, possibly with chunking/streaming

Not recommended as the primary solution for this repo:

  • row-oriented MessagePack
  • Protocol Buffers as one giant message

Verified Current Pipeline

Backend side

The backend builds a GraphResponse and caches it in memory:

  • backend_go/models.go
  • backend_go/snapshot_service.go
  • backend_go/graph_snapshot.go

The response shape is:

type GraphResponse struct {
    Nodes         []Node
    Edges         []Edge
    RouteSegments []RouteSegment
    Meta          *GraphMeta
}

and it is currently written as one JSON document with:

json.NewEncoder(w).Encode(v)

in backend_go/http_helpers.go.

Frontend side

The frontend currently does:

  1. fetch("/api/graph?...")
  2. await graphRes.json()
  3. read graph.nodes, graph.edges, graph.route_segments, graph.meta
  4. build:
    • Float32Array xs
    • Float32Array ys
    • Uint32Array vertexIds
    • Uint32Array edgeData
    • Float32Array routeLineVertices
  5. call renderer.init(xs, ys, vertexIds, edgeData, routeLineVertices)

Relevant files:

  • frontend/src/App.tsx
  • frontend/src/renderer.ts

This means the current browser path is:

  • wire bytes
  • JSON text/body handling
  • JS arrays of node/edge objects
  • typed arrays
  • renderer-side typed arrays/maps/GPU buffers

The expensive part happens before step 4.

Verified Data Access Audit

This section verifies every field currently produced by the backend and whether it is actually needed by the frontend transport.

Main graph response fields

Field Produced in backend Used by frontend? Where used Required on wire for current UX? Notes
nodes[].id backend_go/models.go Yes frontend/src/App.tsx Yes Used to build vertexIds, and to map selected renderer indices back to backend IDs for selection queries.
nodes[].x backend_go/models.go Yes frontend/src/App.tsx Yes Used to build xs.
nodes[].y backend_go/models.go Yes frontend/src/App.tsx Yes Used to build ys.
nodes[].iri backend_go/models.go Yes frontend/src/App.tsx Yes, if keeping current hover UX Used for hover tooltip text.
nodes[].label backend_go/models.go Yes frontend/src/App.tsx Yes, if keeping current hover UX Used for hover tooltip text.
nodes[].termType backend_go/models.go No frontend use none in frontend/src No Still needed internally by backend snapshot/selection index.
edges[].source backend_go/models.go Yes frontend/src/App.tsx Yes Used to build edgeData.
edges[].target backend_go/models.go Yes frontend/src/App.tsx Yes Used to build edgeData.
edges[].predicate_id backend_go/models.go No main-graph frontend use none in frontend/src/App.tsx No Still needed internally by backend snapshot and hierarchy layout preparation.
route_segments[].points backend_go/models.go Yes frontend/src/App.tsx Yes when route segments are present Used to build routeLineVertices.
route_segments[].edge_index backend_go/models.go Not used after parsing graphRouteSegmentArray validation only No Could be dropped from frontend transport if route lines are pre-flattened.
route_segments[].kind backend_go/models.go Not used after parsing graphRouteSegmentArray validation only No Could be dropped from frontend transport if route lines are pre-flattened.
meta.backend backend_go/models.go Yes frontend/src/App.tsx Yes Displayed in overlay.
meta.nodes backend_go/models.go Yes frontend/src/App.tsx Yes Displayed in overlay.
meta.edges backend_go/models.go Yes frontend/src/App.tsx Yes Displayed in overlay.
meta.graph_query_id backend_go/models.go Yes frontend/src/selection_queries/api.ts Yes Sent back on selection endpoints.
meta.node_limit backend_go/models.go Yes frontend/src/selection_queries/api.ts Yes Sent back on selection endpoints.
meta.edge_limit backend_go/models.go Yes frontend/src/selection_queries/api.ts Yes Sent back on selection endpoints.
meta.ttl_path backend_go/models.go No none in frontend/src No Frontend type declares it, but current UI does not use it.
meta.sparql_endpoint backend_go/models.go No none in frontend/src No Not used by current UI.
meta.include_bnodes backend_go/models.go No none in frontend/src No Not used by current UI.
meta.layout_engine backend_go/models.go No none in frontend/src No Not used by current UI.
meta.layout_root_iri backend_go/models.go No none in frontend/src No Not used by current UI.
meta.predicates backend_go/models.go No frontend use none in frontend/src No Still used internally by backend selection/hierarchy logic.

Backend-internal fields that do not need to stay in the frontend transport

This is the most important audit result.

The backend currently reuses one struct for:

  • internal cached snapshot
  • HTTP response payload

That is convenient, but it means the frontend receives fields that only the backend needs.

Verified internal-only dependencies:

  • snapshot.Nodes[].TermType is used in backend_go/selection_query.go to build the selection index.
  • snapshot.Meta.Predicates is used in backend_go/selection_query.go.
  • Edge.PredicateID is used internally for hierarchy layout preparation in backend_go/hierarchy_layout_bridge.go.

The frontend does not need those fields for current behavior.

What the frontend actually needs

For the current graph view, the hot path can be reduced to:

  • vertexIds[]
  • xs[]
  • ys[]
  • edgeSources[]
  • edgeTargets[]
  • routeLineVertices[] or route geometry equivalent
  • label[] and iri[] by node index
  • meta.backend
  • meta.nodes
  • meta.edges
  • meta.graph_query_id
  • meta.node_limit
  • meta.edge_limit

That is much closer to a columnar or binary payload than to the current array-of-objects JSON.

Why the Current JSON Path Hurts

Response.json() is not just a lightweight decode helper. MDN states that Response.json() reads the stream to completion and resolves with the result of parsing the body text as JSON into a JavaScript object.

That matters here because the current payload is row-oriented:

  • millions of node objects
  • millions of edge objects

Even though the renderer later wants typed arrays, the browser must first create those JS objects.

This is exactly the part that can stall or run out of memory before renderer.init(...) starts.

Alternatives

1. Columnar JSON

Idea

Keep JSON, but change the schema from row-oriented objects:

{
  "nodes": [{ "id": 1, "x": 0.1, "y": 0.2, ... }],
  "edges": [{ "source": 1, "target": 2, ... }]
}

to column-oriented arrays:

{
  "vertex_ids": [...],
  "xs": [...],
  "ys": [...],
  "edge_sources": [...],
  "edge_targets": [...],
  "node_labels": [...],
  "node_iris": [...],
  "route_line_vertices": [...],
  "meta": { ... }
}

Pros

  • easiest migration from the current API contract
  • no schema compiler
  • easy to debug with ordinary tooling
  • much closer to what the renderer already consumes
  • avoids creating per-edge objects in frontend application code

Cons

  • still goes through JSON parsing
  • still materializes JS arrays before typed arrays are built
  • huge numeric arrays in JSON are still text, not binary
  • string columns are still ordinary JS strings

Fit for current pipeline

Good.

No current frontend feature would be lost if the payload includes:

  • ids/xs/ys/edge sources/targets
  • labels/iris
  • route line vertices or equivalent
  • the small subset of meta fields currently used

Overall assessment

Best low-risk intermediate step.

It is clearly better than today's row-oriented JSON, but it is not the endgame if the goal is to remove the parse bottleneck for 1 GB+ payloads.

2. NDJSON / Chunked JSON

Idea

Change the backend to stream multiple JSON records instead of one giant JSON object.

Examples:

  • one line per chunk of nodes/edges
  • one line for metadata
  • one line per route segment chunk

NDJSON is explicitly designed for transporting multiple JSON texts in a stream protocol.

Pros

  • can start processing before the whole payload arrives
  • better observability and progress reporting
  • easier cancellation/retry semantics
  • avoids one monolithic Response.json() boundary

Cons

  • record-per-edge NDJSON would still create far too many JS objects
  • to be worth it here, it should be chunked columnar NDJSON, not row NDJSON
  • frontend load path must become stream-based
  • renderer still currently expects all arrays at once

Fit for current pipeline

Moderate.

It can preserve all current information, but it does not by itself solve the "final representation should look like GPU inputs" goal unless each chunk is already columnar.

Best shape if chosen

Not:

  • one JSON object per edge
  • one JSON object per node

Better:

  • one NDJSON record for metadata
  • then NDJSON records where each record contains columnar chunks:
    • vertex_ids_chunk
    • xs_chunk
    • ys_chunk
    • edge_sources_chunk
    • edge_targets_chunk

Overall assessment

Viable, but only attractive if progressiveness is a major goal. On its own, it is weaker than columnar binary formats for this renderer.

3. MessagePack

Idea

Use a compact binary encoding instead of JSON.

The official JavaScript implementation supports:

  • encode
  • decode
  • decodeAsync(stream)
  • decodeArrayStream(stream)
  • decodeMultiStream(stream)

and even custom extension types for faster handling of large Float32Array payloads.

Pros

  • smaller payload than JSON
  • binary transport
  • async and stream-capable decoding APIs exist
  • mature JS library

Cons

  • if you keep the current row-oriented schema, you still get one huge object graph after decode
  • therefore MessagePack alone does not remove the fundamental object-allocation problem
  • custom extension types improve typed-array cases, but then you are already halfway to designing a custom binary protocol

Fit for current pipeline

Moderate.

It can preserve all current information easily.

But if the schema remains object-heavy, the browser still ends up with millions of JS objects.

Overall assessment

Useful if paired with a columnar schema. Not compelling as a first move if the schema stays row-oriented.

4. Apache Arrow IPC

Idea

Use Arrow's columnar binary format and Arrow JS support.

Arrow JS provides:

  • tableFromIPC(...)
  • support for fetch(...)
  • typed-array-backed vectors
  • dictionary-encoded strings
  • a columnar memory model explicitly meant for efficient processing and movement of large in-memory data

Pros

  • strongest off-the-shelf fit for typed-array-oriented rendering
  • columnar by design
  • binary rather than textual
  • supports large numeric columns very naturally
  • supports dictionary encoding for repeated strings like labels or IRIs
  • much closer to the renderer/GPU input shape than JSON objects

Cons

  • larger conceptual/tooling jump than columnar JSON
  • route segments are nested/variable-length; representing them cleanly needs design
  • frontend code becomes Arrow-aware unless the decode is hidden behind an adapter
  • backend must serialize Arrow on the Go side or produce Arrow-compatible IPC

Fit for current pipeline

Very good.

Current frontend needs can be represented as columns:

  • vertex_ids: uint32
  • xs: float32
  • ys: float32
  • edge_sources: uint32
  • edge_targets: uint32
  • labels: utf8 or dictionary-encoded utf8
  • iris: utf8 or dictionary-encoded utf8

Route geometry should probably not stay as nested route-segment objects. It would fit better as:

  • pre-flattened route_line_vertices float column/buffer
  • or a second Arrow table dedicated to line segments

Overall assessment

One of the two best solutions for this repo.

If you want a standard format instead of inventing one, Arrow is the most attractive candidate.

5. FlatBuffers

Idea

Use a schema-defined binary format designed for direct access without unpacking/parsing.

FlatBuffers explicitly advertises:

  • access to serialized data without parsing/unpacking
  • memory efficiency and speed
  • forwards/backwards compatibility

Pros

  • very strong memory-efficiency story
  • schema evolution support
  • no full parse/unpack step in the same way as JSON
  • can model both scalars and more complex structures

Cons

  • requires schema/compiler/generated bindings
  • JavaScript integration is more manual than JSON or Arrow
  • ergonomics in app code are not as simple as arrays/objects
  • strings and nested route structures are supported, but the developer experience is more specialized

Fit for current pipeline

Good, technically.

It can preserve all current information and remove the giant object-graph parse step.

However, compared with Arrow or a custom binary envelope, it is a less natural conceptual fit for a renderer whose hot path is already columnar/typed-array-based.

Overall assessment

A strong technical option, but probably not the most ergonomic option for this specific frontend.

6. Protocol Buffers

Idea

Use a schema-defined binary format with generated bindings.

Pros

  • compact binary encoding
  • schema/versioning
  • mature ecosystem

Cons

  • official docs describe protobuf as a good fit for typed structured messages up to a few megabytes
  • the same docs warn that large data can require loading entire messages into memory and can cause multiple copies
  • large repeated numeric arrays are not protobuf's sweet spot
  • still not especially close to the renderer's typed-array model

Fit for current pipeline

Poor for this specific payload size and shape.

Overall assessment

Not recommended for this main graph transport.

7. Custom Binary Typed-Array Envelope

Idea

Define a transport specifically around what the renderer and hover/selection pipeline need.

Example structure:

  • small fixed header or small JSON header:
    • version
    • counts
    • offsets/lengths
    • meta subset
  • then raw binary buffers:
    • vertex_ids
    • xs
    • ys
    • edge_sources
    • edge_targets
    • route_line_vertices
    • string dictionary / offsets for label and iri

Pros

  • closest possible fit to current renderer
  • no schema compiler required
  • no row-object materialization
  • easiest path to zero-copy or near-zero-copy arrays on the frontend
  • easiest path to worker transfer via ArrayBuffer
  • can separate hot render data from cold metadata cleanly

Cons

  • custom protocol to design, version, validate, and document
  • less tooling/interoperability than Arrow
  • backend and frontend both need careful binary codecs

Fit for current pipeline

Excellent.

You can preserve all current behavior while only sending the data the frontend actually uses.

Overall assessment

The best performance-oriented fit if you are comfortable owning a custom format.

Comparison Table

Option Closeness to GPU shape Avoids giant object graph Supports all current frontend data Streaming-friendly Implementation cost Recommendation
Current row JSON Poor No Yes Poor Already done Replace
Columnar JSON Medium No Yes Medium Low Good interim
NDJSON chunked columnar JSON Medium Partially Yes Good Medium Situational
MessagePack row-oriented Poor No Yes Good Medium Not enough alone
MessagePack columnar Medium Partially Yes Good Medium Viable but secondary
Arrow IPC Very high Yes or mostly yes Yes Good Medium-high Strong candidate
FlatBuffers High Yes Yes Medium High Good but specialized
Protobuf Low-medium No practical win here Yes Medium Medium-high Not recommended
Custom binary typed-array envelope Very high Yes Yes Good High Strongest fit

The frontend does not need node/edge objects as its primary graph transport.

The main graph payload should be modeled as:

  • vertex_ids
  • xs
  • ys
  • edge_sources
  • edge_targets
  • route_line_vertices
  • node_labels
  • node_iris
  • meta

This can be represented as:

  • columnar JSON
  • Arrow columns
  • FlatBuffers vectors
  • custom binary sections

Fields that can be removed from the frontend transport immediately

Without changing current visible behavior, the main graph transport does not need to include:

  • nodes[].termType
  • edges[].predicate_id
  • meta.predicates
  • meta.ttl_path
  • meta.sparql_endpoint
  • meta.include_bnodes
  • meta.layout_engine
  • meta.layout_root_iri
  • route_segments[].edge_index
  • route_segments[].kind

Important:

Some of those fields are still needed by the backend's internal snapshot, especially for selection queries and hierarchy layout. That argues for splitting:

  • internal snapshot model
  • frontend transport DTO

instead of continuing to reuse one struct for both.

Additional Architectural Notes

A worker is complementary, not a transport format

Web Workers can move parsing/build work off the main thread, and ArrayBuffer is transferable. That is useful, but it does not by itself solve the current over-allocation problem if the payload is still a giant row-oriented JSON document.

Workers are most valuable when paired with:

  • binary columnar payloads
  • streamed columnar chunks
  • transfer of ArrayBuffers rather than giant JS object graphs

The backend can keep a richer internal snapshot than it sends

This repo already caches snapshots server-side. Selection and triple queries are built from the backend snapshot and the small graphMeta values sent back by the client.

That means the frontend transport can be much slimmer than the backend snapshot representation, as long as the backend retains its richer internal data.

This is the cleanest way to avoid losing information while optimizing the frontend transport.

Final Recommendation

Best long-term option

Pick one of:

  1. Custom binary typed-array envelope
  2. Apache Arrow IPC

Reason:

  • both map naturally to the renderer's actual input model
  • both avoid the giant row-object parse path
  • both can preserve all current frontend-visible information

Best low-risk migration path

If you want an incremental step before going binary:

  1. split backend internal snapshot from frontend transport DTO
  2. move /api/graph to columnar JSON
  3. keep only the metadata fields the frontend actually uses
  4. later replace the same columnar DTO with Arrow or custom binary

That path reduces waste immediately and keeps the eventual binary migration straightforward.

Sources

Official documentation and primary sources used for the comparison: