Pipeviz Pipeviz

Easy, elegant lineage from a single .json

An open source JSON spec for lineage. Declare your pipelines, get beautiful graphs.

{
  "pipelines": [{
    "name": "etl-job",
    "input_sources": ["raw_events"],
    "output_sources": ["cleaned_events"]
  }]
}
Stack Agnostic
SQL, Spark, Kafka, APIs, shell scripts. Just JSON.
Zero Dependencies
One HTML file. No backend, no build step. Host anywhere.
Federated
Each team owns their JSON. Merge with jq for the org-wide view.
Column-Level Lineage
Track field-level provenance. See where each attribute comes from.
How it works
1
Define
Write a pipeviz.json describing your pipelines and data sources
2
Load
Drop your file here, or host both files together on any static server
3
Explore
Click through the graph, trace dependencies, export DOT for other tools
Superpowers

Each pipeline declares what it reads and writes. Pipeviz stitches these into a complete graph. From there, you get some superpowers for free:

Parallelize backfills by topologically sorting nodes into stages. Independent pipelines run together.
Detect circular dependencies before they break your backfill. Cycles shown in red.
Handle diamond dependencies correctly. Each node runs only after all its parents complete.
Trace blast radius downstream from any node. Know exactly what breaks.
Export to Mermaid and embed diagrams in CI/CD, PRs, or docs.
Serve via API with the built-in Clojure webserver. Query your graph programmatically.
Build MCPs on top of the API. Let LLMs query and reason about your lineage.
Why

Other solutions ask for a lot. OpenLineage/Marquez need agents, a metadata store, scheduler integration. Atlas wants a governance platform. dbt couples you to their framework. Pipeviz asks for one JSON file.

Clojure Why Clojure

Pipeviz is written in Clojure/ClojureScript, a Lisp that runs on the JVM and compiles to JavaScript, for two main reasons:

•
Code-as-data. Like Clojure, Pipeviz espouses the code-as-data philosophy: 100 functions operating on one data structure vs 10 functions on 10 structures. This style is well suited to spec validation, transformation, and extensibility.
•
No required persistence layer. Clojure is homoiconic. Your lineage graph is just an EDN map. Even when running Pipeviz as a server, you can REPL into a live instance, inspect state, redefine functions on the fly, and add your own hooks and validations that update your graph (without a DB).
Tip: Auto-load a config with ?url=https://yoursite.com/pipeviz.json

Load a configuration to see your pipelines

Load a configuration to see your datasources

Legend
ā–  = Pipeline
ā— = Data Source
ā–£ = Group
ā–” = Cluster
Node Details
Legend
ā–¢ = Data Source
Ā· = Attribute
→ = Derived from
Planner
Select pipelines to see the execution plan

                
Export
Export your graph as JSON, Mermaid, or DOT format
Load a configuration to see the export...
Load a configuration to see statistics.
Merging Pipeviz Configs

When your data platform spans multiple teams or repositories, you can maintain separate config files and merge them into a single Pipeviz config using jq.

Basic Merge (Two Files)

Combine pipelines and datasources from two config files:

jq -s '{
  pipelines: (.[0].pipelines + .[1].pipelines),
  datasources: (.[0].datasources + .[1].datasources)
}' team_a.json team_b.json > combined.json
Merge Multiple Files

Merge all *.pipeviz.json files in a directory:

jq -s '{
  pipelines: [.[].pipelines[]] | unique_by(.name),
  datasources: [.[].datasources[]] | unique_by(.name)
}' configs/*.pipeviz.json > combined.json
Merge with Deduplication

When configs might have overlapping definitions, dedupe by name (last write wins):

jq -s '{
  pipelines: (
    [.[].pipelines[]] | group_by(.name) | map(last)
  ),
  datasources: (
    [.[].datasources[]] | group_by(.name) | map(last)
  )
}' *.json > merged.json
Smart Prefix (Only on Collision)

Only add prefixes when names actually collide:

# Find colliding names, prefix only those
jq -s '
  [
    (.[0].pipelines | map(. + {_src: "a"})),
    (.[1].pipelines | map(. + {_src: "b"}))
  ] | add
  | group_by(.name)
  | map(
      if length > 1 then
        map(.name = ._src + "_" + .name)
      else . end
    )
  | add
  | map(del(._src))
' team_a.json team_b.json
Self-Hosting Pipeviz

Run a server that serves your config as a live dashboard.

Quick Start

Run the server with your config file:

# Start server on port 3000 with your config
clj -M:server 3000 path/to/your-config.json

Then open http://localhost:3000 in your browser.

Server Options
# Default port (8080) with config
clj -M:server path/to/config.json

# Custom port
clj -M:server 3000 path/to/config.json

# Multiple configs (merged automatically)
clj -M:server 3000 team_a.json team_b.json
REPL-Driven Development

Connect to the nREPL for live exploration and extension:

# Start shadow-cljs in watch mode
npx shadow-cljs watch app

# Connect to the REPL from another terminal
npx shadow-cljs cljs-repl app
The REPL gives you live access to the running app state. You can modify the graph, test analysis functions, or prototype new features without restarting.
Extending the Server

The server exposes a simple HTTP API. Add custom endpoints in src/pipeviz/server/main.clj:

;; Example: Add a custom analysis endpoint
(defn custom-handler [config]
  (fn [req]
    (case (:uri req)
      "/api/custom" {:status 200
                     :headers {"Content-Type" "application/json"}
                     :body (json/write-str (analyze config))}
      nil)))
OpenAPI Spec

The server includes an OpenAPI specification at resources/openapi.json:

# Endpoints documented in the spec:
GET /api/config      # Returns the loaded Pipeviz config
GET /api/health      # Health check endpoint

Use this spec with Swagger UI or to generate API clients.

Production Deployment

Build a standalone bundle for deployment:

# Build optimized JS bundle
npx shadow-cljs release release

# Bundle into single HTML file
./scripts/bundle.sh

# Output: dist/pipeviz.html (single-file, no server needed)
The bundled pipeviz.html is a single file with all JS/CSS inlined. Perfect for hosting on GitHub Pages, S3, or embedding in internal tools.
Pipeviz JSON Spec

Only pipelines is required. Clusters and datasources are auto-created when referenced.

Root
FieldTypeRequiredDescription
pipelinesarrayYesJobs that transform or move data
datasourcesarrayNoTables, files, streams, APIs
clustersarrayNoGroups for visual organization
{
  "clusters": [...],
  "pipelines": [...],
  "datasources": [...]
}
Pipeline
FieldTypeRequiredDescription
namestringYesUnique identifier
descriptionstringNo
input_sourcesarrayNoDatasources read from
output_sourcesarrayNoDatasources written to
upstream_pipelinesarrayNoPipeline or group names (orange edges)
clusterstringNoCluster name
groupstringNoCollapse into group node
schedulestringNoFree text
tagsarrayNo
linksobjectNoname to URL
durationnumberNoRuntime in minutes (for critical path)
costnumberNoCost per run (for spend analysis)
{
  "name": "user-enrichment",
  "description": "Enriches user data with events",
  "input_sources": ["raw_users", "events"],
  "output_sources": ["enriched_users"],
  "upstream_pipelines": ["data-ingestion"],
  "cluster": "user-processing",
  "group": "etl-jobs",
  "schedule": "Every 2 hours",
  "duration": 45,
  "cost": 12.50,
  "tags": ["user-data", "ml"],
  "links": {
    "airflow": "https://...",
    "docs": "https://..."
  }
}
Datasource

Auto-created when referenced. Define explicitly to add metadata.

FieldTypeRequiredDescription
namestringYesUnique identifier
descriptionstringNo
typestringNosnowflake, postgres, kafka, s3...
ownerstringNo
clusterstringNo
tagsarrayNo
metadataobjectNoArbitrary key-value
linksobjectNoname to URL
attributesarrayNoColumn-level lineage
{
  "name": "raw_users",
  "description": "Raw user data from prod",
  "type": "snowflake",
  "owner": "data-team@company.com",
  "cluster": "user-processing",
  "tags": ["pii", "users"],
  "metadata": {
    "size": "2.1TB",
    "record_count": "45M"
  },
  "links": {
    "snowflake": "https://...",
    "docs": "https://..."
  }
}
Cluster

Auto-created when referenced. Define explicitly for nesting.

FieldTypeRequiredDescription
namestringYesUnique identifier
descriptionstringNo
parentstringNoParent cluster for nesting
{
  "name": "realtime",
  "description": "Real-time processing cluster",
  "parent": "order-processing"
}
Attribute Lineage

Add attributes to a datasource. Supports nesting for structs/objects. Reference upstream with source::attr or source::parent::child.

FieldTypeRequiredDescription
namestringYesColumn/field name
fromstring or arrayNoUpstream refs
attributesarrayNoNested child attributes
{
  "name": "enriched_users",
  "attributes": [
    { "name": "user_id", "from": "raw_users::id" },
    {
      "name": "location",
      "from": "raw_users::address",
      "attributes": [
        { "name": "city", "from": "raw_users::address::city" },
        { "name": "zip", "from": "raw_users::address::zip" }
      ]
    }
  ]
}
Full Example
{
  "clusters": [
    { "name": "etl", "description": "ETL pipelines" }
  ],
  "pipelines": [
    {
      "name": "user-enrichment",
      "description": "Enriches user data with events",
      "input_sources": ["raw_users", "events"],
      "output_sources": ["enriched_users"],
      "cluster": "etl",
      "schedule": "Every 2 hours",
      "tags": ["user-data"],
      "links": { "airflow": "https://..." }
    }
  ],
  "datasources": [
    {
      "name": "raw_users",
      "type": "snowflake",
      "owner": "data-team@company.com",
      "attributes": [
        { "name": "id" },
        { "name": "first" },
        { "name": "last" }
      ]
    },
    {
      "name": "enriched_users",
      "type": "snowflake",
      "attributes": [
        { "name": "user_id", "from": "raw_users::id" },
        { "name": "full_name", "from": ["raw_users::first", "raw_users::last"] }
      ]
    }
  ]
}
Edge Types
ColorSourceDescription
Gray →input_sources / output_sourcesData flow through datasources
Orange →upstream_pipelinesDirect pipeline dependency