databuild

No description

Find a file

Stuart Axelbrooke 31db6a00cb Some checks are pending / setup (push) Waiting to run Details add tests		2025-11-25 14:03:38 +08:00
.claude/skills/databuild-build-state-semantics	update skill name and description	2025-11-25 11:30:13 +08:00
.forgejo/workflows	Add CI	2025-05-03 20:53:44 -07:00
databuild	add tests	2025-11-25 14:03:38 +08:00
docs	update partitions refactor plan	2025-11-25 10:28:29 +08:00
examples	implemented api phase 7 - running application	2025-11-22 23:05:46 +08:00
scripts	big bump	2025-11-16 22:21:56 -08:00
tests/end_to_end	disable e2e tests	2025-11-23 11:18:17 +08:00
tools/build_rules	Generate stuff to make intellij happy	2025-07-19 16:02:57 -07:00
.bazelignore	Add .bazelignore	2025-04-21 21:40:30 -07:00
.bazelrc	Builds passing	2025-07-10 21:39:43 -07:00
.bazelversion	WIP I guess	2025-10-11 11:13:27 -07:00
.envrc	commit	2025-06-29 19:28:46 -07:00
.gitignore	Implement partitions typestate state machine	2025-11-22 09:53:56 +08:00
AGENTS.md	update AGENTS.md	2025-11-24 19:55:15 +08:00
BUILD.bazel	disable e2e tests	2025-11-23 11:18:17 +08:00
CLAUDE.md	lets go	2025-09-03 21:32:17 -07:00
DESIGN.md	A lot of refactoring	2025-09-27 15:29:22 -07:00
GEMINI.md	Update agent instructions, add symlinked gemini.md	2025-07-26 22:49:19 -07:00
MODULE.bazel	implement refactor	2025-11-25 13:31:31 +08:00
MODULE.bazel.lock	implement refactor	2025-11-25 13:31:31 +08:00
README.md	A lot of refactoring	2025-09-27 15:29:22 -07:00
requirements.in	Add python protobuf dataclass generation	2025-07-30 20:54:36 -07:00
requirements_lock.txt	python 3.13 -> 3.12	2025-08-06 16:47:28 -07:00
run_e2e_tests.sh	Remove basic graph from e2e test	2025-07-27 00:10:59 -07:00

README.md

          ██████╗       ████╗   ███████████╗  ████╗
         ██╔═══██╗     ██╔██║   ╚═══██╔════╝ ██╔██║
        ██╔╝   ██║    ██╔╝██║      ██╔╝     ██╔╝██║
       ██╔╝    ██║   ██╔╝ ██║     ██╔╝     ██╔╝ ██║
      ██╔╝    ██╔╝  ██╔╝  ██║    ██╔╝     ██╔╝  ██║
     ██╔╝   ██╔═╝  █████████║   ██╔╝     █████████║
    ████████╔═╝   ██╔═════██║  ██╔╝     ██╔═════██║
    ╚═══════╝     ╚═╝     ╚═╝  ╚═╝      ╚═╝     ╚═╝

      ██████╗     ██╗   ██╗   ██╗   ██╗       █████╗
     ██╔═══██╗   ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔══██╗
    ██╔╝   ██║  ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝  ██║
   █████████╔╝ ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝   ██║
  ██╔═══██╔═╝ ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝   ██╔╝
 ██╔╝   ██║  ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝  ██╔═╝
█████████╔╝  ██████╔═╝  ██╔╝  ████████╗ ███████╔═╝
╚════════╝   ╚═════╝    ╚═╝   ╚═══════╝ ╚══════╝

     -  - --  D E C L A R A T I V E  -- -  -
     -  - --  P A R T I T I O N E D  -- -  -
     -  - --  D A T A   B U I L D S  -- -  -

DataBuild is a trivially-deployable, partition-oriented, declarative data build system.

DataBuild is for teams at data-driven orgs who need reliable, flexible, and correct data pipelines and are tired of manually orchestrating complex dependency graphs. You define Jobs (that take input data partitions and produce output partitions), compose them into Graphs (partition dependency networks), and DataBuild handles the rest. Just ask it to build a partition, and databuild handles resolving the jobs that need to run, planning execution order, running builds concurrently, and tracking and exposing build progress. Instead of writing orchestration code that breaks when dependencies change, you focus on the data transformations while DataBuild ensures your pipelines are correct, observable, and reliable.

For important context, check out DESIGN.md, along with designs in design/. Also, check out databuild.proto for key system interfaces. Key features:

Declarative dependencies - Ask for data, get data. Define partition dependencies and DataBuild automatically plans what jobs to run and when.
Partition-first design - Build only what's needed. Late data arrivals and partial rebuilds work seamlessly with atomic data partitions.
Deploy anywhere - One binary, any platform. Bazel-based builds create hermetic applications that run locally, in containers, or in the cloud.

Usage

Graph Description Methods

Bazel targets: The foundational
Python DSL: A more succinct method with partition patterns and decorator-based auto graph wiring. Example usage.

Examples

Test app: color votes
- Bazel graph description example
- Python DSL description example
See the podcast example BUILD file.

Ways to Use DataBuild in Production

As a CLI build tool: You can run DataBuild builds from the command line or in a remote environment - no build event log required!
As a standalone service: Similar to Dagster or Airflow, you can run a persistent service that you send build requests to, and which serves an API and web dashboard.
As a cloud-native containerized build tool: Build containers from your graphs and launch scheduled builds using a container service like ECS, or even your own kubernetes cluster.

Development

Intellij

Run these to allow intellij to understand the rust source:

# Generate a Cargo.toml file so intellij can link rust src
python3 scripts/generate_cargo_toml.py
# Generate a gitignore'd rust file representing the protobuf interfaces
scripts/generate_proto_for_ide.sh

Compiling

bazel build //...

Bullet-proof compile-time correctness is essential for production reliability. Backend protobuf changes must cause predictable frontend compilation failures, preventing runtime errors. Our three-pronged approach ensures this:

Complete Type Chain: Proto → Rust → OpenAPI → TypeScript → Components
- Each step uses generated types, maintaining accuracy across the entire pipeline
- Breaking changes at any layer cause compilation failures in dependent layers
Consistent Data Transformation: Service boundary layer transforms API responses to dashboard types
- Canonical frontend interfaces isolated from backend implementation details
- Transformations handle protobuf nullability and normalize data shapes
- Components never directly access generated API types
Strict TypeScript Configuration: Enforces explicit null handling and prevents implicit any types
- strictNullChecks catches undefined property access patterns
- noImplicitAny surfaces type safety gaps
- Runtime type errors become compile-time failures

This system guarantees that backend interface changes are caught during TypeScript compilation, not in production.

Testing

DataBuild core testing:

bazel test //...

End to end testing:

./run_e2e_tests.sh

Test Strategy

Where possible, we make invalid state unrepresentable via rust's type system. Where that is not possible, we prefer property-testing, with a handful of bespoke tests to capture critical edge cases or important behaviors.