Claude Instructions

Project Overview

DataBuild is a bazel-based data build system. Key files:

DESIGN.md - Overall design of databuild
databuild.proto - System interfaces
Component designs - design docs for specific aspects or components of databuild:
- Core build - How the core semantics of databuild works and are implemented
- Build event log - How the build event log works and is accessed
- Service - How the databuild HTTP service and web app are designed.
- Glossary - Centralized description of key terms.
- Graph specification - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
- Observability - How observability is systematically achieved throughout databuild applications.
- Deploy strategies - Different strategies for deploying databuild applications.
- Triggers - How triggering works in databuild applications.
- Why databuild? - Why to choose databuild instead of other better established orchestration solutions.

Please reference these for any related work, as they indicate key technical bias/direction of the project.

Tenets

Declarative over imperative wherever possible/reasonable.
We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
Do not add "unknown" results when parses or matches fail - these should always throw.
Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.

Build & Test

# Build all databuild components
bazel build //...

# Run databuild unit tests
bazel test //...

# Run end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh

# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.

Project Structure

databuild/ - Core system (Rust/Proto)
examples/ - Example implementations
scripts/ - Build utilities

Key Components

Graph analysis/execution in Rust
Bazel rules for job orchestration
Java/Python examples for different use cases

DataBuild Job Architecture

Job Target Structure

Each DataBuild job creates three Bazel targets:

job_name.cfg - Configuration target (calls binary with "config" subcommand)
job_name.exec - Execution target (calls binary with "exec" subcommand)
job_name - Main job target (pipes config output to exec input)

Unified Job Binary Pattern

Jobs use a single binary with subcommands:

def main():
    command = sys.argv[1]  # "config" or "exec"
    if command == "config":
        handle_config(sys.argv[2:])  # Output job configuration JSON
    elif command == "exec":
        handle_exec(sys.argv[2:])    # Perform actual work

DataBuild Execution Flow

Planning Phase: DataBuild calls .cfg targets to get job configurations
Execution Phase: DataBuild calls main job targets which pipe config to exec
Job Resolution: Job lookup returns base job names (e.g., //:job_name), not .cfg variants

Graph Configuration

databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],  # Reference base job targets
    lookup = ":job_lookup",     # Binary that routes partition refs to jobs
)

Job Lookup Pattern

def lookup_job_for_partition(partition_ref: str) -> str:
    if pattern.match(partition_ref):
        return "//:job_name"  # Return base job target
    raise ValueError(f"No job found for: {partition_ref}")

Common Pitfalls

Empty args: Jobs with "args": [] won't execute properly
Wrong target refs: Job lookup must return base targets, not .cfg variants
Missing partition refs: All outputs must be addressable via partition references
Not adding new generated files to OpenAPI outs: Bazel hermeticity demands that we specify each output file, so when the OpenAPI code gen would create new files, we need to explicitly add them to the target's outs field.

Documentation

We use plans / designs in the plans directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.

4.6 KiB Raw Blame History