databuild/CLAUDE.md

5.3 KiB

Agent Instructions

Project Overview

DataBuild is a bazel-based data build system. Key files:

  • DESIGN.md - Overall design of databuild
  • databuild.proto - System interfaces
  • Component designs - design docs for specific aspects or components of databuild:
    • Core build - How the core semantics of databuild works and are implemented
    • Build event log - How the build event log works and is accessed
    • Service - How the databuild HTTP service and web app are designed.
    • Glossary - Centralized description of key terms.
    • Graph specification - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
    • Observability - How observability is systematically achieved throughout databuild applications.
    • Deploy strategies - Different strategies for deploying databuild applications.
    • Wants - How triggering works in databuild applications.
    • Why databuild? - Why to choose databuild instead of other better established orchestration solutions.

Please reference these for any related work, as they indicate key technical bias/direction of the project.

Tenets

  • Declarative over imperative wherever possible/reasonable.
  • We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
  • Do not add "unknown" results when parses or matches fail - these should always throw.
  • Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.
  • CLI/Service Interchangeability: Both the CLI and service must produce identical artifacts (BEL events, logs, metrics, outputs) in the same locations. Users should be able to build with one interface and query/inspect results from the other seamlessly. This principle applies to all DataBuild operations, not just builds.

Build & Test

# Build all databuild components
bazel build //...

# Run databuild unit tests
bazel test //...

# Run end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh

# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.

Project Structure

  • databuild/ - Core system (Rust/Proto)
  • examples/ - Example implementations
  • scripts/ - Build utilities

Key Components

  • Graph analysis/execution in Rust
  • Bazel rules for job orchestration
  • Java/Python examples for different use cases

DataBuild Job Architecture

Job Target Structure

Each DataBuild job creates three Bazel targets:

  • job_name.cfg - Configuration target (calls binary with "config" subcommand)
  • job_name.exec - Execution target (calls binary with "exec" subcommand)
  • job_name - Main job target (pipes config output to exec input)

Unified Job Binary Pattern

Jobs use a single binary with subcommands:

def main():
    command = sys.argv[1]  # "config" or "exec"
    if command == "config":
        handle_config(sys.argv[2:])  # Output job configuration JSON
    elif command == "exec":
        handle_exec(sys.argv[2:])    # Perform actual work

DataBuild Execution Flow

  1. Planning Phase: DataBuild calls .cfg targets to get job configurations
  2. Execution Phase: DataBuild calls main job targets which pipe config to exec
  3. Job Resolution: Job lookup returns base job names (e.g., //:job_name), not .cfg variants

Graph Configuration

databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],  # Reference base job targets
    lookup = ":job_lookup",     # Binary that routes partition refs to jobs
)

Job Lookup Pattern

def lookup_job_for_partition(partition_ref: str) -> str:
    if pattern.match(partition_ref):
        return "//:job_name"  # Return base job target
    raise ValueError(f"No job found for: {partition_ref}")

Common Pitfalls

  • Not using protobuf-defined interface: Where structs and interfaces are defined centrally in databuild.proto, those interfaces should always be used. E.g., in rust depending on them via the prost-generated structs, and in the web app via the OpenAPI-generated typescript interfaces.
  • Empty args: Jobs with "args": [] won't execute properly
  • Wrong target refs: Job lookup must return base targets, not .cfg variants
  • Missing partition refs: All outputs must be addressable via partition references
  • Not adding new generated files to OpenAPI outs: Bazel hermeticity demands that we specify each output file, so when the OpenAPI code gen would create new files, we need to explicitly add them to the target's outs field.

Notes / Tips

  • Rust dependencies are implemented via rules_rust, so new dependencies should be added in the MODULE.bazel file.

Documentation

We use plans / designs in the plans directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.