databuild/CLAUDE.md
2025-07-09 15:05:57 -07:00

4.6 KiB

Claude Instructions

Project Overview

DataBuild is a bazel-based data build system. Key files:

Build & Test

# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh

# Run all core unit tests
./scripts/bb_test_all

# Remote testing
./scripts/bb_remote_test_all

# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.

End-to-End Testing

The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:

Test Suite Structure

  • tests/end_to_end/simple_test.sh - Basic CLI vs Service validation
  • tests/end_to_end/podcast_simple_test.sh - Podcast reviews CLI vs Service validation
  • tests/end_to_end/basic_graph_test.sh - Comprehensive basic graph testing
  • tests/end_to_end/podcast_reviews_test.sh - Comprehensive podcast testing

Event Validation

Tests ensure CLI and Service emit identical build events:

  • Build request events: Orchestration lifecycle (received, planning, executing, completed)
  • Job events: Job execution tracking
  • Partition events: Partition build status

CLI vs Service Event Alignment

Recent improvements ensure both paths emit identical events:

  • CLI: Enhanced with orchestration events to match Service behavior
  • Service: HTTP API orchestration events + core build events
  • Validation: Tests fail if event counts or types differ between CLI and Service

Running Individual Tests

# Test basic graph
tests/end_to_end/simple_test.sh \
  examples/basic_graph/bazel-bin/basic_graph.build \
  examples/basic_graph/bazel-bin/basic_graph.service

# Test podcast reviews (run from correct directory)
cd examples/podcast_reviews
../../tests/end_to_end/podcast_simple_test.sh \
  bazel-bin/podcast_reviews_graph.build \
  bazel-bin/podcast_reviews_graph.service

Project Structure

  • databuild/ - Core system (Rust/Proto)
  • examples/ - Example implementations
  • scripts/ - Build utilities

Key Components

  • Graph analysis/execution in Rust
  • Bazel rules for job orchestration
  • Java/Python examples for different use cases

DataBuild Job Architecture

Job Target Structure

Each DataBuild job creates three Bazel targets:

  • job_name.cfg - Configuration target (calls binary with "config" subcommand)
  • job_name.exec - Execution target (calls binary with "exec" subcommand)
  • job_name - Main job target (pipes config output to exec input)

Unified Job Binary Pattern

Jobs use a single binary with subcommands:

def main():
    command = sys.argv[1]  # "config" or "exec"
    if command == "config":
        handle_config(sys.argv[2:])  # Output job configuration JSON
    elif command == "exec":
        handle_exec(sys.argv[2:])    # Perform actual work

Job Configuration Requirements

CRITICAL: Job configs must include non-empty args for execution to work:

config = {
    "configs": [{
        "outputs": [{"str": partition_ref}],
        "inputs": [...],
        "args": ["some_arg"],  # REQUIRED: Cannot be empty []
        "env": {"PARTITION_REF": partition_ref}
    }]
}

Jobs with "args": [] will only have their config function called during execution, not exec.

DataBuild Execution Flow

  1. Planning Phase: DataBuild calls .cfg targets to get job configurations
  2. Execution Phase: DataBuild calls main job targets which pipe config to exec
  3. Job Resolution: Job lookup returns base job names (e.g., //:job_name), not .cfg variants

Graph Configuration

databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],  # Reference base job targets
    lookup = ":job_lookup",     # Binary that routes partition refs to jobs
)

Job Lookup Pattern

def lookup_job_for_partition(partition_ref: str) -> str:
    if pattern.match(partition_ref):
        return "//:job_name"  # Return base job target
    raise ValueError(f"No job found for: {partition_ref}")

Common Pitfalls

  • Empty args: Jobs with "args": [] won't execute properly
  • Wrong target refs: Job lookup must return base targets, not .cfg variants
  • Missing partition refs: All outputs must be addressable via partition references

Documentation

We use plans / designs in the plans directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.