databuild/CLAUDE.md
Stuart Axelbrooke b27a249b09
Some checks are pending
/ setup (push) Waiting to run
Add testing details
2025-07-07 22:42:59 -07:00

4.3 KiB

Claude Instructions

Project Overview

DataBuild is a bazel-based data build system. Key files:

Build & Test

# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh

# Run all core unit tests
./scripts/bb_test_all

# Remote testing
./scripts/bb_remote_test_all

# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.

End-to-End Testing

The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:

Test Suite Structure

  • tests/end_to_end/simple_test.sh - Basic CLI vs Service validation
  • tests/end_to_end/podcast_simple_test.sh - Podcast reviews CLI vs Service validation
  • tests/end_to_end/basic_graph_test.sh - Comprehensive basic graph testing
  • tests/end_to_end/podcast_reviews_test.sh - Comprehensive podcast testing

Event Validation

Tests ensure CLI and Service emit identical build events:

  • Build request events: Orchestration lifecycle (received, planning, executing, completed)
  • Job events: Job execution tracking
  • Partition events: Partition build status

CLI vs Service Event Alignment

Recent improvements ensure both paths emit identical events:

  • CLI: Enhanced with orchestration events to match Service behavior
  • Service: HTTP API orchestration events + core build events
  • Validation: Tests fail if event counts or types differ between CLI and Service

Running Individual Tests

# Test basic graph
tests/end_to_end/simple_test.sh \
  examples/basic_graph/bazel-bin/basic_graph.build \
  examples/basic_graph/bazel-bin/basic_graph.service

# Test podcast reviews (run from correct directory)
cd examples/podcast_reviews
../../tests/end_to_end/podcast_simple_test.sh \
  bazel-bin/podcast_reviews_graph.build \
  bazel-bin/podcast_reviews_graph.service

Project Structure

  • databuild/ - Core system (Rust/Proto)
  • examples/ - Example implementations
  • scripts/ - Build utilities

Key Components

  • Graph analysis/execution in Rust
  • Bazel rules for job orchestration
  • Java/Python examples for different use cases

DataBuild Job Architecture

Job Target Structure

Each DataBuild job creates three Bazel targets:

  • job_name.cfg - Configuration target (calls binary with "config" subcommand)
  • job_name.exec - Execution target (calls binary with "exec" subcommand)
  • job_name - Main job target (pipes config output to exec input)

Unified Job Binary Pattern

Jobs use a single binary with subcommands:

def main():
    command = sys.argv[1]  # "config" or "exec"
    if command == "config":
        handle_config(sys.argv[2:])  # Output job configuration JSON
    elif command == "exec":
        handle_exec(sys.argv[2:])    # Perform actual work

Job Configuration Requirements

CRITICAL: Job configs must include non-empty args for execution to work:

config = {
    "configs": [{
        "outputs": [{"str": partition_ref}],
        "inputs": [...],
        "args": ["some_arg"],  # REQUIRED: Cannot be empty []
        "env": {"PARTITION_REF": partition_ref}
    }]
}

Jobs with "args": [] will only have their config function called during execution, not exec.

DataBuild Execution Flow

  1. Planning Phase: DataBuild calls .cfg targets to get job configurations
  2. Execution Phase: DataBuild calls main job targets which pipe config to exec
  3. Job Resolution: Job lookup returns base job names (e.g., //:job_name), not .cfg variants

Graph Configuration

databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],  # Reference base job targets
    lookup = ":job_lookup",     # Binary that routes partition refs to jobs
)

Job Lookup Pattern

def lookup_job_for_partition(partition_ref: str) -> str:
    if pattern.match(partition_ref):
        return "//:job_name"  # Return base job target
    raise ValueError(f"No job found for: {partition_ref}")

Common Pitfalls

  • Empty args: Jobs with "args": [] won't execute properly
  • Wrong target refs: Job lookup must return base targets, not .cfg variants
  • Missing partition refs: All outputs must be addressable via partition references