stuart/databuild

Fork 0

Stuart Axelbrooke 115e8cfdb0 Update docs

2025-07-09 15:05:57 -07:00

4.6 KiB

Raw Blame History

Claude Instructions

Project Overview

DataBuild is a bazel-based data build system. Key files:

databuild.proto - System interfaces
manifesto.md - Project philosophy
core-concepts.md - Core concepts

Build & Test

# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh

# Run all core unit tests
./scripts/bb_test_all

# Remote testing
./scripts/bb_remote_test_all

# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.

End-to-End Testing

The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:

Test Suite Structure

tests/end_to_end/simple_test.sh - Basic CLI vs Service validation
tests/end_to_end/podcast_simple_test.sh - Podcast reviews CLI vs Service validation
tests/end_to_end/basic_graph_test.sh - Comprehensive basic graph testing
tests/end_to_end/podcast_reviews_test.sh - Comprehensive podcast testing

Event Validation

Tests ensure CLI and Service emit identical build events:

Build request events: Orchestration lifecycle (received, planning, executing, completed)
Job events: Job execution tracking
Partition events: Partition build status

CLI vs Service Event Alignment

Recent improvements ensure both paths emit identical events:

CLI: Enhanced with orchestration events to match Service behavior
Service: HTTP API orchestration events + core build events
Validation: Tests fail if event counts or types differ between CLI and Service

Running Individual Tests

# Test basic graph
tests/end_to_end/simple_test.sh \
  examples/basic_graph/bazel-bin/basic_graph.build \
  examples/basic_graph/bazel-bin/basic_graph.service

# Test podcast reviews (run from correct directory)
cd examples/podcast_reviews
../../tests/end_to_end/podcast_simple_test.sh \
  bazel-bin/podcast_reviews_graph.build \
  bazel-bin/podcast_reviews_graph.service

Project Structure

databuild/ - Core system (Rust/Proto)
examples/ - Example implementations
scripts/ - Build utilities

Key Components

Graph analysis/execution in Rust
Bazel rules for job orchestration
Java/Python examples for different use cases

DataBuild Job Architecture

Job Target Structure

Each DataBuild job creates three Bazel targets:

job_name.cfg - Configuration target (calls binary with "config" subcommand)
job_name.exec - Execution target (calls binary with "exec" subcommand)
job_name - Main job target (pipes config output to exec input)

Unified Job Binary Pattern

Jobs use a single binary with subcommands:

def main():
    command = sys.argv[1]  # "config" or "exec"
    if command == "config":
        handle_config(sys.argv[2:])  # Output job configuration JSON
    elif command == "exec":
        handle_exec(sys.argv[2:])    # Perform actual work

Job Configuration Requirements

CRITICAL: Job configs must include non-empty args for execution to work:

config = {
    "configs": [{
        "outputs": [{"str": partition_ref}],
        "inputs": [...],
        "args": ["some_arg"],  # REQUIRED: Cannot be empty []
        "env": {"PARTITION_REF": partition_ref}
    }]
}

Jobs with "args": [] will only have their config function called during execution, not exec.

DataBuild Execution Flow

Planning Phase: DataBuild calls .cfg targets to get job configurations
Execution Phase: DataBuild calls main job targets which pipe config to exec
Job Resolution: Job lookup returns base job names (e.g., //:job_name), not .cfg variants

Graph Configuration

databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],  # Reference base job targets
    lookup = ":job_lookup",     # Binary that routes partition refs to jobs
)

Job Lookup Pattern

def lookup_job_for_partition(partition_ref: str) -> str:
    if pattern.match(partition_ref):
        return "//:job_name"  # Return base job target
    raise ValueError(f"No job found for: {partition_ref}")

Common Pitfalls

Empty args: Jobs with "args": [] won't execute properly
Wrong target refs: Job lookup must return base targets, not .cfg variants
Missing partition refs: All outputs must be addressable via partition references

Documentation

We use plans / designs in the plans directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.

4.6 KiB Raw Blame History