databuild/examples/multihop/README.md

2.2 KiB

Multi-Hop Dependency Example

This example demonstrates DataBuild's ability to handle multi-hop dependencies between jobs.

Overview

The example consists of two jobs:

  • job_alpha: Produces the data/alpha partition
  • job_beta: Depends on data/alpha and produces data/beta

When you request data/beta:

  1. Beta job runs and detects missing data/alpha dependency
  2. Orchestrator creates a want for data/alpha
  3. Alpha job runs and produces data/alpha
  4. Beta job runs again and succeeds, producing data/beta

Running the Example

From the repository root:

# Build the CLI
bazel build //databuild:databuild_cli

# Clean up any previous state
rm -f /tmp/databuild_multihop*.db /tmp/databuild_multihop_alpha_complete

# Start the server with the multihop configuration
./bazel-bin/databuild/databuild_cli serve \
  --port 3050 \
  --database /tmp/databuild_multihop.db \
  --config examples/multihop/config.json

In another terminal, create a want for data/beta:

# Create a want for data/beta (which will trigger the dependency chain)
./bazel-bin/databuild/databuild_cli --server http://localhost:3050 \
  want data/beta

# Watch the wants
./bazel-bin/databuild/databuild_cli --server http://localhost:3050 \
  wants list

# Watch the job runs
./bazel-bin/databuild/databuild_cli --server http://localhost:3050 \
  job-runs list

# Watch the partitions
./bazel-bin/databuild/databuild_cli --server http://localhost:3050 \
  partitions list

Expected Behavior

  1. Initial want for data/beta is created
  2. Beta job runs, detects missing data/alpha, reports dependency miss
  3. Orchestrator creates derivative want for data/alpha
  4. Alpha job runs and succeeds
  5. Beta job runs again and succeeds
  6. Both partitions are now in Live state

Configuration Format

The example uses JSON format (config.json), but TOML is also supported. Here's the equivalent TOML:

[[jobs]]
label = "//examples/multihop:job_alpha"
entrypoint = "./examples/multihop/job_alpha.sh"
partition_patterns = ["data/alpha"]

[jobs.environment]
JOB_NAME = "alpha"

[[jobs]]
label = "//examples/multihop:job_beta"
entrypoint = "./examples/multihop/job_beta.sh"
partition_patterns = ["data/beta"]

[jobs.environment]
JOB_NAME = "beta"