4.6 KiB
4.6 KiB
Claude Instructions
Project Overview
DataBuild is a bazel-based data build system. Key files:
DESIGN.md- Overall design of databuilddatabuild.proto- System interfaces- Component designs - design docs for specific aspects or components of databuild:
- Core build - How the core semantics of databuild works and are implemented
- Build event log - How the build event log works and is accessed
- Service - How the databuild HTTP service and web app are designed.
- Glossary - Centralized description of key terms.
- Graph specification - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
- Observability - How observability is systematically achieved throughout databuild applications.
- Deploy strategies - Different strategies for deploying databuild applications.
- Triggers - How triggering works in databuild applications.
- Why databuild? - Why to choose databuild instead of other better established orchestration solutions.
Please reference these for any related work, as they indicate key technical bias/direction of the project.
Tenets
- Declarative over imperative wherever possible/reasonable.
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
- Do not add "unknown" results when parses or matches fail - these should always throw.
- Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.
Build & Test
# Build all databuild components
bazel build //...
# Run databuild unit tests
bazel test //...
# Run end-to-end tests (validates CLI vs Service consistency)
./run_e2e_tests.sh
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
Project Structure
databuild/- Core system (Rust/Proto)examples/- Example implementationsscripts/- Build utilities
Key Components
- Graph analysis/execution in Rust
- Bazel rules for job orchestration
- Java/Python examples for different use cases
DataBuild Job Architecture
Job Target Structure
Each DataBuild job creates three Bazel targets:
job_name.cfg- Configuration target (calls binary with "config" subcommand)job_name.exec- Execution target (calls binary with "exec" subcommand)job_name- Main job target (pipes config output to exec input)
Unified Job Binary Pattern
Jobs use a single binary with subcommands:
def main():
command = sys.argv[1] # "config" or "exec"
if command == "config":
handle_config(sys.argv[2:]) # Output job configuration JSON
elif command == "exec":
handle_exec(sys.argv[2:]) # Perform actual work
DataBuild Execution Flow
- Planning Phase: DataBuild calls
.cfgtargets to get job configurations - Execution Phase: DataBuild calls main job targets which pipe config to exec
- Job Resolution: Job lookup returns base job names (e.g.,
//:job_name), not.cfgvariants
Graph Configuration
databuild_graph(
name = "my_graph",
jobs = [":job1", ":job2"], # Reference base job targets
lookup = ":job_lookup", # Binary that routes partition refs to jobs
)
Job Lookup Pattern
def lookup_job_for_partition(partition_ref: str) -> str:
if pattern.match(partition_ref):
return "//:job_name" # Return base job target
raise ValueError(f"No job found for: {partition_ref}")
Common Pitfalls
- Empty args: Jobs with
"args": []won't execute properly - Wrong target refs: Job lookup must return base targets, not
.cfgvariants - Missing partition refs: All outputs must be addressable via partition references
- Not adding new generated files to OpenAPI outs: Bazel hermeticity demands that we specify each output file, so when the OpenAPI code gen would create new files, we need to explicitly add them to the target's outs field.
Documentation
We use plans / designs in the plans directory to anchor most large scale efforts. We create plans that are good bets, though not necessarily exhaustive, then (and this is critical) we update them after the work is completed, or after significant progress towards completion.