Update design docs and claude.md
This commit is contained in:
parent
e32fea0d58
commit
3c67d5cb82
3 changed files with 35 additions and 60 deletions
83
CLAUDE.md
83
CLAUDE.md
|
|
@ -2,64 +2,42 @@
|
|||
|
||||
## Project Overview
|
||||
DataBuild is a bazel-based data build system. Key files:
|
||||
- [`DESIGN.md`](./DESIGN.md) - Overall design of databuild
|
||||
- [`databuild.proto`](databuild/databuild.proto) - System interfaces
|
||||
- [`manifesto.md`](manifesto.md) - Project philosophy
|
||||
- [`core-concepts.md`](core-concepts.md) - Core concepts
|
||||
- Component designs - design docs for specific aspects or components of databuild:
|
||||
- [Core build](./design/core-build.md) - How the core semantics of databuild works and are implemented
|
||||
- [Build event log](./design/build-event-log.md) - How the build event log works and is accessed
|
||||
- [Service](./design/service.md) - How the databuild HTTP service and web app are designed.
|
||||
- [Glossary](./design/glossary.md) - Centralized description of key terms.
|
||||
- [Graph specification](./design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
|
||||
- [Observability](./design/observability.md) - How observability is systematically achieved throughout databuild applications.
|
||||
- [Deploy strategies](./design/deploy-strategies.md) - Different strategies for deploying databuild applications.
|
||||
- [Triggers](./design/triggers.md) - How triggering works in databuild applications.
|
||||
- [Why databuild?](./design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions.
|
||||
|
||||
Please reference these for any related work, as they indicate key technical bias/direction of the project.
|
||||
|
||||
## Tenets
|
||||
|
||||
- Declarative over imperative wherever possible/reasonable.
|
||||
- We are building for the future, and choose to do "the right thing" rather than taking shortcuts to get unstuck. If you get stuck, pause and ask for help/input.
|
||||
- In addition, do not add "unknown" results when parses or matches fail - these should always throw.
|
||||
- Do not add "unknown" results when parses or matches fail - these should always throw.
|
||||
- Compile time correctness is a super-power, and investment in it speeds up flywheel for development and user value.
|
||||
|
||||
## Build & Test
|
||||
```bash
|
||||
# Run comprehensive end-to-end tests (validates CLI vs Service consistency)
|
||||
# Build all databuild components
|
||||
bazel build //...
|
||||
|
||||
# Run databuild unit tests
|
||||
bazel test //...
|
||||
|
||||
# Run end-to-end tests (validates CLI vs Service consistency)
|
||||
./run_e2e_tests.sh
|
||||
|
||||
# Run all core unit tests
|
||||
./scripts/bb_test_all
|
||||
|
||||
# Remote testing
|
||||
./scripts/bb_remote_test_all
|
||||
|
||||
# Do not try to `bazel test //examples/basic_graph/...`, as this will not work.
|
||||
```
|
||||
|
||||
## End-to-End Testing
|
||||
The project includes comprehensive end-to-end tests that validate CLI and Service build consistency:
|
||||
|
||||
### Test Suite Structure
|
||||
- `tests/end_to_end/simple_test.sh` - Basic CLI vs Service validation
|
||||
- `tests/end_to_end/podcast_simple_test.sh` - Podcast reviews CLI vs Service validation
|
||||
- `tests/end_to_end/basic_graph_test.sh` - Comprehensive basic graph testing
|
||||
- `tests/end_to_end/podcast_reviews_test.sh` - Comprehensive podcast testing
|
||||
|
||||
### Event Validation
|
||||
Tests ensure CLI and Service emit identical build events:
|
||||
- **Build request events**: Orchestration lifecycle (received, planning, executing, completed)
|
||||
- **Job events**: Job execution tracking
|
||||
- **Partition events**: Partition build status
|
||||
|
||||
### CLI vs Service Event Alignment
|
||||
Recent improvements ensure both paths emit identical events:
|
||||
- CLI: Enhanced with orchestration events to match Service behavior
|
||||
- Service: HTTP API orchestration events + core build events
|
||||
- Validation: Tests fail if event counts or types differ between CLI and Service
|
||||
|
||||
### Running Individual Tests
|
||||
```bash
|
||||
# Test basic graph
|
||||
tests/end_to_end/simple_test.sh \
|
||||
examples/basic_graph/bazel-bin/basic_graph.build \
|
||||
examples/basic_graph/bazel-bin/basic_graph.service
|
||||
|
||||
# Test podcast reviews (run from correct directory)
|
||||
cd examples/podcast_reviews
|
||||
../../tests/end_to_end/podcast_simple_test.sh \
|
||||
bazel-bin/podcast_reviews_graph.build \
|
||||
bazel-bin/podcast_reviews_graph.service
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
- `databuild/` - Core system (Rust/Proto)
|
||||
- `examples/` - Example implementations
|
||||
|
|
@ -89,21 +67,6 @@ def main():
|
|||
handle_exec(sys.argv[2:]) # Perform actual work
|
||||
```
|
||||
|
||||
### Job Configuration Requirements
|
||||
**CRITICAL**: Job configs must include non-empty `args` for execution to work:
|
||||
```python
|
||||
config = {
|
||||
"configs": [{
|
||||
"outputs": [{"str": partition_ref}],
|
||||
"inputs": [...],
|
||||
"args": ["some_arg"], # REQUIRED: Cannot be empty []
|
||||
"env": {"PARTITION_REF": partition_ref}
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
Jobs with `"args": []` will only have their config function called during execution, not exec.
|
||||
|
||||
### DataBuild Execution Flow
|
||||
1. **Planning Phase**: DataBuild calls `.cfg` targets to get job configurations
|
||||
2. **Execution Phase**: DataBuild calls main job targets which pipe config to exec
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@ status summary, job run statistics, etc.
|
|||
## Architecture
|
||||
- Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) /
|
||||
[CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy.
|
||||
- BELs are only ever written to by graph processes (e.g. CLI or service), not the jobs themselves.
|
||||
- BEL uses only two types of tables:
|
||||
- The root event table, with event ID, timestamp, message, event type, and ID fields for related event types.
|
||||
- Type-specific event tables (e.g. task even, partition event, build request event, etc).
|
||||
|
|
|
|||
11
design/why-databuild.md
Normal file
11
design/why-databuild.md
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
|
||||
# Why DataBuild?
|
||||
|
||||
(work in progress)
|
||||
|
||||
Why?
|
||||
- Orchestration logic changes all the time, better to not write it directly
|
||||
- Declarative -> Compile time correctness (e.g. can detect when no job produces a partition pattern)
|
||||
- Compartmentalized jobs + data deps -> Simplicity and compartmentalization of complexity
|
||||
- Bazel based -> Easy to deploy, maintain, and update
|
||||
|
||||
Loading…
Reference in a new issue