Add shared core plan

2025-07-19 19:28:54 -07:00 · 2025-07-19 19:28:54 -07:00 · d618a124ed
commit d618a124ed
parent 77d74c09fb
3 changed files with 100 additions and 84 deletions
--- a/databuild/README.md
+++ b/databuild/README.md
@ -1,88 +1,26 @@
-# DataBuild Protobuf Interfaces

-This directory contains the protobuf interfaces for DataBuild, implemented as a hermetic Bazel-native solution.
+# DataBuild

-## Architecture
+## API

-### Hermetic Build Approach
-Instead of relying on external Cargo dependencies or complex protoc toolchains, we use a **hermetic Bazel genrule** that generates Rust code directly from the protobuf specification. This ensures:
+A sort of requirements doc for the semantics of DataBuild, enumerating the nouns and verbs they can do.

- **Full Hermeticity**: No external dependencies beyond what's in the Bazel workspace
- **Consistency**: Same generated code across all environments  
- **Performance**: Fast builds without complex dependency resolution
- **Simplicity**: Pure Bazel solution that integrates seamlessly
+### Graph

-### Generated Code Structure
-
-The build generates Rust structs that mirror the protobuf specification in `databuild.proto`:
-
-```rust
-// Core types
-pub struct PartitionRef { pub str: String }
-pub struct JobConfig { /* ... */ }
-pub struct JobGraph { /* ... */ }
-// ... and all other protobuf messages
-```
-
-### Custom Serialization
-
-Since we're hermetic, we implement our own JSON serialization instead of relying on serde:
-
-```rust
-let partition = PartitionRef::new("my-partition");
-let json = partition.to_json(); // {"str":"my-partition"}
-let parsed = PartitionRef::from_json(&json).unwrap();
-```
-
-## Usage
-
-### In BUILD.bazel files:
-```starlark
-rust_library(
-    name = "my_service",
-    deps = ["//databuild:databuild"],
-    # ... 
-)
-```
-
-### In Rust code:
-```rust
-use databuild::*;
-
-let partition = PartitionRef::new("my-partition");
-let job_config = JobConfig {
-    outputs: vec![partition],
-    inputs: vec![],
-    args: vec!["process".to_string()],
-    env: HashMap::new(),
-};
-```
-
-## Build Targets
-
- `//databuild:databuild` - Main library with generated protobuf types
- `//databuild:databuild_test` - Tests for the generated code
- `//databuild:databuild_proto` - The protobuf library definition
- `//databuild:structs` - Legacy manually-written structs (deprecated)
-
-## Testing
-
-```bash
-bazel test //databuild:...
-```
-
-## Benefits of This Approach
-
-1. **No External Dependencies**: Eliminates prost, tonic-build, and complex protoc setups
-2. **Bazel Native**: Fully integrated with Bazel's dependency graph
-3. **Fast Builds**: No compilation of external crates or complex build scripts
-4. **Hermetic**: Same results every time, everywhere
-5. **Maintainable**: Simple genrule that's easy to understand and modify
-6. **Extensible**: Easy to add custom methods and serialization logic
-
-## Future Enhancements
-
- Add wire-format serialization if needed
- Generate service stubs for gRPC-like communication
- Add validation methods for message types
- Extend custom serialization to support more formats
+- `analyze` - Produce the job graph required to build the requested set of partitions.
+- `build` - Analyze and then execute the produced job graph to build the requested partitions.
+- `builds`
+  - `list` - List past builds.
+  - `show` - Shows current status of specified build and list events. Can tail build events for a build with `--follow/-f`
+  - `cancel` - Cancel specified build.
+- `partitions`
+  - `list` - Lists partitions.
+  - `show` - Shows current status of the specified partition.
+  - `invalidate` - Marks a partition as invalid (will be rebuilt, won't be read).
+- `jobs`
+  - `list` - List jobs in the graph.
+  - `show` - Shows task statistics (success %, runtime, etc) and recent task results.
+- `tasks` (job runs)
+  - `list` - Lists past tasks.
+  - `show` - Describes current task status and lists events.
+  - `cancel` - Cancels a specific task.
--- a/plans/shared-core.md
+++ b/plans/shared-core.md
@ -0,0 +1,73 @@
+
+# Shared Core Refactor
+
+We want to refactor the codebase to move shared functionality into core components that are shared between interfaces (e.g. CLI and service), and which can be tested independently. The capabilities are listed in [`databuild/README.md`](../databuild/README.md#graph), and each first level bullet represents a subcommand, with sub-bullets representing sub-sub-commands, e.g. you can run `bazel-bin/mygraph.cli builds cancel c38a442d-fad3-4f74-ae3f-062e5377fe52`. This should match service capabilities, e.g. `get -XPOST localhost:8080/builds/c38a442d-fad3-4f74-ae3f-062e5377fe52/cancel`.
+
+These core capabilities should be factored into explicit read vs write capabilities. On the write side, it should verify the action is relevant (e.g. you can't cancel a nonexistent build, but you can request the build of an existing partition, it will just delegate), and then write the appropriate event to the BEL. Simple. On the read side, the different capabilities should be implemented by different "repositories", a'la the repository pattern. We can then handle any variation in backing database internal to the repositories (since most SQL will be valid for both SQLite, postgres, and delta).
+
+# Plan
+We should take a phased approach to executing this plan. After implementing the core functionality and unit tests for each phase, we should pause and write down any potential refactoring that would benefit the system before moving onto the next phase.
+
+## Phase 1 - Implement `MockBuildEventLog`
+Goal: create a common testing tool that allows easy specification of testing conditions (e.g. BEL contents/events) to test system/graph behavior.
+- Should use an in-memory sqlite database to ensure tests can be run in parallel
+- Should make it very easy to specify test data (e.g. event constructors with random defaults that can be overwritten)
+- Should include a trivial unit test that writes a valid event and verifies its there via real code paths.
+
+## Phase 2 - Implement Common Event Write Component
+Goal: create a single interface for writing events to the build event log.
+- Should include all existing "write" functionality, like requesting a new build, etc.
+- Migrate CLI to use new write component
+  - TODO - whats the exec model? Does it write the event, then start the execute based on the ID? Does it start a service? Actually what does tailing builds look like?
+- Migrate service to use new write component
+
+## Phase 3 - Implement `partitions` Repository
+- Create a new build event log event for partition invalidation (with reason field)
+- Implement a repository in `databuild/repositories/partitions/` that queries the build event log for the following capabilities
+  - list
+  - show
+  - invalidate
+- Add `partitions` subcommand to CLI
+- Migrate or add partition capabilities to service.
+
+## Phase 4 - Implement `jobs` Repository
+- Implement a repository in `databuild/repositories/jobs/` that queries the BEL for the following capabilities
+  - list
+  - show
+- Add `jobs` subcommand to CLI
+- Migrate or add jobs capabilities to service.
+
+## Phase 5 - Implement `tasks` Repository
+- Implement a "task cancel" job BEL event (with reason field)
+- Implement a repository in `databuild/repositories/tasks/` that queries the BEL for the following capabilities
+  - list
+  - show
+- And add to the common write component a `cancel_task` method to implement this
+- Add `tasks` subcommand to CLI
+- Add service endpoint for canceling tasks
+- (TODO - later we will need to implement a way to operate on the dashboard - lets do that in a later project)
+
+## Phase 6 - Implement `builds` Repository
+- Implement a "build cancel" BEL event (with reason field)
+- Implement a repository in `databuild/repositories/builds/` that queries the BEL for the following capabilities
+  - list
+  - show
+  - cancel
+- Add to the common write component a `cancel_build` method
+- Add `builds` subcommand to the CLI
+- Migrate service endpoints to use the new shared impl
+- Add service endpoint implementing build cancel
+- Add a cancel button to the build status page (for in-progress builds)
+
+## Phase 7 - Testing
+- Review prior work, ensure that tests have been written to cover the 90% of most important functionality for each component.
+- Ensure all tests pass, and fix those that don't.
+- Run e2e tests.
+
+## Phase 8 - Reflection & Next Steps
+- Reflect on the work done and look for opportunities for improvement and refactoring
+- Call out any "buried bodies" very explicitly (things which need to be revisited for the implementation to be complete)
+
+# Note
+
+Do not take shortcuts. This we are building for the long term. If you have any questions, please pause and ask.
--- a/plans/todo.md
+++ b/plans/todo.md
@ -2,6 +2,11 @@
 - Status indicator for page selection
 - On build request detail page, show aggregated job results
 - Use path based navigation instead of hashbang?
- Build event job links are not encoding job labels properly
+- How do we encode job labels in the path? (Build event job links are not encoding job labels properly)
 - Resolve double type system with protobuf and openapi
 - Prometheus metrics export
+- Plan for external worker dispatch (e.g. k8s pod per build, or launch in container service)
+  - k8s can use [jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
+- Should we have meaningful exit codes? E.g. "retry-able error", etc?
+- Triggers?
+- How do we handle task logging?