Compare commits

...

3 commits

Author SHA1 Message Date
e32fea0d58 Add potential sketch for scala
Some checks are pending
/ setup (push) Waiting to run
2025-07-26 20:19:40 -07:00
111e6d9987 Update PartitionManifest interface 2025-07-26 20:10:53 -07:00
033ba12f43 Update designs 2025-07-26 00:48:18 -07:00
11 changed files with 652 additions and 2 deletions

View file

@ -1,7 +1,7 @@
# DataBuild Design
DataBuild is a trivially-deployable, partition-oriented, declarative build system. Where data orchestration flows are normally imperative and implicit (do this, then do that, etc), DataBuild uses stated data dependencies to make this process declarative and explicit. DataBuild scales the declarative nature of tools like DBT to meet the needs of modern, broadly integrated data and ML organizations, that consume data from many sources and which arrive on a highly varying basis. DataBuild enables confident, bounded completeness in a world where input data is effectively never complete at any given time.
DataBuild is a trivially-deployable, partition-oriented, declarative build system. Where data orchestration flows are normally imperative and implicit (do this, then do that, etc), DataBuild uses stated data dependencies to make this process declarative and explicit. DataBuild scales the declarative nature of tools like DBT to meet the needs of modern, broadly integrated data and ML organizations, who consume data from many sources and which arrive on a highly varying basis. DataBuild enables confident, bounded completeness in a world where input data is effectively never complete at any given time.
## Philosophy
@ -17,7 +17,8 @@ Graphs and jobs are defined in [bazel](https://bazel.build), allowing graphs (an
- **Jobs** - Their `exec` entrypoint builds partitions from partitions, and their `config` entrypoint specifies what partitions are required to produce the requested partition(s), along with the specific config to run `exec` with to build said partitions.
- **Graphs** - Composes jobs together to achieve multi-job orchestration, using a `lookup` mechanism to resolve a requested partition to the job that can build it. Together with its constituent jobs, Graphs can fully plan the build of any set of partitions. Most interactions with a DataBuild app happen with a graph.
- **Build Event Log** - Encodes the state of the system, recording build requests, job activity, partition production, etc to enable running databuild as a deployed application.
- **Bazel targets** - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in `databuild_job` bazel targets, and connecting them with a `databuild_graph` target.
- **Bazel Targets** - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in `databuild_job` bazel targets, and connecting them with a `databuild_graph` target.
- [**Graph Specification Strategies**](design/graph-specification.md) (coming soon) Application libraries in Python/Rust/Scala that use language features to enable ergonomic and succinct specification of jobs and graphs.
### Partition / Job Assumptions and Best Practices
@ -54,4 +55,17 @@ The BEL encodes all relevant build actions that occur, enabling concurrent build
The BEL is similar to [event-sourced](https://martinfowler.com/eaaDev/EventSourcing.html) systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service.
### Triggers and Wants (Coming Soon)
["Wants"](./design/triggers.md) are the main mechanism for continually building partitions over time. In real world scenarios, it is standard for data to arrive late, or not at all. Wants cause the databuild graph to continually attempt to build the wanted partitions until a) the partitions are live or b) the want expires, at which another script can be run. Wants are the mechanism that implements SLA checking.
You can also use cron-based triggers, which return partition refs that they want built.
# Key Insights
- Orchestration logic changes all the time - better to not write it at all.
- Orchestration decisions and application logic is innately coupled
## Assumptions
- Job -> partition relationships are canonical, job runs are idempotent

View file

@ -74,6 +74,9 @@ message PartitionManifest {
// The configuration used to run the job
Task task = 5;
// Arbitrary metadata about the produced partitions, keyed by partition ref
map<string, string> metadata = 6;
}
message JobExecuteRequest { repeated PartitionRef outputs = 1; }

33
design/build-event-log.md Normal file
View file

@ -0,0 +1,33 @@
# Build Event Log (BEL)
Purpose: Store build events and define views summarizing databuild application state, like partition catalog, build
status summary, job run statistics, etc.
## Architecture
- Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) /
[CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy.
- BEL uses only two types of tables:
- The root event table, with event ID, timestamp, message, event type, and ID fields for related event types.
- Type-specific event tables (e.g. task even, partition event, build request event, etc).
- This makes it easy to support multiple backends (SQLite, postgres, and delta tables are supported initially).
- Exposes an access layer that mediates writes, and which exposes entity-specific repositories for reads.
## Correctness Strategy
- Access layer will evaluate events requested to be written, returning an error if the event is not a correct next.
state based on the involved component's governing state diagram.
- Events are versioned, with each versions' schemas stored in [`databuild.proto`](../databuild/databuild.proto).
## Write Interface
See [trait definition](../databuild/event_log/mod.rs).
## Read Repositories
There are repositories for the following entities:
- Builds
- Jobs
- Partitions
- Tasks
Generally the following verbs are available for each:
- Show
- List
- Cancel

137
design/core-build.md Normal file
View file

@ -0,0 +1,137 @@
# Core Build
Purpose: Centralize the build logic and semantics in a performant, correct core.
## Architecture
- Jobs depend on input partitions and produce output partitions.
- Graphs compose jobs to fully plan and execute builds of requested partitions.
- Both jobs and graphs emit events via the [build event log](./build-event-log.md) to update build state.
- A common interface is implemented to execute job and graph build actions, which different clients rely on (e.g. CLI,
service, etc)
- Jobs and graphs use wrappers to implement configuration and [observability](./observability.md)
- Graph-based composition is the basis for databuild application [deployment](./deploy-strategies.md)
## Jobs
Jobs are the atomic unit of work in databuild.
- Job wrapper fulfills configuration, observability, and record keeping
### `job.config`
Purpose: Enable planning of execution graph. Executed in-process when possible for speed. For interface details, see
[`PartitionRef`](./glossary.md#partitionref) and [`JobConfig`](./glossary.md#jobconfig) in
[`databuild.proto`](../databuild/databuild.proto).
```rust
trait DataBuildJob {
fn config(outputs: Vec<PartitionRef>) -> JobConfig;
}
```
#### `job.config` State Diagrams
```mermaid
flowchart TD
begin((begin)) --> validate_args
emit_job_config_fail --> fail((fail))
validate_args -- fail --> emit_arg_validate_fail --> emit_job_config_fail
validate_args -- success --> emit_arg_validate_success --> run_config
run_config -- fail --> emit_config_fail --> emit_job_config_fail
run_config -- success --> emit_config_success ---> success((success))
```
### `job.exec`
Purpose: Execute job in exec wrapper.
```rust
trait DataBuildJob {
fn exec(config: JobConfig) -> PartitionManifest;
}
```
#### `job.exec` State Diagram
```mermaid
flowchart TD
begin((begin)) --> validate_config
emit_job_exec_fail --> fail((fail))
validate_config -- fail --> emit_config_validate_fail --> emit_job_exec_fail
validate_config -- success --> emit_config_validate_success --> launch_task
launch_task -- fail --> emit_task_launch_fail --> emit_job_exec_fail
launch_task -- success --> emit_task_launch_success --> await_task
await_task -- waited N seconds --> emit_heartbeat --> await_task
await_task -- non-zero exit code --> emit_task_failed --> emit_job_exec_fail
await_task -- zero exit code --> emit_task_success --> calculate_metadata
calculate_metadata -- fail --> emit_metadata_calculation_fail --> emit_job_exec_fail
calculate_metadata -- success --> emit_metadata ---> success((success))
```
## Graphs
Graphs are the unit of composition. To `analyze` (plan) task graphs (see [`JobGraph`](./glossary.md#jobgraph)), they
iteratively walk back from the requested output partitions, invoking `job.config` until no unresolved partitions
remain. To `build` partitions, the graph runs `analyze` then iteratively executes the resulting task graph.
### `graph.analyze`
Purpose: produce a complete task graph to materialize a requested set of partitions.
```rust
trait DataBuildGraph {
fn analyze(outputs: Vec<PartitionRef>) -> JobGraph;
}
```
#### `graph.analyze` State Diagram
```mermaid
flowchart TD
begin((begin)) --> initialize_missing_partitions --> dispatch_missing_partitions
emit_graph_analyze_fail --> fail((fail))
dispatch_missing_partitions -- fail --> emit_partition_dispatch_fail --> emit_graph_analyze_fail
dispatch_missing_partitions -- success --> cycle_detected?
cycle_detected? -- yes --> emit_cycle_detected --> emit_graph_analyze_fail
cycle_detected? -- no --> remaining_missing_partitions?
remaining_missing_partitions? -- yes --> dispatch_missing_partitions
remaining_missing_partitions? -- no --> emit_job_graph --> success((success))
```
### `graph.build`
Purpose: analyze, then execute the resulting task graph.
```rust
trait DataBuildGraph {
fn build(outputs: Vec<PartitionRef>);
}
```
#### `graph.build` State Diagram
```mermaid
flowchart TD
begin((begin)) --> graph_analyze
emit_graph_build_fail --> fail((fail))
graph_analyze -- fail --> emit_graph_build_fail
graph_analyze -- success --> initialize_ready_jobs --> remaining_ready_jobs?
remaining_ready_jobs? -- yes --> emit_remaining_jobs --> schedule_jobs
remaining_ready_jobs? -- none schedulable --> emit_jobs_unschedulable --> emit_graph_build_fail
schedule_jobs -- fail --> emit_job_schedule_fail --> emit_graph_build_fail
schedule_jobs -- success --> emit_job_schedule_success --> await_jobs
await_jobs -- job_failure --> emit_job_failure --> emit_job_cancels --> cancel_running_jobs
cancel_running_jobs --> emit_graph_build_fail
await_jobs -- N seconds since heartbeat --> emit_heartbeat --> await_jobs
await_jobs -- job_success --> remaining_ready_jobs?
remaining_ready_jobs? -- no ---------> emit_graph_build_success --> success((success))
```
## Correctness Strategy
- Core component interfaces are described in [`databuild.proto`](../databuild/databuild.proto), a protobuf interface
shared by all core components and all [GSLs](./graph-specification.md).
- [GSLs](./graph-specification.md) implement ergonomic graph, job, and partition helpers that make coupling explicit
- Graphs automatically detect and raise on non-unique job -> partition mappings
- Graph and job processes are fully described by state diagrams, whose state transitions are logged to the
[build event log](./build-event-log.md).
## Partition Delegation
- Sometimes a partition already exists, or another build request is already planning on producing a partition
- A later build request with delegate to an already existing build request for said partition
- The later build request will write an event to the [build event log](./build-event-log.md) referencing the ID
of the delegate, allowing traceability of visualization
## Heartbeats / Health Checks
- Which strategy do we use?
- If we are launching tasks to a place we can't health check, how could they heartbeat?

View file

@ -0,0 +1,11 @@
# Deploy Strategies
- Purpose
- Trivial deployment and updates for databuild applications is key, allowing for shipping quickly via continuous delivery
- Build continuity across deploys
- Strategies
- Binary deployment
- Docker deployment
- K8s deployment
- Workloads on cloud run, k8s job submission, etc?

34
design/glossary.md Normal file
View file

@ -0,0 +1,34 @@
# `Job`
Atomic unit of work, producing and consuming specific partitions. See [jobs](./core-build.md#jobs).
# `Graph`
Composes [jobs](#job) to build partitions. See [graphs](./core-build.md#graphs)
# `Partition`
Partitions are atomic units of data, produced and depended on by jobs. A job can produce multiple partitions, but
multiple jobs cannot produce the same partition - e.g. job -> partition relationships must be unique/canonical.
# `PartitionRef`
PartitionsRefs are strings that uniquely identify partitions. They can contain anything, but generally they are S3
URIs, like `s3://companybkt/datasets/foo/date=2025-01-01`, or custom formats like
`dal://prod/clicks/region=4/date=2025-01-01/`. PartitionRefs are used as dependency signals during
[task graph analysis](./core-build.md#graphanalyze). To enable explicit coupling and ergonomics, there are generally
helper classes for creating, parsing, and accessing fields for PartitionRefs in [GSLs](#graph-specification-language-gsl).
# `PartitionPattern`
Patterns that group partitions (e.g. a dataset) and allow for validation (e.g. does this job actually produce the
expected output partition?)
# `JobConfig`
The complete configuration of a job needed to produce the desired partitions, as calculated by
[`job.config`](./core-build.md#jobconfig)
# `JobGraph`
A complete graph of job configs, with [`PartitionRef`](#partitionref) dependency edges, which when executed will
produce the requested partitions.
# Graph Specification Language (GSL)
Language-specific libraries that make implementing databuild graphs and jobs more succinct and ergonomic.
See [graph specification](./graph-specification.md).

View file

@ -0,0 +1,219 @@
# App Specification
AKA the different ways databuild applications can be described.
## Correctness Strategy
- Examples implemented that use each graph specification strategy, and are tested in CI/CD.
- Graph specification strategies provide
## Bazel
- Purpose: compilation/build target that fulfills promise of project (like bytecode for JVM langs)
- Job binaries (config and exec)
- Graph lookup binary (lookup)
- Job target (config and exec)
- Graph target (build and analyze)
- See [core build](./core-build.md) for details
## Python
- Wrapper functions enable graph registry
- Partition object increases ergonomics and enables explicit data coupling
```python
from dataclasses import dataclass
from databuild import (
DataBuildGraph, DataBuildJob, Partition, JobConfig, PyJobConfig, BazelJobConfig, PartitionManifest, Want
)
from helpers import ingest_reviews, categorize_reviews, sla_failure_notify
from datetime import datetime, timedelta
graph = DataBuildGraph("//:podcast_reviews_graph")
ALL_CATEGORIES = {"comedy", ...}
# Partition definitions, used by the graph to resolve jobs by introspecting their config signatures
ExtractedReviews = Partition[r"reviews/date=(?P<date>\d{4}-\d{2}-\d{2})"]
CategorizedReviews = Partition[r"categorized_reviews/category=(?P<category>[^/]+)/date=(?P<date>\d{4}-\d{2}-\d{2})"]
PhraseModel = Partition[r"phrase_models/category=(?P<category>[^/]+)/date=(?P<date>\d{4}-\d{2}-\d{2})"]
PhraseStats = Partition[r"phrase_stats/category=(?P<category>[^/]+)/date=(?P<date>\d{4}-\d{2}-\d{2})"]
@graph.job
class ExtractReviews(DataBuildJob):
def config(self, outputs: list[ExtractedReviews]) -> list[JobConfig]:
# One job run can output multiple partitions
args = [p.date for p in outputs]
return [JobConfig(outputs=outputs, inputs=[], args=args,)]
def exec(self, config: JobConfig) -> PartitionManifest:
for (date, output) in zip(config.args, config.outputs):
ingest_reviews(date).write(output)
# Start and end time inferred by wrapper (but could be overridden)
return config.partitionManifest(job=self)
@dataclass
class CategorizeReviewsArgs:
date: str
category: str
@graph.job
class CategorizeReviews(DataBuildJob):
def config(self, outputs: list[CategorizedReviews]) -> list[JobConfig]:
# This job only outputs one partition per run
return [
# The PyJobConfig allows you to pass objects in config, rather than just `args` and `env`
PyJobConfig[CategorizeReviewsArgs](
outputs=[p],
inputs=ExtractedReviews.dep.materialize(date=p.date),
params=CategorizeReviewsArgs(date=p.date, category=p.category),
)
for p in outputs
]
def exec(self, config: PyJobConfig[CategorizeReviewsArgs]) -> None:
categorize_reviews(config.params.date, config.params.category)
# Partition manifest automatically constructed from config
@graph.job
class PhraseModeling(DataBuildJob):
def config(self, outputs: list[PhraseModel]) -> list[JobConfig]:
# This job relies on a bazel executable target to run the actual job
return [
BazelJobConfig(
outputs=[p],
inputs=[CategorizedReviews.dep.materialize(date=p.date, category=p.category)],
exec_target="//jobs:phrase_modeling",
env={"CATEGORY": p.category, "DATA_DATE": p.date},
)
for p in outputs
]
# This job is fully defined in bazel
graph.bazel_job(target="//jobs:phrase_stats_job", outputs=list[PhraseStats])
@graph.want(cron='0 0 * * *')
def phrase_stats_want() -> list[Want[PhraseStats]]:
# Crates a new want every midnight that times out in 3 days
wanted = [PhraseStats(date=datetime.now().date().isoformat(), category=cat) for cat in ALL_CATEGORIES]
on_fail = lambda p: f"Failed to calculate partition `{p}`"
return [graph.want(partitions=wanted, ttl=timedelta(days=3), on_fail=on_fail)]
```
- TODO - do we need an escape hatch for "after 2025 use this job, before use that job" functionality?
## Rust?
## Scala?
```scala
import databuild._
import scala.concurrent.duration._
import java.time.LocalDate
object PodcastReviewsGraph extends DataBuildGraph("//:podcast_reviews_graph") {
val AllCategories = Set("comedy", ???)
case class DatePartition(date: String)
case class CategoryDatePartition(category: String, date: String)
// Partition definitions using extractors
object ExtractedReviews extends Partition[DatePartition](
"""reviews/date=(?P<date>\d{4}-\d{2}-\d{2})""".r
)
object CategorizedReviews extends Partition[CategoryDatePartition](
"""categorized_reviews/category=(?P<category>[^/]+)/date=(?P<date>\d{4}-\d{2}-\d{2})""".r
)
object PhraseModel extends Partition[CategoryDatePartition](
"""phrase_models/category=(?P<category>[^/]+)/date=(?P<date>\d{4}-\d{2}-\d{2})""".r
)
object PhraseStats extends Partition[CategoryDatePartition](
"""phrase_stats/category=(?P<category>[^/]+)/date=(?P<date>\d{4}-\d{2}-\d{2})""".r
)
// Job definitions
@job
object ExtractReviewsJob extends DataBuildJob[ExtractedReviews] {
def config(outputs: List[ExtractedReviews]): List[JobConfig] = {
val args = outputs.map(_.date)
List(JobConfig(
outputs = outputs,
inputs = Nil,
args = args
))
}
def exec(config: JobConfig): PartitionManifest = {
config.args.zip(config.outputs).foreach { case (date, output) =>
ingestReviews(date).writeTo(output)
}
config.toPartitionManifest(this)
}
}
@job
object CategorizeReviewsJob extends DataBuildJob[CategorizedReviews] {
case class Args(date: String, category: String)
def config(outputs: List[CategorizedReviews]): List[JobConfig] = {
outputs.map { p =>
ScalaJobConfig[Args](
outputs = List(p),
inputs = ExtractedReviews.dep.materialize(date = p.date),
params = Args(p.date, p.category)
)
}
}
def exec(config: ScalaJobConfig[Args]): Unit = {
categorizeReviews(config.params.date, config.params.category)
// Partition manifest auto-constructed
}
}
@job
object PhraseModelingJob extends DataBuildJob[PhraseModel] {
def config(outputs: List[PhraseModel]): List[JobConfig] = {
outputs.map { p =>
BazelJobConfig(
outputs = List(p),
inputs = List(CategorizedReviews.dep.materialize(
category = p.category,
date = p.date
)),
execTarget = "//jobs:phrase_modeling",
env = Map("CATEGORY" -> p.category, "DATA_DATE" -> p.date)
)
}
}
}
// External bazel job
bazelJob("//jobs:phrase_stats_job", outputType = classOf[PhraseStats])
// Want definition
@want(cron = "0 0 * * *")
def phraseStatsWant(): List[Want[PhraseStats]] = {
val today = LocalDate.now().toString
val wanted = AllCategories.map(cat => PhraseStats(cat, today)).toList
List(want(
partitions = wanted,
ttl = 3.days,
onFail = p => s"Failed to calculate partition `$p`"
))
}
}
```

19
design/observability.md Normal file
View file

@ -0,0 +1,19 @@
# Observability
- Purpose
- To enable simple, comprehensive metrics and logging observability for databuild applications
- Wrappers as observability implementation
- Liveness guarantees are:
- Task process is still running
- Logs are being shipped
- Metrics are being gathered (graph scrapes worker metrics, re-exposes)
- Heartbeating
- Log shipping
- Metrics exposed
- Metrics
- Service
- Jobs
- Logging
- Service
- Jobs

123
design/service.md Normal file
View file

@ -0,0 +1,123 @@
# Service
Purpose: Enable centrally hostable and human-consumable interface for databuild applications.
## Correctness Strategy
- Rely on databuild.proto, call same shared code in core
- Fully asserted type safety from core to service to web app
- Core -- databuild.proto --> service -- openapi --> web app
- No magic strings (how? protobuf doesn't have consts. enums values? code gen over yaml?)
## API
The purpose of the API is to enable remote, programmatic interaction with databuild applications, and to host endpoints
needed by the [web app](#web-app).
See [OpenAPI spec](../bazel-bin/databuild/client/openapi.json) (may need to
`bazel build //databuild/client:extract_openapi_spec` if its not found).
## Web App
The web app visualizes databuild application state via features like listing past builds, job statistics,
partition liveness, build request status, etc. This section specifies the hierarchy of functions of the web app. Pages
are described in visual order (generally top to bottom).
General requirements:
- Nav at top of page
- DataBuild logo in top left
- Navigation links at the top allowing navigation to each list page:
- Wants list page
- Jobs list page
- Build requests list page
- Triggers list page
- Build event log page
- Graph label at top right
- Search box for finding builds, jobs, and partitions (needs a new service API?)
### Home Page
Jumping off point to navigate and build.
- A text box, an "Analyze" button, and a "Build" button for doing exactly that (would be great to have autocomplete,
also PartitionRef patterns would help with ergonomics for less typing / more safety)
- List recent builds with their requested partitions and current status, with link to build request page
- List of recently attempted partitions, with status, link to partition page, and link to build request page
- List of jobs, with (colored) last week success ratio, and link to job page
### Build Request Page
- Show build request ID and overall status of build (colored) and "Cancel" button at top
- progress bar indicating number of: needs-build partitions, building partitions, non-live delegated partitions, and
live partitions
- Summary information table
- Requested at
- analyze time (with datetime range)
- build time (with datetime range)
- number of tasks in each state (don't include sates with 0 count)
- number of partitions in each state (don't include sates with 0 count)
- Show graph diagram of job graph (collapsable)
- With each job and partition status color coded & linked to related run / partition
- [paginated](#build-event-log-pagination) list of related build events at bottom
### Job Status Page
- Job label
- "Recent Runs" select, controlling page size
- "Recent Runs Page" select - the `< 1 2 3 ... N >` style paginator
- Job success rate (for all selected; colored)
- Bar graph showing job execution run times for last N (selectable between 31, 100, 365)
- Recent task runs
- With links to build request, task run, partition
- With task result
- With run time
- With expandable partition metadata
- [paginated](#build-event-log-pagination) list of related build events at bottom
### Task Run Page
- With job label, task status, and "Cancel" button at top
- Summary information table
- task run ID
- output/input partitions
- task start and end time
- task duration
- Graph similar to [build request page](#build-request-page), all partitions and jobs not involved in this task made
translucent (expandable)
- With [paginated](#build-event-log-pagination) table of build events at bottom
### Partition Status Page
- With PartitionRef, link to matching [PartitionPattern](#partitionpattern-page), color-coded status, and "build" button at top
- List of tasks that produced this partition
- [paginated](#build-event-log-pagination) list of related build events at bottom
### PartitionPattern Page
- Paginated table of partitions that match this partition pattern, sortable by cols, including:
- Partition ref (with link)
- Partition pattern values
- Partition status
- Build request link
- Task link (with run time next to it)
## Triggers List Page
- Paginated list of registered triggers
- With link to trigger detail page
- With expandable list of produced build requests or wants
## Trigger Detail Page
- Trigger name, last run at, and "Trigger" button at top
- Trigger history table, including:
- Trigger time
- Trigger result (successful/failed)
- Partitions or wants requested
## Wants List Page
## Want Detail Page
### Build Event Log Page
I dunno, some people want to look at the raw thing.
- A [paginated](#build-event-log-pagination) list of build event log entries
### Build Event Log Pagination
This element is present on most pages, and should be reusable/pluggable for a given set of events/filters.
- Table with headers of significant fields, sorted by timestamp by default
- With timestamp, event ID, and message field
- With color coded event type
- With links to build requests, jobs, and partitions where IDs are present
- With expandable details that show the preformatted JSON event contents
- With the `< 1 2 3 ... N >` style paginator
- Page size of 100

56
design/triggers.md Normal file
View file

@ -0,0 +1,56 @@
# Triggers
Purpose: to enable simple but powerful declarative specification of what data should be built.
## Correctness Strategy
- Wants + TTLs
- ...?
## Wants
Wants cause graphs to try to build the wanted partitions until a) the partitions are live or b) the TTL runs out. Wants
can trigger a callback on TTL expiry, enabling SLA-like behavior. Wants are recorded in the [BEL](./build-event-log.md),
so they can be queried and viewed in the web app, linking to build requests triggered by a given want, enabling
answering of the "why doesn't this partition exist yet?" question.
### Unwants
You can also unwant partitions, which overrides all wants of those partitions prior to the unwant timestamp. This is
primarily to enable the "data source is now disabled" style feature practically necessary in many data platforms.
### Virtual Partitions & External Data
Essentially all data teams consume some external data source, and late arriving data is the rule more than the
exception. Virtual partitions are a way to model external data that is not produced by a graph. For all intents and
purposes, these are standard partitions, the only difference is that the job that "produces" them doesn't actually
do any ETL, it just assesses external data sufficiency and emits a "partition live" event when its ready to be consumed.
## Triggers
## Taints
- Mechanism for invalidating existing partitions (e.g. we know bad data went into this, need to stop consumers from
using it)
---
- Purpose
- Every useful data application has triggering to ensure data is built on schedule
- Philosophy
- Opinionated strategy plus escape hatches
- Taints
- Two strategies
- Basic: cron triggered scripts that return partitions
- Bazel: target with `cron`, `executable` fields, optional `partition_patterns` field to constrain
- Declarative: want-based, wants cause build requests to be continually retried until the wanted
partitions are live, or running a `want_failed` script if it times out (e.g. SLA breach)
- +want and -want
- +want declares want for 1+ partitions with a timeout, recorded to the [build event log](./build-event-log.md)
- -want invalidates all past wants of specified partitions (but not future; doesn't impact non-specified
partitions)
- Their primary purpose is to prevent an SLA breach alarm when a datasource is disabled, etc.
- Need graph preconditions? And concept of external/virtual partitions or readiness probes?
- Virtual partitions: allow graphs to say "precondition failed"; can be created in BEL, created via want or
cron trigger? (e.g. want strategy continually tries to resolve the external data, creating a virtual
partition once it can find it; cron just runs the script when its triggered)
- Readiness probes don't fit the paradigm, feel too imperative.

View file

@ -4,6 +4,7 @@
- Status indicator for page selection
- On build request detail page, show aggregated job results
- Use path based navigation instead of hashbang?
- Add build request notes
- How do we encode job labels in the path? (Build event job links are not encoding job labels properly)
- Resolve double type system with protobuf and openapi
- Prometheus metrics export