databuild/plans/roadmap.md
2025-07-07 19:20:45 -07:00

6.2 KiB

Roadmap

Please read the core concepts and manifesto to understand project motivation. This roadmap describes the different phases of execution, and composition of the system/concepts at a high level.

flowchart
  core -- emits --> partition_event_log
  partition_event_log -- read by --> build_graph_service
  build_graph_service -- invokes --> core
  build_graph_service -- informs --> build_graph_dashboard

Stages

Foundation: data build graph definition

Status: DONE

This phase establishes the core capability of describing a flexible declarative partition-aware build system. This graph is the concept upon which other concepts can be attached, e.g. making the graph deployable for remote builds, etc.

Build Event Log

Design Doc

Status: Done

This phase establishes the build event log, which allows for tracking of partition status, coordination of build requests (e.g. avoiding duplicate work, contention, etc), and eventual visualization of build requests and partition liveness/staleness status. It is comprised of a schema as well as an access layer allowing it to be written and read by different system components.

Build Graph Service

Design Doc

Status: Done

Together with the Build Event Log, this enables deployment of a persistent build service that builds data on request without needing to rebuild existing non-stale partitions. It also serves build request status and progress, and surfaces partition liveness / freshness endpoints. Key questions it answers:

  • What build requests have there been?
  • What is the status of a specific build request?
  • What would the build graph look like to materialize a specific partition?
  • What build events have happened in this time frame?
  • Is this partition live and not stale?
  • What build events are relevant/related to this partition? (e.g. why doesn't this exist yet, etc)
  • Build this partition, returning a build request ID.

End-to-End Tests (Phase 1)

Design Doc

Status: Planning

Uses the basic graph and podcast reviews examples to implement end-to-end testing of the databuild capabilities.

  • Build the same partitions via CLI and service, verify that we get the same events out, and that we get expected events in each
  • They should have separate log databases
  • Should be implemented as a sh_test or similar so that bazel test //... at each workplace root triggers
  • Is there any risk of bazel inefficiency here / slow tests? How would we mitigate?

Build Graph Dashboard

Design Doc

Status: Not Started

A UI that relies on the Build Graph Service, showing things like build activity and partition liveness information. There are a few key pages:

  • Partition build request status page: shows the status of all work involved in building a partition, including upstream partition build actions and delegated (active and handled by another build request) build requests. Engineers watch this to see what's happening, it tails the build event log.
  • Partition status page: Is the partition live? Stale? What past builds produced it? (with links) Also, include a button for building the partition (with option for force if it already exists and is non-stale).
  • Job list page: lists all jobs included in the graph, along with aggregate success metrics and timing information.
  • Job history page: for a given job, list recent job runs and their success and timing information, along with any interesting metadata.
  • Job run page: All the execution information for a specific job run, including env vars, parameters, result, logs, etc.
  • Analyze page: runs the graph analyze verb, returning the plan that would produce the requested partitions
  • Raw SQL page: enable debugging by allowing submission of sql queries to be executed against build event log + views

Risks

Over-Engineering / Scoping

The goal of this project is to produce a powerful, inspiring view on how declarative data builds can work, not to produce a production system. We take extra steps to achieve very high leverage and differentiated capabilities, but not to enable table stakes or obvious features that aren't required for the former.

Complexity

This project already has a lot of irreducible complexity, and adding optional complexity is a likely failure mode.

Questions

Should graphs be services?

A tempting way to organize different graphs is to have them be literal services, and represent cross-graph dependency builds as requests to upstream graph services, etc. Graphs as services is attractive as service boundaries generally match org boundaries, etc, and this matches that pattern. It also means that we are creating a distributed system - though perhaps that's the implicit choice in using more than 1 graph anyway?

Do we need first-class trigger concepts?

In theory, every trigger is just a simple script triggered by cron or request, that maps the input data to a set of desired partitions, likely with some intermediate step to look at a catalog and source candidate partition col values, etc. This is not very inspiring, as it doesn't sound differentiating in value, and theoretically we should punt on or simply not implement the low marginal value features. Is this truly valuable? It would need to bring a new level of convenience/simplicity/ease of deployment, or a new capability based on "expected partitions" to be justified. For example, we might be able to predict when we think partitions will land next, or what they might do in the future, that could be useful operationally. But those hypotheticals may also be best left to extensions or some "plugin" concept.

Will we need a dataset concept?

Theoretically, DataBuild doesn't need the dataset concept to fully resolve build graphs, and produce desired partitions. Practically, partitions will be instances of different classes, and humans will use those classes as organizing concepts, e.g. in asking about recent partition builds of a kind. To what extent do we need to implement a dataset concept? We could implement them as views, e.g. allowing the specification of partition patterns, tagging, etc.