databuild/docs/narrative/why-not-push.md

2.8 KiB

Or "why pull"?

  • Initially, when writing a single DAG or pipeline, "pull" and "push" concepts don't make sense. You just run the DAG that produces the data periodically via some automated trigger.
  • As separate jobs get added to the workload, in many cases you can just add them to the same DAG. Without complex dependency relationships, you can keep composing different workloads this way.
  • This simplicity is great, making it easy to understand system state even in the case of failures, and to develop marginally and retry granularly with tools like Airflow, Dagster, and Prefect.
  • "push" semantics arise from building upon the simple solution here: data build decisions push across data dep relationships, going from "input" data to "output" data. Said another way, data gets built because it can be built and its inputs are available.
  • This breaks down for partitioned datasets, where dependencies are more complicated that 1:1 - e.g. if you need to aggregate revenue from events coming from different platforms and customers
    • This happens when other people deliver batched data to you, particularly when you do business on platforms that handle multiple customers' business for you
  • Under push, even with extensive logging, propagation of fixes and new features takes backfilling downstream of the version-bumped or fixed dataset.
  • Under push, that orchestration code is distributed across teams and DAGs. It's tribal knowledge. Under pull, the dependency relationships are explicit and queryable.
  • The systemd analogy:
    • SysV init and init.d scripts (1980s-2000s): Services were started by numbered bash scripts (/etc/init.d/S10network, /etc/rc3.d/S20postgres) that explicitly called other scripts in sequence. Each script contained imperative startup logic and had to manually handle dependencies - if your app needed postgres, your init script had to know to start postgres first, wait for it, then start your app. Changing the dependency graph meant editing scripts across multiple services. Debugging startup failures required tracing through bash scripts to figure out which service failed to start its dependencies. The numbering system (S10, S20, S30) was a crude way to enforce ordering, but became unmaintainable as systems grew complex.
    • The key problem: Orchestration logic was distributed across individual service scripts (push-based), requiring global knowledge to modify, rather than being centralized with declarative per-service dependencies (pull-based).
  • In pull, data is built because something wants it to exist. No complex logic is needed to reason about what needs to run before a given workload can run. The system does it all for you based on the graph structure specified by your application/codebase.
  • Cool thing you can do: estimate the cost of fixing an issue or backfilling data (because of the downstream work that needs to be done)