This commit is contained in:
Stuart Axelbrooke 2025-05-25 15:21:34 -07:00
parent 0180794560
commit 0c38569e71

View file

@ -150,31 +150,3 @@ The data catalog is an essential component that enables cache-aware planning of
## Entity Relationships
Note: Data services are missing below. They could be "libraries" for data, e.g. filling an organizational role for data dependencies at a larger scale.
# Appendix
## Why not data build services?
Currently DataBuild handles all data materializing by creating a new process/running and completing a described program. Most programs take at least a couple seconds to start up, and others could be much worse (e.g. Python loading particularly heavy runtime deps). It's easy to imagine DataBuild instead interacting with data build app servers, that are already live, and interact via RPC calls.
Adopting a model where DataBuild interacts with long-lived data build app servers via RPC calls, instead of spinning up new processes for each job, presents several trade-offs:
**Pros:**
* **Reduced Startup Latency:** Eliminates process startup overhead (e.g., Python interpreter initialization, dependency loading), leading to faster execution, especially for numerous small, quick jobs.
* **Resource Pooling & Efficiency:** Servers can maintain warm caches (e.g., for frequently accessed metadata or small lookup tables) and potentially manage resource pools (like database connections) more efficiently than individual transient processes.
* **Stateful Capabilities:** Long-running services can maintain state across multiple RPC calls, which could be beneficial for more complex, interactive, or iterative data build scenarios, though this also introduces complexity.
* **Optimized Data Transfer:** For certain types of interactions, RPCs might allow for more optimized data transfer protocols compared to passing data via files or standard streams, especially if data locality can be leveraged.
**Cons:**
* **Increased Operational Complexity:** Managing, scaling, monitoring, and ensuring the high availability of these app servers adds significant operational burden compared to stateless, ephemeral jobs.
* **State Management Challenges:** While potentially a pro, statefulness in servers can lead to "sticky" bugs, memory leaks, and difficulties in ensuring idempotency and reproducibility if not carefully designed.
* **Deployment Complexity:** Rolling out updates or new versions of these data build services becomes more complex and requires careful strategies to avoid disrupting ongoing builds.
* **Resource Contention:** Multiple concurrent requests to a shared server can lead to resource contention (CPU, memory, network I/O) if the server isn't properly scaled or requests aren't managed (e.g., via request queuing and prioritization).
* **Reduced Isolation:** Jobs running within the same server process have less isolation than jobs running in separate processes. A misbehaving job or a bug in the server could potentially affect other jobs.
* **Hermeticity Concerns:** Achieving the same level of hermeticity and reproducibility as isolated processes can be more challenging. Dependencies and environment configurations are managed at the server level, potentially leading to less predictable job execution environments.
* **RPC Overhead:** While startup is faster, the overhead of network communication and data serialization/deserialization for RPC calls can become significant, especially for large data payloads, potentially negating some performance gains.
* **Scalability Model:** Scaling a fleet of stateful or semi-stateful services can be more complex than scaling out stateless, containerized jobs.
This model shifts complexity from individual job execution to the management of the underlying service infrastructure. It might be beneficial for scenarios with very high volumes of very small, fast jobs where startup overhead is the dominant factor, but it introduces new challenges in terms of operational management, reliability, and maintaining the desirable properties of isolated, reproducible builds.