237 lines
No EOL
7.6 KiB
Markdown
237 lines
No EOL
7.6 KiB
Markdown
# DataBuild Interface Evolution: Strategic Options and Technical Decisions
|
|
|
|
This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs.
|
|
|
|
## Executive Summary
|
|
|
|
DataBuild must choose between three fundamental interface strategies:
|
|
1. **Pure Bazel** (current): Maximum guarantees, maximum verbosity
|
|
2. **High-Level DSL**: Expressive interfaces that compile to Bazel
|
|
3. **Pure Declarative**: Eliminate orchestration entirely through relational modeling
|
|
|
|
## The Core Technical Decisions
|
|
|
|
### Decision 1: Where Should Dependency Logic Live?
|
|
|
|
**Option A: In-Job Config (Current Design)**
|
|
```python
|
|
# Job knows its own dependencies
|
|
def config(self, date):
|
|
return {"inputs": [f"raw/{date}", f"raw/{date-1}"]}
|
|
```
|
|
- ✅ **Locality of knowledge** - dependency logic next to usage
|
|
- ✅ **Natural evolution** - changes happen in one place
|
|
- ❌ **Performance overhead** - subprocess per config call
|
|
- **Thrives in**: Complex enterprise environments where jobs have intricate, evolving dependencies
|
|
|
|
**Option B: Graph-Level Declaration**
|
|
```python
|
|
databuild_job(
|
|
name = "process_daily",
|
|
depends_on = ["raw/{date}"],
|
|
produces = ["processed/{date}"]
|
|
)
|
|
```
|
|
- ✅ **Static analysis** - entire graph visible without execution
|
|
- ✅ **Performance** - microseconds vs seconds for planning
|
|
- ❌ **Flexibility** - harder to express dynamic dependencies
|
|
- ❌ **Implicit coupling** - jobs have to duplicate data dependency resolution
|
|
- **Thrives in**: High-frequency trading systems, real-time analytics where planning speed matters
|
|
|
|
**Option C: Hybrid Pattern-Based**
|
|
```python
|
|
# Patterns at graph level, resolution at runtime
|
|
@job(dependency_pattern="raw/{source}/[date-window:date]")
|
|
def aggregate(date, window=7):
|
|
# Runtime resolves exact partitions
|
|
```
|
|
- ✅ **Best of both** - fast planning with flexibility
|
|
- ✅ **Progressive disclosure** - simple cases simple
|
|
- ❌ **Complexity** - two places to look
|
|
- **Thrives in**: Modern data platforms serving diverse teams with varying sophistication
|
|
|
|
### Decision 2: Interface Language Choice
|
|
|
|
**Option A: Pure Bazel (Status Quo)**
|
|
```starlark
|
|
databuild_job(
|
|
name = "etl",
|
|
binary = ":etl_binary",
|
|
)
|
|
```
|
|
**Narrative**: "The Infrastructure-as-Code Platform"
|
|
- For organizations that value reproducibility above all else
|
|
- Where data pipelines are mission-critical infrastructure
|
|
- Teams that already use Bazel for other systems
|
|
|
|
**Strengths**:
|
|
- Hermetic builds guarantee reproducibility
|
|
- Multi-language support out of the box
|
|
- Battle-tested deployment story
|
|
|
|
**Weaknesses**:
|
|
- High barrier to entry
|
|
- Verbose for simple cases
|
|
- Limited expressiveness
|
|
|
|
**Option B: Python DSL → Bazel Compilation**
|
|
```python
|
|
@db.job
|
|
def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"):
|
|
return raw.load().transform().save()
|
|
```
|
|
**Narrative**: "The Developer-First Data Platform"
|
|
- For data teams that move fast and iterate quickly
|
|
- Where Python is already the lingua franca
|
|
- Organizations prioritizing developer productivity
|
|
|
|
**Strengths**:
|
|
- 10x more concise than Bazel
|
|
- Natural for data scientists/engineers
|
|
- Rich ecosystem integration
|
|
|
|
**Weaknesses**:
|
|
- Additional compilation step
|
|
- Python-centric (less multi-language)
|
|
- Debugging across abstraction layers
|
|
|
|
**Option C: Rust DSL with Procedural Macros**
|
|
```rust
|
|
#[job]
|
|
fn process(
|
|
#[partition("raw/{date}")] input: Partition<Data>
|
|
) -> Partition<Output, "output/{date}"> {
|
|
input.load()?.transform().save()
|
|
}
|
|
```
|
|
**Narrative**: "The High-Performance Data Platform"
|
|
- For organizations processing massive scale
|
|
- Where performance and correctness are equally critical
|
|
- Teams willing to invest in Rust expertise
|
|
|
|
**Strengths**:
|
|
- Compile-time guarantees with elegance
|
|
- Zero-cost abstractions
|
|
- Single language with execution engine
|
|
|
|
**Weaknesses**:
|
|
- Steep learning curve
|
|
- Smaller talent pool
|
|
- Less flexible than Python
|
|
|
|
### Decision 3: Orchestration Philosophy
|
|
|
|
**Option A: Explicit Orchestration (Traditional)**
|
|
- Users define execution order and dependencies
|
|
- Similar to Airflow, Prefect, Dagster
|
|
- **Thrives in**: Organizations with complex business logic requiring explicit control
|
|
|
|
**Option B: Implicit Orchestration (Current DataBuild)**
|
|
- Users define jobs and dependencies
|
|
- System figures out execution order
|
|
- **Thrives in**: Data engineering teams wanting to focus on transformations, not plumbing
|
|
|
|
**Option C: No Orchestration (Pure Declarative)**
|
|
```python
|
|
@partition("clean/{date}")
|
|
class CleanData:
|
|
source = "raw/*/{date}"
|
|
|
|
def transform(self, raw):
|
|
# Pure function, no orchestration
|
|
return clean(merge(raw))
|
|
```
|
|
**Narrative**: "The SQL-for-Data-Pipelines Platform"
|
|
- Orchestration is an implementation detail
|
|
- Users declare relationships, system handles everything
|
|
- **Thrives in**: Next-generation data platforms, organizations ready to rethink data processing
|
|
|
|
**Strengths**:
|
|
- Eliminates entire categories of bugs
|
|
- Enables powerful optimizations
|
|
|
|
**Weaknesses**:
|
|
- Paradigm shift for users
|
|
- Less control over execution
|
|
- Harder to debug when things go wrong
|
|
|
|
## Strategic Recommendations by Use Case
|
|
|
|
### For Startups/Fast-Moving Teams
|
|
**Recommendation**: Python DSL → Bazel
|
|
- Start with Python for rapid development
|
|
- Compile to Bazel for production
|
|
- Migrate critical jobs to native Bazel/Rust over time
|
|
|
|
### For Enterprise/Regulated Industries
|
|
**Recommendation**: Pure Bazel with Graph-Level Dependencies
|
|
- Maintain full auditability and reproducibility
|
|
- Use graph-level deps for performance
|
|
- Consider Rust DSL for new greenfield projects
|
|
|
|
### For Next-Gen Data Platforms
|
|
**Recommendation**: Pure Declarative with Rust Implementation
|
|
- Leap directly to declarative model
|
|
- Build on Rust for performance and correctness
|
|
- Pioneer the "SQL for pipelines" approach
|
|
|
|
## Implementation Patterns
|
|
|
|
### Pattern 1: Gradual Migration
|
|
```
|
|
Current Bazel → Python DSL (compile to Bazel) → Pure Declarative
|
|
```
|
|
- Low risk, high compatibility
|
|
- Teams can adopt at their own pace
|
|
- Preserves existing investments
|
|
|
|
### Pattern 2: Parallel Tracks
|
|
```
|
|
Bazel Interface (production)
|
|
↕️
|
|
Python Interface (development)
|
|
```
|
|
- Different interfaces for different use cases
|
|
- Development velocity without sacrificing production guarantees
|
|
- Higher maintenance burden
|
|
|
|
### Pattern 3: Clean Break
|
|
```
|
|
New declarative system alongside legacy
|
|
```
|
|
- Fastest path to innovation
|
|
- No legacy constraints
|
|
- Requires significant investment
|
|
|
|
## Key Technical Insights
|
|
|
|
### Single Source of Truth Principle
|
|
Whatever path chosen, dependency declaration and resolution must be co-located:
|
|
```python
|
|
# Good: Single source
|
|
def process(input: partition("raw/{date}")):
|
|
return input.load().transform()
|
|
|
|
# Bad: Split sources
|
|
# In config: depends = ["raw/{date}"]
|
|
# In code: data = load("raw/{date}") # Duplication!
|
|
```
|
|
|
|
### The Pattern Language Insight
|
|
No new DSL needed for patterns - leverage existing language features:
|
|
- Python: f-strings, glob, regex
|
|
- Rust: const generics, pattern matching
|
|
- Both: bidirectional pattern template libraries
|
|
|
|
### The Orchestration Elimination Insight
|
|
The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning.
|
|
|
|
## Conclusion
|
|
|
|
The optimal path depends on organizational maturity and ambition:
|
|
|
|
1. **Conservative Evolution**: Enhance Bazel with better patterns and graph-level deps
|
|
2. **Developer-Focused**: Python DSL compiling to Bazel, maintaining guarantees
|
|
3. **Revolutionary Leap**: Pure declarative relationships with Rust implementation
|
|
|
|
Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision. |