databuild/plans/12-dsl.md

237 lines
No EOL
7.6 KiB
Markdown

# DataBuild Interface Evolution: Strategic Options and Technical Decisions
This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs.
## Executive Summary
DataBuild must choose between three fundamental interface strategies:
1. **Pure Bazel** (current): Maximum guarantees, maximum verbosity
2. **High-Level DSL**: Expressive interfaces that compile to Bazel
3. **Pure Declarative**: Eliminate orchestration entirely through relational modeling
## The Core Technical Decisions
### Decision 1: Where Should Dependency Logic Live?
**Option A: In-Job Config (Current Design)**
```python
# Job knows its own dependencies
def config(self, date):
return {"inputs": [f"raw/{date}", f"raw/{date-1}"]}
```
-**Locality of knowledge** - dependency logic next to usage
-**Natural evolution** - changes happen in one place
-**Performance overhead** - subprocess per config call
- **Thrives in**: Complex enterprise environments where jobs have intricate, evolving dependencies
**Option B: Graph-Level Declaration**
```python
databuild_job(
name = "process_daily",
depends_on = ["raw/{date}"],
produces = ["processed/{date}"]
)
```
-**Static analysis** - entire graph visible without execution
-**Performance** - microseconds vs seconds for planning
-**Flexibility** - harder to express dynamic dependencies
-**Implicit coupling** - jobs have to duplicate data dependency resolution
- **Thrives in**: High-frequency trading systems, real-time analytics where planning speed matters
**Option C: Hybrid Pattern-Based**
```python
# Patterns at graph level, resolution at runtime
@job(dependency_pattern="raw/{source}/[date-window:date]")
def aggregate(date, window=7):
# Runtime resolves exact partitions
```
-**Best of both** - fast planning with flexibility
-**Progressive disclosure** - simple cases simple
-**Complexity** - two places to look
- **Thrives in**: Modern data platforms serving diverse teams with varying sophistication
### Decision 2: Interface Language Choice
**Option A: Pure Bazel (Status Quo)**
```starlark
databuild_job(
name = "etl",
binary = ":etl_binary",
)
```
**Narrative**: "The Infrastructure-as-Code Platform"
- For organizations that value reproducibility above all else
- Where data pipelines are mission-critical infrastructure
- Teams that already use Bazel for other systems
**Strengths**:
- Hermetic builds guarantee reproducibility
- Multi-language support out of the box
- Battle-tested deployment story
**Weaknesses**:
- High barrier to entry
- Verbose for simple cases
- Limited expressiveness
**Option B: Python DSL → Bazel Compilation**
```python
@db.job
def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"):
return raw.load().transform().save()
```
**Narrative**: "The Developer-First Data Platform"
- For data teams that move fast and iterate quickly
- Where Python is already the lingua franca
- Organizations prioritizing developer productivity
**Strengths**:
- 10x more concise than Bazel
- Natural for data scientists/engineers
- Rich ecosystem integration
**Weaknesses**:
- Additional compilation step
- Python-centric (less multi-language)
- Debugging across abstraction layers
**Option C: Rust DSL with Procedural Macros**
```rust
#[job]
fn process(
#[partition("raw/{date}")] input: Partition<Data>
) -> Partition<Output, "output/{date}"> {
input.load()?.transform().save()
}
```
**Narrative**: "The High-Performance Data Platform"
- For organizations processing massive scale
- Where performance and correctness are equally critical
- Teams willing to invest in Rust expertise
**Strengths**:
- Compile-time guarantees with elegance
- Zero-cost abstractions
- Single language with execution engine
**Weaknesses**:
- Steep learning curve
- Smaller talent pool
- Less flexible than Python
### Decision 3: Orchestration Philosophy
**Option A: Explicit Orchestration (Traditional)**
- Users define execution order and dependencies
- Similar to Airflow, Prefect, Dagster
- **Thrives in**: Organizations with complex business logic requiring explicit control
**Option B: Implicit Orchestration (Current DataBuild)**
- Users define jobs and dependencies
- System figures out execution order
- **Thrives in**: Data engineering teams wanting to focus on transformations, not plumbing
**Option C: No Orchestration (Pure Declarative)**
```python
@partition("clean/{date}")
class CleanData:
source = "raw/*/{date}"
def transform(self, raw):
# Pure function, no orchestration
return clean(merge(raw))
```
**Narrative**: "The SQL-for-Data-Pipelines Platform"
- Orchestration is an implementation detail
- Users declare relationships, system handles everything
- **Thrives in**: Next-generation data platforms, organizations ready to rethink data processing
**Strengths**:
- Eliminates entire categories of bugs
- Enables powerful optimizations
**Weaknesses**:
- Paradigm shift for users
- Less control over execution
- Harder to debug when things go wrong
## Strategic Recommendations by Use Case
### For Startups/Fast-Moving Teams
**Recommendation**: Python DSL → Bazel
- Start with Python for rapid development
- Compile to Bazel for production
- Migrate critical jobs to native Bazel/Rust over time
### For Enterprise/Regulated Industries
**Recommendation**: Pure Bazel with Graph-Level Dependencies
- Maintain full auditability and reproducibility
- Use graph-level deps for performance
- Consider Rust DSL for new greenfield projects
### For Next-Gen Data Platforms
**Recommendation**: Pure Declarative with Rust Implementation
- Leap directly to declarative model
- Build on Rust for performance and correctness
- Pioneer the "SQL for pipelines" approach
## Implementation Patterns
### Pattern 1: Gradual Migration
```
Current Bazel → Python DSL (compile to Bazel) → Pure Declarative
```
- Low risk, high compatibility
- Teams can adopt at their own pace
- Preserves existing investments
### Pattern 2: Parallel Tracks
```
Bazel Interface (production)
↕️
Python Interface (development)
```
- Different interfaces for different use cases
- Development velocity without sacrificing production guarantees
- Higher maintenance burden
### Pattern 3: Clean Break
```
New declarative system alongside legacy
```
- Fastest path to innovation
- No legacy constraints
- Requires significant investment
## Key Technical Insights
### Single Source of Truth Principle
Whatever path chosen, dependency declaration and resolution must be co-located:
```python
# Good: Single source
def process(input: partition("raw/{date}")):
return input.load().transform()
# Bad: Split sources
# In config: depends = ["raw/{date}"]
# In code: data = load("raw/{date}") # Duplication!
```
### The Pattern Language Insight
No new DSL needed for patterns - leverage existing language features:
- Python: f-strings, glob, regex
- Rust: const generics, pattern matching
- Both: bidirectional pattern template libraries
### The Orchestration Elimination Insight
The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning.
## Conclusion
The optimal path depends on organizational maturity and ambition:
1. **Conservative Evolution**: Enhance Bazel with better patterns and graph-level deps
2. **Developer-Focused**: Python DSL compiling to Bazel, maintaining guarantees
3. **Revolutionary Leap**: Pure declarative relationships with Rust implementation
Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision.