databuild/plans/12-dsl.md

# DataBuild Interface Evolution: Strategic Options and Technical Decisions

This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs.

## Executive Summary

DataBuild must choose between three fundamental interface strategies:
1. **Pure Bazel** (current): Maximum guarantees, maximum verbosity
2. **High-Level DSL**: Expressive interfaces that compile to Bazel
3. **Pure Declarative**: Eliminate orchestration entirely through relational modeling

## The Core Technical Decisions

### Decision 1: Where Should Dependency Logic Live?

**Option A: In-Job Config (Current Design)**
```python
# Job knows its own dependencies
def config(self, date):
    return {"inputs": [f"raw/{date}", f"raw/{date-1}"]}
```
- ✅ **Locality of knowledge** - dependency logic next to usage
- ✅ **Natural evolution** - changes happen in one place
- ❌ **Performance overhead** - subprocess per config call
- **Thrives in**: Complex enterprise environments where jobs have intricate, evolving dependencies

**Option B: Graph-Level Declaration**
```python
databuild_job(
    name = "process_daily",
    depends_on = ["raw/{date}"],
    produces = ["processed/{date}"]
)
```
- ✅ **Static analysis** - entire graph visible without execution
- ✅ **Performance** - microseconds vs seconds for planning
- ❌ **Flexibility** - harder to express dynamic dependencies
- ❌ **Implicit coupling** - jobs have to duplicate data dependency resolution
- **Thrives in**: High-frequency trading systems, real-time analytics where planning speed matters

**Option C: Hybrid Pattern-Based**
```python
# Patterns at graph level, resolution at runtime
@job(dependency_pattern="raw/{source}/[date-window:date]")
def aggregate(date, window=7):
    # Runtime resolves exact partitions
```
- ✅ **Best of both** - fast planning with flexibility
- ✅ **Progressive disclosure** - simple cases simple
- ❌ **Complexity** - two places to look
- **Thrives in**: Modern data platforms serving diverse teams with varying sophistication

### Decision 2: Interface Language Choice

**Option A: Pure Bazel (Status Quo)**
```starlark
databuild_job(
    name = "etl",
    binary = ":etl_binary",
)
```
**Narrative**: "The Infrastructure-as-Code Platform"
- For organizations that value reproducibility above all else
- Where data pipelines are mission-critical infrastructure
- Teams that already use Bazel for other systems

**Strengths**:
- Hermetic builds guarantee reproducibility
- Multi-language support out of the box
- Battle-tested deployment story

**Weaknesses**:
- High barrier to entry
- Verbose for simple cases
- Limited expressiveness

**Option B: Python DSL → Bazel Compilation**
```python
@db.job
def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"):
    return raw.load().transform().save()
```
**Narrative**: "The Developer-First Data Platform"
- For data teams that move fast and iterate quickly
- Where Python is already the lingua franca
- Organizations prioritizing developer productivity

**Strengths**:
- 10x more concise than Bazel
- Natural for data scientists/engineers
- Rich ecosystem integration

**Weaknesses**:
- Additional compilation step
- Python-centric (less multi-language)
- Debugging across abstraction layers

**Option C: Rust DSL with Procedural Macros**
```rust
#[job]
fn process(
    #[partition("raw/{date}")] input: Partition<Data>
) -> Partition<Output, "output/{date}"> {
    input.load()?.transform().save()
}
```
**Narrative**: "The High-Performance Data Platform"
- For organizations processing massive scale
- Where performance and correctness are equally critical
- Teams willing to invest in Rust expertise

**Strengths**:
- Compile-time guarantees with elegance
- Zero-cost abstractions
- Single language with execution engine

**Weaknesses**:
- Steep learning curve
- Smaller talent pool
- Less flexible than Python

### Decision 3: Orchestration Philosophy

**Option A: Explicit Orchestration (Traditional)**
- Users define execution order and dependencies
- Similar to Airflow, Prefect, Dagster
- **Thrives in**: Organizations with complex business logic requiring explicit control

**Option B: Implicit Orchestration (Current DataBuild)**
- Users define jobs and dependencies
- System figures out execution order
- **Thrives in**: Data engineering teams wanting to focus on transformations, not plumbing

**Option C: No Orchestration (Pure Declarative)**
```python
@partition("clean/{date}")
class CleanData:
    source = "raw/*/{date}"

    def transform(self, raw):
        # Pure function, no orchestration
        return clean(merge(raw))
```
**Narrative**: "The SQL-for-Data-Pipelines Platform"
- Orchestration is an implementation detail
- Users declare relationships, system handles everything
- **Thrives in**: Next-generation data platforms, organizations ready to rethink data processing

**Strengths**:
- Eliminates entire categories of bugs
- Enables powerful optimizations

**Weaknesses**:
- Paradigm shift for users
- Less control over execution
- Harder to debug when things go wrong

## Strategic Recommendations by Use Case

### For Startups/Fast-Moving Teams
**Recommendation**: Python DSL → Bazel
- Start with Python for rapid development
- Compile to Bazel for production
- Migrate critical jobs to native Bazel/Rust over time

### For Enterprise/Regulated Industries
**Recommendation**: Pure Bazel with Graph-Level Dependencies
- Maintain full auditability and reproducibility
- Use graph-level deps for performance
- Consider Rust DSL for new greenfield projects

### For Next-Gen Data Platforms
**Recommendation**: Pure Declarative with Rust Implementation
- Leap directly to declarative model
- Build on Rust for performance and correctness
- Pioneer the "SQL for pipelines" approach

## Implementation Patterns

### Pattern 1: Gradual Migration
```
Current Bazel → Python DSL (compile to Bazel) → Pure Declarative
```
- Low risk, high compatibility
- Teams can adopt at their own pace
- Preserves existing investments

### Pattern 2: Parallel Tracks
```
Bazel Interface (production)
     ↕️
Python Interface (development)
```
- Different interfaces for different use cases
- Development velocity without sacrificing production guarantees
- Higher maintenance burden

### Pattern 3: Clean Break
```
New declarative system alongside legacy
```
- Fastest path to innovation
- No legacy constraints
- Requires significant investment

## Key Technical Insights

### Single Source of Truth Principle
Whatever path chosen, dependency declaration and resolution must be co-located:
```python
# Good: Single source
def process(input: partition("raw/{date}")):
    return input.load().transform()

# Bad: Split sources
# In config: depends = ["raw/{date}"]
# In code: data = load("raw/{date}")  # Duplication!
```

### The Pattern Language Insight
No new DSL needed for patterns - leverage existing language features:
- Python: f-strings, glob, regex
- Rust: const generics, pattern matching
- Both: bidirectional pattern template libraries

### The Orchestration Elimination Insight
The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning.

## Conclusion

The optimal path depends on organizational maturity and ambition:

1. **Conservative Evolution**: Enhance Bazel with better patterns and graph-level deps
2. **Developer-Focused**: Python DSL compiling to Bazel, maintaining guarantees
3. **Revolutionary Leap**: Pure declarative relationships with Rust implementation

Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision.