Compare commits
3 commits
4bb8af2c74
...
04c5924746
| Author | SHA1 | Date | |
|---|---|---|---|
| 04c5924746 | |||
| eeef8b6444 | |||
| 58c57332e1 |
3 changed files with 364 additions and 104 deletions
127
databuild/dashboard/TYPE_SAFETY.md
Normal file
127
databuild/dashboard/TYPE_SAFETY.md
Normal file
|
|
@ -0,0 +1,127 @@
|
|||
# Dashboard Type Safety Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the type safety architecture implemented in the DataBuild dashboard to prevent runtime errors from backend API changes.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The dashboard previously experienced runtime crashes when backend API changes were deployed:
|
||||
- `status.toLowerCase()` failed when status changed from string to object
|
||||
- `partition.str` access failed when partition structure changed
|
||||
- TypeScript compilation passed but runtime errors occurred
|
||||
|
||||
## Solution Architecture
|
||||
|
||||
### 1. Dashboard Data Contracts
|
||||
|
||||
We define stable TypeScript interfaces in `types.ts` that represent the data shapes the UI components expect:
|
||||
|
||||
```typescript
|
||||
export interface DashboardBuild {
|
||||
build_request_id: string;
|
||||
status: string; // Always a human-readable string
|
||||
requested_partitions: string[]; // Always flat string array
|
||||
// ... other fields
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Transformation Layer
|
||||
|
||||
The `services.ts` file contains transformation functions that convert OpenAPI-generated types to dashboard types:
|
||||
|
||||
```typescript
|
||||
function transformBuildSummary(apiResponse: BuildSummary): DashboardBuild {
|
||||
return {
|
||||
build_request_id: apiResponse.build_request_id,
|
||||
status: apiResponse.status_name, // Extract string from API
|
||||
requested_partitions: apiResponse.requested_partitions.map(p => p.str), // Flatten objects
|
||||
// ... transform other fields
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Component Isolation
|
||||
|
||||
All UI components use only dashboard types, never raw API types:
|
||||
|
||||
```typescript
|
||||
// GOOD: Using dashboard types
|
||||
const build: DashboardBuild = await DashboardService.getBuildDetail(id);
|
||||
m('div', build.status.toLowerCase()); // Safe - status is always string
|
||||
|
||||
// BAD: Using API types directly
|
||||
const build: BuildSummary = await apiClient.getBuild(id);
|
||||
m('div', build.status.toLowerCase()); // Unsafe - status might be object
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Compile-time Safety**: TypeScript catches type mismatches during development
|
||||
2. **Runtime Protection**: Transformation functions handle API changes gracefully
|
||||
3. **Clear Boundaries**: UI code is isolated from API implementation details
|
||||
4. **Easier Updates**: API changes require updates only in transformation functions
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- `transformation-tests.ts`: Verify transformation functions produce correct dashboard types
|
||||
|
||||
### Strict TypeScript Configuration
|
||||
- `exactOptionalPropertyTypes`: Ensures optional properties are handled explicitly
|
||||
- `strictNullChecks`: Prevents null/undefined errors
|
||||
- `noImplicitAny`: Requires explicit typing
|
||||
|
||||
## Maintenance Guidelines
|
||||
|
||||
### When Backend API Changes
|
||||
|
||||
1. Update the OpenAPI spec and regenerate client
|
||||
2. TypeScript compilation will fail in transformation functions if types changed
|
||||
3. Update only the transformation functions to handle new API shape
|
||||
4. Run tests to verify UI components still work correctly
|
||||
|
||||
### Adding New Features
|
||||
|
||||
1. Define dashboard types in `types.ts`
|
||||
2. Create transformation functions in `services.ts`
|
||||
3. Use only dashboard types in components
|
||||
4. Add tests for the transformation logic
|
||||
|
||||
## Example: Handling API Evolution
|
||||
|
||||
If the backend changes `status` from string to object:
|
||||
|
||||
```typescript
|
||||
// Old API
|
||||
{ status_name: "COMPLETED" }
|
||||
|
||||
// New API
|
||||
{ status: { code: 4, name: "COMPLETED" } }
|
||||
|
||||
// Transformation handles both
|
||||
function transformBuildSummary(apiResponse: any): DashboardBuild {
|
||||
return {
|
||||
status: apiResponse.status_name || apiResponse.status?.name || 'UNKNOWN',
|
||||
// ... other fields
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
The UI components continue working without changes because they always receive the expected `string` type.
|
||||
|
||||
## Monitoring
|
||||
|
||||
To maintain type safety over time:
|
||||
|
||||
1. **Build-time Checks**: TypeScript compilation catches type errors
|
||||
2. **Test Suite**: Transformation tests run on every build
|
||||
3. **Code Reviews**: Ensure new code follows the pattern
|
||||
4. **Documentation**: Keep this document updated with patterns
|
||||
|
||||
## Related Files
|
||||
|
||||
- `types.ts` - Dashboard type definitions
|
||||
- `services.ts` - API transformation functions
|
||||
- `transformation-tests.ts` - Unit tests for transformations
|
||||
- `tsconfig_app.json` - Strict TypeScript configuration
|
||||
237
plans/dsl.md
Normal file
237
plans/dsl.md
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
# DataBuild Interface Evolution: Strategic Options and Technical Decisions
|
||||
|
||||
This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs.
|
||||
|
||||
## Executive Summary
|
||||
|
||||
DataBuild must choose between three fundamental interface strategies:
|
||||
1. **Pure Bazel** (current): Maximum guarantees, maximum verbosity
|
||||
2. **High-Level DSL**: Expressive interfaces that compile to Bazel
|
||||
3. **Pure Declarative**: Eliminate orchestration entirely through relational modeling
|
||||
|
||||
## The Core Technical Decisions
|
||||
|
||||
### Decision 1: Where Should Dependency Logic Live?
|
||||
|
||||
**Option A: In-Job Config (Current Design)**
|
||||
```python
|
||||
# Job knows its own dependencies
|
||||
def config(self, date):
|
||||
return {"inputs": [f"raw/{date}", f"raw/{date-1}"]}
|
||||
```
|
||||
- ✅ **Locality of knowledge** - dependency logic next to usage
|
||||
- ✅ **Natural evolution** - changes happen in one place
|
||||
- ❌ **Performance overhead** - subprocess per config call
|
||||
- **Thrives in**: Complex enterprise environments where jobs have intricate, evolving dependencies
|
||||
|
||||
**Option B: Graph-Level Declaration**
|
||||
```python
|
||||
databuild_job(
|
||||
name = "process_daily",
|
||||
depends_on = ["raw/{date}"],
|
||||
produces = ["processed/{date}"]
|
||||
)
|
||||
```
|
||||
- ✅ **Static analysis** - entire graph visible without execution
|
||||
- ✅ **Performance** - microseconds vs seconds for planning
|
||||
- ❌ **Flexibility** - harder to express dynamic dependencies
|
||||
- ❌ **Implicit coupling** - jobs have to duplicate data dependency resolution
|
||||
- **Thrives in**: High-frequency trading systems, real-time analytics where planning speed matters
|
||||
|
||||
**Option C: Hybrid Pattern-Based**
|
||||
```python
|
||||
# Patterns at graph level, resolution at runtime
|
||||
@job(dependency_pattern="raw/{source}/[date-window:date]")
|
||||
def aggregate(date, window=7):
|
||||
# Runtime resolves exact partitions
|
||||
```
|
||||
- ✅ **Best of both** - fast planning with flexibility
|
||||
- ✅ **Progressive disclosure** - simple cases simple
|
||||
- ❌ **Complexity** - two places to look
|
||||
- **Thrives in**: Modern data platforms serving diverse teams with varying sophistication
|
||||
|
||||
### Decision 2: Interface Language Choice
|
||||
|
||||
**Option A: Pure Bazel (Status Quo)**
|
||||
```starlark
|
||||
databuild_job(
|
||||
name = "etl",
|
||||
binary = ":etl_binary",
|
||||
)
|
||||
```
|
||||
**Narrative**: "The Infrastructure-as-Code Platform"
|
||||
- For organizations that value reproducibility above all else
|
||||
- Where data pipelines are mission-critical infrastructure
|
||||
- Teams that already use Bazel for other systems
|
||||
|
||||
**Strengths**:
|
||||
- Hermetic builds guarantee reproducibility
|
||||
- Multi-language support out of the box
|
||||
- Battle-tested deployment story
|
||||
|
||||
**Weaknesses**:
|
||||
- High barrier to entry
|
||||
- Verbose for simple cases
|
||||
- Limited expressiveness
|
||||
|
||||
**Option B: Python DSL → Bazel Compilation**
|
||||
```python
|
||||
@db.job
|
||||
def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"):
|
||||
return raw.load().transform().save()
|
||||
```
|
||||
**Narrative**: "The Developer-First Data Platform"
|
||||
- For data teams that move fast and iterate quickly
|
||||
- Where Python is already the lingua franca
|
||||
- Organizations prioritizing developer productivity
|
||||
|
||||
**Strengths**:
|
||||
- 10x more concise than Bazel
|
||||
- Natural for data scientists/engineers
|
||||
- Rich ecosystem integration
|
||||
|
||||
**Weaknesses**:
|
||||
- Additional compilation step
|
||||
- Python-centric (less multi-language)
|
||||
- Debugging across abstraction layers
|
||||
|
||||
**Option C: Rust DSL with Procedural Macros**
|
||||
```rust
|
||||
#[job]
|
||||
fn process(
|
||||
#[partition("raw/{date}")] input: Partition<Data>
|
||||
) -> Partition<Output, "output/{date}"> {
|
||||
input.load()?.transform().save()
|
||||
}
|
||||
```
|
||||
**Narrative**: "The High-Performance Data Platform"
|
||||
- For organizations processing massive scale
|
||||
- Where performance and correctness are equally critical
|
||||
- Teams willing to invest in Rust expertise
|
||||
|
||||
**Strengths**:
|
||||
- Compile-time guarantees with elegance
|
||||
- Zero-cost abstractions
|
||||
- Single language with execution engine
|
||||
|
||||
**Weaknesses**:
|
||||
- Steep learning curve
|
||||
- Smaller talent pool
|
||||
- Less flexible than Python
|
||||
|
||||
### Decision 3: Orchestration Philosophy
|
||||
|
||||
**Option A: Explicit Orchestration (Traditional)**
|
||||
- Users define execution order and dependencies
|
||||
- Similar to Airflow, Prefect, Dagster
|
||||
- **Thrives in**: Organizations with complex business logic requiring explicit control
|
||||
|
||||
**Option B: Implicit Orchestration (Current DataBuild)**
|
||||
- Users define jobs and dependencies
|
||||
- System figures out execution order
|
||||
- **Thrives in**: Data engineering teams wanting to focus on transformations, not plumbing
|
||||
|
||||
**Option C: No Orchestration (Pure Declarative)**
|
||||
```python
|
||||
@partition("clean/{date}")
|
||||
class CleanData:
|
||||
source = "raw/*/{date}"
|
||||
|
||||
def transform(self, raw):
|
||||
# Pure function, no orchestration
|
||||
return clean(merge(raw))
|
||||
```
|
||||
**Narrative**: "The SQL-for-Data-Pipelines Platform"
|
||||
- Orchestration is an implementation detail
|
||||
- Users declare relationships, system handles everything
|
||||
- **Thrives in**: Next-generation data platforms, organizations ready to rethink data processing
|
||||
|
||||
**Strengths**:
|
||||
- Eliminates entire categories of bugs
|
||||
- Enables powerful optimizations
|
||||
|
||||
**Weaknesses**:
|
||||
- Paradigm shift for users
|
||||
- Less control over execution
|
||||
- Harder to debug when things go wrong
|
||||
|
||||
## Strategic Recommendations by Use Case
|
||||
|
||||
### For Startups/Fast-Moving Teams
|
||||
**Recommendation**: Python DSL → Bazel
|
||||
- Start with Python for rapid development
|
||||
- Compile to Bazel for production
|
||||
- Migrate critical jobs to native Bazel/Rust over time
|
||||
|
||||
### For Enterprise/Regulated Industries
|
||||
**Recommendation**: Pure Bazel with Graph-Level Dependencies
|
||||
- Maintain full auditability and reproducibility
|
||||
- Use graph-level deps for performance
|
||||
- Consider Rust DSL for new greenfield projects
|
||||
|
||||
### For Next-Gen Data Platforms
|
||||
**Recommendation**: Pure Declarative with Rust Implementation
|
||||
- Leap directly to declarative model
|
||||
- Build on Rust for performance and correctness
|
||||
- Pioneer the "SQL for pipelines" approach
|
||||
|
||||
## Implementation Patterns
|
||||
|
||||
### Pattern 1: Gradual Migration
|
||||
```
|
||||
Current Bazel → Python DSL (compile to Bazel) → Pure Declarative
|
||||
```
|
||||
- Low risk, high compatibility
|
||||
- Teams can adopt at their own pace
|
||||
- Preserves existing investments
|
||||
|
||||
### Pattern 2: Parallel Tracks
|
||||
```
|
||||
Bazel Interface (production)
|
||||
↕️
|
||||
Python Interface (development)
|
||||
```
|
||||
- Different interfaces for different use cases
|
||||
- Development velocity without sacrificing production guarantees
|
||||
- Higher maintenance burden
|
||||
|
||||
### Pattern 3: Clean Break
|
||||
```
|
||||
New declarative system alongside legacy
|
||||
```
|
||||
- Fastest path to innovation
|
||||
- No legacy constraints
|
||||
- Requires significant investment
|
||||
|
||||
## Key Technical Insights
|
||||
|
||||
### Single Source of Truth Principle
|
||||
Whatever path chosen, dependency declaration and resolution must be co-located:
|
||||
```python
|
||||
# Good: Single source
|
||||
def process(input: partition("raw/{date}")):
|
||||
return input.load().transform()
|
||||
|
||||
# Bad: Split sources
|
||||
# In config: depends = ["raw/{date}"]
|
||||
# In code: data = load("raw/{date}") # Duplication!
|
||||
```
|
||||
|
||||
### The Pattern Language Insight
|
||||
No new DSL needed for patterns - leverage existing language features:
|
||||
- Python: f-strings, glob, regex
|
||||
- Rust: const generics, pattern matching
|
||||
- Both: bidirectional pattern template libraries
|
||||
|
||||
### The Orchestration Elimination Insight
|
||||
The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The optimal path depends on organizational maturity and ambition:
|
||||
|
||||
1. **Conservative Evolution**: Enhance Bazel with better patterns and graph-level deps
|
||||
2. **Developer-Focused**: Python DSL compiling to Bazel, maintaining guarantees
|
||||
3. **Revolutionary Leap**: Pure declarative relationships with Rust implementation
|
||||
|
||||
Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision.
|
||||
|
|
@ -1,104 +0,0 @@
|
|||
# Python DSL Exploration: Foundational Ideas for DataBuild's Evolution
|
||||
|
||||
This document explores how Python's expressiveness could reshape DataBuild's interface, not as an implementation plan, but as a collection of themes and insights that could inspire future evolution.
|
||||
|
||||
## Core Narratives
|
||||
|
||||
### 1. The Prefect Inspiration: Achieving 10x Conciseness
|
||||
DataBuild currently requires ~1000 lines of Bazel + Rust to express what could potentially be ~100 lines of Python. This order-of-magnitude difference suggests fundamental opportunities for abstraction.
|
||||
|
||||
### 2. From Orchestration to Relations: The SQL Insight
|
||||
The key realization: if orchestration logic changes frequently, we shouldn't make it easier to write - we should eliminate writing it entirely. Like SQL focuses on relational algebra rather than execution plans, DataBuild could focus on data relationships rather than orchestration steps.
|
||||
|
||||
### 3. The Spectrum of Approaches
|
||||
|
||||
#### Pure Python (Maximum Dynamism)
|
||||
```python
|
||||
@db.job(outputs=lambda date: [f"processed/{date}"])
|
||||
def process(date: str, raw: Partition) -> Partition:
|
||||
return transform(raw)
|
||||
```
|
||||
- Runtime introspection discovers dependencies
|
||||
- Decorators provide the interface
|
||||
- Trade-off: Sacrifices compile-time guarantees for expressiveness
|
||||
|
||||
#### Hybrid Approaches (Best of Both Worlds)
|
||||
Multiple strategies explored:
|
||||
- **Python DSL → Bazel Generation**: Python defines, Bazel executes
|
||||
- **Python Orchestrator + Bazel Workers**: Python handles coordination, Bazel handles computation
|
||||
- **Dual-Mode System**: Development in Python, production in Bazel
|
||||
- **Gradual Migration**: Start pure Python, migrate heavy jobs to Bazel over time
|
||||
|
||||
#### Pure Declarative (The Ultimate Vision)
|
||||
```python
|
||||
@rdb.partition("clean/{date}")
|
||||
class CleanData:
|
||||
@rdb.derives_from("raw/*/{{date}}")
|
||||
def transform(self, raw_partitions: List[Partition]) -> Partition:
|
||||
# Pure functional relationship, no orchestration
|
||||
pass
|
||||
```
|
||||
|
||||
## Foundational Themes
|
||||
|
||||
### 1. Declarative Over Imperative
|
||||
The evolution from "do this, then that" to "this depends on that" represents a fundamental shift in how we think about data pipelines. The interface should express relationships, not recipes.
|
||||
|
||||
### 2. Pattern-Based Dependencies
|
||||
Instead of explicitly listing dependencies, patterns like `raw/*/{{date}}` or `features/[date-30:date]` can express complex relationships concisely. This mirrors SQL's ability to express joins and windows declaratively.
|
||||
|
||||
### 3. Interface/Implementation Separation
|
||||
The most promising approaches separate:
|
||||
- **Interface**: How users express data relationships (Python's domain)
|
||||
- **Implementation**: How computations execute (Bazel/Rust's domain)
|
||||
|
||||
### 4. Correctness Through Constraints
|
||||
Rather than compile-time checking of imperative code, correctness could come from:
|
||||
- Functional transformations (no side effects)
|
||||
- Pattern-based completeness (all dependencies captured)
|
||||
- Relational integrity (cycles impossible by construction)
|
||||
|
||||
### 5. Runtime Intelligence
|
||||
With declarative relationships, the system can:
|
||||
- Build optimal execution plans at runtime
|
||||
- Adapt to resource availability
|
||||
- Skip unnecessary recomputation
|
||||
- Parallelize automatically
|
||||
|
||||
## Key Insights
|
||||
|
||||
### The Orchestration Paradox
|
||||
"Orchestration logic changes frequently, so we shouldn't implement it directly at all." This paradox suggests that the solution to complex orchestration isn't better orchestration tools, but eliminating orchestration entirely through declarative relationships.
|
||||
|
||||
### The SQL Analogy
|
||||
SQL's success comes from focusing on relational algebra rather than execution. DataBuild could similarly focus on data relationships rather than build steps. Users declare "what depends on what," not "how to build things."
|
||||
|
||||
### The Gradient of Guarantees
|
||||
Different parts of the system need different guarantees:
|
||||
- **Relationship declarations**: Need flexibility, benefit from Python
|
||||
- **Computational execution**: Need hermeticity, benefit from Bazel
|
||||
- **Runtime planning**: Need intelligence, benefit from Rust
|
||||
|
||||
## Future Explorations
|
||||
|
||||
### Interface Evolution Paths
|
||||
1. **Gradual Enhancement**: Keep current Bazel interface, add Python layer on top
|
||||
2. **Parallel Tracks**: Maintain both Bazel-first and Python-first interfaces
|
||||
3. **Fundamental Reimagining**: Redesign around pure declarative relationships
|
||||
|
||||
### Technical Investigations
|
||||
- How to preserve Bazel's hermeticity with Python's dynamism
|
||||
- Pattern matching languages for partition dependencies
|
||||
- Query planning algorithms for data pipelines
|
||||
- Time-travel and what-if analysis capabilities
|
||||
|
||||
### Philosophical Questions
|
||||
- Is orchestration a fundamental need or an implementation detail?
|
||||
- Can we achieve both expressiveness and correctness?
|
||||
- What would "SQL for data pipelines" actually look like?
|
||||
|
||||
## Conclusion
|
||||
|
||||
These explorations suggest that DataBuild's future might not be in making orchestration easier, but in making it unnecessary. By focusing on declarative data relationships rather than imperative build steps, we could achieve both the expressiveness of Python and the guarantees of Bazel, while eliminating entire categories of complexity.
|
||||
|
||||
The ultimate vision: users declare what data depends on what other data, and the system figures out everything else - just like SQL.
|
||||
Loading…
Reference in a new issue