add sick ass logo

Add systemd comment
Describe wants and taints in readme
2025-08-20 19:39:01 -07:00 · 2025-08-18 22:04:53 -07:00 · 2025-08-18 20:54:31 -07:00 · 2025-08-16 16:21:43 -07:00 · 2025-08-16 15:53:26 -07:00 · 2025-08-16 15:37:47 -07:00
23 changed files with 772 additions and 8 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -17,7 +17,9 @@ Graphs and jobs are defined in [bazel](https://bazel.build), allowing graphs (an
 - **Jobs** - Their `exec` entrypoint builds partitions from partitions, and their `config` entrypoint specifies what partitions are required to produce the requested partition(s), along with the specific config to run `exec` with to build said partitions.
 - **Graphs** - Composes jobs together to achieve multi-job orchestration, using a `lookup` mechanism to resolve a requested partition to the job that can build it. Together with its constituent jobs, Graphs can fully plan the build of any set of partitions. Most interactions with a DataBuild app happen with a graph.
 - **Build Event Log** - Encodes the state of the system, recording build requests, job activity, partition production, etc to enable running databuild as a deployed application.
- **Bazel Targets** - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in `databuild_job` bazel targets, and connecting them with a `databuild_graph` target. 
+- **Wants** - Partition wants can be registered with DataBuild, causing it to build the wanted partitions as soon as its graph-external dependencies are met.
+- **Taints** - Taints mark a partition as invalid, indicating that readers should not use it, and that it should be rebuilt when requested or depended upon. If there is a still-active want for the tainted partition, it will be rebuilt immediately.
+- **Bazel Targets** - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in `databuild_job` bazel targets, and connecting them with a `databuild_graph` target.
 - [**Graph Specification Strategies**](design/graph-specification.md) (coming soon) Application libraries in Python/Rust/Scala that use language features to enable ergonomic and succinct specification of jobs and graphs.

 ### Partition / Job Assumptions and Best Practices
@ -65,7 +67,8 @@ You can also use cron-based triggers, which return partition refs that they want
 # Key Insights

 - Orchestration logic changes all the time - better to not write it at all.
- Orchestration decisions and application logic is innately coupled
+- Orchestration decisions and application logic is innately coupled.
+- "systemd for data platforms"

 ## Assumptions

--- a/README.md
+++ b/README.md
@ -1,5 +1,25 @@

-# DataBuild
+```
+          ██████╗       ████╗   ███████████╗  ████╗
+         ██╔═══██╗     ██╔██║   ╚═══██╔════╝ ██╔██║
+        ██╔╝   ██║    ██╔╝██║      ██╔╝     ██╔╝██║
+       ██╔╝    ██║   ██╔╝ ██║     ██╔╝     ██╔╝ ██║
+      ██╔╝    ██╔╝  ██╔╝  ██║    ██╔╝     ██╔╝  ██║
+     ██╔╝   ██╔═╝  █████████║   ██╔╝     █████████║
+    ████████╔═╝   ██╔═════██║  ██╔╝     ██╔═════██║
+    ╚═══════╝     ╚═╝     ╚═╝  ╚═╝      ╚═╝     ╚═╝
+
+      ██████╗     ██╗   ██╗   ██╗   ██╗       █████╗
+     ██╔═══██╗   ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔══██╗
+    ██╔╝   ██║  ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝  ██║
+   █████████╔╝ ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝   ██║
+  ██╔═══██╔═╝ ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝   ██╔╝
+ ██╔╝   ██║  ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝  ██╔═╝
+█████████╔╝  ██████╔═╝  ██╔╝  ████████╗ ███████╔═╝
+╚════════╝   ╚═════╝    ╚═╝   ╚═══════╝ ╚══════╝
+
+        - --  S Y S T E M   O N L I N E  -- -
+```

 DataBuild is a trivially-deployable, partition-oriented, declarative data build system.

--- a/databuild/ascii_logo.txt
+++ b/databuild/ascii_logo.txt
@ -0,0 +1,19 @@
+          ██████╗       ████╗   ███████████╗  ████╗
+         ██╔═══██╗     ██╔██║   ╚═══██╔════╝ ██╔██║
+        ██╔╝   ██║    ██╔╝██║      ██╔╝     ██╔╝██║
+       ██╔╝    ██║   ██╔╝ ██║     ██╔╝     ██╔╝ ██║
+      ██╔╝    ██╔╝  ██╔╝  ██║    ██╔╝     ██╔╝  ██║
+     ██╔╝   ██╔═╝  █████████║   ██╔╝     █████████║
+    ████████╔═╝   ██╔═════██║  ██╔╝     ██╔═════██║
+    ╚═══════╝     ╚═╝     ╚═╝  ╚═╝      ╚═╝     ╚═╝
+
+      ██████╗     ██╗   ██╗   ██╗   ██╗       █████╗
+     ██╔═══██╗   ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔══██╗
+    ██╔╝   ██║  ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝  ██║
+   █████████╔╝ ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝   ██║
+  ██╔═══██╔═╝ ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝   ██╔╝
+ ██╔╝   ██║  ██╔╝  ██╔╝  ██╔╝  ██╔╝      ██╔╝  ██╔═╝
+█████████╔╝  ██████╔═╝  ██╔╝  ████████╗ ███████╔═╝
+╚════════╝   ╚═════╝    ╚═╝   ╚═══════╝ ╚══════╝
+
+        - --  S Y S T E M   O N L I N E  -- -
--- a/databuild/dashboard/pages.ts
+++ b/databuild/dashboard/pages.ts
@ -462,6 +462,8 @@ export const BuildStatus: TypedComponent<BuildStatusAttrs> = {
      ...(build.completed_at ? [{stage: 'Build Completed', time: build.completed_at, icon: '✅'}] : []),
    ];

+    let startedAt = build.started_at || build.requested_at;
+
    return m('div.container.mx-auto.p-4', [
      // Build Header
      m('.build-header.mb-6', [
@ -485,8 +487,8 @@ export const BuildStatus: TypedComponent<BuildStatusAttrs> = {
          ]),
          m('.stat.bg-base-100.shadow.rounded-lg.p-4', [
            m('.stat-title', 'Duration'),
-            m('.stat-value.text-2xl', (build.completed_at - build.started_at) ? formatDuration((build.completed_at - build.started_at)) : '—'),
-            m('.stat-desc', build.started_at ? formatDateTime(build.started_at) : 'Not started')
+            m('.stat-value.text-2xl', (build.completed_at - startedAt) ? formatDuration((build.completed_at - startedAt)) : '—'),
+            m('.stat-desc', startedAt ? formatDateTime(startedAt) : 'Not started')
          ])
        ])
      ]),
--- a/databuild/dashboard/services.ts
+++ b/databuild/dashboard/services.ts
@ -462,6 +462,7 @@ export function formatDateTime(epochNanos: number): string {

 export function formatDuration(durationNanos?: number | null): string {
  let durationMs = durationNanos ? durationNanos / 1000000 : null;
+  console.warn('Formatting duration:', durationMs);
  if (!durationMs || durationMs <= 0) {
    return '—';
  }
--- a/databuild/dsl/python/dsl.py
+++ b/databuild/dsl/python/dsl.py
@ -120,7 +120,7 @@ class DataBuildGraph:
        import os
        
        # Get job classes from the lookup table
-        job_classes = list(set(self.lookup.values()))
+        job_classes = sorted(set(self.lookup.values()), key=lambda cls: cls.__name__)
        
        # Format deps for BUILD.bazel
        if deps:
@ -172,6 +172,15 @@ databuild_graph(
    lookup = ":{name}_job_lookup",
    visibility = ["//visibility:public"],
 )
+
+# Create tar archive of generated files for testing
+genrule(
+    name = "existing_generated",
+    srcs = glob(["*.py", "BUILD.bazel"]),
+    outs = ["existing_generated.tar"],
+    cmd = "mkdir -p temp && cp $(SRCS) temp/ && find temp -exec touch -t 197001010000 {{}} + && tar -cf $@ -C temp .",
+    visibility = ["//visibility:public"],
+)
 '''
        
        with open(os.path.join(output_dir, "BUILD.bazel"), "w") as f:
--- a/databuild/test/app/BUILD.bazel
+++ b/databuild/test/app/BUILD.bazel
@ -1,9 +1,15 @@
 py_library(
    name = "job_src",
-    srcs = glob(["**/*.py"]),
+    srcs = glob(["**/*.py"], exclude=["e2e_test_common.py"]),
    visibility = ["//visibility:public"],
    deps = [
        "//databuild:py_proto",
        "//databuild/dsl/python:dsl",
    ],
 )
+
+py_library(
+    name = "e2e_test_common",
+    srcs = ["e2e_test_common.py"],
+    visibility = ["//visibility:public"],
+)
--- a/databuild/test/app/bazel/BUILD.bazel
+++ b/databuild/test/app/bazel/BUILD.bazel
@ -65,6 +65,14 @@ py_test(
    ],
 )

+py_test(
+    name = "test_e2e",
+    srcs = ["test_e2e.py"],
+    data = [":bazel_graph.build"],
+    main = "test_e2e.py",
+    deps = ["//databuild/test/app:e2e_test_common"],
+)
+
 # Bazel-defined
 ## Graph
 databuild_graph(
--- a/databuild/test/app/bazel/graph/lookup.py
+++ b/databuild/test/app/bazel/graph/lookup.py
@ -4,7 +4,7 @@ from collections import defaultdict
 import sys
 import json

-LABEL_BASE = "//databuild/test/app"
+LABEL_BASE = "//databuild/test/app/bazel"


 def lookup(raw_ref: str):
--- a/databuild/test/app/bazel/test_e2e.py
+++ b/databuild/test/app/bazel/test_e2e.py
@ -0,0 +1,37 @@
+#!/usr/bin/env python3
+"""
+End-to-end test for the bazel-defined test app.
+
+Tests the full pipeline: build execution -> output verification -> JSON validation.
+"""
+
+import os
+from databuild.test.app.e2e_test_common import DataBuildE2ETestBase
+
+
+class BazelE2ETest(DataBuildE2ETestBase):
+    """End-to-end test for the bazel-defined test app."""
+    
+    def test_end_to_end_execution(self):
+        """Test full end-to-end execution of the bazel graph."""
+        # Build possible paths for the bazel graph build binary
+        possible_paths = self.get_standard_runfiles_paths(
+            'databuild/test/app/bazel/bazel_graph.build'
+        )
+        
+        # Add fallback paths for local testing
+        possible_paths.extend([
+            'bazel-bin/databuild/test/app/bazel/bazel_graph.build',
+            './bazel_graph.build'
+        ])
+        
+        # Find the graph build binary
+        graph_build_path = self.find_graph_build_binary(possible_paths)
+        
+        # Execute and verify the graph build
+        self.execute_and_verify_graph_build(graph_build_path)
+
+
+if __name__ == '__main__':
+    import unittest
+    unittest.main()
--- a/databuild/test/app/dsl/BUILD.bazel
+++ b/databuild/test/app/dsl/BUILD.bazel
@ -22,3 +22,33 @@ databuild_dsl_generator(
    deps = [":dsl_src"],
    visibility = ["//visibility:public"],
 )
+
+# Generate fresh DSL output for comparison testing
+genrule(
+    name = "generate_fresh_dsl",
+    outs = ["generated_fresh.tar"],
+    cmd_bash = """
+        # Create temporary directory for generation
+        mkdir -p temp_workspace/databuild/test/app/dsl
+        
+        # Set environment to generate to temp directory
+        export BUILD_WORKSPACE_DIRECTORY="temp_workspace"
+        
+        # Run the generator
+        $(location :graph.generate)
+        
+        # Create tar archive of generated files
+        if [ -d "temp_workspace/databuild/test/app/dsl/generated" ]; then
+            find temp_workspace/databuild/test/app/dsl/generated -exec touch -t 197001010000 {} +
+            tar -cf $@ -C temp_workspace/databuild/test/app/dsl/generated .
+        else
+            # Create empty tar if no files generated
+            tar -cf $@ -T /dev/null
+        fi
+        
+        # Clean up
+        rm -rf temp_workspace
+    """,
+    tools = [":graph.generate"],
+    visibility = ["//visibility:public"],
+)
--- a/databuild/test/app/dsl/claude-generated-dsl-test.md
+++ b/databuild/test/app/dsl/claude-generated-dsl-test.md
@ -0,0 +1,9 @@
+
+We can't write a direct `bazel test` for the DSL generated graph, because:
+
+1. Bazel doesn't allow you to `bazel run graph.generate` to generate a BUILD.bazel that will be used in the same build.
+2. We don't want to leak test generation into the graph generation code (since tests here are app specific)
+
+Instead, we need to use a two phase process, where we rely on the graph to already be generated here, which will contain a test, such that `bazel test //...` will give us recall over generated source as well. This implies that this generated source is going to be checked in to git (gasp, I know), and we need a mechanism to ensure it stays up to date. To achieve this, we'll create a test that asserts that the contents of the `generated` dir is the exact same as the output of a new run of `graph.generate`.
+
+Our task is to implement this test that asserts equality between the two, e.g. the target could depend on `graph.generate`, and in the test run it and md5 the results, comparing it to the md5 of the existing generated dir.
--- a/databuild/test/app/dsl/generated/BUILD.bazel
+++ b/databuild/test/app/dsl/generated/BUILD.bazel
@ -0,0 +1,71 @@
+load("@databuild//databuild:rules.bzl", "databuild_job", "databuild_graph")
+
+# Generated by DataBuild DSL - do not edit manually
+# This file is generated in a subdirectory to avoid overwriting the original BUILD.bazel
+
+py_binary(
+    name = "aggregate_color_votes_binary",
+    srcs = ["aggregate_color_votes.py"],
+    main = "aggregate_color_votes.py",
+    deps = ["@@//databuild/test/app/dsl:dsl_src"],
+)
+
+databuild_job(
+    name = "aggregate_color_votes",
+    binary = ":aggregate_color_votes_binary",
+)
+py_binary(
+    name = "color_vote_report_calc_binary",
+    srcs = ["color_vote_report_calc.py"],
+    main = "color_vote_report_calc.py",
+    deps = ["@@//databuild/test/app/dsl:dsl_src"],
+)
+
+databuild_job(
+    name = "color_vote_report_calc",
+    binary = ":color_vote_report_calc_binary",
+)
+py_binary(
+    name = "ingest_color_votes_binary",
+    srcs = ["ingest_color_votes.py"],
+    main = "ingest_color_votes.py",
+    deps = ["@@//databuild/test/app/dsl:dsl_src"],
+)
+
+databuild_job(
+    name = "ingest_color_votes",
+    binary = ":ingest_color_votes_binary",
+)
+py_binary(
+    name = "trailing_color_votes_binary",
+    srcs = ["trailing_color_votes.py"],
+    main = "trailing_color_votes.py",
+    deps = ["@@//databuild/test/app/dsl:dsl_src"],
+)
+
+databuild_job(
+    name = "trailing_color_votes",
+    binary = ":trailing_color_votes_binary",
+)
+
+py_binary(
+    name = "dsl_job_lookup",
+    srcs = ["dsl_job_lookup.py"],
+    deps = ["@@//databuild/test/app/dsl:dsl_src"],
+)
+
+databuild_graph(
+    name = "dsl_graph",
+    jobs = ["aggregate_color_votes", "color_vote_report_calc", "ingest_color_votes", "trailing_color_votes"],
+    lookup = ":dsl_job_lookup",
+    visibility = ["//visibility:public"],
+)
+
+# Create tar archive of generated files for testing
+genrule(
+    name = "existing_generated",
+    srcs = glob(["*.py", "BUILD.bazel"]),
+    outs = ["existing_generated.tar"],
+    cmd = "mkdir -p temp && cp $(SRCS) temp/ && find temp -exec touch -t 197001010000 {} + && tar -cf $@ -C temp .",
+    visibility = ["//visibility:public"],
+)
--- a/databuild/test/app/dsl/generated/aggregate_color_votes.py
+++ b/databuild/test/app/dsl/generated/aggregate_color_votes.py
@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+"""
+Generated job script for AggregateColorVotes.
+"""
+
+import sys
+import json
+from databuild.test.app.dsl.graph import AggregateColorVotes
+from databuild.proto import PartitionRef, JobConfigureResponse, to_dict
+
+
+def parse_outputs_from_args(args: list[str]) -> list:
+    """Parse partition output references from command line arguments."""
+    outputs = []
+    for arg in args:
+        # Find which output type can deserialize this partition reference
+        for output_type in AggregateColorVotes.output_types:
+            try:
+                partition = output_type.deserialize(arg)
+                outputs.append(partition)
+                break
+            except ValueError:
+                continue
+        else:
+            raise ValueError(f"No output type in AggregateColorVotes can deserialize partition ref: {arg}")
+    
+    return outputs
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        raise Exception(f"Invalid command usage")
+
+    command = sys.argv[1]
+    job_instance = AggregateColorVotes()
+    
+    if command == "config":
+        # Parse output partition references as PartitionRef objects (for Rust wrapper)
+        output_refs = [PartitionRef(str=raw_ref) for raw_ref in sys.argv[2:]]
+        
+        # Also parse them into DSL partition objects (for DSL job.config())
+        outputs = parse_outputs_from_args(sys.argv[2:])
+        
+        # Call job's config method - returns list[JobConfig]
+        configs = job_instance.config(outputs)
+        
+        # Wrap in JobConfigureResponse and serialize using to_dict()
+        response = JobConfigureResponse(configs=configs)
+        print(json.dumps(to_dict(response)))
+        
+    elif command == "exec":
+        # The exec method expects a JobConfig but the Rust wrapper passes args
+        # For now, let the DSL job handle the args directly 
+        # TODO: This needs to be refined based on actual Rust wrapper interface
+        job_instance.exec(*sys.argv[2:])
+        
+    else:
+        raise Exception(f"Invalid command `{sys.argv[1]}`")
--- a/databuild/test/app/dsl/generated/color_vote_report_calc.py
+++ b/databuild/test/app/dsl/generated/color_vote_report_calc.py
@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+"""
+Generated job script for ColorVoteReportCalc.
+"""
+
+import sys
+import json
+from databuild.test.app.dsl.graph import ColorVoteReportCalc
+from databuild.proto import PartitionRef, JobConfigureResponse, to_dict
+
+
+def parse_outputs_from_args(args: list[str]) -> list:
+    """Parse partition output references from command line arguments."""
+    outputs = []
+    for arg in args:
+        # Find which output type can deserialize this partition reference
+        for output_type in ColorVoteReportCalc.output_types:
+            try:
+                partition = output_type.deserialize(arg)
+                outputs.append(partition)
+                break
+            except ValueError:
+                continue
+        else:
+            raise ValueError(f"No output type in ColorVoteReportCalc can deserialize partition ref: {arg}")
+    
+    return outputs
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        raise Exception(f"Invalid command usage")
+
+    command = sys.argv[1]
+    job_instance = ColorVoteReportCalc()
+    
+    if command == "config":
+        # Parse output partition references as PartitionRef objects (for Rust wrapper)
+        output_refs = [PartitionRef(str=raw_ref) for raw_ref in sys.argv[2:]]
+        
+        # Also parse them into DSL partition objects (for DSL job.config())
+        outputs = parse_outputs_from_args(sys.argv[2:])
+        
+        # Call job's config method - returns list[JobConfig]
+        configs = job_instance.config(outputs)
+        
+        # Wrap in JobConfigureResponse and serialize using to_dict()
+        response = JobConfigureResponse(configs=configs)
+        print(json.dumps(to_dict(response)))
+        
+    elif command == "exec":
+        # The exec method expects a JobConfig but the Rust wrapper passes args
+        # For now, let the DSL job handle the args directly 
+        # TODO: This needs to be refined based on actual Rust wrapper interface
+        job_instance.exec(*sys.argv[2:])
+        
+    else:
+        raise Exception(f"Invalid command `{sys.argv[1]}`")
--- a/databuild/test/app/dsl/generated/dsl_job_lookup.py
+++ b/databuild/test/app/dsl/generated/dsl_job_lookup.py
@ -0,0 +1,53 @@
+#!/usr/bin/env python3
+"""
+Generated job lookup for DataBuild DSL graph.
+Maps partition patterns to job targets.
+"""
+
+import sys
+import re
+import json
+from collections import defaultdict
+
+
+# Mapping from partition patterns to job targets
+JOB_MAPPINGS = {
+    r"daily_color_votes/(?P<data_date>\d{4}-\d{2}-\d{2})/(?P<color>[^/]+)": "//databuild/test/app/dsl/generated:ingest_color_votes",
+    r"color_votes_1m/(?P<data_date>\d{4}-\d{2}-\d{2})/(?P<color>[^/]+)": "//databuild/test/app/dsl/generated:trailing_color_votes",
+    r"color_votes_1w/(?P<data_date>\d{4}-\d{2}-\d{2})/(?P<color>[^/]+)": "//databuild/test/app/dsl/generated:trailing_color_votes",
+    r"daily_votes/(?P<data_date>\d{4}-\d{2}-\d{2})": "//databuild/test/app/dsl/generated:aggregate_color_votes",
+    r"votes_1w/(?P<data_date>\d{4}-\d{2}-\d{2})": "//databuild/test/app/dsl/generated:aggregate_color_votes",
+    r"votes_1m/(?P<data_date>\d{4}-\d{2}-\d{2})": "//databuild/test/app/dsl/generated:aggregate_color_votes",
+    r"color_vote_report/(?P<data_date>\d{4}-\d{2}-\d{2})/(?P<color>[^/]+)": "//databuild/test/app/dsl/generated:color_vote_report_calc",
+}
+
+
+def lookup_job_for_partition(partition_ref: str) -> str:
+    """Look up which job can build the given partition reference."""
+    for pattern, job_target in JOB_MAPPINGS.items():
+        if re.match(pattern, partition_ref):
+            return job_target
+    
+    raise ValueError(f"No job found for partition: {partition_ref}")
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: job_lookup.py <partition_ref> [partition_ref...]", file=sys.stderr)
+        sys.exit(1)
+    
+    results = defaultdict(list)
+    try:
+        for partition_ref in sys.argv[1:]:
+            job_target = lookup_job_for_partition(partition_ref)
+            results[job_target].append(partition_ref)
+        
+        # Output the results as JSON (matching existing lookup format)
+        print(json.dumps(dict(results)))
+    except ValueError as e:
+        print(f"ERROR: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/databuild/test/app/dsl/generated/ingest_color_votes.py
+++ b/databuild/test/app/dsl/generated/ingest_color_votes.py
@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+"""
+Generated job script for IngestColorVotes.
+"""
+
+import sys
+import json
+from databuild.test.app.dsl.graph import IngestColorVotes
+from databuild.proto import PartitionRef, JobConfigureResponse, to_dict
+
+
+def parse_outputs_from_args(args: list[str]) -> list:
+    """Parse partition output references from command line arguments."""
+    outputs = []
+    for arg in args:
+        # Find which output type can deserialize this partition reference
+        for output_type in IngestColorVotes.output_types:
+            try:
+                partition = output_type.deserialize(arg)
+                outputs.append(partition)
+                break
+            except ValueError:
+                continue
+        else:
+            raise ValueError(f"No output type in IngestColorVotes can deserialize partition ref: {arg}")
+    
+    return outputs
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        raise Exception(f"Invalid command usage")
+
+    command = sys.argv[1]
+    job_instance = IngestColorVotes()
+    
+    if command == "config":
+        # Parse output partition references as PartitionRef objects (for Rust wrapper)
+        output_refs = [PartitionRef(str=raw_ref) for raw_ref in sys.argv[2:]]
+        
+        # Also parse them into DSL partition objects (for DSL job.config())
+        outputs = parse_outputs_from_args(sys.argv[2:])
+        
+        # Call job's config method - returns list[JobConfig]
+        configs = job_instance.config(outputs)
+        
+        # Wrap in JobConfigureResponse and serialize using to_dict()
+        response = JobConfigureResponse(configs=configs)
+        print(json.dumps(to_dict(response)))
+        
+    elif command == "exec":
+        # The exec method expects a JobConfig but the Rust wrapper passes args
+        # For now, let the DSL job handle the args directly 
+        # TODO: This needs to be refined based on actual Rust wrapper interface
+        job_instance.exec(*sys.argv[2:])
+        
+    else:
+        raise Exception(f"Invalid command `{sys.argv[1]}`")
--- a/databuild/test/app/dsl/generated/trailing_color_votes.py
+++ b/databuild/test/app/dsl/generated/trailing_color_votes.py
@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+"""
+Generated job script for TrailingColorVotes.
+"""
+
+import sys
+import json
+from databuild.test.app.dsl.graph import TrailingColorVotes
+from databuild.proto import PartitionRef, JobConfigureResponse, to_dict
+
+
+def parse_outputs_from_args(args: list[str]) -> list:
+    """Parse partition output references from command line arguments."""
+    outputs = []
+    for arg in args:
+        # Find which output type can deserialize this partition reference
+        for output_type in TrailingColorVotes.output_types:
+            try:
+                partition = output_type.deserialize(arg)
+                outputs.append(partition)
+                break
+            except ValueError:
+                continue
+        else:
+            raise ValueError(f"No output type in TrailingColorVotes can deserialize partition ref: {arg}")
+    
+    return outputs
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        raise Exception(f"Invalid command usage")
+
+    command = sys.argv[1]
+    job_instance = TrailingColorVotes()
+    
+    if command == "config":
+        # Parse output partition references as PartitionRef objects (for Rust wrapper)
+        output_refs = [PartitionRef(str=raw_ref) for raw_ref in sys.argv[2:]]
+        
+        # Also parse them into DSL partition objects (for DSL job.config())
+        outputs = parse_outputs_from_args(sys.argv[2:])
+        
+        # Call job's config method - returns list[JobConfig]
+        configs = job_instance.config(outputs)
+        
+        # Wrap in JobConfigureResponse and serialize using to_dict()
+        response = JobConfigureResponse(configs=configs)
+        print(json.dumps(to_dict(response)))
+        
+    elif command == "exec":
+        # The exec method expects a JobConfig but the Rust wrapper passes args
+        # For now, let the DSL job handle the args directly 
+        # TODO: This needs to be refined based on actual Rust wrapper interface
+        job_instance.exec(*sys.argv[2:])
+        
+    else:
+        raise Exception(f"Invalid command `{sys.argv[1]}`")
--- a/databuild/test/app/dsl/generated_test/BUILD.bazel
+++ b/databuild/test/app/dsl/generated_test/BUILD.bazel
@ -0,0 +1,7 @@
+py_test(
+    name = "test_e2e",
+    srcs = ["test_e2e.py"],
+    data = ["//databuild/test/app/dsl/generated:dsl_graph.build"],
+    main = "test_e2e.py",
+    deps = ["//databuild/test/app:e2e_test_common"],
+)
--- a/databuild/test/app/dsl/generated_test/test_e2e.py
+++ b/databuild/test/app/dsl/generated_test/test_e2e.py
@ -0,0 +1,37 @@
+#!/usr/bin/env python3
+"""
+End-to-end test for the DSL-generated test app.
+
+Tests the full pipeline: build execution -> output verification -> JSON validation.
+"""
+
+import os
+from databuild.test.app.e2e_test_common import DataBuildE2ETestBase
+
+
+class DSLGeneratedE2ETest(DataBuildE2ETestBase):
+    """End-to-end test for the DSL-generated test app."""
+    
+    def test_end_to_end_execution(self):
+        """Test full end-to-end execution of the DSL-generated graph."""
+        # Build possible paths for the DSL-generated graph build binary
+        possible_paths = self.get_standard_runfiles_paths(
+            'databuild/test/app/dsl/generated/dsl_graph.build'
+        )
+        
+        # Add fallback paths for local testing
+        possible_paths.extend([
+            'bazel-bin/databuild/test/app/dsl/generated/dsl_graph.build',
+            './dsl_graph.build'
+        ])
+        
+        # Find the graph build binary
+        graph_build_path = self.find_graph_build_binary(possible_paths)
+        
+        # Execute and verify the graph build
+        self.execute_and_verify_graph_build(graph_build_path)
+
+
+if __name__ == '__main__':
+    import unittest
+    unittest.main()
--- a/databuild/test/app/dsl/test/BUILD.bazel
+++ b/databuild/test/app/dsl/test/BUILD.bazel
@ -73,3 +73,15 @@ py_test(
        "//databuild/test/app/dsl:dsl_src",
    ],
 )
+
+# DSL generation consistency test
+py_test(
+    name = "test_dsl_generation_consistency",
+    srcs = ["test_dsl_generation_consistency.py"],
+    main = "test_dsl_generation_consistency.py",
+    data = [
+        "//databuild/test/app/dsl:generate_fresh_dsl",
+        "//databuild/test/app/dsl/generated:existing_generated",
+    ],
+    deps = [],
+)
--- a/databuild/test/app/dsl/test/test_dsl_generation_consistency.py
+++ b/databuild/test/app/dsl/test/test_dsl_generation_consistency.py
@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+"""
+Test that verifies the generated DSL code is up-to-date.
+
+This test ensures that the checked-in generated directory contents match
+exactly what would be produced by a fresh run of graph.generate.
+"""
+
+import hashlib
+import os
+import subprocess
+import tempfile
+import unittest
+from pathlib import Path
+
+
+class TestDSLGenerationConsistency(unittest.TestCase):
+    def setUp(self):
+        # Find the test runfiles directory to locate tar files
+        runfiles_dir = os.environ.get("RUNFILES_DIR")
+        if runfiles_dir:
+            self.runfiles_root = Path(runfiles_dir) / "_main"
+        else:
+            # Fallback for development - not expected to work in this case
+            self.fail("RUNFILES_DIR not set - test must be run via bazel test")
+
+    def _compute_tar_hash(self, tar_path: Path) -> str:
+        """Compute MD5 hash of a tar file's contents."""
+        if not tar_path.exists():
+            self.fail(f"Tar file not found: {tar_path}")
+            
+        with open(tar_path, "rb") as f:
+            content = f.read()
+            return hashlib.md5(content).hexdigest()
+
+    def _extract_and_list_tar(self, tar_path: Path) -> set:
+        """Extract tar file and return set of file paths and their content hashes."""
+        if not tar_path.exists():
+            return set()
+            
+        result = subprocess.run([
+            "tar", "-tf", str(tar_path)
+        ], capture_output=True, text=True)
+        
+        if result.returncode != 0:
+            self.fail(f"Failed to list tar contents: {result.stderr}")
+            
+        return set(result.stdout.strip().split('\n')) if result.stdout.strip() else set()
+
+    def test_generated_code_is_up_to_date(self):
+        """Test that the existing generated tar matches the fresh generated tar."""
+        
+        # Find the tar files from data dependencies
+        existing_tar = self.runfiles_root / "databuild/test/app/dsl/generated/existing_generated.tar"
+        fresh_tar = self.runfiles_root / "databuild/test/app/dsl/generated_fresh.tar"
+        
+        # Compute hashes of both tar files
+        existing_hash = self._compute_tar_hash(existing_tar)
+        fresh_hash = self._compute_tar_hash(fresh_tar)
+        
+        # Compare hashes
+        if existing_hash != fresh_hash:
+            # Provide detailed diff information
+            existing_files = self._extract_and_list_tar(existing_tar)
+            fresh_files = self._extract_and_list_tar(fresh_tar)
+            
+            only_in_existing = existing_files - fresh_files
+            only_in_fresh = fresh_files - existing_files
+            
+            error_msg = [
+                "Generated DSL code is out of date!",
+                f"Existing tar hash: {existing_hash}",
+                f"Fresh tar hash: {fresh_hash}",
+                "",
+                "To fix this, run:",
+                "  bazel run //databuild/test/app/dsl:graph.generate",
+                ""
+            ]
+            
+            if only_in_existing:
+                error_msg.extend([
+                    "Files only in existing generated code:",
+                    *[f"  - {f}" for f in sorted(only_in_existing)],
+                    ""
+                ])
+            
+            if only_in_fresh:
+                error_msg.extend([
+                    "Files only in fresh generated code:",
+                    *[f"  + {f}" for f in sorted(only_in_fresh)],
+                    ""
+                ])
+                
+            common_files = existing_files & fresh_files
+            if common_files:
+                error_msg.extend([
+                    f"Common files: {len(common_files)}",
+                    "This suggests files have different contents.",
+                ])
+            
+            self.fail("\n".join(error_msg))
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/databuild/test/app/e2e_test_common.py
+++ b/databuild/test/app/e2e_test_common.py
@ -0,0 +1,103 @@
+#!/usr/bin/env python3
+"""
+Common end-to-end test logic for DataBuild test apps.
+
+Provides shared functionality for testing both bazel-defined and DSL-generated graphs.
+"""
+
+import json
+import os
+import shutil
+import subprocess
+import time
+import unittest
+from pathlib import Path
+from typing import List, Optional
+
+
+class DataBuildE2ETestBase(unittest.TestCase):
+    """Base class for DataBuild end-to-end tests."""
+    
+    def setUp(self):
+        """Set up test environment."""
+        self.output_dir = Path("/tmp/data/color_votes_1w/2025-09-01/red")
+        self.output_file = self.output_dir / "data.json"
+        self.partition_ref = "color_votes_1w/2025-09-01/red"
+        
+        # Clean up any existing test data
+        if self.output_dir.exists():
+            shutil.rmtree(self.output_dir)
+    
+    def tearDown(self):
+        """Clean up test environment."""
+        if self.output_dir.exists():
+            shutil.rmtree(self.output_dir)
+    
+    def find_graph_build_binary(self, possible_paths: List[str]) -> str:
+        """Find the graph.build binary from a list of possible paths."""
+        graph_build_path = None
+        for path in possible_paths:
+            if os.path.exists(path):
+                graph_build_path = path
+                break
+        
+        self.assertIsNotNone(graph_build_path, 
+                           f"Graph build binary not found in any of: {possible_paths}")
+        return graph_build_path
+    
+    def execute_and_verify_graph_build(self, graph_build_path: str) -> None:
+        """Execute the graph build and verify the results."""
+        # Record start time for file modification check
+        start_time = time.time()
+        
+        # Execute the graph build (shell script)
+        result = subprocess.run(
+            ["bash", graph_build_path, self.partition_ref],
+            capture_output=True,
+            text=True
+        )
+        
+        # Verify execution succeeded
+        self.assertEqual(result.returncode, 0, 
+                        f"Graph build failed with stderr: {result.stderr}")
+        
+        # Verify output file was created
+        self.assertTrue(self.output_file.exists(), 
+                       f"Output file {self.output_file} was not created")
+        
+        # Verify file was created recently (within 60 seconds)
+        file_mtime = os.path.getmtime(self.output_file)
+        time_diff = file_mtime - start_time
+        self.assertGreaterEqual(time_diff, -1,  # Allow 1 second clock skew
+                               f"File appears to be too old: {time_diff} seconds")
+        self.assertLessEqual(time_diff, 60,
+                            f"File creation took too long: {time_diff} seconds")
+        
+        # Verify file contains valid JSON
+        with open(self.output_file, 'r') as f:
+            content = f.read()
+        
+        try:
+            data = json.loads(content)
+        except json.JSONDecodeError as e:
+            self.fail(f"Output file does not contain valid JSON: {e}")
+        
+        # Basic sanity check on JSON structure
+        self.assertIsInstance(data, (dict, list), 
+                             "JSON should be an object or array")
+    
+    def get_standard_runfiles_paths(self, relative_path: str) -> List[str]:
+        """Get standard list of possible runfiles paths for a binary."""
+        runfiles_dir = os.environ.get("RUNFILES_DIR")
+        test_srcdir = os.environ.get("TEST_SRCDIR")
+        
+        possible_paths = []
+        if runfiles_dir:
+            possible_paths.append(os.path.join(runfiles_dir, '_main', relative_path))
+            possible_paths.append(os.path.join(runfiles_dir, relative_path))
+        
+        if test_srcdir:
+            possible_paths.append(os.path.join(test_srcdir, '_main', relative_path))
+            possible_paths.append(os.path.join(test_srcdir, relative_path))
+        
+        return possible_paths
Author	SHA1	Message	Date
Stuart Axelbrooke	ad2cc7498b	add sick ass logo Some checks failed / setup (push) Has been cancelled Details	2025-08-20 19:39:01 -07:00
Stuart Axelbrooke	5c9c2a05cc	Add systemd comment	2025-08-18 22:04:53 -07:00
Stuart Axelbrooke	8fd1c9b046	Describe wants and taints in readme Some checks failed / setup (push) Has been cancelled Details	2025-08-18 20:54:31 -07:00
Stuart Axelbrooke	dc622dd0ac	Minor timestamp fix Some checks failed / setup (push) Has been cancelled Details	2025-08-16 16:21:43 -07:00
Stuart Axelbrooke	b3298e7213	Add test app e2e test coverage for generated graph	2025-08-16 15:53:26 -07:00
Stuart Axelbrooke	f92cfeb9b5	Add test app generated package	2025-08-16 15:37:47 -07:00
Stuart Axelbrooke	07d2a9faec	Detect out of date generated source	2025-08-16 15:37:07 -07:00
Stuart Axelbrooke	952366ab66	Add e2e test for test app bazel impl	2025-08-16 09:39:56 -07:00