# Podcast Reviews Example This is an example data application where we produce text insights from podcast review data. It is made up of N datasets: - Raw reviews `(date, podcast, text, rating)` - Podcasts `(podcast, title, category)` - Categorized review text `(date, category, podcast, text)` - Phrase models `(date, category, hash, ngram, score)` - Podcast phrase stats `(date, category, podcast, ngram, count, rating)` - Podcast daily summary `(date, category, podcast, phrase_stats, recent_reviews)` ```mermaid flowchart LR raw_reviews[(Raw Reviews)] & podcasts[(Podcasts)] --> categorize_text --> categorized_texts[(Categorized Texts)] categorized_texts --> phrase[Phrase Modeling] --> phrase_models[(Phrase Models)] phrase_models & raw_reviews --> phrase_stats --> podcast_phrase_stats[(Podcast Phrase Stats)] podcast_phrase_stats & raw_reviews --> calc_summary --> podcast_daily_summary[(Podcast Daily Summary)] ``` ## Input Data Get it from [here](https://www.kaggle.com/datasets/thoughtvector/podcastreviews/versions/28?select=database.sqlite)! (and put it in `examples/podcast_reviews/data/ingest/database.sqlite`) ## `phrase` Dependency This relies on [`soaxelbrooke/phrase`](https://github.com/soaxelbrooke/phrase) for phrase extraction - check out its [releases](https://github.com/soaxelbrooke/phrase/releases) to get a relevant binary. ## Building Output Partitions ### CLI Build Use the DataBuild CLI to build specific partitions: ```bash bazel build //:podcast_reviews_graph.build # Builds bazel-bin/podcast_reviews_graph.build # Build raw reviews for a specific date bazel-bin/podcast_reviews_graph.build "reviews/date=2020-01-01" # Build categorized reviews bazel-bin/podcast_reviews_graph.build "categorized_reviews/category=Technology/date=2020-01-01" # Build phrase models bazel-bin/podcast_reviews_graph.build "phrase_models/category=Technology/date=2020-01-01" # Build daily summaries bazel-bin/podcast_reviews_graph.build "daily_summaries/category=Technology/date=2020-01-01" # Build all podcasts data bazel-bin/podcast_reviews_graph.build "podcasts/all" ``` ### Service Build Use the Build Graph Service for HTTP API access: ```bash # Start the service bazel build //:podcast_reviews_graph.service bazel-bin/podcast_reviews_graph.service # Submit build request for reviews curl -X POST http://localhost:8080/api/v1/builds \ -H "Content-Type: application/json" \ -d '{"partitions": ["reviews/date=2020-01-01"]}' # Submit build request for daily summary (builds entire pipeline) curl -X POST http://localhost:8080/api/v1/builds \ -H "Content-Type: application/json" \ -d '{"partitions": ["daily_summaries/category=Technology/date=2020-01-01"]}' # Check build status curl http://localhost:8080/api/v1/builds/BUILD_REQUEST_ID # Get partition status curl http://localhost:8080/api/v1/partitions/reviews%2Fdate%3D2020-01-01/status # Get partition events curl http://localhost:8080/api/v1/partitions/reviews%2Fdate%3D2020-01-01/events # Analyze build graph (planning only) curl -X POST http://localhost:8080/api/v1/analyze \ -H "Content-Type: application/json" \ -d '{"partitions": ["daily_summaries/category=Technology/date=2020-01-01"]}' ``` ### Partition Reference Patterns The following partition reference patterns are supported: - `reviews/date=YYYY-MM-DD` - Raw reviews for a specific date - `podcasts/all` - All podcasts metadata - `categorized_reviews/category=CATEGORY/date=YYYY-MM-DD` - Categorized reviews - `phrase_models/category=CATEGORY/date=YYYY-MM-DD` - Phrase models - `phrase_stats/category=CATEGORY/date=YYYY-MM-DD` - Phrase statistics - `daily_summaries/category=CATEGORY/date=YYYY-MM-DD` - Daily summaries (the "output")