Example Projects
Example Projects¶
End-to-end projects demonstrating aaiclick orchestration with real-world data.
Each project lives in aaiclick/example_projects/<name>/ and includes a shell
script as the single entry point.
Basic Lineage¶
AI-powered lineage explanation for a revenue pipeline. Builds a
prices * quantities + bonus computation, traces its full backward and forward
lineage graphs, then uses an LLM to explain how the result was produced. The AI
step runs only when AAICLICK_AI_API_KEY is set; without it, the example prints
the lineage graphs and skips the LLM explanation.
Basic Worker¶
A minimal orchestration example that registers a job printing 6 ticks at 0.5-second intervals, then executes it via a worker.
chdb Benchmark¶
Compares aaiclick's Object API against native chdb SQL on identical data (1M rows, 10 runs averaged). Measures ingest, column sum, multiply, filter, sort, count distinct, and group-by operations.
Cyber Threat Feeds — Multi-Source Normalization¶
Multi-source cybersecurity pipeline that loads CISA KEV, Shodan CVEDB, and FIRST EPSS data (JSON, gzip CSV) directly into ClickHouse via URL ingestion, normalizes and consolidates them into an AggregatingMergeTree table keyed by CVE ID, and produces a threat intelligence report.
IMDb Dataset Builder¶
Large-scale data curation pipeline that loads IMDb title.basics (~10M rows) from the official dataset URL, profiles raw data, filters to quality movies (1980+, 40–300 min runtime, non-adult), normalizes genres via explode, enriches each title with Wikipedia plot text via Wikidata P345 title resolution plus an AggregatingMergeTree merge against the Hugging Face wikimedia/wikipedia Parquet dump, and optionally publishes a curated Parquet dataset to Hugging Face.
# Demo mode (500k rows)
./imdb_dataset_builder.sh
# Full dataset (~10M rows)
./imdb_dataset_builder.sh --full
Set HF_TOKEN to publish the curated dataset to Hugging Face Hub. Set AIRTABLE_API_KEY + AIRTABLE_BASE_ID (table defaults to IMDB) to publish a ~200-row stratified-by-genre showcase sample to Airtable.
NYC Taxi Pipeline¶
Distributed computing example that loads NYC TLC Yellow Taxi trip data from Parquet URLs, runs parallel analysis tasks (basic stats, statistical metrics, group-by, tip and distance analysis), and produces a summary report.