Object¶
Overview¶
The Object class (aaiclick/data/object.py) wraps a ClickHouse table. Each instance corresponds to one table; operator overloading creates new tables with results.
Key Features:
- Operator overloading for arithmetic, comparison, and bitwise operations
- Immutable structure — no in-place schema changes; insert() appends data but never mutates structure
- Support for scalars, arrays, and dictionaries
- Element-wise operations on arrays
See DataContext for lifecycle, schemas, and deployment modes.
API Quick Reference¶
| API | Category | Description | Section |
|---|---|---|---|
+, -, *, /, //, %, ** |
Arithmetic | Element-wise arithmetic | Arithmetic Operators |
==, !=, <, <=, >, >= |
Comparison | Element-wise comparison | Comparison Operators |
&, \|, ^ |
Bitwise | Bitwise AND / OR / XOR | Bitwise Operators |
.min(), .max(), .sum(), .mean() |
Aggregation | Reduce array to scalar | Aggregation Operators |
.std(), .var(), .count(), .quantile(q) |
Aggregation | Statistical reduction | Aggregation Operators |
.count_if(condition) |
Aggregation | Count rows matching condition(s) | Aggregation Operators |
.unique() |
Unique | Deduplicate values via GROUP BY | Unique |
.nunique() |
Unique | Count distinct values (fused) | Unique |
.isin(other) |
Membership | IN subquery test (returns UInt8 mask) | Membership Operator |
.match(p), .like(p), .ilike(p) |
String / Regex | Pattern matching (returns UInt8 mask) | String/Regex Operators |
.extract(p), .replace(p, r) |
String / Regex | Capture group extraction, regex replace | String/Regex Operators |
.year(), .month(), .day_of_week() |
Unary Transforms | Date/time extraction → new Object | Unary Transform Operators |
.lower(), .upper(), .length(), .trim() |
Unary Transforms | String transforms → new Object | Unary Transform Operators |
.abs(), .log2(), .sqrt() |
Unary Transforms | Math functions → new Object | Unary Transform Operators |
.group_by(keys).sum(col) etc. |
Group By | Aggregation with GROUP BY + optional HAVING | Group By Operations |
.group_by(keys).agg({col: spec}) |
Group By | Multiple aggregations (multi-agg per column) | Group By Operations |
.group_by(keys).any(col) |
Group By | Arbitrary non-NULL value per group | Group By Operations |
.group_by(keys).group_array_distinct(col) |
Group By | Distinct values → Array per group | Group By Operations |
.view(where, limit, offset, order_by) |
Views | Read-only View with optional filters | Views |
.where(cond) / .or_where(cond) |
Views | Fluent WHERE chaining (AND / OR) | Chained WHERE Clauses |
obj[key] / obj[[keys]] |
Views | Select column(s) from dict Object → View | Column Selection |
.with_columns({name: Computed(type, expr)}) |
Views | Add SQL expression columns → View | Computed Column Expansion |
literal(value, ch_type) |
Helpers | Constant column (Computed with SQL literal) |
Computed Column Expansion |
.with_year(col), .with_month(col) … |
Views | Named shortcuts for common computed columns | Domain Helpers |
.with_split_by_char(col, sep) |
Views | Split String column by separator → Array | Domain Helpers |
.with_isin(col, other) |
Views | IN subquery computed column → UInt8 | Domain Helpers |
.rename({old: new}) |
Views | Alias column names in a View | Column Renaming |
.explode(*columns, left=False) |
Views | Flatten Array column(s) into rows → View | Explode |
.copy() |
Copy | Materialize Object / View to new Object | copy() |
.insert(*sources) |
Ingest | Insert data from Objects / scalars / lists | insert() |
.insert_from_url(url) |
Ingest | Insert rows from a remote URL | insert_from_url() |
.concat(*sources) / concat(a, b, …) |
Ingest | Concatenate sources into a new Object | concat() |
.join(other, on, how) |
Join | Join two Objects on key columns → dict Object | join() |
.data(orient=…) |
Data Retrieval | Fetch results to Python (scalar / list / dict) | data() |
.markdown(truncate=…) |
Data Retrieval | Render data as markdown table | markdown() |
.execute(order_by, limit, offset) |
Data Retrieval | Run full query, discard rows → QueryStats |
execute() |
await .stats() |
Data Retrieval | QueryStats of the query that made this Object |
stats() |
.export(path) |
Export | Stream data to a file (extension → format) | export() |
Operator Support¶
All operators work element-wise on scalar and array data, creating new Object tables. See examples/basic_operators.py.
Scalar broadcast
Python scalars work on either side: obj * 2 and 2 * obj both work.
Lazy Operator Results (a + b and a.sum() are Plans, not Tables)¶
Every binary operator (+, -, *, /, //, %, **, ==, !=, <, <=, >, >=, &, |, ^), aggregation (.min() / .max() / .sum() / .mean() / .std() / .var() / .count() / .count_if() / .quantile() / .unique() / .nunique()), and unary transform (.year() / .month() / .day_of_week() / .lower() / .upper() / .length() / .trim() / .abs() / .log2() / .sqrt()) returns a LazyOperator — a subclass of Object that captures the operation plan (lhs, rhs, operator, precomputed result schema) without touching ClickHouse. The CREATE TABLE + INSERT INTO ... SELECT happens when the lazy is awaited.
# 1. Build a plan — pure Python, no DB call
plan = a + b # LazyOperator (no table yet)
total = a.sum() # also a LazyOperator
# 2. Read rows / scalars — async methods auto-materialize
rows = await (a + b).data() # creates t_<id>, reads rows
value = await a.sum().data() # one await — materialize + read
# 3. Get the materialized Object (for .table, legacy APIs, etc.)
obj = await (a + b) # table = t_<id>
scalar_obj = await a.sum() # table = t_<id> with one row
# 4. Control the table name — use .as_()
revenue = await (prices * quantities).as_("revenue") # t_revenue_<id>
daily_total = await orders.sum().as_("daily_total") # t_daily_total_<id>
# 5. Persist beyond the current context
daily = await (sales + bonuses).as_("daily", scope="job") # j_<job_id>_daily
yearly = await orders.sum().as_("yearly", scope="global") # p_yearly
await is only needed when the caller wants the materialized Object back — to read .table, hand it to an eager API, etc. Reading rows (.data(), .markdown(), .export(), .result()) or chaining further operators ((a + b) + c, obj.abs().sum()) doesn't require an explicit await of the intermediate — the lazy passes through.
.as_(name, scope=...)¶
Names the result table and chooses its lifetime. Returns a new LazyOperator (the receiver is unchanged):
scope |
Table name | Lifetime |
|---|---|---|
"temp_named" (default) |
t_<name>_<snowflake> |
Drops with the context |
"job" |
j_<job_id>_<name> |
Lives until the active orch job expires |
"global" |
p_<name> |
Persists; remove with delete_persistent_object(name, scope="global") |
name / scope on always-materializing ops¶
.copy(), .concat(), .join(), and every GroupByQuery aggregation
(.sum(), .mean(), .min(), .max(), .count(), .std(), .var(),
.any(), .group_array_distinct(), .agg()) accept the same name /
scope kwargs with identical semantics to .as_() above. These ops
always create a new table — no lazy chain to elide — so the kwargs land
directly on the existing async methods instead of going through
LazyOperator:
enriched = await a.join(b, on="user_id", name="enriched", scope="job")
# → j_<job_id>_enriched
totals = await orders.group_by("category").sum("amount", name="totals", scope="global")
# → p_totals
LazyOperator's overrides forward name / scope to the materialized
delegate, so (a + b).copy(name=..., scope=...) and
(a + b).join(b, name=..., scope=...) work as expected.
Invariants¶
- No DB writes until awaited. Creating a
LazyOperatoris pure-Python. A lazy that's never awaited never creates a table. - Sync
.tableraises pre-materialize. Reading.table(or any property derived from it —.scope,.persistent,.order_by) on an unawaited LazyOperator raisesRuntimeError..schemaworks (it's precomputed). - Async methods auto-materialize.
.data(),.sum(), etc. firstawait self(materializes once, cached) then delegate. .as_()is immutable. Calling it returns a new LazyOperator; the receiver keeps its (None, None) name/scope.- Re-await is idempotent (single-task).
_materializedcache ensures sequential awaits of the same lazy produce one table.
Chained operators¶
(a + b) + c builds a tree:
When awaited, each LazyOperator node materializes into its own table — no fusion. await ((a + b) + c).as_("outer") writes two tables: an unnamed temp for the inner a + b and a t_outer_<id> for the outer. A LazyOperator that's been materialized once is reused if it appears in multiple expressions.
Aggregations and unary transforms stack the same way. obj.abs().sum() builds:
await (a + b).sum().as_("total", scope="job") writes one unnamed temp for the inner + and j_<job_id>_total for the outer sum. Aggregations always materialize their input first — there is no SQL-level fusion across the chain.
Avoid: concurrent await of the same instance¶
asyncio.gather(lazy.data(), lazy.data()) races: both tasks see _materialized is None and both materialize. The result is two tables (the second wins the cache slot; the first is orphaned and dropped via refcount cleanup). To share a result across tasks, await once and share the returned Object.
Implementation: LazyOperator, _plan_operator, _plan_aggregation, _plan_unary_transform in aaiclick/data/object/object.py; shared schema-computation helpers in aaiclick/data/object/schema_compute.py (_preview_operator_schema, _preview_agg_schema, _preview_unary_schema, …); materialization in aaiclick/data/object/operators.py (_apply_operator_db, _apply_aggregation, unary_transform, …).
Arithmetic Operators
| Python Operator | Description | ClickHouse Equivalent | Forward Method | Reverse Method |
|---|---|---|---|---|
+ |
Addition | + |
__add__ |
__radd__ |
- |
Subtraction | - |
__sub__ |
__rsub__ |
* |
Multiplication | * |
__mul__ |
__rmul__ |
/ |
Division | / |
__truediv__ |
__rtruediv__ |
// |
Floor Division | intDiv() |
__floordiv__ |
__rfloordiv__ |
% |
Modulo | % |
__mod__ |
__rmod__ |
** |
Power | power() |
__pow__ |
__rpow__ |
Arithmetic Type Promotion¶
Arithmetic result types match ClickHouse's native promotion rules. See _promote_arithmetic_type() in aaiclick/data/object/schema_compute.py, validated by aaiclick/data/object/test_type_promotion.py against SELECT toTypeName().
Comparison Operators
| Python Operator | Description | ClickHouse Equivalent | Python Method |
|---|---|---|---|
== |
Equal | = |
__eq__ |
!= |
Not Equal | != |
__ne__ |
< |
Less Than | < |
__lt__ |
<= |
Less Than or Equal | <= |
__le__ |
> |
Greater Than | > |
__gt__ |
>= |
Greater Than or Equal | >= |
__ge__ |
Comparison operators don't need explicit reverse methods — Python swaps </> and <=/>= automatically (e.g., 5 < obj becomes obj > 5).
Bitwise Operators
| Python Operator | Description | ClickHouse Equivalent | Forward Method | Reverse Method |
|---|---|---|---|---|
& |
Bitwise AND | bitAnd() |
__and__ |
__rand__ |
\| |
Bitwise OR | bitOr() |
__or__ |
__ror__ |
^ |
Bitwise XOR | bitXor() |
__xor__ |
__rxor__ |
Aggregation Operators¶
Reduce an array to a scalar Object.
| Method | ClickHouse Function | Notes |
|---|---|---|
.min() |
min() |
|
.max() |
max() |
|
.sum() |
sum() |
|
.mean() |
avg() |
|
.std() |
stddevPop() |
|
.var() |
varPop() |
|
.count() |
count() |
|
.quantile(q) |
quantile(q)() |
Approximate |
.count_if(cond) |
countIf() |
Scalar (str) or dict Object (dict of conditions) |
Unique¶
| Method | ClickHouse Implementation | Notes |
|---|---|---|
.unique() |
GROUP BY |
Order not guaranteed |
.nunique() |
SELECT count() FROM (... GROUP BY value) |
Fused count distinct |
String/Regex Operators¶
Pattern matching on String columns. Each method takes a str pattern, returns a new Object, and is chainable (e.g., match() → sum() to count matches).
| Method | ClickHouse Function | Result Type | Description |
|---|---|---|---|
.match(p) |
match(val, p) |
UInt8 | RE2 regex match (0 or 1) |
.like(p) |
val LIKE p |
UInt8 | SQL LIKE (%, _ wildcards) |
.ilike(p) |
val ILIKE p |
UInt8 | Case-insensitive LIKE |
.extract(p) |
extract(val, p) |
String | Extract first capture group |
.replace(p, r) |
replaceRegexpAll(val, p, r) |
String | Replace all regex matches |
Note: ClickHouse uses RE2 regex syntax (no lookaheads/lookbehinds).
Membership Operator: isin()¶
UInt8 membership mask via ClickHouse IN subquery — all data stays in the database.
| Method | ClickHouse Equivalent | Result Type |
|---|---|---|
.isin(other) |
value IN (SELECT value FROM other_table) |
UInt8 |
Accepts an Object or a Python list (auto-converted to Object).
obj = await create_object_from_value(["a", "b", "c", "d"])
allowed = await create_object_from_value(["a", "c"])
mask = await obj.isin(allowed)
await mask.data() # [1, 0, 1, 0]
# Also works with a plain Python list
mask = await obj.isin(["a", "c"])
# Chain with sum() to count matches
total = await mask.sum()
await total.data() # 2
# Works on dict column selection
obj = await create_object_from_value({"category": ["a", "b", "c"], "val": [1, 2, 3]})
mask = await obj["category"].isin(allowed)
Tests: aaiclick/data/object/test_isin.py. For runnable examples, see examples/isin.py.
Unary Transform Operators¶
Implementation: aaiclick/data/object.py (methods) delegates to aaiclick/data/operators.py — see unary_transform()
Apply a ClickHouse function element-wise to the value column, returning a new Object. Object-level equivalents of Domain Helpers which operate on Views.
| Method | ClickHouse Function | Result Type | Category |
|---|---|---|---|
.year() |
toYear() |
UInt16 | Date/time |
.month() |
toMonth() |
UInt8 | Date/time |
.day_of_week() |
toDayOfWeek() |
UInt8 | Date/time |
.lower() |
lower() |
String | String |
.upper() |
upper() |
String | String |
.length() |
length() |
UInt64 | String |
.trim() |
trimBoth() |
String | String |
.abs() |
abs() |
Float64 | Math |
.log2() |
log2() |
Float64 | Math |
.sqrt() |
sqrt() |
Float64 | Math |
Results are full Objects — chainable with any operator (e.g., await (await obj.year()).unique()).
Tests: aaiclick/data/test_unary_transforms.py. For runnable examples, see examples/transforms.py.
Group By Operations¶
Pandas-style two-step: obj.group_by('key').sum('col'). See GroupByQuery class in aaiclick/data/object.py.
| Method | Description | Result Column Type |
|---|---|---|
.sum(col) |
Sum per group | Preserves int types, Float64 for float |
.mean(col) |
Average per group | Float64 |
.min(col) |
Minimum per group | Preserves source type |
.max(col) |
Maximum per group | Preserves source type |
.count() |
Count rows per group | UInt64 (column named _count) |
.std(col) |
Standard deviation per group | Float64 |
.var(col) |
Variance per group | Float64 |
.any(col) |
Arbitrary non-NULL per group | Preserves source type |
.group_array_distinct(col) |
Distinct values → Array per group | Array(T) where T is source type |
.agg({col: op}) |
Multiple aggregations | Per-function type rules |
agg() spec formats — the dict value (per source column) accepts three forms:
| Form | Example | Behavior |
|---|---|---|
str |
{'amount': GB_SUM} |
Alias = column name (backward compat) |
Agg(op, alias) |
{'amount': Agg(GB_SUM, 'total')} |
Single op with explicit alias |
[Agg(op, alias), ...] |
{'amount': [Agg(GB_SUM, 'amt_sum'), Agg(GB_MEAN, 'amt_avg')]} |
Multiple ops on the same column |
All three forms can be mixed in a single agg() call.
Multiple group keys, View support, and chained .having()/.or_having() for post-aggregation filtering (same AND/OR pattern as WHERE). .or_having() requires a prior .having(). Result is a normal dict Object. See examples/group_by.py.
Known gap: No Array(T) column support — groupArray(), groupUniqArray(), per-group concat not available.
Memory/Disk Settings¶
For large datasets, ClickHouse can spill to disk via max_bytes_before_external_sort, max_bytes_in_join, join_algorithm.
Ingest¶
concat()¶
Concatenates multiple sources into a new Object. Also available as standalone concat(a, b, ...). Self must be array; args can be Objects, scalars, or lists.
- Variadic:
obj.concat(a, b, c)— any number of sources in one call - Nullable promotion: if any source has nullable columns, the result column is promoted to
Nullable - Compatible types: all sources must have matching column names and compatible ClickHouse types
insert()¶
Inserts data from one or more sources into an existing Object. Target must be array; sources can be Objects, scalars, or lists. Missing columns get ClickHouse defaults.
- Variadic:
obj.insert(a, b, c)— any number of sources in one call - Subset columns: sources don't need all target columns
- View support: sources can be Views with
where(),with_columns(), field selection, etc. - Type casting: source column types are
CASTto target column types
insert_from_url()¶
Insert data from a URL into an existing Object. For create_object_from_url(), see DataContext.
Supported URL input formats¶
create_object_from_url() and insert_from_url() accept any format ClickHouse
can read via the url() table function. Pass the format name explicitly with
format=...; the file extension is not consulted. Headerless formats expose
columns as c1, c2, … because there are no names to bind.
| Format | Notes |
|---|---|
Parquet |
columnar, type-rich (default) |
ORC |
Apache ORC columnar |
Arrow |
reading not supported — use Parquet/ORC |
Avro |
container file with embedded schema |
CSV |
no header — columns must be c1, c2, … |
CSVWithNames |
header row provides column names |
CSVWithNamesAndTypes |
header + types row — DESCRIBE returns full types |
TSV |
no header — c1, c2, … |
TSVWithNames |
header row |
TSVWithNamesAndTypes |
header + types row |
JSON |
full ClickHouse JSON envelope ({meta, data, rows}) |
JSONEachRow |
newline-delimited JSON objects |
JSONCompactEachRow |
newline-delimited JSON arrays — c1, c2, … |
RawBLOB / JSONAsString |
JSON-blob mode for nested API responses (see below) |
# Tabular formats
await create_object_from_url(
"https://example.com/data.parquet",
columns=["id", "price", "name"],
format="Parquet",
)
# Headerless CSV — bind to positional column names
await create_object_from_url(
"https://example.com/raw.csv",
columns=["c1", "c2", "c3"],
format="CSV",
)
# Avro container with embedded schema
await create_object_from_url(
"https://example.com/events.avro",
columns=["id", "ts", "payload"],
format="Avro",
)
JSON-blob mode¶
For nested JSON APIs that wrap rows inside an envelope (e.g.
{"vulnerabilities": [...]}), pass RawBLOB or JSONAsString plus
json_path and json_columns. ClickHouse loads the whole document as a
single string and applies JSONExtract per field:
await create_object_from_url(
"https://services.nvd.nist.gov/rest/json/cves/2.0",
format="JSONAsString",
json_path="vulnerabilities",
json_columns={
"cve.id": ColumnInfo("String"),
"cve.published": ColumnInfo("DateTime64(3)"),
"cve.descriptions": ColumnInfo("String", array=True),
},
)
Type inference¶
By default the loader runs DESCRIBE (SELECT … FROM url(…) LIMIT 0) to
discover column types. Pass column_types= to skip the DESCRIBE round-trip
— useful for headerless CSV/TSV where ClickHouse may fail to infer numeric
types from a sample of zero rows.
Shared insert mechanics
Both insert() and concat() delegate to _insert_source() (aaiclick/data/ingest.py) — one INSERT INTO ... SELECT CAST(...) FROM source per source. Order follows argument order.
Row Order¶
Array Objects have no implicit row order. Reads return rows in ClickHouse's natural source order — stable within a process but not portable across runs. For deterministic order, name the column you want to sort on:
obj = await create_object_from_value([3, 1, 2])
await obj.data() # → [3, 1, 2] or [1, 2, 3] etc. (undefined)
await obj.data(order_by="value") # → [1, 2, 3]
await obj.view(order_by="value").data() # equivalent
Binary elementwise ops between two array Objects from different tables require order_by on both sides — the join uses it as the row-pairing key:
a = await create_object_from_value([1, 2, 3])
b = await create_object_from_value([10, 20, 30])
# a + b ← TypeError: cross-table op without explicit row order
result = await (a.view(order_by="value") + b.view(order_by="value"))
print(await result.data(order_by="value")) # → [11, 22, 33]
Same-table ops (a + a), scalar broadcast (a + 10), and aggregations don't need the contract — they have a natural alignment.
Cross-table array ops need order_by
a + b between two array Objects from different tables raises TypeError unless both sides are wrapped with .view(order_by=...). Row order is opt-in, not implicit.
Join¶
join()¶
Joins two Objects on one or more key columns into a new dict Object.
Implementation: aaiclick/data/object/join.py — see resolve_join_keys, build_join_schema, and join_objects_db.
async def join(
self,
other: Object,
*,
on: str | list[str] | None = None,
left_on: str | list[str] | None = None,
right_on: str | list[str] | None = None,
how: Literal["inner", "left", "right", "full", "cross"] = "inner",
suffixes: tuple[str, str] | None = None,
) -> Object
| Argument | Purpose |
|---|---|
other |
Right-hand Object. |
on |
Key column name(s) present under the same name in both Objects. |
left_on |
Left-side key(s) when names differ. Requires matching-length right_on. |
right_on |
Right-side key(s) when names differ. |
how |
Join type. Default "inner". |
suffixes |
Suffix handling for non-key column collisions. True → default ("_l", "_r"); a tuple picks custom suffixes; None / False → collision raises ValueError. |
Exactly one of {on} or {left_on, right_on} must be set, except how="cross" which forbids all three.
Join types¶
how |
ClickHouse keyword | Left cols nullable? | Right cols nullable? |
|---|---|---|---|
inner |
INNER JOIN |
no | no |
left |
LEFT JOIN |
no | yes (promoted) |
right |
RIGHT JOIN |
yes (promoted) | no |
full |
FULL OUTER JOIN |
yes (promoted) | yes (promoted) |
cross |
CROSS JOIN |
no | no |
Outer-join SQL is emitted with SETTINGS join_use_nulls = 1 so misses materialize as NULL instead of the outer-side type's default.
Column semantics¶
- Key dedup: under
on=the key appears once under its original name; underleft_on/right_onboth key columns survive.FULLjoins withon=coalesce the merged key across sides. - Collisions: non-key columns that appear on both sides raise
ValueErrorunlesssuffixes=True(shorthand for("_l", "_r")) or a customsuffixes=("_x", "_y")tuple is passed. Both sides get a suffix — empty suffixes are rejected. - Key types: compared via the same type-compatibility rule as
concat(String↔FixedStringOK,String↔Int64not). - LowCardinality / Array: preserved;
LowCardinality(String)on the outer side becomesLowCardinality(Nullable(String)). Join keys themselves must be scalar. - Order: join output has no intrinsic order. Pass
data(order_by=...)(or.view(order_by=...)) if deterministic order is required.
Examples¶
# Inner join on a shared key
users = await create_object_from_value({"id": [1, 2, 3], "name": ["A", "B", "C"]})
orders = await create_object_from_value({"id": [1, 1, 4], "total": [9.5, 14.0, 2.0]})
joined = await users.join(orders, on="id")
await joined.data(orient="records")
# → [{"id": 1, "name": "A", "total": 9.5}, {"id": 1, "name": "A", "total": 14.0}]
# Left join — unmatched rows keep left, right columns become NULL
enriched = await users.join(orders, on="id", how="left")
# → rows for id=2, 3 have total=None; schema marks 'total' Nullable(Float64)
# left_on / right_on when key names differ — both keys survive
orders2 = await create_object_from_value({"user_id": [1, 2], "total": [9.5, 3.0]})
await users.join(orders2, left_on="id", right_on="user_id")
# → rows include both 'id' and 'user_id'
# Suffixes on non-key collision — True picks the default ("_l", "_r")
a = await create_object_from_value({"id": [1, 2], "score": [10, 20]})
b = await create_object_from_value({"id": [1, 2], "score": [99, 88]})
merged = await a.join(b, on="id", suffixes=True)
# → columns: id, score_l, score_r
# Or supply a custom tuple:
merged = await a.join(b, on="id", suffixes=("_old", "_new"))
# Cross join — Cartesian product
colors = await create_object_from_value({"c": ["red", "blue"]})
sizes = await create_object_from_value({"s": ["S", "M"]})
skus = await colors.join(sizes, how="cross") # 4 rows
Distributed considerations¶
The default chdb backend is single-shard, so the naive JOIN suffices. For a future sharded backend, GLOBAL JOIN and join_algorithm hints (parallel_hash, partial_merge, grace_hash) are the next levers — deferred to docs/future.md. Nothing in the v1 API precludes a later strategy= kwarg.
Semi / anti / asof joins are deferred: each has semantic nuance (left-only output schema, required order_by, tolerance) worth its own mini-spec.
Data Retrieval¶
data()¶
Returns: scalar → value, array → list, dict → dict or list of dicts.
Orient (dict Objects):
| Constant | Value | Description |
|---|---|---|
ORIENT_DICT |
'dict' |
Dict with arrays as values (default) |
ORIENT_RECORDS |
'records' |
List of dicts (one per row) |
markdown()¶
Returns data as a plain-text markdown table with auto-sized columns. Optional truncate: dict[str, int] caps column widths. Floats → 2dp, None → N/A.
execute()¶
Runs the query this Object/View describes, discards every row server-side (FORMAT Null), and returns a QueryStats. Forces full compute without paying for transport — benchmark a View, warm caches, or assert a query runs without checking its output.
Rebuilds the same SELECT .data() would issue (honoring where, order_by, limit, offset, computed columns, renames, field selection), so a View's execute() measures its projection. Per-call order_by / limit / offset mirror .data(). No safety row cap applies — the whole pipeline runs unless you pass limit.
view = obj.view(where="score > 10").rename({"score": "s"})
stats = await view.execute()
print(stats.read_rows) # rows scanned by the full pipeline
print(stats.elapsed_s) # server-side wall time
# view's table is unchanged; no result rows came back
QueryStats is a frozen value type (exported from aaiclick). Every field is best-effort — a backend fills what it can surface and leaves the rest None:
| Field | Meaning | clickhouse-connect (HTTP) | chdb |
|---|---|---|---|
read_rows |
rows ClickHouse scanned | summary | storage_rows_read |
read_bytes |
uncompressed bytes scanned | summary | storage_bytes_read |
elapsed_s |
server-side wall time, seconds | elapsed_ns / 1e9 |
elapsed |
result_rows |
rows the SELECT produced | summary | None |
written_rows |
rows written by an INSERT | summary | None |
written_bytes |
bytes written by an INSERT | summary | None |
Tests: aaiclick/data/object/test_execute_stats.py
stats()¶
async, returns QueryStats | None. Populated on objects born from a server-side query; None everywhere else. Async (not a property) because awaiting it on an un-awaited LazyOperator first materializes it — same terminal semantics as .data().
| Object origin | await .stats() |
|---|---|
.copy() / .concat() / .join() |
stats of the INSERT … SELECT |
operator / aggregation / group_by result (incl. await a + b) |
stats of the materialization |
plain table-backed Object |
None |
unexecuted View |
None (it never ran a query) |
result = await big_view.copy()
stats = await result.stats()
print(stats.written_rows) # how many rows the copy wrote (HTTP)
print(stats.read_rows) # how many it scanned to produce them
export()¶
Export data to a local file. The format is picked from the file extension
and the data streams directly from ClickHouse — no Python round-trip, so
multi-million-row exports stay memory-bounded. View constraints (where,
limit, order_by) are honored. Returns the absolute path written.
await obj.export("/tmp/data.csv")
await obj.export("/tmp/data.parquet")
await obj.view(where="score > 10", limit=1000).export("/tmp/top.jsonl")
Supported extensions¶
| Extension | ClickHouse format | Notes |
|---|---|---|
.csv |
CSVWithNames |
header row included |
.tsv |
TSVWithNames |
header row included |
.json / .jsonl / .ndjson |
JSONEachRow |
newline-delimited JSON |
.parquet |
Parquet |
columnar, best for large datasets |
.arrow |
Arrow |
Apache Arrow IPC stream |
.orc |
ORC |
Apache ORC columnar |
.avro |
Avro |
Avro container with embedded schema |
.md |
Markdown |
pipe-delimited markdown table |
.xml |
XML |
one <row> element per record |
.sql |
SQLInsert |
INSERT INTO table VALUES (...) |
Compression¶
Append .gz or .xz and the writer compresses automatically:
await obj.export("/tmp/data.csv.gz")
await obj.export("/tmp/data.parquet.gz")
await obj.export("/tmp/data.json.xz")
Backend behavior¶
- chdb (embedded):
INSERT INTO FUNCTION file('path', fmt)— the embedded engine streams directly to disk and picks the compression codec from the path suffix. - clickhouse-connect (remote):
raw_stream(query, fmt=fmt)returns uncompressed bytes, which are copied to the local file in 64 KB chunks (on a worker thread) and wrapped throughgzip/lzmaif the path asks for it.INSERT INTO FUNCTION file()cannot be used here because it would write to the server'suser_files_path, not the client.
Views¶
Read-only filtered projection of an Object — same table, no data copy. Created via obj.view(where=..., limit=..., offset=..., order_by=...). Supports all read operations; cannot insert(). See examples/view_examples.py.
Chained WHERE Clauses¶
obj.where(cond)— creates View with initial WHEREview.where(cond)— AND-chains:.where('x > 10').where('y < 20')→WHERE (x > 10) AND (y < 20)view.or_where(cond)— OR-chains:.where('x > 100').or_where('y < 5')→WHERE (x > 100) OR (y < 5)
or_where() requires a prior where()
Calling or_where() without a preceding where() raises ValueError.
Same applies to or_having() on GroupByQuery.
Column Selection¶
Select columns from a dict Object → View (same table, restricted SELECT).
obj["col"]— single column → array-like View (singlevaluefield)obj[["col_a", "col_b"]]— multiple columns → dict-like View
obj = await create_object_from_value({"x": [1, 2, 3], "y": [10, 20, 30]})
# Single column — returns array-like View
await obj["x"].data() # [1, 2, 3]
arr = await obj["x"].copy() # new array Object
# Multiple columns — returns dict-like View
await obj[["x", "y"]].data() # {'x': [1, 2, 3], 'y': [10, 20, 30]}
Preserves WHERE and computed column constraints when chained on a filtered View.
Tests: aaiclick/data/object/test_column_selection.py
Computed Column Expansion: with_columns()¶
with_columns() adds derived columns (SQL expressions) as a lightweight View — no new table, no data copy.
API¶
Synchronous (no await). Works on both Object and View; preserves existing constraints. Uses Computed(type, expression) from aaiclick.data.models.
literal() Helper¶
Convenience wrapper for constant columns — handles SQL quoting:
from aaiclick import literal
# Before: manual quoting
obj.with_columns({"source": Computed("String", "'dataset_a'")})
# After: literal() handles it
obj.with_columns({"source": literal("dataset_a", "String")})
obj.with_columns({"flag": literal(True, "UInt8")})
obj.with_columns({"weight": literal(1.0, "Float64")})
Supported types: str (quoted), int/float (bare), bool (true/false).
Explode¶
Flattens Array column(s) into individual rows (scalar columns duplicated). Returns a View — downstream operators fuse into a single query. Exploded columns change from Array(T) to T. See aaiclick/data/examples/explode.py.
Tests: aaiclick/data/object/test_explode.py
Chaining¶
with_columns() returns a View, so all View operations work: group_by(), where(), column selection, further with_columns() calls (additive), and operators on selected columns.
with_columns() internals
Result is always a View with dict-like schema. Array sources promote to dict; scalar sources raise ValueError. Column name collisions raise ValueError — with_columns() adds, never replaces.
SQL expressions are passed verbatim to ClickHouse. Basic validation rejects semicolons and SELECT. Type mismatches are caught by ClickHouse at query time.
Domain Helpers¶
Named shortcuts that delegate to with_columns(). Each auto-names the result column; all accept alias= override and return a View.
| Helper | Default Alias | Type | Expression |
|---|---|---|---|
with_year(col) |
{col}_year |
UInt16 |
toYear(col) |
with_month(col) |
{col}_month |
UInt8 |
toMonth(col) |
with_day_of_week(col) |
{col}_dow |
UInt8 |
toDayOfWeek(col) |
with_date_diff(unit, col_a, col_b) |
{col_a}_{col_b}_diff |
Int64 |
dateDiff('unit', col_a, col_b) |
with_lower(col) |
{col}_lower |
String |
lower(col) |
with_upper(col) |
{col}_upper |
String |
upper(col) |
with_length(col) |
{col}_length |
UInt64 |
length(col) |
with_trim(col) |
{col}_trimmed |
String |
trim(col) |
with_abs(col) |
{col}_abs |
Float64 |
abs(col) |
with_log2(col) |
{col}_log2 |
Float64 |
log2(col) |
with_sqrt(col) |
{col}_sqrt |
Float64 |
sqrt(col) |
with_bucket(col, size) |
{col}_bucket |
Int64 |
intDiv(col, size) |
with_hash_bucket(col, n) |
{col}_hash |
UInt64 |
cityHash64(col) % n |
with_if(cond, then, else, *, alias) |
required alias |
String |
if(cond, then, else) |
with_cast(col, ch_type) |
{col}_{type_lower} |
ch_type |
to{Type}(col) |
with_split_by_char(col, sep) |
{col}_parts |
Array(String) |
splitByChar(sep, col) |
with_isin(col, other) |
{col}_isin |
UInt8 |
col IN (SELECT value FROM …) |
with_columns() remains the public power-user interface for arbitrary expressions via Computed(type, expression).
Tests: aaiclick/data/object/test_with_columns.py
Column Renaming: rename()¶
Returns a View aliasing column names (old AS new). Synchronous, no await.
from aaiclick import literal
# Rename camelCase columns to snake_case for a consolidated table
kev_view = kev.rename({
"cveID": "cve_id",
"vendorProject": "vendor",
"vulnerabilityName": "vulnerability_name",
}).with_columns({
"source": literal("kev", "String"),
})
await consolidated.insert(kev_view)
Chainable with with_columns(), where(), and other View operations. New names must not collide with non-renamed columns.
Tests: aaiclick/data/object/test_rename.py
The copy() Method¶
Full data copy → new Object. Works on both Objects and Views — filters, computed columns, and ORDER BY are preserved.
# Copy an array Object
obj_copy = await obj.copy()
# Materialize a View into a new Object
arr = await obj["x"].copy() # array Object from dict column
subset = await obj.where("x > 5").copy() # filtered copy
# Sorted copy — ORDER BY is preserved
sorted_copy = await obj.view(order_by="amount DESC").copy()
await sorted_copy.data() # returns rows sorted by amount DESC
The result carries the INSERT … SELECT stats — result = await obj.copy() then (await result.stats()).read_rows tells you how much it scanned. See stats().
copy() is not serialized across workers
Unlike insert() and concat(), copy() does not take a per-table
advisory lock. Two concurrent copy() calls into the same named
destination will interleave their rows non-deterministically. Structure
your pipeline so a named destination has a single writer.
Tests: aaiclick/data/test_copy_parametrized.py
Operation Provenance (Oplog)¶
All Object operations are instrumented to record provenance via OplogCollector. See docs/oplog.md.