Skip to content

Object

Overview

The Object class (aaiclick/data/object.py) wraps a ClickHouse table. Each instance corresponds to one table; operator overloading creates new tables with results.

Key Features: - Operator overloading for arithmetic, comparison, and bitwise operations - Immutable structure — no in-place schema changes; insert() appends data but never mutates structure - Support for scalars, arrays, and dictionaries - Element-wise operations on arrays

See DataContext for lifecycle, schemas, and deployment modes.

API Quick Reference

API Category Description Section
+, -, *, /, //, %, ** Arithmetic Element-wise arithmetic Arithmetic Operators
==, !=, <, <=, >, >= Comparison Element-wise comparison Comparison Operators
&, \|, ^ Bitwise Bitwise AND / OR / XOR Bitwise Operators
.min(), .max(), .sum(), .mean() Aggregation Reduce array to scalar Aggregation Operators
.std(), .var(), .count(), .quantile(q) Aggregation Statistical reduction Aggregation Operators
.count_if(condition) Aggregation Count rows matching condition(s) Aggregation Operators
.unique() Unique Deduplicate values via GROUP BY Unique
.nunique() Unique Count distinct values (fused) Unique
.isin(other) Membership IN subquery test (returns UInt8 mask) Membership Operator
.match(p), .like(p), .ilike(p) String / Regex Pattern matching (returns UInt8 mask) String/Regex Operators
.extract(p), .replace(p, r) String / Regex Capture group extraction, regex replace String/Regex Operators
.year(), .month(), .day_of_week() Unary Transforms Date/time extraction → new Object Unary Transform Operators
.lower(), .upper(), .length(), .trim() Unary Transforms String transforms → new Object Unary Transform Operators
.abs(), .log2(), .sqrt() Unary Transforms Math functions → new Object Unary Transform Operators
.group_by(keys).sum(col) etc. Group By Aggregation with GROUP BY + optional HAVING Group By Operations
.group_by(keys).agg({col: spec}) Group By Multiple aggregations (multi-agg per column) Group By Operations
.group_by(keys).any(col) Group By Arbitrary non-NULL value per group Group By Operations
.group_by(keys).group_array_distinct(col) Group By Distinct values → Array per group Group By Operations
.view(where, limit, offset, order_by) Views Read-only View with optional filters Views
.where(cond) / .or_where(cond) Views Fluent WHERE chaining (AND / OR) Chained WHERE Clauses
obj[key] / obj[[keys]] Views Select column(s) from dict Object → View Column Selection
.with_columns({name: Computed(type, expr)}) Views Add SQL expression columns → View Computed Column Expansion
literal(value, ch_type) Helpers Constant column (Computed with SQL literal) Computed Column Expansion
.with_year(col), .with_month(col) Views Named shortcuts for common computed columns Domain Helpers
.with_split_by_char(col, sep) Views Split String column by separator → Array Domain Helpers
.with_isin(col, other) Views IN subquery computed column → UInt8 Domain Helpers
.rename({old: new}) Views Alias column names in a View Column Renaming
.explode(*columns, left=False) Views Flatten Array column(s) into rows → View Explode
.copy() Copy Materialize Object / View to new Object copy()
.insert(*sources) Ingest Insert data from Objects / scalars / lists insert()
.insert_from_url(url) Ingest Insert rows from a remote URL insert_from_url()
.concat(*sources) / concat(a, b, …) Ingest Concatenate sources into a new Object concat()
.join(other, on, how) Join Join two Objects on key columns → dict Object join()
.data(orient=…) Data Retrieval Fetch results to Python (scalar / list / dict) data()
.markdown(truncate=…) Data Retrieval Render data as markdown table markdown()
.execute(order_by, limit, offset) Data Retrieval Run full query, discard rows → QueryStats execute()
await .stats() Data Retrieval QueryStats of the query that made this Object stats()
.export(path) Export Stream data to a file (extension → format) export()

Operator Support

All operators work element-wise on scalar and array data, creating new Object tables. See examples/basic_operators.py.

Scalar broadcast

Python scalars work on either side: obj * 2 and 2 * obj both work.

Lazy Operator Results (a + b and a.sum() are Plans, not Tables)

Every binary operator (+, -, *, /, //, %, **, ==, !=, <, <=, >, >=, &, |, ^), aggregation (.min() / .max() / .sum() / .mean() / .std() / .var() / .count() / .count_if() / .quantile() / .unique() / .nunique()), and unary transform (.year() / .month() / .day_of_week() / .lower() / .upper() / .length() / .trim() / .abs() / .log2() / .sqrt()) returns a LazyOperator — a subclass of Object that captures the operation plan (lhs, rhs, operator, precomputed result schema) without touching ClickHouse. The CREATE TABLE + INSERT INTO ... SELECT happens when the lazy is awaited.

# 1. Build a plan — pure Python, no DB call
plan = a + b                                       # LazyOperator (no table yet)
total = a.sum()                                    # also a LazyOperator

# 2. Read rows / scalars — async methods auto-materialize
rows = await (a + b).data()                        # creates t_<id>, reads rows
value = await a.sum().data()                       # one await — materialize + read

# 3. Get the materialized Object (for .table, legacy APIs, etc.)
obj = await (a + b)                                # table = t_<id>
scalar_obj = await a.sum()                         # table = t_<id> with one row

# 4. Control the table name — use .as_()
revenue = await (prices * quantities).as_("revenue")            # t_revenue_<id>
daily_total = await orders.sum().as_("daily_total")             # t_daily_total_<id>

# 5. Persist beyond the current context
daily = await (sales + bonuses).as_("daily", scope="job")       # j_<job_id>_daily
yearly = await orders.sum().as_("yearly", scope="global")       # p_yearly

await is only needed when the caller wants the materialized Object back — to read .table, hand it to an eager API, etc. Reading rows (.data(), .markdown(), .export(), .result()) or chaining further operators ((a + b) + c, obj.abs().sum()) doesn't require an explicit await of the intermediate — the lazy passes through.

.as_(name, scope=...)

Names the result table and chooses its lifetime. Returns a new LazyOperator (the receiver is unchanged):

scope Table name Lifetime
"temp_named" (default) t_<name>_<snowflake> Drops with the context
"job" j_<job_id>_<name> Lives until the active orch job expires
"global" p_<name> Persists; remove with delete_persistent_object(name, scope="global")

name / scope on always-materializing ops

.copy(), .concat(), .join(), and every GroupByQuery aggregation (.sum(), .mean(), .min(), .max(), .count(), .std(), .var(), .any(), .group_array_distinct(), .agg()) accept the same name / scope kwargs with identical semantics to .as_() above. These ops always create a new table — no lazy chain to elide — so the kwargs land directly on the existing async methods instead of going through LazyOperator:

enriched = await a.join(b, on="user_id", name="enriched", scope="job")
# → j_<job_id>_enriched

totals = await orders.group_by("category").sum("amount", name="totals", scope="global")
# → p_totals

LazyOperator's overrides forward name / scope to the materialized delegate, so (a + b).copy(name=..., scope=...) and (a + b).join(b, name=..., scope=...) work as expected.

Invariants

  • No DB writes until awaited. Creating a LazyOperator is pure-Python. A lazy that's never awaited never creates a table.
  • Sync .table raises pre-materialize. Reading .table (or any property derived from it — .scope, .persistent, .order_by) on an unawaited LazyOperator raises RuntimeError. .schema works (it's precomputed).
  • Async methods auto-materialize. .data(), .sum(), etc. first await self (materializes once, cached) then delegate.
  • .as_() is immutable. Calling it returns a new LazyOperator; the receiver keeps its (None, None) name/scope.
  • Re-await is idempotent (single-task). _materialized cache ensures sequential awaits of the same lazy produce one table.

Chained operators

(a + b) + c builds a tree:

LazyOperator(op="+", lhs=LazyOperator(op="+", lhs=a, rhs=b), rhs=c)

When awaited, each LazyOperator node materializes into its own table — no fusion. await ((a + b) + c).as_("outer") writes two tables: an unnamed temp for the inner a + b and a t_outer_<id> for the outer. A LazyOperator that's been materialized once is reused if it appears in multiple expressions.

Aggregations and unary transforms stack the same way. obj.abs().sum() builds:

LazyOperator(op="sum", lhs=LazyOperator(op="abs", lhs=obj, rhs=None), rhs=None)

await (a + b).sum().as_("total", scope="job") writes one unnamed temp for the inner + and j_<job_id>_total for the outer sum. Aggregations always materialize their input first — there is no SQL-level fusion across the chain.

Avoid: concurrent await of the same instance

asyncio.gather(lazy.data(), lazy.data()) races: both tasks see _materialized is None and both materialize. The result is two tables (the second wins the cache slot; the first is orphaned and dropped via refcount cleanup). To share a result across tasks, await once and share the returned Object.

Implementation: LazyOperator, _plan_operator, _plan_aggregation, _plan_unary_transform in aaiclick/data/object/object.py; shared schema-computation helpers in aaiclick/data/object/schema_compute.py (_preview_operator_schema, _preview_agg_schema, _preview_unary_schema, …); materialization in aaiclick/data/object/operators.py (_apply_operator_db, _apply_aggregation, unary_transform, …).

Arithmetic Operators
Python Operator Description ClickHouse Equivalent Forward Method Reverse Method
+ Addition + __add__ __radd__
- Subtraction - __sub__ __rsub__
* Multiplication * __mul__ __rmul__
/ Division / __truediv__ __rtruediv__
// Floor Division intDiv() __floordiv__ __rfloordiv__
% Modulo % __mod__ __rmod__
** Power power() __pow__ __rpow__

Arithmetic Type Promotion

Arithmetic result types match ClickHouse's native promotion rules. See _promote_arithmetic_type() in aaiclick/data/object/schema_compute.py, validated by aaiclick/data/object/test_type_promotion.py against SELECT toTypeName().

Comparison Operators
Python Operator Description ClickHouse Equivalent Python Method
== Equal = __eq__
!= Not Equal != __ne__
< Less Than < __lt__
<= Less Than or Equal <= __le__
> Greater Than > __gt__
>= Greater Than or Equal >= __ge__

Comparison operators don't need explicit reverse methods — Python swaps </> and <=/>= automatically (e.g., 5 < obj becomes obj > 5).

Bitwise Operators
Python Operator Description ClickHouse Equivalent Forward Method Reverse Method
& Bitwise AND bitAnd() __and__ __rand__
\| Bitwise OR bitOr() __or__ __ror__
^ Bitwise XOR bitXor() __xor__ __rxor__

Aggregation Operators

Reduce an array to a scalar Object.

Method ClickHouse Function Notes
.min() min()
.max() max()
.sum() sum()
.mean() avg()
.std() stddevPop()
.var() varPop()
.count() count()
.quantile(q) quantile(q)() Approximate
.count_if(cond) countIf() Scalar (str) or dict Object (dict of conditions)

Unique

Method ClickHouse Implementation Notes
.unique() GROUP BY Order not guaranteed
.nunique() SELECT count() FROM (... GROUP BY value) Fused count distinct

String/Regex Operators

Pattern matching on String columns. Each method takes a str pattern, returns a new Object, and is chainable (e.g., match()sum() to count matches).

Method ClickHouse Function Result Type Description
.match(p) match(val, p) UInt8 RE2 regex match (0 or 1)
.like(p) val LIKE p UInt8 SQL LIKE (%, _ wildcards)
.ilike(p) val ILIKE p UInt8 Case-insensitive LIKE
.extract(p) extract(val, p) String Extract first capture group
.replace(p, r) replaceRegexpAll(val, p, r) String Replace all regex matches

Note: ClickHouse uses RE2 regex syntax (no lookaheads/lookbehinds).

Membership Operator: isin()

UInt8 membership mask via ClickHouse IN subquery — all data stays in the database.

Method ClickHouse Equivalent Result Type
.isin(other) value IN (SELECT value FROM other_table) UInt8

Accepts an Object or a Python list (auto-converted to Object).

obj = await create_object_from_value(["a", "b", "c", "d"])
allowed = await create_object_from_value(["a", "c"])
mask = await obj.isin(allowed)
await mask.data()           # [1, 0, 1, 0]

# Also works with a plain Python list
mask = await obj.isin(["a", "c"])

# Chain with sum() to count matches
total = await mask.sum()
await total.data()          # 2

# Works on dict column selection
obj = await create_object_from_value({"category": ["a", "b", "c"], "val": [1, 2, 3]})
mask = await obj["category"].isin(allowed)

Tests: aaiclick/data/object/test_isin.py. For runnable examples, see examples/isin.py.

Unary Transform Operators

Implementation: aaiclick/data/object.py (methods) delegates to aaiclick/data/operators.py — see unary_transform()

Apply a ClickHouse function element-wise to the value column, returning a new Object. Object-level equivalents of Domain Helpers which operate on Views.

Method ClickHouse Function Result Type Category
.year() toYear() UInt16 Date/time
.month() toMonth() UInt8 Date/time
.day_of_week() toDayOfWeek() UInt8 Date/time
.lower() lower() String String
.upper() upper() String String
.length() length() UInt64 String
.trim() trimBoth() String String
.abs() abs() Float64 Math
.log2() log2() Float64 Math
.sqrt() sqrt() Float64 Math

Results are full Objects — chainable with any operator (e.g., await (await obj.year()).unique()).

Tests: aaiclick/data/test_unary_transforms.py. For runnable examples, see examples/transforms.py.

Group By Operations

Pandas-style two-step: obj.group_by('key').sum('col'). See GroupByQuery class in aaiclick/data/object.py.

Method Description Result Column Type
.sum(col) Sum per group Preserves int types, Float64 for float
.mean(col) Average per group Float64
.min(col) Minimum per group Preserves source type
.max(col) Maximum per group Preserves source type
.count() Count rows per group UInt64 (column named _count)
.std(col) Standard deviation per group Float64
.var(col) Variance per group Float64
.any(col) Arbitrary non-NULL per group Preserves source type
.group_array_distinct(col) Distinct values → Array per group Array(T) where T is source type
.agg({col: op}) Multiple aggregations Per-function type rules

agg() spec formats — the dict value (per source column) accepts three forms:

Form Example Behavior
str {'amount': GB_SUM} Alias = column name (backward compat)
Agg(op, alias) {'amount': Agg(GB_SUM, 'total')} Single op with explicit alias
[Agg(op, alias), ...] {'amount': [Agg(GB_SUM, 'amt_sum'), Agg(GB_MEAN, 'amt_avg')]} Multiple ops on the same column

All three forms can be mixed in a single agg() call.

Multiple group keys, View support, and chained .having()/.or_having() for post-aggregation filtering (same AND/OR pattern as WHERE). .or_having() requires a prior .having(). Result is a normal dict Object. See examples/group_by.py.

Known gap: No Array(T) column support — groupArray(), groupUniqArray(), per-group concat not available.

Memory/Disk Settings

For large datasets, ClickHouse can spill to disk via max_bytes_before_external_sort, max_bytes_in_join, join_algorithm.

Ingest

concat()

Concatenates multiple sources into a new Object. Also available as standalone concat(a, b, ...). Self must be array; args can be Objects, scalars, or lists.

  • Variadic: obj.concat(a, b, c) — any number of sources in one call
  • Nullable promotion: if any source has nullable columns, the result column is promoted to Nullable
  • Compatible types: all sources must have matching column names and compatible ClickHouse types

insert()

Inserts data from one or more sources into an existing Object. Target must be array; sources can be Objects, scalars, or lists. Missing columns get ClickHouse defaults.

  • Variadic: obj.insert(a, b, c) — any number of sources in one call
  • Subset columns: sources don't need all target columns
  • View support: sources can be Views with where(), with_columns(), field selection, etc.
  • Type casting: source column types are CAST to target column types

insert_from_url()

Insert data from a URL into an existing Object. For create_object_from_url(), see DataContext.

Supported URL input formats

create_object_from_url() and insert_from_url() accept any format ClickHouse can read via the url() table function. Pass the format name explicitly with format=...; the file extension is not consulted. Headerless formats expose columns as c1, c2, … because there are no names to bind.

Format Notes
Parquet columnar, type-rich (default)
ORC Apache ORC columnar
Arrow reading not supported — use Parquet/ORC
Avro container file with embedded schema
CSV no header — columns must be c1, c2, …
CSVWithNames header row provides column names
CSVWithNamesAndTypes header + types row — DESCRIBE returns full types
TSV no header — c1, c2, …
TSVWithNames header row
TSVWithNamesAndTypes header + types row
JSON full ClickHouse JSON envelope ({meta, data, rows})
JSONEachRow newline-delimited JSON objects
JSONCompactEachRow newline-delimited JSON arrays — c1, c2, …
RawBLOB / JSONAsString JSON-blob mode for nested API responses (see below)
# Tabular formats
await create_object_from_url(
    "https://example.com/data.parquet",
    columns=["id", "price", "name"],
    format="Parquet",
)

# Headerless CSV — bind to positional column names
await create_object_from_url(
    "https://example.com/raw.csv",
    columns=["c1", "c2", "c3"],
    format="CSV",
)

# Avro container with embedded schema
await create_object_from_url(
    "https://example.com/events.avro",
    columns=["id", "ts", "payload"],
    format="Avro",
)

JSON-blob mode

For nested JSON APIs that wrap rows inside an envelope (e.g. {"vulnerabilities": [...]}), pass RawBLOB or JSONAsString plus json_path and json_columns. ClickHouse loads the whole document as a single string and applies JSONExtract per field:

await create_object_from_url(
    "https://services.nvd.nist.gov/rest/json/cves/2.0",
    format="JSONAsString",
    json_path="vulnerabilities",
    json_columns={
        "cve.id":               ColumnInfo("String"),
        "cve.published":        ColumnInfo("DateTime64(3)"),
        "cve.descriptions":     ColumnInfo("String", array=True),
    },
)

Type inference

By default the loader runs DESCRIBE (SELECT … FROM url(…) LIMIT 0) to discover column types. Pass column_types= to skip the DESCRIBE round-trip — useful for headerless CSV/TSV where ClickHouse may fail to infer numeric types from a sample of zero rows.

Shared insert mechanics

Both insert() and concat() delegate to _insert_source() (aaiclick/data/ingest.py) — one INSERT INTO ... SELECT CAST(...) FROM source per source. Order follows argument order.

Row Order

Array Objects have no implicit row order. Reads return rows in ClickHouse's natural source order — stable within a process but not portable across runs. For deterministic order, name the column you want to sort on:

obj = await create_object_from_value([3, 1, 2])

await obj.data()                      # → [3, 1, 2] or [1, 2, 3] etc. (undefined)
await obj.data(order_by="value")      # → [1, 2, 3]
await obj.view(order_by="value").data()  # equivalent

Binary elementwise ops between two array Objects from different tables require order_by on both sides — the join uses it as the row-pairing key:

a = await create_object_from_value([1, 2, 3])
b = await create_object_from_value([10, 20, 30])

# a + b   ← TypeError: cross-table op without explicit row order
result = await (a.view(order_by="value") + b.view(order_by="value"))
print(await result.data(order_by="value"))  # → [11, 22, 33]

Same-table ops (a + a), scalar broadcast (a + 10), and aggregations don't need the contract — they have a natural alignment.

Cross-table array ops need order_by

a + b between two array Objects from different tables raises TypeError unless both sides are wrapped with .view(order_by=...). Row order is opt-in, not implicit.

Join

join()

Joins two Objects on one or more key columns into a new dict Object.

Implementation: aaiclick/data/object/join.py — see resolve_join_keys, build_join_schema, and join_objects_db.

async def join(
    self,
    other: Object,
    *,
    on: str | list[str] | None = None,
    left_on: str | list[str] | None = None,
    right_on: str | list[str] | None = None,
    how: Literal["inner", "left", "right", "full", "cross"] = "inner",
    suffixes: tuple[str, str] | None = None,
) -> Object
Argument Purpose
other Right-hand Object.
on Key column name(s) present under the same name in both Objects.
left_on Left-side key(s) when names differ. Requires matching-length right_on.
right_on Right-side key(s) when names differ.
how Join type. Default "inner".
suffixes Suffix handling for non-key column collisions. True → default ("_l", "_r"); a tuple picks custom suffixes; None / False → collision raises ValueError.

Exactly one of {on} or {left_on, right_on} must be set, except how="cross" which forbids all three.

Join types

how ClickHouse keyword Left cols nullable? Right cols nullable?
inner INNER JOIN no no
left LEFT JOIN no yes (promoted)
right RIGHT JOIN yes (promoted) no
full FULL OUTER JOIN yes (promoted) yes (promoted)
cross CROSS JOIN no no

Outer-join SQL is emitted with SETTINGS join_use_nulls = 1 so misses materialize as NULL instead of the outer-side type's default.

Column semantics

  • Key dedup: under on= the key appears once under its original name; under left_on/right_on both key columns survive. FULL joins with on= coalesce the merged key across sides.
  • Collisions: non-key columns that appear on both sides raise ValueError unless suffixes=True (shorthand for ("_l", "_r")) or a custom suffixes=("_x", "_y") tuple is passed. Both sides get a suffix — empty suffixes are rejected.
  • Key types: compared via the same type-compatibility rule as concat (StringFixedString OK, StringInt64 not).
  • LowCardinality / Array: preserved; LowCardinality(String) on the outer side becomes LowCardinality(Nullable(String)). Join keys themselves must be scalar.
  • Order: join output has no intrinsic order. Pass data(order_by=...) (or .view(order_by=...)) if deterministic order is required.

Examples

# Inner join on a shared key
users  = await create_object_from_value({"id":  [1, 2, 3], "name": ["A", "B", "C"]})
orders = await create_object_from_value({"id":  [1, 1, 4], "total": [9.5, 14.0, 2.0]})

joined = await users.join(orders, on="id")
await joined.data(orient="records")
# → [{"id": 1, "name": "A", "total": 9.5}, {"id": 1, "name": "A", "total": 14.0}]

# Left join — unmatched rows keep left, right columns become NULL
enriched = await users.join(orders, on="id", how="left")
# → rows for id=2, 3 have total=None; schema marks 'total' Nullable(Float64)

# left_on / right_on when key names differ — both keys survive
orders2 = await create_object_from_value({"user_id": [1, 2], "total": [9.5, 3.0]})
await users.join(orders2, left_on="id", right_on="user_id")
# → rows include both 'id' and 'user_id'

# Suffixes on non-key collision — True picks the default ("_l", "_r")
a = await create_object_from_value({"id": [1, 2], "score": [10, 20]})
b = await create_object_from_value({"id": [1, 2], "score": [99, 88]})
merged = await a.join(b, on="id", suffixes=True)
# → columns: id, score_l, score_r

# Or supply a custom tuple:
merged = await a.join(b, on="id", suffixes=("_old", "_new"))

# Cross join — Cartesian product
colors = await create_object_from_value({"c": ["red", "blue"]})
sizes  = await create_object_from_value({"s": ["S", "M"]})
skus   = await colors.join(sizes, how="cross")  # 4 rows

Distributed considerations

The default chdb backend is single-shard, so the naive JOIN suffices. For a future sharded backend, GLOBAL JOIN and join_algorithm hints (parallel_hash, partial_merge, grace_hash) are the next levers — deferred to docs/future.md. Nothing in the v1 API precludes a later strategy= kwarg.

Semi / anti / asof joins are deferred: each has semantic nuance (left-only output schema, required order_by, tolerance) worth its own mini-spec.

Data Retrieval

data()

Returns: scalar → value, array → list, dict → dict or list of dicts.

Orient (dict Objects):

Constant Value Description
ORIENT_DICT 'dict' Dict with arrays as values (default)
ORIENT_RECORDS 'records' List of dicts (one per row)

markdown()

Returns data as a plain-text markdown table with auto-sized columns. Optional truncate: dict[str, int] caps column widths. Floats → 2dp, None → N/A.

execute()

Runs the query this Object/View describes, discards every row server-side (FORMAT Null), and returns a QueryStats. Forces full compute without paying for transport — benchmark a View, warm caches, or assert a query runs without checking its output.

Rebuilds the same SELECT .data() would issue (honoring where, order_by, limit, offset, computed columns, renames, field selection), so a View's execute() measures its projection. Per-call order_by / limit / offset mirror .data(). No safety row cap applies — the whole pipeline runs unless you pass limit.

view = obj.view(where="score > 10").rename({"score": "s"})
stats = await view.execute()
print(stats.read_rows)   # rows scanned by the full pipeline
print(stats.elapsed_s)   # server-side wall time
# view's table is unchanged; no result rows came back

QueryStats is a frozen value type (exported from aaiclick). Every field is best-effort — a backend fills what it can surface and leaves the rest None:

Field Meaning clickhouse-connect (HTTP) chdb
read_rows rows ClickHouse scanned summary storage_rows_read
read_bytes uncompressed bytes scanned summary storage_bytes_read
elapsed_s server-side wall time, seconds elapsed_ns / 1e9 elapsed
result_rows rows the SELECT produced summary None
written_rows rows written by an INSERT summary None
written_bytes bytes written by an INSERT summary None

Tests: aaiclick/data/object/test_execute_stats.py

stats()

async, returns QueryStats | None. Populated on objects born from a server-side query; None everywhere else. Async (not a property) because awaiting it on an un-awaited LazyOperator first materializes it — same terminal semantics as .data().

Object origin await .stats()
.copy() / .concat() / .join() stats of the INSERT … SELECT
operator / aggregation / group_by result (incl. await a + b) stats of the materialization
plain table-backed Object None
unexecuted View None (it never ran a query)
result = await big_view.copy()
stats = await result.stats()
print(stats.written_rows)   # how many rows the copy wrote (HTTP)
print(stats.read_rows)      # how many it scanned to produce them

export()

Export data to a local file. The format is picked from the file extension and the data streams directly from ClickHouse — no Python round-trip, so multi-million-row exports stay memory-bounded. View constraints (where, limit, order_by) are honored. Returns the absolute path written.

await obj.export("/tmp/data.csv")
await obj.export("/tmp/data.parquet")
await obj.view(where="score > 10", limit=1000).export("/tmp/top.jsonl")

Supported extensions

Extension ClickHouse format Notes
.csv CSVWithNames header row included
.tsv TSVWithNames header row included
.json / .jsonl / .ndjson JSONEachRow newline-delimited JSON
.parquet Parquet columnar, best for large datasets
.arrow Arrow Apache Arrow IPC stream
.orc ORC Apache ORC columnar
.avro Avro Avro container with embedded schema
.md Markdown pipe-delimited markdown table
.xml XML one <row> element per record
.sql SQLInsert INSERT INTO table VALUES (...)

Compression

Append .gz or .xz and the writer compresses automatically:

await obj.export("/tmp/data.csv.gz")
await obj.export("/tmp/data.parquet.gz")
await obj.export("/tmp/data.json.xz")

Backend behavior

  • chdb (embedded): INSERT INTO FUNCTION file('path', fmt) — the embedded engine streams directly to disk and picks the compression codec from the path suffix.
  • clickhouse-connect (remote): raw_stream(query, fmt=fmt) returns uncompressed bytes, which are copied to the local file in 64 KB chunks (on a worker thread) and wrapped through gzip / lzma if the path asks for it. INSERT INTO FUNCTION file() cannot be used here because it would write to the server's user_files_path, not the client.

Views

Read-only filtered projection of an Object — same table, no data copy. Created via obj.view(where=..., limit=..., offset=..., order_by=...). Supports all read operations; cannot insert(). See examples/view_examples.py.

Chained WHERE Clauses

  • obj.where(cond) — creates View with initial WHERE
  • view.where(cond) — AND-chains: .where('x > 10').where('y < 20')WHERE (x > 10) AND (y < 20)
  • view.or_where(cond) — OR-chains: .where('x > 100').or_where('y < 5')WHERE (x > 100) OR (y < 5)

or_where() requires a prior where()

Calling or_where() without a preceding where() raises ValueError. Same applies to or_having() on GroupByQuery.

Column Selection

Select columns from a dict Object → View (same table, restricted SELECT).

  • obj["col"] — single column → array-like View (single value field)
  • obj[["col_a", "col_b"]] — multiple columns → dict-like View
obj = await create_object_from_value({"x": [1, 2, 3], "y": [10, 20, 30]})

# Single column — returns array-like View
await obj["x"].data()              # [1, 2, 3]
arr = await obj["x"].copy()        # new array Object

# Multiple columns — returns dict-like View
await obj[["x", "y"]].data()       # {'x': [1, 2, 3], 'y': [10, 20, 30]}

Preserves WHERE and computed column constraints when chained on a filtered View.

Tests: aaiclick/data/object/test_column_selection.py

Computed Column Expansion: with_columns()

with_columns() adds derived columns (SQL expressions) as a lightweight View — no new table, no data copy.

API

Synchronous (no await). Works on both Object and View; preserves existing constraints. Uses Computed(type, expression) from aaiclick.data.models.

literal() Helper

Convenience wrapper for constant columns — handles SQL quoting:

from aaiclick import literal

# Before: manual quoting
obj.with_columns({"source": Computed("String", "'dataset_a'")})

# After: literal() handles it
obj.with_columns({"source": literal("dataset_a", "String")})
obj.with_columns({"flag": literal(True, "UInt8")})
obj.with_columns({"weight": literal(1.0, "Float64")})

Supported types: str (quoted), int/float (bare), bool (true/false).

Explode

Flattens Array column(s) into individual rows (scalar columns duplicated). Returns a View — downstream operators fuse into a single query. Exploded columns change from Array(T) to T. See aaiclick/data/examples/explode.py.

Tests: aaiclick/data/object/test_explode.py

Chaining

with_columns() returns a View, so all View operations work: group_by(), where(), column selection, further with_columns() calls (additive), and operators on selected columns.

with_columns() internals

Result is always a View with dict-like schema. Array sources promote to dict; scalar sources raise ValueError. Column name collisions raise ValueErrorwith_columns() adds, never replaces.

SQL expressions are passed verbatim to ClickHouse. Basic validation rejects semicolons and SELECT. Type mismatches are caught by ClickHouse at query time.

Domain Helpers

Named shortcuts that delegate to with_columns(). Each auto-names the result column; all accept alias= override and return a View.

Helper Default Alias Type Expression
with_year(col) {col}_year UInt16 toYear(col)
with_month(col) {col}_month UInt8 toMonth(col)
with_day_of_week(col) {col}_dow UInt8 toDayOfWeek(col)
with_date_diff(unit, col_a, col_b) {col_a}_{col_b}_diff Int64 dateDiff('unit', col_a, col_b)
with_lower(col) {col}_lower String lower(col)
with_upper(col) {col}_upper String upper(col)
with_length(col) {col}_length UInt64 length(col)
with_trim(col) {col}_trimmed String trim(col)
with_abs(col) {col}_abs Float64 abs(col)
with_log2(col) {col}_log2 Float64 log2(col)
with_sqrt(col) {col}_sqrt Float64 sqrt(col)
with_bucket(col, size) {col}_bucket Int64 intDiv(col, size)
with_hash_bucket(col, n) {col}_hash UInt64 cityHash64(col) % n
with_if(cond, then, else, *, alias) required alias String if(cond, then, else)
with_cast(col, ch_type) {col}_{type_lower} ch_type to{Type}(col)
with_split_by_char(col, sep) {col}_parts Array(String) splitByChar(sep, col)
with_isin(col, other) {col}_isin UInt8 col IN (SELECT value FROM …)

with_columns() remains the public power-user interface for arbitrary expressions via Computed(type, expression).

Tests: aaiclick/data/object/test_with_columns.py

Column Renaming: rename()

Returns a View aliasing column names (old AS new). Synchronous, no await.

from aaiclick import literal

# Rename camelCase columns to snake_case for a consolidated table
kev_view = kev.rename({
    "cveID": "cve_id",
    "vendorProject": "vendor",
    "vulnerabilityName": "vulnerability_name",
}).with_columns({
    "source": literal("kev", "String"),
})
await consolidated.insert(kev_view)

Chainable with with_columns(), where(), and other View operations. New names must not collide with non-renamed columns.

Tests: aaiclick/data/object/test_rename.py

The copy() Method

Full data copy → new Object. Works on both Objects and Views — filters, computed columns, and ORDER BY are preserved.

# Copy an array Object
obj_copy = await obj.copy()

# Materialize a View into a new Object
arr = await obj["x"].copy()         # array Object from dict column
subset = await obj.where("x > 5").copy()  # filtered copy

# Sorted copy — ORDER BY is preserved
sorted_copy = await obj.view(order_by="amount DESC").copy()
await sorted_copy.data()  # returns rows sorted by amount DESC

The result carries the INSERT … SELECT stats — result = await obj.copy() then (await result.stats()).read_rows tells you how much it scanned. See stats().

copy() is not serialized across workers

Unlike insert() and concat(), copy() does not take a per-table advisory lock. Two concurrent copy() calls into the same named destination will interleave their rows non-deterministically. Structure your pipeline so a named destination has a single writer.

Tests: aaiclick/data/test_copy_parametrized.py

Operation Provenance (Oplog)

All Object operations are instrumented to record provenance via OplogCollector. See docs/oplog.md.