Object¶

Overview¶

The Object class (aaiclick/data/object.py) wraps a ClickHouse table. Each instance corresponds to one table; operator overloading creates new tables with results.

Key Features: - Operator overloading for arithmetic, comparison, and bitwise operations - Immutable structure — no in-place schema changes; insert() appends data but never mutates structure - Support for scalars, arrays, and dictionaries - Element-wise operations on arrays

See DataContext for lifecycle, schemas, and deployment modes.

API Quick Reference¶

API	Category	Description	Section
`+`, `-`, ``, `/`, `//`, `%`, `*`	Arithmetic	Element-wise arithmetic	Arithmetic Operators
`==`, `!=`, `<`, `<=`, `>`, `>=`	Comparison	Element-wise comparison	Comparison Operators
`&`, `\\|`, `^`	Bitwise	Bitwise AND / OR / XOR	Bitwise Operators
`.min()`, `.max()`, `.sum()`, `.mean()`	Aggregation	Reduce array to scalar	Aggregation Operators
`.std()`, `.var()`, `.count()`, `.quantile(q)`	Aggregation	Statistical reduction	Aggregation Operators
`.count_if(condition)`	Aggregation	Count rows matching condition(s)	Aggregation Operators
`.unique()`	Unique	Deduplicate values via GROUP BY	Unique
`.nunique()`	Unique	Count distinct values (fused)	Unique
`.isin(other)`	Membership	IN subquery test (returns UInt8 mask)	Membership Operator
`.match(p)`, `.like(p)`, `.ilike(p)`	String / Regex	Pattern matching (returns UInt8 mask)	String/Regex Operators
`.extract(p)`, `.replace(p, r)`	String / Regex	Capture group extraction, regex replace	String/Regex Operators
`.year()`, `.month()`, `.day_of_week()`	Unary Transforms	Date/time extraction → new Object	Unary Transform Operators
`.lower()`, `.upper()`, `.length()`, `.trim()`	Unary Transforms	String transforms → new Object	Unary Transform Operators
`.abs()`, `.log2()`, `.sqrt()`	Unary Transforms	Math functions → new Object	Unary Transform Operators
`.group_by(keys).sum(col)` etc.	Group By	Aggregation with GROUP BY + optional HAVING	Group By Operations
`.group_by(keys).agg({col: spec})`	Group By	Multiple aggregations (multi-agg per column)	Group By Operations
`.group_by(keys).any(col)`	Group By	Arbitrary non-NULL value per group	Group By Operations
`.group_by(keys).group_array_distinct(col)`	Group By	Distinct values → Array per group	Group By Operations
`.view(where, limit, offset, order_by)`	Views	Read-only View with optional filters	Views
`.where(cond)` / `.or_where(cond)`	Views	Fluent WHERE chaining (AND / OR)	Chained WHERE Clauses
`obj[key]` / `obj[[keys]]`	Views	Select column(s) from dict Object → View	Column Selection
`.with_columns({name: Computed(type, expr)})`	Views	Add SQL expression columns → View	Computed Column Expansion
`literal(value, ch_type)`	Helpers	Constant column (`Computed` with SQL literal)	Computed Column Expansion
`.with_year(col)`, `.with_month(col)` …	Views	Named shortcuts for common computed columns	Domain Helpers
`.with_split_by_char(col, sep)`	Views	Split String column by separator → Array	Domain Helpers
`.with_isin(col, other)`	Views	IN subquery computed column → UInt8	Domain Helpers
`.rename({old: new})`	Views	Alias column names in a View	Column Renaming
`.explode(*columns, left=False)`	Views	Flatten Array column(s) into rows → View	Explode
`.copy()`	Copy	Materialize Object / View to new Object	copy()
`.insert(*sources)`	Ingest	Insert data from Objects / scalars / lists	insert()
`.insert_from_url(url)`	Ingest	Insert rows from a remote URL	insert_from_url()
`.concat(*sources)` / `concat(a, b, …)`	Ingest	Concatenate sources into a new Object	concat()
`.join(other, on, how)`	Join	Join two Objects on key columns → dict Object	join()
`.data(orient=…)`	Data Retrieval	Fetch results to Python (scalar / list / dict)	data()
`.markdown(truncate=…)`	Data Retrieval	Render data as markdown table	markdown()
`.execute(order_by, limit, offset)`	Data Retrieval	Run full query, discard rows → `QueryStats`	execute()
`await .stats()`	Data Retrieval	`QueryStats` of the query that made this Object	stats()
`.export(path)`	Export	Stream data to a file (extension → format)	export()

Operator Support¶

All operators work element-wise on scalar and array data, creating new Object tables. See examples/basic_operators.py.

Scalar broadcast

Python scalars work on either side: obj * 2 and 2 * obj both work.

Lazy Operator Results (`a + b` and `a.sum()` are Plans, not Tables)¶

Every binary operator (+, -, *, /, //, %, **, ==, !=, <, <=, >, >=, &, |, ^), aggregation (.min() / .max() / .sum() / .mean() / .std() / .var() / .count() / .count_if() / .quantile() / .unique() / .nunique()), and unary transform (.year() / .month() / .day_of_week() / .lower() / .upper() / .length() / .trim() / .abs() / .log2() / .sqrt()) returns a LazyOperator — a subclass of Object that captures the operation plan (lhs, rhs, operator, precomputed result schema) without touching ClickHouse. The CREATE TABLE + INSERT INTO ... SELECT happens when the lazy is awaited.

# 1. Build a plan — pure Python, no DB call
plan = a + b                                       # LazyOperator (no table yet)
total = a.sum()                                    # also a LazyOperator

# 2. Read rows / scalars — async methods auto-materialize
rows = await (a + b).data()                        # creates t_<id>, reads rows
value = await a.sum().data()                       # one await — materialize + read

# 3. Get the materialized Object (for .table, legacy APIs, etc.)
obj = await (a + b)                                # table = t_<id>
scalar_obj = await a.sum()                         # table = t_<id> with one row

# 4. Control the table name — use .as_()
revenue = await (prices * quantities).as_("revenue")            # t_revenue_<id>
daily_total = await orders.sum().as_("daily_total")             # t_daily_total_<id>

# 5. Persist beyond the current context
daily = await (sales + bonuses).as_("daily", scope="job")       # j_<job_id>_daily
yearly = await orders.sum().as_("yearly", scope="global")       # p_yearly

await is only needed when the caller wants the materialized Object back — to read .table, hand it to an eager API, etc. Reading rows (.data(), .markdown(), .export(), .result()) or chaining further operators ((a + b) + c, obj.abs().sum()) doesn't require an explicit await of the intermediate — the lazy passes through.

`.as_(name, scope=...)`¶

Names the result table and chooses its lifetime. Returns a new LazyOperator (the receiver is unchanged):

`scope`	Table name	Lifetime
`"temp_named"` (default)	`t_<name>_<snowflake>`	Drops with the context
`"job"`	`j_<job_id>_<name>`	Lives until the active orch job expires
`"global"`	`p_<name>`	Persists; remove with `delete_persistent_object(name, scope="global")`

`name` / `scope` on always-materializing ops¶

.copy(), .concat(), .join(), and every GroupByQuery aggregation (.sum(), .mean(), .min(), .max(), .count(), .std(), .var(), .any(), .group_array_distinct(), .agg()) accept the same name / scope kwargs with identical semantics to .as_() above. These ops always create a new table — no lazy chain to elide — so the kwargs land directly on the existing async methods instead of going through LazyOperator:

enriched = await a.join(b, on="user_id", name="enriched", scope="job")
# → j_<job_id>_enriched

totals = await orders.group_by("category").sum("amount", name="totals", scope="global")
# → p_totals

LazyOperator's overrides forward name / scope to the materialized delegate, so (a + b).copy(name=..., scope=...) and (a + b).join(b, name=..., scope=...) work as expected.

Invariants¶

No DB writes until awaited. Creating a LazyOperator is pure-Python. A lazy that's never awaited never creates a table.
Sync .table raises pre-materialize. Reading .table (or any property derived from it — .scope, .persistent, .order_by) on an unawaited LazyOperator raises RuntimeError. .schema works (it's precomputed).
Async methods auto-materialize. .data(), .sum(), etc. first await self (materializes once, cached) then delegate.
.as_() is immutable. Calling it returns a new LazyOperator; the receiver keeps its (None, None) name/scope.
Re-await is idempotent (single-task). _materialized cache ensures sequential awaits of the same lazy produce one table.

Chained operators¶

(a + b) + c builds a tree:

LazyOperator(op="+", lhs=LazyOperator(op="+", lhs=a, rhs=b), rhs=c)

When awaited, each LazyOperator node materializes into its own table — no fusion. await ((a + b) + c).as_("outer") writes two tables: an unnamed temp for the inner a + b and a t_outer_<id> for the outer. A LazyOperator that's been materialized once is reused if it appears in multiple expressions.

Aggregations and unary transforms stack the same way. obj.abs().sum() builds:

LazyOperator(op="sum", lhs=LazyOperator(op="abs", lhs=obj, rhs=None), rhs=None)

await (a + b).sum().as_("total", scope="job") writes one unnamed temp for the inner + and j_<job_id>_total for the outer sum. Aggregations always materialize their input first — there is no SQL-level fusion across the chain.

Avoid: concurrent await of the same instance¶

asyncio.gather(lazy.data(), lazy.data()) races: both tasks see _materialized is None and both materialize. The result is two tables (the second wins the cache slot; the first is orphaned and dropped via refcount cleanup). To share a result across tasks, await once and share the returned Object.

Implementation: LazyOperator, _plan_operator, _plan_aggregation, _plan_unary_transform in aaiclick/data/object/object.py; shared schema-computation helpers in aaiclick/data/object/schema_compute.py (_preview_operator_schema, _preview_agg_schema, _preview_unary_schema, …); materialization in aaiclick/data/object/operators.py (_apply_operator_db, _apply_aggregation, unary_transform, …).

Arithmetic Operators

Python Operator	Description	ClickHouse Equivalent	Forward Method	Reverse Method
`+`	Addition	`+`	`__add__`	`__radd__`
`-`	Subtraction	`-`	`__sub__`	`__rsub__`
`*`	Multiplication	`*`	`__mul__`	`__rmul__`
`/`	Division	`/`	`__truediv__`	`__rtruediv__`
`//`	Floor Division	`intDiv()`	`__floordiv__`	`__rfloordiv__`
`%`	Modulo	`%`	`__mod__`	`__rmod__`
`**`	Power	`power()`	`__pow__`	`__rpow__`

Arithmetic Type Promotion¶

Arithmetic result types match ClickHouse's native promotion rules. See _promote_arithmetic_type() in aaiclick/data/object/schema_compute.py, validated by aaiclick/data/object/test_type_promotion.py against SELECT toTypeName().

Comparison Operators

Python Operator	Description	ClickHouse Equivalent	Python Method
`==`	Equal	`=`	`__eq__`
`!=`	Not Equal	`!=`	`__ne__`
`<`	Less Than	`<`	`__lt__`
`<=`	Less Than or Equal	`<=`	`__le__`
`>`	Greater Than	`>`	`__gt__`
`>=`	Greater Than or Equal	`>=`	`__ge__`

Comparison operators don't need explicit reverse methods — Python swaps </> and <=/>= automatically (e.g., 5 < obj becomes obj > 5).

Bitwise Operators

Python Operator	Description	ClickHouse Equivalent	Forward Method	Reverse Method
`&`	Bitwise AND	`bitAnd()`	`__and__`	`__rand__`
`\\|`	Bitwise OR	`bitOr()`	`__or__`	`__ror__`
`^`	Bitwise XOR	`bitXor()`	`__xor__`	`__rxor__`

Aggregation Operators¶

Reduce an array to a scalar Object.

Method	ClickHouse Function	Notes
`.min()`	`min()`
`.max()`	`max()`
`.sum()`	`sum()`
`.mean()`	`avg()`
`.std()`	`stddevPop()`
`.var()`	`varPop()`
`.count()`	`count()`
`.quantile(q)`	`quantile(q)()`	Approximate
`.count_if(cond)`	`countIf()`	Scalar (str) or dict Object (dict of conditions)

Unique¶

Method	ClickHouse Implementation	Notes
`.unique()`	`GROUP BY`	Order not guaranteed
`.nunique()`	`SELECT count() FROM (... GROUP BY value)`	Fused count distinct

String/Regex Operators¶

Pattern matching on String columns. Each method takes a str pattern, returns a new Object, and is chainable (e.g., match() → sum() to count matches).

Method	ClickHouse Function	Result Type	Description
`.match(p)`	`match(val, p)`	UInt8	RE2 regex match (0 or 1)
`.like(p)`	`val LIKE p`	UInt8	SQL LIKE (`%`, `_` wildcards)
`.ilike(p)`	`val ILIKE p`	UInt8	Case-insensitive LIKE
`.extract(p)`	`extract(val, p)`	String	Extract first capture group
`.replace(p, r)`	`replaceRegexpAll(val, p, r)`	String	Replace all regex matches

Note: ClickHouse uses RE2 regex syntax (no lookaheads/lookbehinds).

Membership Operator: `isin()`¶

UInt8 membership mask via ClickHouse IN subquery — all data stays in the database.

Method	ClickHouse Equivalent	Result Type
`.isin(other)`	`value IN (SELECT value FROM other_table)`	UInt8

Accepts an Object or a Python list (auto-converted to Object).

obj = await create_object_from_value(["a", "b", "c", "d"])
allowed = await create_object_from_value(["a", "c"])
mask = await obj.isin(allowed)
await mask.data()           # [1, 0, 1, 0]

# Also works with a plain Python list
mask = await obj.isin(["a", "c"])

# Chain with sum() to count matches
total = await mask.sum()
await total.data()          # 2

# Works on dict column selection
obj = await create_object_from_value({"category": ["a", "b", "c"], "val": [1, 2, 3]})
mask = await obj["category"].isin(allowed)

Tests: aaiclick/data/object/test_isin.py. For runnable examples, see examples/isin.py.

Unary Transform Operators¶

Implementation: aaiclick/data/object.py (methods) delegates to aaiclick/data/operators.py — see unary_transform()

Apply a ClickHouse function element-wise to the value column, returning a new Object. Object-level equivalents of Domain Helpers which operate on Views.

Method	ClickHouse Function	Result Type	Category
`.year()`	`toYear()`	UInt16	Date/time
`.month()`	`toMonth()`	UInt8	Date/time
`.day_of_week()`	`toDayOfWeek()`	UInt8	Date/time
`.lower()`	`lower()`	String	String
`.upper()`	`upper()`	String	String
`.length()`	`length()`	UInt64	String
`.trim()`	`trimBoth()`	String	String
`.abs()`	`abs()`	Float64	Math
`.log2()`	`log2()`	Float64	Math
`.sqrt()`	`sqrt()`	Float64	Math

Results are full Objects — chainable with any operator (e.g., await (await obj.year()).unique()).

Tests: aaiclick/data/test_unary_transforms.py. For runnable examples, see examples/transforms.py.

Group By Operations¶

Pandas-style two-step: obj.group_by('key').sum('col'). See GroupByQuery class in aaiclick/data/object.py.

Method	Description	Result Column Type
`.sum(col)`	Sum per group	Preserves int types, Float64 for float
`.mean(col)`	Average per group	Float64
`.min(col)`	Minimum per group	Preserves source type
`.max(col)`	Maximum per group	Preserves source type
`.count()`	Count rows per group	UInt64 (column named `_count`)
`.std(col)`	Standard deviation per group	Float64
`.var(col)`	Variance per group	Float64
`.any(col)`	Arbitrary non-NULL per group	Preserves source type
`.group_array_distinct(col)`	Distinct values → Array per group	`Array(T)` where T is source type
`.agg({col: op})`	Multiple aggregations	Per-function type rules

agg() spec formats — the dict value (per source column) accepts three forms:

Form	Example	Behavior
`str`	`{'amount': GB_SUM}`	Alias = column name (backward compat)
`Agg(op, alias)`	`{'amount': Agg(GB_SUM, 'total')}`	Single op with explicit alias
`[Agg(op, alias), ...]`	`{'amount': [Agg(GB_SUM, 'amt_sum'), Agg(GB_MEAN, 'amt_avg')]}`	Multiple ops on the same column

All three forms can be mixed in a single agg() call.

Multiple group keys, View support, and chained .having()/.or_having() for post-aggregation filtering (same AND/OR pattern as WHERE). .or_having() requires a prior .having(). Result is a normal dict Object. See examples/group_by.py.

Known gap: No Array(T) column support — groupArray(), groupUniqArray(), per-group concat not available.

Memory/Disk Settings¶

For large datasets, ClickHouse can spill to disk via max_bytes_before_external_sort, max_bytes_in_join, join_algorithm.

Ingest¶

concat()¶

Concatenates multiple sources into a new Object. Also available as standalone concat(a, b, ...). Self must be array; args can be Objects, scalars, or lists.

Variadic: obj.concat(a, b, c) — any number of sources in one call
Nullable promotion: if any source has nullable columns, the result column is promoted to Nullable
Compatible types: all sources must have matching column names and compatible ClickHouse types

insert()¶

Inserts data from one or more sources into an existing Object. Target must be array; sources can be Objects, scalars, or lists. Missing columns get ClickHouse defaults.

Variadic: obj.insert(a, b, c) — any number of sources in one call
Subset columns: sources don't need all target columns
View support: sources can be Views with where(), with_columns(), field selection, etc.
Type casting: source column types are CAST to target column types

insert_from_url()¶

Insert data from a URL into an existing Object. For create_object_from_url(), see DataContext.

Supported URL input formats¶

create_object_from_url() and insert_from_url() accept any format ClickHouse can read via the url() table function. Pass the format name explicitly with format=...; the file extension is not consulted. Headerless formats expose columns as c1, c2, … because there are no names to bind.

Format	Notes
`Parquet`	columnar, type-rich (default)
`ORC`	Apache ORC columnar
`Arrow`	reading not supported — use Parquet/ORC
`Avro`	container file with embedded schema
`CSV`	no header — columns must be `c1`, `c2`, …
`CSVWithNames`	header row provides column names
`CSVWithNamesAndTypes`	header + types row — DESCRIBE returns full types
`TSV`	no header — `c1`, `c2`, …
`TSVWithNames`	header row
`TSVWithNamesAndTypes`	header + types row
`JSON`	full ClickHouse JSON envelope (`{meta, data, rows}`)
`JSONEachRow`	newline-delimited JSON objects
`JSONCompactEachRow`	newline-delimited JSON arrays — `c1`, `c2`, …
`RawBLOB` / `JSONAsString`	JSON-blob mode for nested API responses (see below)

# Tabular formats
await create_object_from_url(
    "https://example.com/data.parquet",
    columns=["id", "price", "name"],
    format="Parquet",
)

# Headerless CSV — bind to positional column names
await create_object_from_url(
    "https://example.com/raw.csv",
    columns=["c1", "c2", "c3"],
    format="CSV",
)

# Avro container with embedded schema
await create_object_from_url(
    "https://example.com/events.avro",
    columns=["id", "ts", "payload"],
    format="Avro",
)

JSON-blob mode¶

For nested JSON APIs that wrap rows inside an envelope (e.g. {"vulnerabilities": [...]}), pass RawBLOB or JSONAsString plus json_path and json_columns. ClickHouse loads the whole document as a single string and applies JSONExtract per field:

await create_object_from_url(
    "https://services.nvd.nist.gov/rest/json/cves/2.0",
    format="JSONAsString",
    json_path="vulnerabilities",
    json_columns={
        "cve.id":               ColumnInfo("String"),
        "cve.published":        ColumnInfo("DateTime64(3)"),
        "cve.descriptions":     ColumnInfo("String", array=True),
    },
)

Type inference¶

By default the loader runs DESCRIBE (SELECT … FROM url(…) LIMIT 0) to discover column types. Pass column_types= to skip the DESCRIBE round-trip — useful for headerless CSV/TSV where ClickHouse may fail to infer numeric types from a sample of zero rows.

Shared insert mechanics

Both insert() and concat() delegate to _insert_source() (aaiclick/data/ingest.py) — one INSERT INTO ... SELECT CAST(...) FROM source per source. Order follows argument order.

Row Order¶

Array Objects have no implicit row order. Reads return rows in ClickHouse's natural source order — stable within a process but not portable across runs. For deterministic order, name the column you want to sort on:

obj = await create_object_from_value([3, 1, 2])

await obj.data()                      # → [3, 1, 2] or [1, 2, 3] etc. (undefined)
await obj.data(order_by="value")      # → [1, 2, 3]
await obj.view(order_by="value").data()  # equivalent

Binary elementwise ops between two array Objects from different tables require order_by on both sides — the join uses it as the row-pairing key:

a = await create_object_from_value([1, 2, 3])
b = await create_object_from_value([10, 20, 30])

# a + b   ← TypeError: cross-table op without explicit row order
result = await (a.view(order_by="value") + b.view(order_by="value"))
print(await result.data(order_by="value"))  # → [11, 22, 33]

Same-table ops (a + a), scalar broadcast (a + 10), and aggregations don't need the contract — they have a natural alignment.

Cross-table array ops need order_by

a + b between two array Objects from different tables raises TypeError unless both sides are wrapped with .view(order_by=...). Row order is opt-in, not implicit.

Join¶

join()¶

Joins two Objects on one or more key columns into a new dict Object.

Implementation: aaiclick/data/object/join.py — see resolve_join_keys, build_join_schema, and join_objects_db.

async def join(
    self,
    other: Object,
    *,
    on: str | list[str] | None = None,
    left_on: str | list[str] | None = None,
    right_on: str | list[str] | None = None,
    how: Literal["inner", "left", "right", "full", "cross"] = "inner",
    suffixes: tuple[str, str] | None = None,
) -> Object

Argument	Purpose
`other`	Right-hand Object.
`on`	Key column name(s) present under the same name in both Objects.
`left_on`	Left-side key(s) when names differ. Requires matching-length `right_on`.
`right_on`	Right-side key(s) when names differ.
`how`	Join type. Default `"inner"`.
`suffixes`	Suffix handling for non-key column collisions. `True` → default `("_l", "_r")`; a tuple picks custom suffixes; `None` / `False` → collision raises `ValueError`.

Exactly one of {on} or {left_on, right_on} must be set, except how="cross" which forbids all three.

Join types¶

`how`	ClickHouse keyword	Left cols nullable?	Right cols nullable?
`inner`	`INNER JOIN`	no	no
`left`	`LEFT JOIN`	no	yes (promoted)
`right`	`RIGHT JOIN`	yes (promoted)	no
`full`	`FULL OUTER JOIN`	yes (promoted)	yes (promoted)
`cross`	`CROSS JOIN`	no	no

Outer-join SQL is emitted with SETTINGS join_use_nulls = 1 so misses materialize as NULL instead of the outer-side type's default.

Column semantics¶

Key dedup: under on= the key appears once under its original name; under left_on/right_on both key columns survive. FULL joins with on= coalesce the merged key across sides.
Collisions: non-key columns that appear on both sides raise ValueError unless suffixes=True (shorthand for ("_l", "_r")) or a custom suffixes=("_x", "_y") tuple is passed. Both sides get a suffix — empty suffixes are rejected.
Key types: compared via the same type-compatibility rule as concat (String ↔ FixedString OK, String ↔ Int64 not).
LowCardinality / Array: preserved; LowCardinality(String) on the outer side becomes LowCardinality(Nullable(String)). Join keys themselves must be scalar.
Order: join output has no intrinsic order. Pass data(order_by=...) (or .view(order_by=...)) if deterministic order is required.

Examples¶

# Inner join on a shared key
users  = await create_object_from_value({"id":  [1, 2, 3], "name": ["A", "B", "C"]})
orders = await create_object_from_value({"id":  [1, 1, 4], "total": [9.5, 14.0, 2.0]})

joined = await users.join(orders, on="id")
await joined.data(orient="records")
# → [{"id": 1, "name": "A", "total": 9.5}, {"id": 1, "name": "A", "total": 14.0}]

# Left join — unmatched rows keep left, right columns become NULL
enriched = await users.join(orders, on="id", how="left")
# → rows for id=2, 3 have total=None; schema marks 'total' Nullable(Float64)

# left_on / right_on when key names differ — both keys survive
orders2 = await create_object_from_value({"user_id": [1, 2], "total": [9.5, 3.0]})
await users.join(orders2, left_on="id", right_on="user_id")
# → rows include both 'id' and 'user_id'

# Suffixes on non-key collision — True picks the default ("_l", "_r")
a = await create_object_from_value({"id": [1, 2], "score": [10, 20]})
b = await create_object_from_value({"id": [1, 2], "score": [99, 88]})
merged = await a.join(b, on="id", suffixes=True)
# → columns: id, score_l, score_r

# Or supply a custom tuple:
merged = await a.join(b, on="id", suffixes=("_old", "_new"))

# Cross join — Cartesian product
colors = await create_object_from_value({"c": ["red", "blue"]})
sizes  = await create_object_from_value({"s": ["S", "M"]})
skus   = await colors.join(sizes, how="cross")  # 4 rows

Distributed considerations¶

The default chdb backend is single-shard, so the naive JOIN suffices. For a future sharded backend, GLOBAL JOIN and join_algorithm hints (parallel_hash, partial_merge, grace_hash) are the next levers — deferred to docs/future.md. Nothing in the v1 API precludes a later strategy= kwarg.

Semi / anti / asof joins are deferred: each has semantic nuance (left-only output schema, required order_by, tolerance) worth its own mini-spec.

Data Retrieval¶

data()¶

Returns: scalar → value, array → list, dict → dict or list of dicts.

Orient (dict Objects):

Constant	Value	Description
`ORIENT_DICT`	`'dict'`	Dict with arrays as values (default)
`ORIENT_RECORDS`	`'records'`	List of dicts (one per row)

markdown()¶

Returns data as a plain-text markdown table with auto-sized columns. Optional truncate: dict[str, int] caps column widths. Floats → 2dp, None → N/A.

execute()¶

Runs the query this Object/View describes, discards every row server-side (FORMAT Null), and returns a QueryStats. Forces full compute without paying for transport — benchmark a View, warm caches, or assert a query runs without checking its output.

Rebuilds the same SELECT .data() would issue (honoring where, order_by, limit, offset, computed columns, renames, field selection), so a View's execute() measures its projection. Per-call order_by / limit / offset mirror .data(). No safety row cap applies — the whole pipeline runs unless you pass limit.

view = obj.view(where="score > 10").rename({"score": "s"})
stats = await view.execute()
print(stats.read_rows)   # rows scanned by the full pipeline
print(stats.elapsed_s)   # server-side wall time
# view's table is unchanged; no result rows came back

QueryStats is a frozen value type (exported from aaiclick). Every field is best-effort — a backend fills what it can surface and leaves the rest None:

Field	Meaning	clickhouse-connect (HTTP)	chdb
`read_rows`	rows ClickHouse scanned	summary	`storage_rows_read`
`read_bytes`	uncompressed bytes scanned	summary	`storage_bytes_read`
`elapsed_s`	server-side wall time, seconds	`elapsed_ns / 1e9`	`elapsed`
`result_rows`	rows the SELECT produced	summary	`None`
`written_rows`	rows written by an INSERT	summary	`None`
`written_bytes`	bytes written by an INSERT	summary	`None`

Tests: aaiclick/data/object/test_execute_stats.py

stats()¶

async, returns QueryStats | None. Populated on objects born from a server-side query; None everywhere else. Async (not a property) because awaiting it on an un-awaited LazyOperator first materializes it — same terminal semantics as .data().

Object origin	`await .stats()`
`.copy()` / `.concat()` / `.join()`	stats of the `INSERT … SELECT`
operator / aggregation / `group_by` result (incl. `await a + b`)	stats of the materialization
plain table-backed `Object`	`None`
unexecuted `View`	`None` (it never ran a query)

result = await big_view.copy()
stats = await result.stats()
print(stats.written_rows)   # how many rows the copy wrote (HTTP)
print(stats.read_rows)      # how many it scanned to produce them

export()¶

Export data to a local file. The format is picked from the file extension and the data streams directly from ClickHouse — no Python round-trip, so multi-million-row exports stay memory-bounded. View constraints (where, limit, order_by) are honored. Returns the absolute path written.

await obj.export("/tmp/data.csv")
await obj.export("/tmp/data.parquet")
await obj.view(where="score > 10", limit=1000).export("/tmp/top.jsonl")

Supported extensions¶

Extension	ClickHouse format	Notes
`.csv`	`CSVWithNames`	header row included
`.tsv`	`TSVWithNames`	header row included
`.json` / `.jsonl` / `.ndjson`	`JSONEachRow`	newline-delimited JSON
`.parquet`	`Parquet`	columnar, best for large datasets
`.arrow`	`Arrow`	Apache Arrow IPC stream
`.orc`	`ORC`	Apache ORC columnar
`.avro`	`Avro`	Avro container with embedded schema
`.md`	`Markdown`	pipe-delimited markdown table
`.xml`	`XML`	one `<row>` element per record
`.sql`	`SQLInsert`	`INSERT INTO table VALUES (...)`

Compression¶

Append .gz or .xz and the writer compresses automatically:

await obj.export("/tmp/data.csv.gz")
await obj.export("/tmp/data.parquet.gz")
await obj.export("/tmp/data.json.xz")

Backend behavior¶

chdb (embedded): INSERT INTO FUNCTION file('path', fmt) — the embedded engine streams directly to disk and picks the compression codec from the path suffix.
clickhouse-connect (remote): raw_stream(query, fmt=fmt) returns uncompressed bytes, which are copied to the local file in 64 KB chunks (on a worker thread) and wrapped through gzip / lzma if the path asks for it. INSERT INTO FUNCTION file() cannot be used here because it would write to the server's user_files_path, not the client.

Views¶

Read-only filtered projection of an Object — same table, no data copy. Created via obj.view(where=..., limit=..., offset=..., order_by=...). Supports all read operations; cannot insert(). See examples/view_examples.py.

Chained WHERE Clauses¶

obj.where(cond) — creates View with initial WHERE
view.where(cond) — AND-chains: .where('x > 10').where('y < 20') → WHERE (x > 10) AND (y < 20)
view.or_where(cond) — OR-chains: .where('x > 100').or_where('y < 5') → WHERE (x > 100) OR (y < 5)

or_where() requires a prior where()

Calling or_where() without a preceding where() raises ValueError. Same applies to or_having() on GroupByQuery.

Column Selection¶

Select columns from a dict Object → View (same table, restricted SELECT).

obj["col"] — single column → array-like View (single value field)
obj[["col_a", "col_b"]] — multiple columns → dict-like View

obj = await create_object_from_value({"x": [1, 2, 3], "y": [10, 20, 30]})

# Single column — returns array-like View
await obj["x"].data()              # [1, 2, 3]
arr = await obj["x"].copy()        # new array Object

# Multiple columns — returns dict-like View
await obj[["x", "y"]].data()       # {'x': [1, 2, 3], 'y': [10, 20, 30]}

Preserves WHERE and computed column constraints when chained on a filtered View.

Tests: aaiclick/data/object/test_column_selection.py

Computed Column Expansion: `with_columns()`¶

with_columns() adds derived columns (SQL expressions) as a lightweight View — no new table, no data copy.

API¶

Synchronous (no await). Works on both Object and View; preserves existing constraints. Uses Computed(type, expression) from aaiclick.data.models.

`literal()` Helper¶

Convenience wrapper for constant columns — handles SQL quoting:

from aaiclick import literal

# Before: manual quoting
obj.with_columns({"source": Computed("String", "'dataset_a'")})

# After: literal() handles it
obj.with_columns({"source": literal("dataset_a", "String")})
obj.with_columns({"flag": literal(True, "UInt8")})
obj.with_columns({"weight": literal(1.0, "Float64")})

Supported types: str (quoted), int/float (bare), bool (true/false).

Explode¶

Flattens Array column(s) into individual rows (scalar columns duplicated). Returns a View — downstream operators fuse into a single query. Exploded columns change from Array(T) to T. See aaiclick/data/examples/explode.py.

Tests: aaiclick/data/object/test_explode.py

Chaining¶

with_columns() returns a View, so all View operations work: group_by(), where(), column selection, further with_columns() calls (additive), and operators on selected columns.

with_columns() internals

Result is always a View with dict-like schema. Array sources promote to dict; scalar sources raise ValueError. Column name collisions raise ValueError — with_columns() adds, never replaces.

SQL expressions are passed verbatim to ClickHouse. Basic validation rejects semicolons and SELECT. Type mismatches are caught by ClickHouse at query time.

Domain Helpers¶

Named shortcuts that delegate to with_columns(). Each auto-names the result column; all accept alias= override and return a View.

Helper	Default Alias	Type	Expression
`with_year(col)`	`{col}_year`	`UInt16`	`toYear(col)`
`with_month(col)`	`{col}_month`	`UInt8`	`toMonth(col)`
`with_day_of_week(col)`	`{col}_dow`	`UInt8`	`toDayOfWeek(col)`
`with_date_diff(unit, col_a, col_b)`	`{col_a}_{col_b}_diff`	`Int64`	`dateDiff('unit', col_a, col_b)`
`with_lower(col)`	`{col}_lower`	`String`	`lower(col)`
`with_upper(col)`	`{col}_upper`	`String`	`upper(col)`
`with_length(col)`	`{col}_length`	`UInt64`	`length(col)`
`with_trim(col)`	`{col}_trimmed`	`String`	`trim(col)`
`with_abs(col)`	`{col}_abs`	`Float64`	`abs(col)`
`with_log2(col)`	`{col}_log2`	`Float64`	`log2(col)`
`with_sqrt(col)`	`{col}_sqrt`	`Float64`	`sqrt(col)`
`with_bucket(col, size)`	`{col}_bucket`	`Int64`	`intDiv(col, size)`
`with_hash_bucket(col, n)`	`{col}_hash`	`UInt64`	`cityHash64(col) % n`
`with_if(cond, then, else, *, alias)`	required `alias`	`String`	`if(cond, then, else)`
`with_cast(col, ch_type)`	`{col}_{type_lower}`	`ch_type`	`to{Type}(col)`
`with_split_by_char(col, sep)`	`{col}_parts`	`Array(String)`	`splitByChar(sep, col)`
`with_isin(col, other)`	`{col}_isin`	`UInt8`	`col IN (SELECT value FROM …)`

with_columns() remains the public power-user interface for arbitrary expressions via Computed(type, expression).

Tests: aaiclick/data/object/test_with_columns.py

Column Renaming: `rename()`¶

Returns a View aliasing column names (old AS new). Synchronous, no await.

from aaiclick import literal

# Rename camelCase columns to snake_case for a consolidated table
kev_view = kev.rename({
    "cveID": "cve_id",
    "vendorProject": "vendor",
    "vulnerabilityName": "vulnerability_name",
}).with_columns({
    "source": literal("kev", "String"),
})
await consolidated.insert(kev_view)

Chainable with with_columns(), where(), and other View operations. New names must not collide with non-renamed columns.

Tests: aaiclick/data/object/test_rename.py

The copy() Method¶

Full data copy → new Object. Works on both Objects and Views — filters, computed columns, and ORDER BY are preserved.

# Copy an array Object
obj_copy = await obj.copy()

# Materialize a View into a new Object
arr = await obj["x"].copy()         # array Object from dict column
subset = await obj.where("x > 5").copy()  # filtered copy

# Sorted copy — ORDER BY is preserved
sorted_copy = await obj.view(order_by="amount DESC").copy()
await sorted_copy.data()  # returns rows sorted by amount DESC

The result carries the INSERT … SELECT stats — result = await obj.copy() then (await result.stats()).read_rows tells you how much it scanned. See stats().

copy() is not serialized across workers

Unlike insert() and concat(), copy() does not take a per-table advisory lock. Two concurrent copy() calls into the same named destination will interleave their rows non-deterministically. Structure your pipeline so a named destination has a single writer.

Tests: aaiclick/data/test_copy_parametrized.py

Operation Provenance (Oplog)¶

All Object operations are instrumented to record provenance via OplogCollector. See docs/oplog.md.