Prebuilt Images & Shell Tasks — Design¶
Problem¶
Today every docker/kubernetes job is built from the user's git repo: a
docker_build task is auto-injected, clones the repo at a SHA, and runs
docker build. The container is then always invoked as
so the image must contain aaiclick plus the task's Python code. There is no way to:
- Run a job against a prebuilt image (e.g.
python:3.12) without a build stage, or - Run an arbitrary command in that image instead of an aaiclick
moduleentrypoint.
This spec adds both, and in the process folds the sprawling per-job Docker/K8s columns into a single typed config.
Goals¶
- Run a task as a literal command (an argv) in any runner's execution environment — a host subprocess, a Docker container, or a Kubernetes Pod — with no aaiclick required in that environment.
- Supply a prebuilt
image_tagso the auto-injected build task is skipped (docker/kubernetes only). - Replace the flat
git_*/dockerfile/image_tag/kubernetes_configcolumns with one typed, discriminatedrunnerconfig.
Non-goals (tracked in docs/future.md)¶
- Capturing shell stdout as a data result (
result.data()). Shell tasks are exit-code-only. - String-form (
sh -c "…") commands. Argv-list only. - Moving
entrypoint/kwargsinto a unified entry config (full symmetry). The module entry keeps its existing columns — see "Scope decision".
Orthogonal axes¶
The feature is independent dials that compose freely:
| Axis | Variants | Applies to |
|---|---|---|
| entry_type (per task) | module (today) / shell (new) |
every runner — subprocess, docker, kubernetes |
| runner_mode (per job) | subprocess / docker / kubernetes |
— |
| image_source (per job, nested in docker/kubernetes runner config) | build (git → build task → computed tag, today) / prebuilt (explicit image_tag, no build) |
docker, kubernetes only (subprocess has no image) |
entry_type is orthogonal to the runner: a shell task is "run this argv in
the runner's environment, success = exit 0," whether that environment is a host
subprocess, a container, or a Pod. So the full matrix is module/shell ×
subprocess/docker/kubernetes, with image_source an extra dial on the two
container runners.
Notable combinations:
shell+subprocess— run a local command on the worker host (no image).shell+prebuilt— run a command onpython:3.12(headline case).module+prebuilt— a pre-published aaiclick image, no rebuild.shell+build— a command against the user's freshly built image.
Data model¶
Task entry¶
entry_type is a flat Literal discriminator column; the module entry keeps
its existing entrypoint/kwargs columns; shell adds its own payload columns.
ENTRY_MODULE = "module"
ENTRY_SHELL = "shell"
EntryType = Literal["module", "shell"]
ENTRY_TYPES: list[EntryType] = [ENTRY_MODULE, ENTRY_SHELL]
Task columns:
| Column | Type | Notes |
|---|---|---|
entry_type |
String, not null, no DB server default |
Discriminator. No DB server default; the Python model defaults to module only so the framework's module-task constructors stay terse. Every submission-boundary path (run_job, API request models, CLI) sets it explicitly — that's where "explicit, not silently defaulted" is enforced. |
entrypoint |
str |
Module dotted path. Required for module; unused for shell. |
kwargs |
JSON |
Module args. Empty for shell. |
command |
JSON (list[str]), nullable |
Argv for shell. Null for module. |
command_env |
JSON (dict[str, str]), nullable |
Env map injected as -e for shell. Null for module. |
A module task is unchanged on disk. A shell task sets
entry_type="shell", command=[…], optional command_env, and leaves
entrypoint empty.
Job runner config¶
runner_mode stays a flat Literal discriminator column (indexed, read on the
dispatch hot path). The variant-specific fields move into one typed runner
config, serialized to a JSON column. The flat git_remote, git_sha,
git_branch, dockerfile, image_tag, kubernetes_config columns on Job
and the dockerfile/git_remote/kubernetes_config columns on RegisteredJob
are removed and folded in.
class ImageBuild(BaseModel):
type: Literal["build"] = "build"
git_remote: str
git_sha: str
git_branch: str | None = None
dockerfile: str | None = None
# image_tag is computed (aaiclick-job:<sha>), not stored here.
class ImagePrebuilt(BaseModel):
type: Literal["prebuilt"] = "prebuilt"
image_tag: str # e.g. "python:3.12"
ImageSource = Annotated[ImageBuild | ImagePrebuilt, Field(discriminator="type")]
class SubprocessRunner(BaseModel):
type: Literal["subprocess"] = "subprocess"
class DockerRunner(BaseModel):
type: Literal["docker"] = "docker"
image: ImageSource
class KubernetesRunner(BaseModel):
type: Literal["kubernetes"] = "kubernetes"
image: ImageSource
namespace: str | None = None
service_account: str | None = None
image_pull_secret: str | None = None
resources: dict | None = None # pod resource requests/limits (passthrough)
RunnerConfig = Annotated[
SubprocessRunner | DockerRunner | KubernetesRunner,
Field(discriminator="type"),
]
Job.runner / RegisteredJob.runner hold the serialized RunnerConfig JSON.
runner_mode remains the flat discriminator and must agree with
runner.type.
Effective image tag¶
build→image_tag = compute_image_tag(git_sha)=[<registry>/]aaiclick-job:<sha>(unchanged).prebuilt→image_tag = image.image_tagverbatim.
A single helper resolves the effective tag from a RunnerConfig so dispatch and
the workers don't branch on the union shape.
Execution layers¶
The runner's invocation lives at a different layer than the task it runs, and
the existing module path blurs the two. Naming this explicitly keeps the
entry_type fork honest. The same three layers exist for every runner
(subprocess / docker / kubernetes), host → execution environment:
- Host worker — the runner's ExecuteFn driving the task (heartbeat,
cancellation poll, wait, result read):
_run_task_in_child(subprocess),_run_task_in_container(docker),_run_task_in_pod(k8s). The genuine worker level — identical formoduleandshell. - Runner invocation — what the runner launches in its execution
environment: the mp child target, the
docker run <image> …argv, or the Podcommand. This is the runner/vehicle level, not the task's own definition. This is whereentry_typebranches. - Task execution —
execute_task(task)running the module entrypoint.
For the container runners, python -m …docker_worker --task-id N (and the Pod
equivalent) is layer 2, not layer 3: a per-task bootstrap shim,
framework plumbing that happens to live in the docker_worker module (named for
the host worker — the source of the confusion). Despite "worker" in the path it
is not a queue-claiming loop: it loads one task by id, boots orch_context(),
calls execute_task, writes the result, and exits. The subprocess runner's
module path has the analogous shim (the mp child target that calls
execute_task).
The module/shell fork differs in what occupies layer 2 — uniformly across
runners:
| entry_type | layer-2 runner invocation | layer-3 execution |
|---|---|---|
module |
a fixed bootstrap shim (mp child target / …docker_worker --task-id N / Pod shim) — plumbing, not the user's entry |
shim calls execute_task(entrypoint) |
shell |
the user's argv directly — the task definition is the invocation | none — the argv is the execution; no execute_task |
So for module the user's entry (a dotted path) executes inside a worker-level
shim; for shell the user's entry (an argv) replaces the shim, bypassing both
the bootstrap and execute_task, in whichever environment the runner provides.
Module/code names stay as-is; a later split of the in-container shim out of
docker_worker.py is out of scope here.
Behavior¶
Conditional build-task injection¶
_create_built_job injects the docker_build prerequisite only when
runner.image is an ImageBuild. For ImagePrebuilt, the entry task is created
with no build dependency and runs straight away against the given image_tag.
resolve_docker_config (renamed/retyped to return a RunnerConfig) keeps the
existing precedence for build (explicit kwarg → registered default → git
auto-detect) and short-circuits to ImagePrebuilt when an image is supplied
explicitly or on the registered job.
Shell invocation — per runner¶
Every runner branches on entry_type at layer 2; module is unchanged.
- subprocess (
mp_worker):modulespawns the mp child that callsexecute_task(today).shellruns the argv as a host subprocess (asynciosubprocess), captures exit code + output, and never imports the task or boots an in-process execute path. - docker (
_build_docker_run_cmd):modulemounts the IPC tmpdir + log base, passesbuild_runner_env(), runspython -m …docker_worker --task-id N.shellrunsdocker run <image> <command…>— no IPC mount, noresult.json. The log base is still bind-mounted so output lands in the per-task log file. - kubernetes (
_build_pod_manifest):modulekeeps the Pod bootstrap shim.shellsets the Pod containercommandto the argv — noRemoteTaskResultround-trip.
Shell result handling¶
Uniform across runners: success = the process/container/Pod exit code is 0,
non-zero → failure. result_ref is always None (shell tasks produce no
result.data()). stdout/stderr is captured into the normal per-task log file
(host subprocess capture / docker logs / kubectl logs), so logs surface
uniformly in the UI/CLI. The existing cancellation/timeout paths are unchanged.
Env summary¶
command_env is the only env a shell task receives in an isolated environment
(container/Pod). On the subprocess runner there is no isolation boundary — the
command runs as a child of the worker — so it inherits the worker's process env
with command_env overlaid on top.
| Task | Env injected |
|---|---|
module (any runner) |
build_runner_env() (DB URLs + framework knobs + passthrough) |
shell + docker / kubernetes |
command_env only (on top of the image's own env) |
shell + subprocess |
worker process env + command_env overlay |
Validation¶
Enforced at write boundaries (Pydantic models, run_job, API request models,
CLI choices=):
prebuiltrequires a non-emptyimage_tag.buildrequiresgit_remote+git_sha(auto-detected as today when not given).shellrequires a non-emptycommandlist. Valid on every runner (subprocess / docker / kubernetes).moduledoes not take acommand.runner_modemust equalrunner.type.
API / submission surface¶
run_job(...): addentry_type,command,command_env, andimage(prebuilt tag) parameters alongside the existing git/k8s ones. The git parameters and animageare mutually exclusive per job.RunJobRequest/RegisteredJobRequest(view_models.py): add the same fields.- CLI: extend the job-submit/register commands with
--entry-type,--command,--command-env,--image.
Migration¶
One Alembic migration (via the generate-migration skill — never hand-written):
- Add
Task.command,Task.command_env, then addTask.entry_typeas nullable, backfill every existing row to"module"in the same migration, and finalize the column as not-null. The column carries no server default — new rows must supplyentry_typeexplicitly from code. - Add
Job.runner+RegisteredJob.runnerJSON columns. Drop theJobflat columns (git_remote/git_sha/git_branch/dockerfile/image_tag/kubernetes_config) — the Job's fully-resolved config now lives inJob.runner, so these are dead.
Keep the RegisteredJob flat default columns (git_remote,
dockerfile, kubernetes_config). A registration holds only partial
build defaults — there is no git_sha at registration time (it is resolved
per run) — and the typed ImageBuild source requires a complete git_sha,
so partial defaults cannot be represented as a runner config without
making the image-source fields optional (a worse model) or inventing a
separate partial-defaults type (scope creep). RegisteredJob.runner is used
only to carry a prebuilt image default; build defaults stay in the flat
columns, read by resolve_runner_config/resolve_kubernetes_config at
submission time. This keeps the unification clean where it pays off (the
Job's resolved snapshot) without distorting the type model for partial
registration defaults.
Scope decision (recorded)¶
entrypoint/kwargs are the backbone of the task framework (@task,
task_registry, create_task, execute_task, and most tests read them
directly). Folding them into a unified entry config (full module/shell symmetry)
was rejected as too large and unrelated to this feature. The runner-side
unification stays because those columns are localized to the Docker/K8s path.
Blast radius¶
orchestration/models.py— columns, discriminator constants, migration.orchestration/docker_config.py— typedRunnerConfigunions, prebuilt resolution, effective-tag helper.orchestration/factories.py— conditional build-task injection.orchestration/execution/dispatch.py— readrunnerconfig / effective tag; carryentry_type/command/command_envintoJobDispatch.orchestration/execution/mp_worker.py— subprocess shell branch (run the argv as a host subprocess instead of the mpexecute_taskchild).orchestration/execution/docker_worker.py— shell branch + exit-code result.orchestration/execution/kubernetes_worker.py— mirror prebuilt + shell.orchestration/execution/docker_build.py— read git fields from the build source.orchestration/registered_jobs.py,view_models.py, CLI — submission params.docs/orchestration.md— document prebuilt + shell, and the "Execution layers" distinction (host worker / container command / task execution; that the in-containerdocker_worker --task-id Nshim is layer 2 plumbing, not task execution). Updating this section is part of the plan, not optional.- Tests across the docker/k8s/factory/dispatch suites.
Testing¶
- Unit:
RunnerConfig/ImageSourcediscriminated-union validation round-trips; effective-tag helper; build-task injection present forbuild, absent forprebuilt; validation rejections (shell-without-command, module-with-command, prebuilt-without-tag, runner_mode mismatch). - Subprocess shell (local, no infra): a
shell+subprocesstask runs the argv on the host, reports success on exit 0 / failure on non-zero, and inherits the worker env withcommand_envoverlaid. This path is testable on the default backend. - Integration (distributed backend, GitHub Actions):
shell+prebuiltjob on a small public image runs the argv and reports success on exit 0 / failure on non-zero;command_envreaches the container; no DB creds leak into a shell container;module+buildpath unchanged.