Files
Verso/docs/python-dependencies-design.md
T
claude 51620caf8b docs: design for per-project Python dependencies (cached venv)
Captures the proposed requirements.txt -> cached virtualenv approach (keyed by
hash, --system-site-packages, QUARTO_PYTHON), its guard rails (auth gating,
egress restriction, resource caps) given anonymous write is enabled, lifecycle
(eviction, failure UX), a phased rollout, and the open decisions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 11:35:38 +00:00

77 lines
3.7 KiB
Markdown

# Design: per-project Python dependencies (cached virtualenv)
Status: **proposal** (not yet implemented). Captures the agreed plan for letting
Quarto `{python}` cells use libraries beyond the curated base set.
## Background
Quarto executes `` ```{python} `` cells through a Jupyter kernel. The base image
([`server-ce/Dockerfile-base`](../server-ce/Dockerfile-base)) bundles a curated
scientific stack (numpy, pandas, scipy, matplotlib, seaborn, scikit-learn,
sympy, plotly, tabulate). Anything outside that set currently fails the render
with `ModuleNotFoundError`.
As a first step that already shipped, the Quarto log parser
([`quarto-log-parser.ts`](../services/web/frontend/js/ide/log-parser/quarto-log-parser.ts))
turns a missing-package traceback into an actionable message. This document is
the *next* step: letting a project declare and install its own dependencies.
**Key constraint:** the instance runs with anonymous read+write enabled
(`OVERLEAF_ALLOW_ANONYMOUS_READ_AND_WRITE_SHARING=true`), so compiles can be
triggered by untrusted users. Installing arbitrary packages is therefore a
security decision, not just a convenience.
## Mechanism
1. **Declaration.** A standard `requirements.txt` at the project root opts the
project in (familiar, Quarto-agnostic, supports version pinning).
2. **Keying.** CLSI hashes `sha256(requirements.txt + python version)`. The hash
names a venv directory on a **persistent volume**, e.g.
`…/data/python-venvs/<hash>/`. Identical dependency sets share one venv across
projects and compiles.
3. **Build-if-missing.** `python3 -m venv --system-site-packages <dir>` (so the
bundled stack stays visible and only the *extra* deps are installed — smaller
and faster), then `<dir>/bin/pip install -r requirements.txt`. Guard with a
per-hash `flock` so concurrent compiles don't build the same venv twice.
4. **Point Quarto at it.** Set `QUARTO_PYTHON=<dir>/bin/python3` in the render
environment (threaded web → CLSI exactly like `exportMode`). With
`--system-site-packages`, `ipykernel` from the base is importable, so the
kernel runs in that interpreter with base + project packages.
## Guard rails
- **Auth gating.** Only run the install path for **logged-in owner/collaborator**
compiles. Anonymous-link compiles use the plain base interpreter and never
trigger installs. Web decides and passes a boolean to CLSI; default-deny.
- **Network egress.** The compile environment must reach PyPI to install.
Restrict egress to PyPI / an internal mirror only (k8s NetworkPolicy + pip
`--index-url`), not arbitrary hosts.
- **Resource caps.** Install timeout, venv size cap, max package count; surface
overruns as a clear log error.
- **Trust boundary.** Even gated, a trusted user installing packages is
arbitrary code execution in the sandbox. Containment stays the CLSI container
+ resource limits + egress policy. This is owner-trust-level by design.
## Lifecycle
- **Eviction.** `touch` the venv on use; an LRU cleanup job prunes the oldest
venvs when the volume exceeds a size budget.
- **Failure UX.** pip errors flow into the log panel (reusing the friendly-error
pattern) showing pip's output.
## Rollout
- **Phase 1.** Detection + `flock` venv build + `QUARTO_PYTHON`, behind a
settings flag (default **off**), gated to logged-in owner, dev volume.
- **Phase 2.** Egress NetworkPolicy + index pinning + eviction job.
- **Phase 3.** Nicer pip-error surfacing + a small project-settings UI
affordance.
## Open decisions
- `requirements.txt` vs a frontmatter field vs both?
- Shared global venv volume vs per-user namespacing (sharing is cheaper;
per-user is stricter isolation)?
- Allow native/compiled wheels (broader support) vs wheels-only/no-build
(tighter security)?