diff --git a/docs/python-dependencies-design.md b/docs/python-dependencies-design.md new file mode 100644 index 0000000000..14ee40e8c1 --- /dev/null +++ b/docs/python-dependencies-design.md @@ -0,0 +1,76 @@ +# Design: per-project Python dependencies (cached virtualenv) + +Status: **proposal** (not yet implemented). Captures the agreed plan for letting +Quarto `{python}` cells use libraries beyond the curated base set. + +## Background + +Quarto executes `` ```{python} `` cells through a Jupyter kernel. The base image +([`server-ce/Dockerfile-base`](../server-ce/Dockerfile-base)) bundles a curated +scientific stack (numpy, pandas, scipy, matplotlib, seaborn, scikit-learn, +sympy, plotly, tabulate). Anything outside that set currently fails the render +with `ModuleNotFoundError`. + +As a first step that already shipped, the Quarto log parser +([`quarto-log-parser.ts`](../services/web/frontend/js/ide/log-parser/quarto-log-parser.ts)) +turns a missing-package traceback into an actionable message. This document is +the *next* step: letting a project declare and install its own dependencies. + +**Key constraint:** the instance runs with anonymous read+write enabled +(`OVERLEAF_ALLOW_ANONYMOUS_READ_AND_WRITE_SHARING=true`), so compiles can be +triggered by untrusted users. Installing arbitrary packages is therefore a +security decision, not just a convenience. + +## Mechanism + +1. **Declaration.** A standard `requirements.txt` at the project root opts the + project in (familiar, Quarto-agnostic, supports version pinning). +2. **Keying.** CLSI hashes `sha256(requirements.txt + python version)`. The hash + names a venv directory on a **persistent volume**, e.g. + `…/data/python-venvs//`. Identical dependency sets share one venv across + projects and compiles. +3. **Build-if-missing.** `python3 -m venv --system-site-packages ` (so the + bundled stack stays visible and only the *extra* deps are installed — smaller + and faster), then `/bin/pip install -r requirements.txt`. Guard with a + per-hash `flock` so concurrent compiles don't build the same venv twice. +4. **Point Quarto at it.** Set `QUARTO_PYTHON=/bin/python3` in the render + environment (threaded web → CLSI exactly like `exportMode`). With + `--system-site-packages`, `ipykernel` from the base is importable, so the + kernel runs in that interpreter with base + project packages. + +## Guard rails + +- **Auth gating.** Only run the install path for **logged-in owner/collaborator** + compiles. Anonymous-link compiles use the plain base interpreter and never + trigger installs. Web decides and passes a boolean to CLSI; default-deny. +- **Network egress.** The compile environment must reach PyPI to install. + Restrict egress to PyPI / an internal mirror only (k8s NetworkPolicy + pip + `--index-url`), not arbitrary hosts. +- **Resource caps.** Install timeout, venv size cap, max package count; surface + overruns as a clear log error. +- **Trust boundary.** Even gated, a trusted user installing packages is + arbitrary code execution in the sandbox. Containment stays the CLSI container + + resource limits + egress policy. This is owner-trust-level by design. + +## Lifecycle + +- **Eviction.** `touch` the venv on use; an LRU cleanup job prunes the oldest + venvs when the volume exceeds a size budget. +- **Failure UX.** pip errors flow into the log panel (reusing the friendly-error + pattern) showing pip's output. + +## Rollout + +- **Phase 1.** Detection + `flock` venv build + `QUARTO_PYTHON`, behind a + settings flag (default **off**), gated to logged-in owner, dev volume. +- **Phase 2.** Egress NetworkPolicy + index pinning + eviction job. +- **Phase 3.** Nicer pip-error surfacing + a small project-settings UI + affordance. + +## Open decisions + +- `requirements.txt` vs a frontmatter field vs both? +- Shared global venv volume vs per-user namespacing (sharing is cheaper; + per-user is stricter isolation)? +- Allow native/compiled wheels (broader support) vs wheels-only/no-build + (tighter security)?