83b6b323c3
Build and Deploy Verso / deploy (push) Successful in 17m0s
Base image: add opencv-python-headless (cv2) and tqdm to the bundled scientific stack, and python3-venv (needed to build per-project venvs). Per-project dependencies: a project's requirements.txt is now installed into a venv cached by its sha256 (python3 -m venv --system-site-packages, so the bundled stack stays visible and only extra packages are installed); QuartoRunner points Quarto at it via QUARTO_PYTHON. A per-hash flock serialises concurrent builds; pip output is merged into output.log; on failure the render falls back to the base interpreter. Venvs live under PYTHON_VENVS_DIR (default /var/lib/overleaf/data/python-venvs). Gating: PythonVenvGate.userCanInstallPython restricts installs to the project owner + invited collaborators (ignorePublicAccess excludes anonymous/link users), threaded to CLSI as allowPythonInstall on the editor compile, presentation export, and publish paths. Behind OVERLEAF_ENABLE_PROJECT_PYTHON_VENV (enabled in the deployment). Design doc updated; Phase 2 (egress policy) and Phase 3 (venv eviction) remain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
101 lines
4.9 KiB
Markdown
101 lines
4.9 KiB
Markdown
# Design: per-project Python dependencies (cached virtualenv)
|
||
|
||
Status: **Phase 1 implemented** (gated behind `OVERLEAF_ENABLE_PROJECT_PYTHON_VENV`,
|
||
on in the deployment). Network egress policy and venv eviction (Phases 2–3)
|
||
remain. Captures the plan for letting Quarto `{python}` cells use libraries
|
||
beyond the curated base set.
|
||
|
||
## What ships in Phase 1
|
||
|
||
- A project root `requirements.txt` is installed into a venv cached by its
|
||
sha256, created with `python3 -m venv --system-site-packages`; `QuartoRunner`
|
||
points Quarto at it via `QUARTO_PYTHON`. A per-hash `flock` serialises
|
||
concurrent builds; pip output is merged into `output.log`; on failure the
|
||
render falls back to the base interpreter (and the missing-package message
|
||
surfaces). Venvs live under `PYTHON_VENVS_DIR`
|
||
(default `/var/lib/overleaf/data/python-venvs`).
|
||
- Gated by `userCanInstallPython` (`PythonVenvGate.mjs`) to the project owner +
|
||
invited collaborators (any role) — never anonymous / link-sharing users —
|
||
threaded to CLSI as `allowPythonInstall` on the editor compile, presentation
|
||
export, and publish paths.
|
||
|
||
### Known Phase-1 limitations
|
||
|
||
- The first build of a heavy `requirements.txt` runs within the compile
|
||
timeout; a very large install can be killed and retried next compile (the
|
||
venv is only marked complete on success).
|
||
- No egress restriction yet (Phase 2) — installs reach PyPI directly.
|
||
- No eviction yet (Phase 3) — venvs accumulate under `PYTHON_VENVS_DIR`.
|
||
|
||
## Background
|
||
|
||
Quarto executes `` ```{python} `` cells through a Jupyter kernel. The base image
|
||
([`server-ce/Dockerfile-base`](../server-ce/Dockerfile-base)) bundles a curated
|
||
scientific stack (numpy, pandas, scipy, matplotlib, seaborn, scikit-learn,
|
||
sympy, plotly, tabulate). Anything outside that set currently fails the render
|
||
with `ModuleNotFoundError`.
|
||
|
||
As a first step that already shipped, the Quarto log parser
|
||
([`quarto-log-parser.ts`](../services/web/frontend/js/ide/log-parser/quarto-log-parser.ts))
|
||
turns a missing-package traceback into an actionable message. This document is
|
||
the *next* step: letting a project declare and install its own dependencies.
|
||
|
||
**Key constraint:** the instance runs with anonymous read+write enabled
|
||
(`OVERLEAF_ALLOW_ANONYMOUS_READ_AND_WRITE_SHARING=true`), so compiles can be
|
||
triggered by untrusted users. Installing arbitrary packages is therefore a
|
||
security decision, not just a convenience.
|
||
|
||
## Mechanism
|
||
|
||
1. **Declaration.** A standard `requirements.txt` at the project root opts the
|
||
project in (familiar, Quarto-agnostic, supports version pinning).
|
||
2. **Keying.** CLSI hashes `sha256(requirements.txt + python version)`. The hash
|
||
names a venv directory on a **persistent volume**, e.g.
|
||
`…/data/python-venvs/<hash>/`. Identical dependency sets share one venv across
|
||
projects and compiles.
|
||
3. **Build-if-missing.** `python3 -m venv --system-site-packages <dir>` (so the
|
||
bundled stack stays visible and only the *extra* deps are installed — smaller
|
||
and faster), then `<dir>/bin/pip install -r requirements.txt`. Guard with a
|
||
per-hash `flock` so concurrent compiles don't build the same venv twice.
|
||
4. **Point Quarto at it.** Set `QUARTO_PYTHON=<dir>/bin/python3` in the render
|
||
environment (threaded web → CLSI exactly like `exportMode`). With
|
||
`--system-site-packages`, `ipykernel` from the base is importable, so the
|
||
kernel runs in that interpreter with base + project packages.
|
||
|
||
## Guard rails
|
||
|
||
- **Auth gating.** Only run the install path for **logged-in owner/collaborator**
|
||
compiles. Anonymous-link compiles use the plain base interpreter and never
|
||
trigger installs. Web decides and passes a boolean to CLSI; default-deny.
|
||
- **Network egress.** The compile environment must reach PyPI to install.
|
||
Restrict egress to PyPI / an internal mirror only (k8s NetworkPolicy + pip
|
||
`--index-url`), not arbitrary hosts.
|
||
- **Resource caps.** Install timeout, venv size cap, max package count; surface
|
||
overruns as a clear log error.
|
||
- **Trust boundary.** Even gated, a trusted user installing packages is
|
||
arbitrary code execution in the sandbox. Containment stays the CLSI container
|
||
+ resource limits + egress policy. This is owner-trust-level by design.
|
||
|
||
## Lifecycle
|
||
|
||
- **Eviction.** `touch` the venv on use; an LRU cleanup job prunes the oldest
|
||
venvs when the volume exceeds a size budget.
|
||
- **Failure UX.** pip errors flow into the log panel (reusing the friendly-error
|
||
pattern) showing pip's output.
|
||
|
||
## Rollout
|
||
|
||
- **Phase 1.** Detection + `flock` venv build + `QUARTO_PYTHON`, behind a
|
||
settings flag (default **off**), gated to logged-in owner, dev volume.
|
||
- **Phase 2.** Egress NetworkPolicy + index pinning + eviction job.
|
||
- **Phase 3.** Nicer pip-error surfacing + a small project-settings UI
|
||
affordance.
|
||
|
||
## Open decisions
|
||
|
||
- `requirements.txt` vs a frontmatter field vs both?
|
||
- Shared global venv volume vs per-user namespacing (sharing is cheaper;
|
||
per-user is stricter isolation)?
|
||
- Allow native/compiled wheels (broader support) vs wheels-only/no-build
|
||
(tighter security)?
|