Files
Verso/docs/python-dependencies-design.md
T
claude 83b6b323c3
Build and Deploy Verso / deploy (push) Successful in 17m0s
Add cv2/tqdm to base; implement per-project Python venvs (Design B, Phase 1)
Base image: add opencv-python-headless (cv2) and tqdm to the bundled
scientific stack, and python3-venv (needed to build per-project venvs).

Per-project dependencies: a project's requirements.txt is now installed into a
venv cached by its sha256 (python3 -m venv --system-site-packages, so the
bundled stack stays visible and only extra packages are installed); QuartoRunner
points Quarto at it via QUARTO_PYTHON. A per-hash flock serialises concurrent
builds; pip output is merged into output.log; on failure the render falls back
to the base interpreter. Venvs live under PYTHON_VENVS_DIR
(default /var/lib/overleaf/data/python-venvs).

Gating: PythonVenvGate.userCanInstallPython restricts installs to the project
owner + invited collaborators (ignorePublicAccess excludes anonymous/link
users), threaded to CLSI as allowPythonInstall on the editor compile,
presentation export, and publish paths. Behind OVERLEAF_ENABLE_PROJECT_PYTHON_VENV
(enabled in the deployment). Design doc updated; Phase 2 (egress policy) and
Phase 3 (venv eviction) remain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 13:14:47 +00:00

4.9 KiB
Raw Blame History

Design: per-project Python dependencies (cached virtualenv)

Status: Phase 1 implemented (gated behind OVERLEAF_ENABLE_PROJECT_PYTHON_VENV, on in the deployment). Network egress policy and venv eviction (Phases 23) remain. Captures the plan for letting Quarto {python} cells use libraries beyond the curated base set.

What ships in Phase 1

  • A project root requirements.txt is installed into a venv cached by its sha256, created with python3 -m venv --system-site-packages; QuartoRunner points Quarto at it via QUARTO_PYTHON. A per-hash flock serialises concurrent builds; pip output is merged into output.log; on failure the render falls back to the base interpreter (and the missing-package message surfaces). Venvs live under PYTHON_VENVS_DIR (default /var/lib/overleaf/data/python-venvs).
  • Gated by userCanInstallPython (PythonVenvGate.mjs) to the project owner + invited collaborators (any role) — never anonymous / link-sharing users — threaded to CLSI as allowPythonInstall on the editor compile, presentation export, and publish paths.

Known Phase-1 limitations

  • The first build of a heavy requirements.txt runs within the compile timeout; a very large install can be killed and retried next compile (the venv is only marked complete on success).
  • No egress restriction yet (Phase 2) — installs reach PyPI directly.
  • No eviction yet (Phase 3) — venvs accumulate under PYTHON_VENVS_DIR.

Background

Quarto executes ```{python} cells through a Jupyter kernel. The base image (server-ce/Dockerfile-base) bundles a curated scientific stack (numpy, pandas, scipy, matplotlib, seaborn, scikit-learn, sympy, plotly, tabulate). Anything outside that set currently fails the render with ModuleNotFoundError.

As a first step that already shipped, the Quarto log parser (quarto-log-parser.ts) turns a missing-package traceback into an actionable message. This document is the next step: letting a project declare and install its own dependencies.

Key constraint: the instance runs with anonymous read+write enabled (OVERLEAF_ALLOW_ANONYMOUS_READ_AND_WRITE_SHARING=true), so compiles can be triggered by untrusted users. Installing arbitrary packages is therefore a security decision, not just a convenience.

Mechanism

  1. Declaration. A standard requirements.txt at the project root opts the project in (familiar, Quarto-agnostic, supports version pinning).
  2. Keying. CLSI hashes sha256(requirements.txt + python version). The hash names a venv directory on a persistent volume, e.g. …/data/python-venvs/<hash>/. Identical dependency sets share one venv across projects and compiles.
  3. Build-if-missing. python3 -m venv --system-site-packages <dir> (so the bundled stack stays visible and only the extra deps are installed — smaller and faster), then <dir>/bin/pip install -r requirements.txt. Guard with a per-hash flock so concurrent compiles don't build the same venv twice.
  4. Point Quarto at it. Set QUARTO_PYTHON=<dir>/bin/python3 in the render environment (threaded web → CLSI exactly like exportMode). With --system-site-packages, ipykernel from the base is importable, so the kernel runs in that interpreter with base + project packages.

Guard rails

  • Auth gating. Only run the install path for logged-in owner/collaborator compiles. Anonymous-link compiles use the plain base interpreter and never trigger installs. Web decides and passes a boolean to CLSI; default-deny.
  • Network egress. The compile environment must reach PyPI to install. Restrict egress to PyPI / an internal mirror only (k8s NetworkPolicy + pip --index-url), not arbitrary hosts.
  • Resource caps. Install timeout, venv size cap, max package count; surface overruns as a clear log error.
  • Trust boundary. Even gated, a trusted user installing packages is arbitrary code execution in the sandbox. Containment stays the CLSI container
    • resource limits + egress policy. This is owner-trust-level by design.

Lifecycle

  • Eviction. touch the venv on use; an LRU cleanup job prunes the oldest venvs when the volume exceeds a size budget.
  • Failure UX. pip errors flow into the log panel (reusing the friendly-error pattern) showing pip's output.

Rollout

  • Phase 1. Detection + flock venv build + QUARTO_PYTHON, behind a settings flag (default off), gated to logged-in owner, dev volume.
  • Phase 2. Egress NetworkPolicy + index pinning + eviction job.
  • Phase 3. Nicer pip-error surfacing + a small project-settings UI affordance.

Open decisions

  • requirements.txt vs a frontmatter field vs both?
  • Shared global venv volume vs per-user namespacing (sharing is cheaper; per-user is stricter isolation)?
  • Allow native/compiled wheels (broader support) vs wheels-only/no-build (tighter security)?