| 1 | # gpu-cpu-mutex |
| 2 | |
| 3 | Two tiny shell tools that let **multiple independent processes share one GPU and a bounded CPU/RAM budget** without colliding - using nothing but `flock`. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the *same* GPU or kick off heavy CPU batches, this stops them from stepping on each other. |
| 4 | |
| 5 | I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With `gpu_run.sh`, the second job simply *waits* for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts. |
| 6 | |
| 7 | ## Why not just use $EXISTING_THING? |
| 8 | |
| 9 | I looked. None of them fit "two unrelated processes, one consumer GPU, share politely": |
| 10 | |
| 11 | | Option | Why it didn't fit | |
| 12 | |---|---| |
| 13 | | **NVIDIA MPS / MIG** | Designed to *co-locate* work on a GPU (spatial sharing / partitioning). I want the opposite - strict **serialization** so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. | |
| 14 | | **Slurm `--gres=gpu`** | A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy. | |
| 15 | | **`CUDA_VISIBLE_DEVICES` gating** | Picks *which* GPU, doesn't stop two jobs from grabbing the *same* one. | |
| 16 | | **`task-spooler` (`ts`), `parallel --sem`** | Closer - but they model a single queue you submit *into*. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). | |
| 17 | |
| 18 | So this is ~120 lines of bash. `flock` already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it. |
| 19 | |
| 20 | ## `gpu_run.sh` - the GPU mutex |
| 21 | |
| 22 | Serializes every GPU job through one exclusive `flock` on a shared lock file. The second caller blocks until the first releases. |
| 23 | |
| 24 | ```bash |
| 25 | gpu_run.sh python gen_image.py "a knight" out.png |
| 26 | gpu_run.sh bash gen_music.sh "lo-fi" track.wav |
| 27 | ``` |
| 28 | |
| 29 | Two details that matter in practice: |
| 30 | |
| 31 | - **Reentrancy.** A wrapped command often calls *other* wrapped commands (`gpu_run.sh bash gen.sh`, where `gen.sh` internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports `GPU_LOCK_HELD=1`; any nested call sees that and skips re-locking. The lock is held for the *whole* tree and released once at the top. |
| 32 | - **An escape hatch.** Not everything that runs *on* the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set `PSRUN_SKIP_GPU_LOCK=1` (or wire your own check) so non-GPU work runs immediately instead of waiting. |
| 33 | |
| 34 | ## `cpu_run.sh` - the CPU/RAM guard |
| 35 | |
| 36 | A **counting semaphore** (N flock slots) plus **thread caps**, so heavy local jobs queue `N`-at-a-time instead of all-at-once, and no single job can grab every core. |
| 37 | |
| 38 | ```bash |
| 39 | CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/ |
| 40 | ``` |
| 41 | |
| 42 | - **N slots:** at most `CPU_RUN_SLOTS` (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or `CPU_RUN_MAXWAIT` elapses, after which it runs anyway - capped + niced, never indefinitely stalled). |
| 43 | - **Thread caps:** pins `OMP_NUM_THREADS` / `OPENBLAS_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` / `VECLIB_MAXIMUM_THREADS` to `CPU_RUN_THREADS`, so one numpy/torch job can't silently fan out to all cores. `3 slots × 3 threads ≈ 9 cores`, leaving headroom for the rest of the system. |
| 44 | - **Politeness:** runs under `nice -n 15` + `ionice -c3` so interactive work stays responsive even when saturated. |
| 45 | |
| 46 | ## How it works (the whole trick) |
| 47 | |
| 48 | `flock(2)` gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is `kill -9`'d. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock *is* the live process holding an fd. |
| 49 | |
| 50 | - **Mutex** = one exclusive lock on one path (`flock 9` on `/tmp/gpu.lock`). |
| 51 | - **Counting semaphore** = N lock files; grab any one with `flock -n` (non-blocking); if all N are held, wait and retry. |
| 52 | |
| 53 | ## Install |
| 54 | |
| 55 | ```bash |
| 56 | git clone https://github.com/zionboggan/gpu-cpu-mutex |
| 57 | chmod +x gpu-cpu-mutex/*.sh |
| 58 | # put them on PATH, or call by path |
| 59 | ``` |
| 60 | |
| 61 | No dependencies beyond `bash`, `flock` (util-linux), and optionally `ionice`. Lock paths default to `/tmp` and are configurable via env (see the top of each script). |
| 62 | |
| 63 | ## License |
| 64 | |
| 65 | MIT - see [LICENSE](LICENSE). |