Zion Boggan zionboggan.com ↗
65 lines · markdown
History for this file →
1
# gpu-cpu-mutex
2
 
3
Two tiny shell tools that let **multiple independent processes share one GPU and a bounded CPU/RAM budget** without colliding - using nothing but `flock`. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the *same* GPU or kick off heavy CPU batches, this stops them from stepping on each other.
4
 
5
I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With `gpu_run.sh`, the second job simply *waits* for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts.
6
 
7
## Why not just use $EXISTING_THING?
8
 
9
I looked. None of them fit "two unrelated processes, one consumer GPU, share politely":
10
 
11
| Option | Why it didn't fit |
12
|---|---|
13
| **NVIDIA MPS / MIG** | Designed to *co-locate* work on a GPU (spatial sharing / partitioning). I want the opposite - strict **serialization** so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. |
14
| **Slurm `--gres=gpu`** | A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy. |
15
| **`CUDA_VISIBLE_DEVICES` gating** | Picks *which* GPU, doesn't stop two jobs from grabbing the *same* one. |
16
| **`task-spooler` (`ts`), `parallel --sem`** | Closer - but they model a single queue you submit *into*. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). |
17
 
18
So this is ~120 lines of bash. `flock` already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it.
19
 
20
## `gpu_run.sh` - the GPU mutex
21
 
22
Serializes every GPU job through one exclusive `flock` on a shared lock file. The second caller blocks until the first releases.
23
 
24
```bash
25
gpu_run.sh python gen_image.py "a knight" out.png
26
gpu_run.sh bash gen_music.sh "lo-fi" track.wav
27
```
28
 
29
Two details that matter in practice:
30
 
31
- **Reentrancy.** A wrapped command often calls *other* wrapped commands (`gpu_run.sh bash gen.sh`, where `gen.sh` internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports `GPU_LOCK_HELD=1`; any nested call sees that and skips re-locking. The lock is held for the *whole* tree and released once at the top.
32
- **An escape hatch.** Not everything that runs *on* the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set `PSRUN_SKIP_GPU_LOCK=1` (or wire your own check) so non-GPU work runs immediately instead of waiting.
33
 
34
## `cpu_run.sh` - the CPU/RAM guard
35
 
36
A **counting semaphore** (N flock slots) plus **thread caps**, so heavy local jobs queue `N`-at-a-time instead of all-at-once, and no single job can grab every core.
37
 
38
```bash
39
CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/
40
```
41
 
42
- **N slots:** at most `CPU_RUN_SLOTS` (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or `CPU_RUN_MAXWAIT` elapses, after which it runs anyway - capped + niced, never indefinitely stalled).
43
- **Thread caps:** pins `OMP_NUM_THREADS` / `OPENBLAS_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` / `VECLIB_MAXIMUM_THREADS` to `CPU_RUN_THREADS`, so one numpy/torch job can't silently fan out to all cores. `3 slots × 3 threads ≈ 9 cores`, leaving headroom for the rest of the system.
44
- **Politeness:** runs under `nice -n 15` + `ionice -c3` so interactive work stays responsive even when saturated.
45
 
46
## How it works (the whole trick)
47
 
48
`flock(2)` gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is `kill -9`'d. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock *is* the live process holding an fd.
49
 
50
- **Mutex** = one exclusive lock on one path (`flock 9` on `/tmp/gpu.lock`).
51
- **Counting semaphore** = N lock files; grab any one with `flock -n` (non-blocking); if all N are held, wait and retry.
52
 
53
## Install
54
 
55
```bash
56
git clone https://github.com/zionboggan/gpu-cpu-mutex
57
chmod +x gpu-cpu-mutex/*.sh
58
# put them on PATH, or call by path
59
```
60
 
61
No dependencies beyond `bash`, `flock` (util-linux), and optionally `ionice`. Lock paths default to `/tmp` and are configurable via env (see the top of each script).
62
 
63
## License
64
 
65
MIT - see [LICENSE](LICENSE).