README.md · GPU CPU Mutex

65 lines · markdown

# gpu-cpu-mutex
 
Two tiny shell tools that let **multiple independent processes share one GPU and a bounded CPU/RAM budget** without colliding - using nothing but `flock`. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the *same* GPU or kick off heavy CPU batches, this stops them from stepping on each other.
 
I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With `gpu_run.sh`, the second job simply *waits* for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts.
 
## Why not just use $EXISTING_THING?
 
I looked. None of them fit "two unrelated processes, one consumer GPU, share politely":
 
| Option | Why it didn't fit |
|---|---|
| **NVIDIA MPS / MIG** | Designed to *co-locate* work on a GPU (spatial sharing / partitioning). I want the opposite - strict **serialization** so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. |
| **Slurm `--gres=gpu`** | A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy. |
| **`CUDA_VISIBLE_DEVICES` gating** | Picks *which* GPU, doesn't stop two jobs from grabbing the *same* one. |
| **`task-spooler` (`ts`), `parallel --sem`** | Closer - but they model a single queue you submit *into*. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). |
 
So this is ~120 lines of bash. `flock` already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it.
 
## `gpu_run.sh` - the GPU mutex
 
Serializes every GPU job through one exclusive `flock` on a shared lock file. The second caller blocks until the first releases.
 
```bash
gpu_run.sh python gen_image.py "a knight" out.png
gpu_run.sh bash gen_music.sh "lo-fi" track.wav
```
 
Two details that matter in practice:
 
- **Reentrancy.** A wrapped command often calls *other* wrapped commands (`gpu_run.sh bash gen.sh`, where `gen.sh` internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports `GPU_LOCK_HELD=1`; any nested call sees that and skips re-locking. The lock is held for the *whole* tree and released once at the top.
- **An escape hatch.** Not everything that runs *on* the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set `PSRUN_SKIP_GPU_LOCK=1` (or wire your own check) so non-GPU work runs immediately instead of waiting.
 
## `cpu_run.sh` - the CPU/RAM guard
 
A **counting semaphore** (N flock slots) plus **thread caps**, so heavy local jobs queue `N`-at-a-time instead of all-at-once, and no single job can grab every core.
 
```bash
CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/
```
 
- **N slots:** at most `CPU_RUN_SLOTS` (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or `CPU_RUN_MAXWAIT` elapses, after which it runs anyway - capped + niced, never indefinitely stalled).
- **Thread caps:** pins `OMP_NUM_THREADS` / `OPENBLAS_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` / `VECLIB_MAXIMUM_THREADS` to `CPU_RUN_THREADS`, so one numpy/torch job can't silently fan out to all cores. `3 slots × 3 threads ≈ 9 cores`, leaving headroom for the rest of the system.
- **Politeness:** runs under `nice -n 15` + `ionice -c3` so interactive work stays responsive even when saturated.
 
## How it works (the whole trick)
 
`flock(2)` gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is `kill -9`'d. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock *is* the live process holding an fd.
 
- **Mutex** = one exclusive lock on one path (`flock 9` on `/tmp/gpu.lock`).
- **Counting semaphore** = N lock files; grab any one with `flock -n` (non-blocking); if all N are held, wait and retry.
 
## Install
 
```bash
git clone https://github.com/zionboggan/gpu-cpu-mutex
chmod +x gpu-cpu-mutex/*.sh
# put them on PATH, or call by path
```
 
No dependencies beyond `bash`, `flock` (util-linux), and optionally `ionice`. Lock paths default to `/tmp` and are configurable via env (see the top of each script).
 
## License
 
MIT - see [LICENSE](LICENSE).

1	# gpu-cpu-mutex
2
3	Two tiny shell tools that let multiple independent processes share one GPU and a bounded CPU/RAM budget without colliding - using nothing but `flock`. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the same GPU or kick off heavy CPU batches, this stops them from stepping on each other.
4
5	I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With `gpu_run.sh`, the second job simply waits for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts.
6
7	## Why not just use $EXISTING_THING?
8
9	I looked. None of them fit "two unrelated processes, one consumer GPU, share politely":
10
11	\| Option \| Why it didn't fit \|
12	\|---\|---\|
13	\| NVIDIA MPS / MIG \| Designed to co-locate work on a GPU (spatial sharing / partitioning). I want the opposite - strict serialization so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. \|
14	\| Slurm `--gres=gpu` \| A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy. \|
15	\| `CUDA_VISIBLE_DEVICES` gating \| Picks which GPU, doesn't stop two jobs from grabbing the same one. \|
16	\| `task-spooler` (`ts`), `parallel --sem` \| Closer - but they model a single queue you submit into. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). \|
17
18	So this is ~120 lines of bash. `flock` already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it.
19
20	## `gpu_run.sh` - the GPU mutex
21
22	Serializes every GPU job through one exclusive `flock` on a shared lock file. The second caller blocks until the first releases.
23
24	```bash
25	gpu_run.sh python gen_image.py "a knight" out.png
26	gpu_run.sh bash gen_music.sh "lo-fi" track.wav
27	```
28
29	Two details that matter in practice:
30
31	- Reentrancy. A wrapped command often calls other wrapped commands (`gpu_run.sh bash gen.sh`, where `gen.sh` internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports `GPU_LOCK_HELD=1`; any nested call sees that and skips re-locking. The lock is held for the whole tree and released once at the top.
32	- An escape hatch. Not everything that runs on the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set `PSRUN_SKIP_GPU_LOCK=1` (or wire your own check) so non-GPU work runs immediately instead of waiting.
33
34	## `cpu_run.sh` - the CPU/RAM guard
35
36	A counting semaphore (N flock slots) plus thread caps, so heavy local jobs queue `N`-at-a-time instead of all-at-once, and no single job can grab every core.
37
38	```bash
39	CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/
40	```
41
42	- N slots: at most `CPU_RUN_SLOTS` (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or `CPU_RUN_MAXWAIT` elapses, after which it runs anyway - capped + niced, never indefinitely stalled).
43	- Thread caps: pins `OMP_NUM_THREADS` / `OPENBLAS_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` / `VECLIB_MAXIMUM_THREADS` to `CPU_RUN_THREADS`, so one numpy/torch job can't silently fan out to all cores. `3 slots × 3 threads ≈ 9 cores`, leaving headroom for the rest of the system.
44	- Politeness: runs under `nice -n 15` + `ionice -c3` so interactive work stays responsive even when saturated.
45
46	## How it works (the whole trick)
47
48	`flock(2)` gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is `kill -9`'d. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock is the live process holding an fd.
49
50	- Mutex = one exclusive lock on one path (`flock 9` on `/tmp/gpu.lock`).
51	- Counting semaphore = N lock files; grab any one with `flock -n` (non-blocking); if all N are held, wait and retry.
52
53	## Install
54
55	```bash
56	git clone https://github.com/zionboggan/gpu-cpu-mutex
57	chmod +x gpu-cpu-mutex/*.sh
58	# put them on PATH, or call by path
59	```
60
61	No dependencies beyond `bash`, `flock` (util-linux), and optionally `ionice`. Lock paths default to `/tmp` and are configurable via env (see the top of each script).
62
63	## License
64
65	MIT - see [LICENSE](LICENSE).