GPU CPU Mutex

Two tiny shell tools that let multiple independent processes share one GPU and a bounded CPU/RAM budget without colliding - using nothing but flock. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or termina...

Project page on zionboggan.com ↗

1 commits First commit Jun 5, 2026 Last commit Jun 5, 2026 (2 weeks ago)

Code Commits Tags

Shell 44.8%Markdown 32.5%Python 22.7%

Files 5 entries

README.md

gpu-cpu-mutex

Two tiny shell tools that let multiple independent processes share one GPU and a bounded CPU/RAM budget without colliding - using nothing but flock. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the same GPU or kick off heavy CPU batches, this stops them from stepping on each other.

I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With gpu_run.sh, the second job simply waits for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts.

Why not just use $EXISTING_THING?

I looked. None of them fit "two unrelated processes, one consumer GPU, share politely":

Option	Why it didn't fit
NVIDIA MPS / MIG	Designed to co-locate work on a GPU (spatial sharing / partitioning). I want the opposite - strict serialization so only one job touches VRAM at a time. MIG isn't even supported on consumer cards.
Slurm `--gres=gpu`	A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy.
`CUDA_VISIBLE_DEVICES` gating	Picks which GPU, doesn't stop two jobs from grabbing the same one.
`task-spooler` (`ts`), `parallel --sem`	Closer - but they model a single queue you submit into. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below).

So this is ~120 lines of bash. flock already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it.

`gpu_run.sh` - the GPU mutex

Serializes every GPU job through one exclusive flock on a shared lock file. The second caller blocks until the first releases.

gpu_run.sh python gen_image.py "a knight" out.png
gpu_run.sh bash gen_music.sh "lo-fi" track.wav

Two details that matter in practice:

Reentrancy. A wrapped command often calls other wrapped commands (gpu_run.sh bash gen.sh, where gen.sh internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports GPU_LOCK_HELD=1; any nested call sees that and skips re-locking. The lock is held for the whole tree and released once at the top.
An escape hatch. Not everything that runs on the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set PSRUN_SKIP_GPU_LOCK=1 (or wire your own check) so non-GPU work runs immediately instead of waiting.

`cpu_run.sh` - the CPU/RAM guard

A counting semaphore (N flock slots) plus thread caps, so heavy local jobs queue N-at-a-time instead of all-at-once, and no single job can grab every core.

CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/

N slots: at most CPU_RUN_SLOTS (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or CPU_RUN_MAXWAIT elapses, after which it runs anyway - capped + niced, never indefinitely stalled).
Thread caps: pins OMP_NUM_THREADS / OPENBLAS_NUM_THREADS / MKL_NUM_THREADS / NUMEXPR_NUM_THREADS / VECLIB_MAXIMUM_THREADS to CPU_RUN_THREADS, so one numpy/torch job can't silently fan out to all cores. 3 slots × 3 threads ≈ 9 cores, leaving headroom for the rest of the system.
Politeness: runs under nice -n 15 + ionice -c3 so interactive work stays responsive even when saturated.

How it works (the whole trick)

flock(2) gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is kill -9'd. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock is the live process holding an fd.

Mutex = one exclusive lock on one path (flock 9 on /tmp/gpu.lock).
Counting semaphore = N lock files; grab any one with flock -n (non-blocking); if all N are held, wait and retry.

Install

git clone https://github.com/zionboggan/gpu-cpu-mutex
chmod +x gpu-cpu-mutex/*.sh
# put them on PATH, or call by path

No dependencies beyond bash, flock (util-linux), and optionally ionice. Lock paths default to /tmp and are configurable via env (see the top of each script).

License

MIT - see LICENSE.