Two tiny shell tools that let multiple independent processes share one GPU and a bounded CPU/RAM budget without colliding - using nothing but flock. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or termina...
Two tiny shell tools that let multiple independent processes share one GPU and a bounded CPU/RAM budget without colliding - using nothing but flock. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the same GPU or kick off heavy CPU batches, this stops them from stepping on each other.
I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With gpu_run.sh, the second job simply waits for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts.
I looked. None of them fit "two unrelated processes, one consumer GPU, share politely":
| Option | Why it didn't fit |
|---|---|
| NVIDIA MPS / MIG | Designed to co-locate work on a GPU (spatial sharing / partitioning). I want the opposite - strict serialization so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. |
Slurm --gres=gpu |
A real scheduler, and it works - but standing up slurmctld+slurmd to serialize two tmux loops on a homelab box is wildly heavy. |
CUDA_VISIBLE_DEVICES gating |
Picks which GPU, doesn't stop two jobs from grabbing the same one. |
task-spooler (ts), parallel --sem |
Closer - but they model a single queue you submit into. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). |
So this is ~120 lines of bash. flock already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it.
gpu_run.sh - the GPU mutexSerializes every GPU job through one exclusive flock on a shared lock file. The second caller blocks until the first releases.
gpu_run.sh python gen_image.py "a knight" out.png
gpu_run.sh bash gen_music.sh "lo-fi" track.wav
Two details that matter in practice:
gpu_run.sh bash gen.sh, where gen.sh internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports GPU_LOCK_HELD=1; any nested call sees that and skips re-locking. The lock is held for the whole tree and released once at the top.PSRUN_SKIP_GPU_LOCK=1 (or wire your own check) so non-GPU work runs immediately instead of waiting.cpu_run.sh - the CPU/RAM guardA counting semaphore (N flock slots) plus thread caps, so heavy local jobs queue N-at-a-time instead of all-at-once, and no single job can grab every core.
CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/
CPU_RUN_SLOTS (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or CPU_RUN_MAXWAIT elapses, after which it runs anyway - capped + niced, never indefinitely stalled).OMP_NUM_THREADS / OPENBLAS_NUM_THREADS / MKL_NUM_THREADS / NUMEXPR_NUM_THREADS / VECLIB_MAXIMUM_THREADS to CPU_RUN_THREADS, so one numpy/torch job can't silently fan out to all cores. 3 slots × 3 threads ≈ 9 cores, leaving headroom for the rest of the system.nice -n 15 + ionice -c3 so interactive work stays responsive even when saturated.flock(2) gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is kill -9'd. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock is the live process holding an fd.
flock 9 on /tmp/gpu.lock).flock -n (non-blocking); if all N are held, wait and retry.git clone https://github.com/zionboggan/gpu-cpu-mutex
chmod +x gpu-cpu-mutex/*.sh
# put them on PATH, or call by path
No dependencies beyond bash, flock (util-linux), and optionally ionice. Lock paths default to /tmp and are configurable via env (see the top of each script).
MIT - see LICENSE.