Deploying MuJoCo on Azure ML: Surprising Pain Points

Every problem we hit deploying MuJoCo on Azure ML, and the fix for each.

Chris MendezChris Mendez·21 MAR 2026
Blueprint schematic of a GPU PCB connected to a cracked network storage block, with engineering callouts for CIFS, stat(), mfsymlinks, /mnt, EGL, and POSIX

TL;DR

We needed to deploy MuJoCo on Azure ML for a physical AI research team working on VLA models. It was surprisingly painful. What should have taken minutes took hours, even with AI help. The surprising culprit wasn't the GPU. It was the filesystem: Azure Files uses CIFS/SMB, which breaks symlinks, has high metadata latency, and silently violates POSIX assumptions that research code depends on.

For context, this would be trivial on GCP. The footguns we hit (and how to get around them) are the point of this post.

Just want it to work? We built simup — a one-command CLI that deploys MuJoCo on Azure with all these issues pre-solved. The rest of this post is the hard-won context behind it.

Background

MuJoCo humanoid physics simulation walking on a checkered floor

MuJoCo humanoid simulation. Image: Google DeepMind (Apache 2.0)

We at Haptic are working with a robotics AI research team at Georgia Tech to train robot control policies using a pipeline of VLA (Vision-Language-Action) models in simulation. The physics engine behind most of this work is MuJoCo. The team hit GPU capacity limits on their university cluster, so we needed to move the workload to a cloud provider with available compute. Azure ML was the only option for non-technical reasons.

A Note on Where I'm Coming From

My background is macOS, Linux, GCP, and AWS. I had never touched Azure before this project. We weren't here because Azure was the obvious choice; we were here because a research partner maxed out their university cluster and needed someone to bridge the gap fast. Some of what bit me was genuinely Azure-specific. Some of it was me bringing the wrong mental model to a system I didn't understand yet.

The GPU Was Never the Problem

Every time something broke, my instinct was to look at CUDA, the driver version, the GPU itself. Every time, I was wrong. The actual enemy was the filesystem.

Observations at a Glance

ProblemSymptomRoot CauseFix
Symlinks silently brokenInstall failures, dead dataset pathsCIFS without mfsymlinksAdd mfsymlinks to /etc/fstab
Storage pincerGPU idle, training never startedNode disk too small; network storage too slowRedirect caches to /mnt, Premium tier for data
Training script hungNever reached epoch 14,338 × 3.3s stat() calls on CIFSPremium tier + restructure data loading
Stale mount wouldn't releaseLost GPU timeSMB mount stuck after account switchReboot (no clean alternative)
MuJoCo rendering failedError looked like CUDA problemEGL not pre-configured on compute nodeInstall EGL, add user to render group
Driver conflictStack wouldn't loadnvidia-535 preinstalled, wrong versionPurge, reinstall nvidia-headless-580-server

What I Was Trying to Do

Simulation-to-real training for a robotics research partner at Haptic: MuJoCo for physics sim, openpi for the policy learning stack, Git LFS-tracked datasets, C++ submodules that needed to compile. The kind of workload where everything touches everything.

I stood up an Azure ML Compute Instance, mounted training data from Azure Files, and assumed the hard part would be the ML itself.

It was not.

Azure Files is SMB/CIFS. Without mfsymlinks as a mount option, creating a symlink silently succeeds, but produces a regular file. No error. The link just doesn't work.

I didn't know this. Coming from Linux-on-local-disk, I had no particular reason to go looking. This is an Azure-on-Linux-specific quirk, not something you'd hit on a standard HPC cluster.

It hit me in two places:

uv hardlink failure. uv defaults to hardlinking packages into venvs for speed. Hardlinks across CIFS don't work. The install “succeeded” but packages weren't usable. Fix: export UV_LINK_MODE=copy.

Dead dataset symlink. Training data was on one mount path; the repo expected it at another. ln -s appeared to work. The symlink was a dead file. Training failed at data loading with an error that looked like a path configuration problem.

Fix: One word in /etc/fstab: mfsymlinks. Getting there cost two hours.

Problem 2: The Storage Pincer

Before getting into fixes, it's worth explaining why this wasn't a simple “you ran out of disk” situation, because the naive fix makes things worse.

H100 nodes on Azure ML are provisioned as compute, not storage. The local disk is almost an afterthought. That sounds fine until you do the math on what a serious robotics ML stack actually needs:

ComponentSize
OS + CUDA stack~30GB
Build artifacts, pip/uv/torch cache~20GB
HuggingFace dataset cache77GB
Model weights20–70GB depending on config
Total~150–200GB before training starts

Local SSD on this node: 126G total, 89G already used, 31G free.

The obvious response (put the dataset on network storage) is the correct instinct. That's what Azure Files is for. The problem is that the research code was written for university HPC clusters with local NVMe. It assumes fast stat() calls. It enumerates files at startup. It does thousands of small metadata operations that feel free on local disk and cost 3.3 seconds each on CIFS.

So the disk problem and the I/O problem aren't separate issues. They're the same problem from two directions: the node has too little local storage to hold the workload, and the network storage it's designed to offload to isn't fast enough to run the workload. You're in a pincer.

local SSD:  126G total | 89G used | 31G free
dataset:    77G
→ dataset does not fit on local SSD
→ must use network storage
→ network storage has 3.3s metadata latency
→ training script that enumerates 4,338 files at startup: effectively hangs

The exit is partial: redirect everything that isn't raw data to local SSD (caches, venvs, build artifacts), leave the dataset on Premium-tier Azure Files, and restructure how the training script accesses files. You don't fully escape the pincer. You make it survivable.

Redirect all caches before touching anything else:

export HF_HOME=/mnt/hf_cache
export UV_CACHE_DIR=/mnt/uv_cache
export PIP_CACHE_DIR=/mnt/pip_cache
export TORCH_HOME=/mnt/torch_cache

Problem 3: CIFS Metadata Latency Killed the Training Script

The failure chain:

training script starts
→ enumerates dataset directory
→ 4,338 parquet files × stat() call each
→ each stat() call: 3.3 seconds on Azure Files Standard
→ total: ~4 hours just to scan the directory
→ training script appears hung
→ GPU never utilized

The script wasn't broken. It was waiting on the filesystem.

Running git lfs install && git submodule update --recursive on the same mount had the same character: thousands of small metadata operations, each paying the SMB round-trip tax.

Migrating to Azure Files Premium tier improved sequential throughput and unblocked the LFS clone. It did not fully solve metadata latency; individual stat() calls were still slow. But it was enough to get the training job started. That migration required a new storage account, azcopy with SAS tokens, an /etc/fstab update, and a remount.

Run these before committing to a storage tier. They take five minutes and can save a day:

# Sequential write throughput
dd if=/dev/zero of=/mnt/mujoco-data/test_write bs=1M count=512 oflag=direct

# Metadata latency: the number that actually matters
time stat /mnt/mujoco-data/any_single_file.parquet

If stat is over a few hundred milliseconds and your training script enumerates files at startup, you have a problem before you've run a single epoch.

Problem 4: The Mount Got Stuck and Required a Reboot

After migrating storage accounts, the old CIFS mount went stale. Standard unmount approaches all failed:

sudo umount //old-account.file.core.windows.net/mujoco-data   # failed
sudo umount -l /mnt/mujoco-data                               # still mounted
sudo fuser -m /mnt/mujoco-data                                # found blocking processes
sudo umount -f /mnt/mujoco-data                               # still stuck
sudo reboot                                                    # this worked

A forced reboot on a cloud GPU node to clear a stale SMB mount. There are billing implications to that sentence.

Problem 5: Headless GPU Rendering Isn't Pre-Configured

MuJoCo needs EGL for offscreen rendering. The Azure ML compute node I was on didn't have it set up. The failure mode was the real problem: the error looked like a CUDA or GPU driver issue. I spent real time on the wrong diagnosis.

sudo apt-get install libegl1-mesa libegl1-mesa-dev
sudo usermod -a -G render $USER

GPU was fine the whole time. It just didn't have access to an EGL context.

Problem 6: NVIDIA Driver Conflicts

The preinstalled nvidia-535 drivers conflicted with what the stack required. Fix:

sudo apt-get purge nvidia*
sudo apt autoremove
sudo apt-get install nvidia-headless-580-server
sudo reboot

The key distinction: nvidia-headless not nvidia-driver. No desktop on a compute node. This isn't Azure-specific; it would bite you on any headless Linux server. But nothing in the Azure ML setup flagged it.

What I Tried That Made It Worse

ApproachWhat HappenedVerdict
Making HF cache dir read-only to prevent disk fillCaused subtle import failures in libraries that write to hub cache on loadReverted
mmap for dataset access over CIFSSlower than direct reads; page fault overhead on network-backed memoryReverted
Staying on Azure Files Standard and tuning mount optionsMetadata latency was structural, not tunableMigrated to Premium
umount -f to clear stale mountDidn't work; kernel held the mountRebooted

What Was Actually Fine

Once the environment stopped fighting me, the GPU did exactly what I asked. Azure gave us access to compute we genuinely didn't have. The university cluster was at capacity, and without Azure we weren't running anything. GPU utilization, once training started, was clean and stable. Azure ML's job management, once you understand the model, is reasonable. The problem was entirely in setup: the gap between what the research code assumed and what Azure ML provides by default.

The Pattern

Worth separating the two types of problems:

Azure-specific friction in this setup:

  • CIFS symlink behavior without mfsymlinks
  • OS disk provisioned for compute, not storage, insufficient for robotics ML workloads
  • Azure Files metadata latency on SMB
  • EGL not pre-configured on the compute node

General HPC-to-cloud friction (probably not Azure-specific):

  • Research codebases that hardcode Ubuntu + SLURM assumptions
  • stat()-heavy data loading pipelines that assume fast local NVMe
  • Python environment tooling that expects symlinks and hardlinks to work
  • Git LFS + network storage being a genuinely bad combination

I can only speak to Azure from this deployment. Whether other clouds handle POSIX-style metadata operations better for this class of workload is an open question worth testing.

Pre-Flight Checklist

Before you touch a single line of research code.

Storage

  • Add mfsymlinks to every CIFS/Azure Files mount option in /etc/fstab
  • Run time stat <any_file> on your mount. If it's over 100ms, plan around it
  • Run a dd write benchmark before committing to a storage tier
  • Do the math: OS + CUDA + caches + weights + dataset. If it exceeds local SSD, you're on network storage for data; account for metadata latency accordingly
  • Use Azure Files Premium, not Standard, for any workload with Git LFS or heavy file enumeration

Disk / Cache

  • Redirect pip, uv, HuggingFace, and torch caches to /mnt before anything else
  • Set UV_LINK_MODE=copy
  • Run df -h before any large install. The OS disk fills faster than you expect

GPU / Rendering

  • Install nvidia-headless-{version}-server, not nvidia-driver
  • Install EGL: sudo apt-get install libegl1-mesa libegl1-mesa-dev
  • Add your user to the render group: sudo usermod -a -G render $USER

Automation

  • Write an idempotent node_setup.sh with all of the above and run it at boot via Azure ML startup scripts
  • Never let critical setup live only in bash history

The GPU ran fine, by the way. When I finally got everything else out of the way, it did exactly what I asked. It had been waiting patiently the entire time.

If you've ported similar workloads to cloud infrastructure and hit different failure modes, especially around metadata latency or POSIX semantics, I'd be curious what you found.

That's Why We Built simup

simup — a blueprint-style illustration of an ape sitting on a cloud, using a laptop

Every issue in this post is something we solved by hand, then automated so nobody else has to. simup is a single CLI that deploys MuJoCo on Azure with all of these fixes baked in. One command and your sim is up.

It's free, open source, and available now on GitHub: github.com/Haptic-AI/one-click-mujoco-azure

If you're running into these kinds of problems and want help getting your workloads off the ground, we can help with that.