Drop-in Slurm
Standard Slurm CLI, srun, sbatch, squeue — all of it. If your team already runs Slurm, you can point your scripts at Tscale and ship today. No new toolchain to learn.
Tscale Managed Slurm gives you a production-grade Slurm cluster on dedicated GPUs — with intelligent queueing, auto-scaling, and observability built in. No sysadmin work, no infra headaches, just submit your job.
Standard Slurm CLI, srun, sbatch, squeue — all of it. If your team already runs Slurm, you can point your scripts at Tscale and ship today. No new toolchain to learn.
Nodes spin up when queues grow and shut down when idle. You pay for what you use — without throttling your research team during a deadline.
Tscale runs the control plane, upgrades, security patches, and node health. Your team focuses on training, not on debugging a stubborn slurmd.
Tscale’s scheduler goes beyond FIFO. Fair-share accounting, gang scheduling, preemption, and topology-aware placement ensure your multi-node jobs land on the right hardware — every time.
When a research team queues a 1000-GPU job, Tscale provisions the nodes, joins them to the cluster, and starts the job — without a single ticket. When the job ends, capacity drains back to the pool automatically.
From 8 to 800 nodes in under 10 minutes. No reserved capacity, no quotas to negotiate.
H100, H200, A100, L40S, MI300X — all in one Slurm cluster, partitioned by partition.
Per-second billing on idle drain. Reserved discounts for steady-state baselines.
Managed Slurm comes with everything you need to schedule, monitor, and debug distributed training runs — including the tools your researchers already use.
NVLink + InfiniBand fabric + tuned NCCL delivers 7.2x throughput vs. vanilla cloud GPU.
Same H100s, same NCCL, 80% lower bill. No egress fees, no per-second markup.
Auto-scaling workers are warm, joined, and ready in under 10 minutes from queue.
Redundant slurmctld, MUNGE failover, and 24/7 SRE on call for cluster-wide incidents.
Project-level accounts, fair-share quotas, and burst allowances — perfect for orgs with multiple research groups sharing a single cluster.
Learn MoreSingle-tenant control plane, encrypted MUNGE, RBAC, and audit logging — for teams with strict data residency or compliance requirements.
Learn MoreManaged Slurm is the orchestration layer for distributed training. Combine it with the rest of Tscale’s services to ship models from research to production without leaving the platform.