Managed Slurm — Tscale | Distributed Training Orchestration on GPU
/ MANAGED SLURM

Distributed training orchestrated at scale

Tscale Managed Slurm gives you a production-grade Slurm cluster on dedicated GPUs — with intelligent queueing, auto-scaling, and observability built in. No sysadmin work, no infra headaches, just submit your job.

Drop-in Slurm

Standard Slurm CLI, srun, sbatch, squeue — all of it. If your team already runs Slurm, you can point your scripts at Tscale and ship today. No new toolchain to learn.

Auto-scaling Capacity

Nodes spin up when queues grow and shut down when idle. You pay for what you use — without throttling your research team during a deadline.

Fully Managed

Tscale runs the control plane, upgrades, security patches, and node health. Your team focuses on training, not on debugging a stubborn slurmd.

/ JOB SCHEDULER

Intelligent queueing & scheduling

Tscale’s scheduler goes beyond FIFO. Fair-share accounting, gang scheduling, preemption, and topology-aware placement ensure your multi-node jobs land on the right hardware — every time.

  • Fair-share scheduling — priorities based on historical usage, project quotas, and burst allowances.
  • Gang scheduling — all-or-nothing allocation for distributed jobs, no half-launched runs.
  • Topology-aware placement — NVLink domains and InfiniBand fabrics stay intact for tight-coupled training.
  • Preemption & backfill — small jobs run in idle gaps, urgent jobs jump the queue.
/ ELASTIC CAPACITY

Scale to thousands of GPUs in minutes

When a research team queues a 1000-GPU job, Tscale provisions the nodes, joins them to the cluster, and starts the job — without a single ticket. When the job ends, capacity drains back to the pool automatically.

  • Burst on demand

    From 8 to 800 nodes in under 10 minutes. No reserved capacity, no quotas to negotiate.

  • Mixed hardware pools

    H100, H200, A100, L40S, MI300X — all in one Slurm cluster, partitioned by partition.

  • Pay for what you use

    Per-second billing on idle drain. Reserved discounts for steady-state baselines.

/ PLATFORM

A complete HPC stack

Managed Slurm comes with everything you need to schedule, monitor, and debug distributed training runs — including the tools your researchers already use.

Scheduler Core

  • Slurm 23.x (latest stable)
  • SlurmDBd accounting
  • Munge authentication
  • High-availability controllers

Workload Managers

  • Pyxis / Enroot containers
  • Singularity / Apptainer
  • Sarus (HPC-native)
  • Bare-metal runs

Frameworks

  • PyTorch DDP
  • DeepSpeed + ZeRO
  • Megatron-LM
  • JAX / pjit
  • Ray Train

Observability

  • Grafana dashboards
  • Prometheus metrics
  • Loki log aggregation
  • DCGM GPU telemetry

Storage

  • High-performance NFS
  • Lustre parallel filesystem
  • S3-compatible object
  • Dataset caching

Access

  • SSH + srun
  • JupyterHub integration
  • REST API (Radar)
  • Terraform provider

Performance

7.2X THROUGHPUT
Faster distributed training

NVLink + InfiniBand fabric + tuned NCCL delivers 7.2x throughput vs. vanilla cloud GPU.

80% LOWER COST
vs. hyperscalers

Same H100s, same NCCL, 80% lower bill. No egress fees, no per-second markup.

<10 MIN PROVISIONING
From job submission to run

Auto-scaling workers are warm, joined, and ready in under 10 minutes from queue.

99.9% UPTIME SLA
HA control plane

Redundant slurmctld, MUNGE failover, and 24/7 SRE on call for cluster-wide incidents.

Built for research teams

Multi-tenant Cluster

Project-level accounts, fair-share quotas, and burst allowances — perfect for orgs with multiple research groups sharing a single cluster.

Learn More

Secure & Isolated

Single-tenant control plane, encrypted MUNGE, RBAC, and audit logging — for teams with strict data residency or compliance requirements.

Learn More

Pair with the rest of the stack

Managed Slurm is the orchestration layer for distributed training. Combine it with the rest of Tscale’s services to ship models from research to production without leaving the platform.

/ MANAGED SLURM

Run your biggest jobs without the ops

Submit Your First Job