/ MANAGED SLURM

Distributed training orchestrated at scale

Tscale Managed Slurm gives you a production-grade Slurm cluster on dedicated GPUs — with intelligent queueing, auto-scaling, and observability built in. No sysadmin work, no infra headaches, just submit your job.

Get Started Contact Sales

Cluster slurm-prod-01

16 / 19 nodes

Active Queued

Drop-in Slurm

Standard Slurm CLI, srun, sbatch, squeue — all of it. If your team already runs Slurm, you can point your scripts at Tscale and ship today. No new toolchain to learn.

Auto-scaling Capacity

Nodes spin up when queues grow and shut down when idle. You pay for what you use — without throttling your research team during a deadline.

Fully Managed

Tscale runs the control plane, upgrades, security patches, and node health. Your team focuses on training, not on debugging a stubborn slurmd.

/ JOB SCHEDULER

Intelligent queueing & scheduling

Tscale’s scheduler goes beyond FIFO. Fair-share accounting, gang scheduling, preemption, and topology-aware placement ensure your multi-node jobs land on the right hardware — every time.

Fair-share scheduling — priorities based on historical usage, project quotas, and burst allowances.
Gang scheduling — all-or-nothing allocation for distributed jobs, no half-launched runs.
Topology-aware placement — NVLink domains and InfiniBand fabrics stay intact for tight-coupled training.
Preemption & backfill — small jobs run in idle gaps, urgent jobs jump the queue.

RUNNING

llama3-70b-sft-job-4218

64x H100user: research2h 14m

High

RUNNING

mixtral-eval-batch-0089

8x A100user: ml-ops22m

Med

QUEUED

qwen2-72b-pretrain-v3

128x H200user: researchest 4h wait

High

QUEUED

hyperparam-sweep-204

4x L40Suser: appliedest 8m wait

Low

QUEUED

data-preproc-pipeline

2x CPUuser: data-engest 1m wait

Low

Active GPU nodes · 24h 142 / 200

00:00 06:00 12:00 18:00 24:00

Avg Util 71%

P95 Wait 3.2m

Cost / hr $48

/ ELASTIC CAPACITY

Scale to thousands of GPUs in minutes

When a research team queues a 1000-GPU job, Tscale provisions the nodes, joins them to the cluster, and starts the job — without a single ticket. When the job ends, capacity drains back to the pool automatically.

Burst on demand

From 8 to 800 nodes in under 10 minutes. No reserved capacity, no quotas to negotiate.
Mixed hardware pools

H100, H200, A100, L40S, MI300X — all in one Slurm cluster, partitioned by partition.
Pay for what you use

Per-second billing on idle drain. Reserved discounts for steady-state baselines.

/ PLATFORM

A complete HPC stack

Managed Slurm comes with everything you need to schedule, monitor, and debug distributed training runs — including the tools your researchers already use.

Scheduler Core

Slurm 23.x (latest stable)
SlurmDBd accounting
Munge authentication
High-availability controllers

Workload Managers

Pyxis / Enroot containers
Singularity / Apptainer
Sarus (HPC-native)
Bare-metal runs

Frameworks

PyTorch DDP
DeepSpeed + ZeRO
Megatron-LM
JAX / pjit
Ray Train

Observability

Grafana dashboards
Prometheus metrics
Loki log aggregation
DCGM GPU telemetry

Storage

High-performance NFS
Lustre parallel filesystem
S3-compatible object
Dataset caching

Access

SSH + srun
JupyterHub integration
REST API (Radar)
Terraform provider

Performance

7.2X THROUGHPUT

Faster distributed training

NVLink + InfiniBand fabric + tuned NCCL delivers 7.2x throughput vs. vanilla cloud GPU.

80% LOWER COST

vs. hyperscalers

Same H100s, same NCCL, 80% lower bill. No egress fees, no per-second markup.

<10 MIN PROVISIONING

From job submission to run

Auto-scaling workers are warm, joined, and ready in under 10 minutes from queue.

99.9% UPTIME SLA

HA control plane

Redundant slurmctld, MUNGE failover, and 24/7 SRE on call for cluster-wide incidents.

Built for research teams

Multi-tenant Cluster

Project-level accounts, fair-share quotas, and burst allowances — perfect for orgs with multiple research groups sharing a single cluster.

Learn More

Secure & Isolated

Single-tenant control plane, encrypted MUNGE, RBAC, and audit logging — for teams with strict data residency or compliance requirements.

Learn More

Pair with the rest of the stack

Managed Slurm is the orchestration layer for distributed training. Combine it with the rest of Tscale’s services to ship models from research to production without leaving the platform.

/ MANAGED SLURM

Run your biggest jobs without the ops

Submit Your First Job

Distributed training orchestrated at scale

Drop-in Slurm

Auto-scaling Capacity

Fully Managed

Intelligent queueing & scheduling

Scale to thousands of GPUs in minutes

Burst on demand

Mixed hardware pools

Pay for what you use

A complete HPC stack

Scheduler Core

Workload Managers

Frameworks

Observability

Storage

Access

Performance

Built for research teams

Multi-tenant Cluster

Secure & Isolated

Pair with the rest of the stack

INFERENCE

FINE TUNING

KUBERNETES

Run your biggest jobs without the ops