EPFL Scales AI Infrastructure with Run:ai
Client
EPFL
Industry
Higher Education & Research
AI Tech Solution
GPU Virtualization and AI Resource Orchestration
Solution Provider
Run:ai
Challenge
EPFL a leading research institution, manages a large-scale GPU infrastructure supporting hundreds of AI researchers and experiments. However, the IT team faced several challenges in scaling and optimizing GPU resource allocation. With over 350 NVIDIA GPUs, EPFL needed to: Improve GPU utilization by reducing idle GPUs during peak research periods. Enable self-service access for AI researchers without IT intervention. Eliminate manual coordination of GPU resources, which relied on emails and spreadsheets. The university initially attempted to create a private GPU cloud using Kubernetes, but could not dynamically allocate GPUs to different AI workloads, leading to inefficiencies.
Solution
EPFL partnered with Run:ai to implement an AI workload orchestration and GPU virtualization platform, transforming its GPU infrastructure into a scalable, self-service AI computing environment. Using Runa:ais Kubernetes-based GPU scheduling, EPFL was able to: Pool over 350 GPUs into a single Kubernetes cluster, allowing for dynamic GPU allocation. Implement fractional GPU usage, enabling multiple AI jobs to share GPU resources efficiently. Create a GPU-as-a-Service model, where researchers could request GPU resources through a self-service interface. Run:ai also introduced real-time GPU performance monitoring, allowing the IT team to track usage and dynamically adjust quotas, ensuring fair and efficient access for all AI researchers.
Results
By deploying Runa:ais AI workload orchestration platform, EPFL achieved major improvements in AI research efficiency and GPU utilization. Reduced idle GPUs by more than 50%, maximizing GPU efficiency. Enabled hundreds of researchers to gain self-service access to GPU resources. Streamlined AI workload scheduling, ensuring optimal GPU allocation across research teams. With Runa:ais GPU virtualization technology, EPFL eliminated manual coordination efforts, improved compute efficiency, and enabled scalable AI infrastructure for cutting-edge research. Case Study Highlights AI infrastructure scaling: Run:ai helped EPFL pool 350+ GPUs into a single Kubernetes-based cluster, optimizing resource sharing. Self-service AI compute: AI researchers gained on-demand access to GPUs without manual IT intervention. 50% reduction in idle GPUs: Runa:ais GPU scheduling maximized AI compute utilization, supporting more AI experiments. Medical Research Case Study 2024run ai.pdf https://pages.run.ai/hubfs/PDFs/Case%20Studies/Medical%20Research%20Case%20Study%202024.pdf
Read Full Case Story
ITOpsAI Hub
A living library of AI insights, frameworks, and case studies curated to spotlight what’s working, what’s evolving, and how to lead through it.