EPFL Scales AI Infrastructure with Run:ai

Client

EPFL

Industry

Higher Education & Research

AI Tech Solution

GPU Virtualization and AI Resource Orchestration

Solution Provider

Run:ai

Challenge

EPFL a leading research institution, manages a large-scale GPU infrastructure supporting hundreds of AI researchers and experiments. However, the IT team faced several challenges in scaling and optimizing GPU resource allocation. With over 350 NVIDIA GPUs, EPFL needed to: Improve GPU utilization by reducing idle GPUs during peak research periods. Enable self-service access for AI researchers without IT intervention. Eliminate manual coordination of GPU resources, which relied on emails and spreadsheets. The university initially attempted to create a private GPU cloud using Kubernetes, but could not dynamically allocate GPUs to different AI workloads, leading to inefficiencies.

Solution

EPFL partnered with Run:ai to implement an AI workload orchestration and GPU virtualization platform, transforming its GPU infrastructure into a scalable, self-service AI computing environment. Using Runa:ais Kubernetes-based GPU scheduling, EPFL was able to: Pool over 350 GPUs into a single Kubernetes cluster, allowing for dynamic GPU allocation. Implement fractional GPU usage, enabling multiple AI jobs to share GPU resources efficiently. Create a GPU-as-a-Service model, where researchers could request GPU resources through a self-service interface. Run:ai also introduced real-time GPU performance monitoring, allowing the IT team to track usage and dynamically adjust quotas, ensuring fair and efficient access for all AI researchers.

Results

By deploying Runa:ais AI workload orchestration platform, EPFL achieved major improvements in AI research efficiency and GPU utilization. Reduced idle GPUs by more than 50%, maximizing GPU efficiency. Enabled hundreds of researchers to gain self-service access to GPU resources. Streamlined AI workload scheduling, ensuring optimal GPU allocation across research teams. With Runa:ais GPU virtualization technology, EPFL eliminated manual coordination efforts, improved compute efficiency, and enabled scalable AI infrastructure for cutting-edge research. Case Study Highlights AI infrastructure scaling: Run:ai helped EPFL pool 350+ GPUs into a single Kubernetes-based cluster, optimizing resource sharing. Self-service AI compute: AI researchers gained on-demand access to GPUs without manual IT intervention. 50% reduction in idle GPUs: Runa:ais GPU scheduling maximized AI compute utilization, supporting more AI experiments. Medical Research Case Study 2024run ai.pdf https://pages.run.ai/hubfs/PDFs/Case%20Studies/Medical%20Research%20Case%20Study%202024.pdf
Read Full Case Story

ITOpsAI Hub

A living library of AI insights, frameworks, and case studies curated to spotlight what’s working, what’s evolving, and how to lead through it.

What you’ll find in AI Blogs & Insights:

  • Practical guides on AIOps, orchestration, and AI implementation
  • Use case breakdowns, frameworks, and tool comparisons
  • Deep dives on how AI impacts IT strategy and operations

Many AI tools symbols in a vertical row. colors purple and blue.

What You'll Find in Resources:

  • Curated reports, research, and strategic frameworks from top AI sources
  • Execution guides on governance, infrastructure, and data strategy
  • Trusted insights to help you scale AI with clarity and confidence

AI Brain on a circuit board. Colors purple, blue

What You'll Find in Case Studies:

  • Vetted examples of how companies are using AI to automate and scale
  • Measurable outcomes from infrastructure, IT, and business transformation
  • Strategic insights on execution, orchestration, and enterprise adoption