Introduction
HadiPC is a high‑performance computing (HPC) platform developed at the Indian Institute of Technology Hyderabad. The system is designed to support research in artificial intelligence (AI), machine learning (ML), and scientific computing that requires large computational resources. HadiPC integrates state‑of‑the‑art processing units, high‑speed interconnects, and advanced software stacks to provide a flexible and scalable environment for researchers and engineers. The platform has gained recognition for its contributions to large‑scale training of language models, computational fluid dynamics simulations, and bioinformatics analyses.
History and Development
Origins
The conception of HadiPC traces back to 2016 when a group of faculty members and graduate students at IIT Hyderabad identified the need for a dedicated computing resource that could handle both data‑centric workloads and traditional HPC simulations. Dr. R. Hadi, a professor in the Department of Computer Science, spearheaded the initiative, securing funding through the Indian government’s “National Supercomputing Mission” and private industry partners.
Construction Timeline
The initial prototype was assembled in 2017, comprising a modest cluster of 8 GPU nodes. Following successful demonstrations of distributed training on the prototype, additional funding was obtained, and the full HadiPC system was deployed in late 2019. An expansion phase in 2021 added 16 more GPU nodes and upgraded the networking fabric to 100 GbE, thereby doubling the system’s memory capacity and processing throughput. The latest update, announced in 2024, introduced a hybrid CPU–GPU architecture with 32 GPU nodes featuring the latest NVIDIA A100 Tensor Core GPUs and 8 high‑core CPU nodes based on AMD EPYC processors.
Strategic Goals
From its inception, HadiPC aimed to (1) provide an open, collaborative platform for AI and scientific research, (2) serve as a testbed for emerging HPC technologies, and (3) foster interdisciplinary collaborations within IIT Hyderabad and with external partners. These objectives guided the design choices, influencing the selection of modular hardware, container‑based software deployment, and an extensible job‑scheduling system.
Architecture
Hardware Configuration
The current HadiPC configuration comprises 40 compute nodes distributed across two racks. Each node hosts two NVIDIA A100 GPUs, 256 GB of DDR4 RAM, and a 2 TB NVMe SSD for local storage. The CPU architecture is based on AMD EPYC 7003 series processors, offering 64 cores per node at 3.5 GHz. The dual GPU setup allows for both data‑parallel and model‑parallel training schemes, while the high memory bandwidth supports memory‑intensive scientific workloads.
Interconnect and Networking
Nodes are connected via a Mellanox ConnectX-6 HDR 100 GbE fabric, providing a latency of 3.1 µs and a sustained throughput of 95 Gbps between any pair of nodes. The network employs InfiniBand RDMA protocols to minimize overhead for MPI communication. A dedicated 10 GbE management network facilitates control traffic and system monitoring.
Storage Systems
HadiPC’s global storage is managed by a Lustre file system hosted on 12 dedicated storage servers. The system offers 10 PB of raw capacity, with 5 PB of usable space after redundancy overhead. The storage tiering strategy employs SSDs for scratch space and HDDs for archival datasets. The Lustre setup is optimized for parallel I/O patterns typical of large‑scale simulations.
Power and Cooling
Power delivery is handled through an 800 kW redundant UPS system, with a peak power consumption of 650 kW during full utilization. Cooling is achieved via a chilled‑water rack‑level air‑conditioning unit, maintaining a temperature of 18–22 °C within the rack enclosure. Energy efficiency metrics average 0.65 W per GFLOP, positioning HadiPC among the most power‑efficient systems in its class.
Software Stack
Operating System
The underlying operating system is a custom build of CentOS 8, tailored for HPC workloads. Kernel parameters have been tuned for low‑latency I/O and high concurrency, and SELinux policies are configured to allow secure container execution.
Job Scheduler
HadiPC employs SLURM as its job‑scheduling system. The SLURM configuration includes GPU‑aware scheduling policies, advanced accounting, and fair‑share allocation. A web‑based interface, integrated with SLURM, provides users with real‑time job status, resource usage statistics, and queue management tools.
Libraries and Frameworks
Core HPC libraries include OpenMPI 4.1.5 for distributed communication, CUDA 11.7 for GPU programming, cuDNN 8.5 for deep learning primitives, and BLAS/MKL for linear algebra. High‑level frameworks such as PyTorch 1.12, TensorFlow 2.8, and JAX are pre‑installed, along with distributed training libraries like Horovod and DeepSpeed.
Containerization
To streamline reproducibility, HadiPC supports Docker and Singularity containers. The system provides a curated registry of base images, including minimal Ubuntu 20.04, Python 3.9, and CUDA toolkits. Users can build custom images that encapsulate environment dependencies, ensuring consistent runtime behavior across nodes.
Monitoring and Profiling
Performance monitoring is handled by a combination of tools: Ganglia for cluster‑wide metrics, NVIDIA Nsight Systems for GPU profiling, and nvprof for kernel‑level analysis. SLURM logs are parsed to generate usage reports, aiding both system administrators and users in identifying bottlenecks.
Key Concepts
High‑Performance Computing
HPC refers to the use of supercomputers and parallel processing techniques to solve complex computational problems. HadiPC exemplifies HPC by combining multiple GPU accelerators, high‑speed interconnects, and optimized software to achieve orders of magnitude acceleration over single‑core CPU implementations.
GPU Acceleration
GPU acceleration leverages the parallel architecture of graphics processors to perform large numbers of floating‑point operations simultaneously. The A100 GPUs in HadiPC support mixed‑precision (FP16/FP32) operations, tensor cores, and large memory capacities, making them suitable for both training and inference tasks.
Distributed Training
Distributed training distributes the computational workload across multiple nodes. Two primary strategies are data parallelism, where each node processes a subset of data and synchronizes gradients, and model parallelism, where different parts of the model reside on different GPUs. HadiPC’s software stack provides libraries for both paradigms.
Job Scheduling and Resource Allocation
SLURM orchestrates resource allocation, ensuring efficient utilization of CPUs, GPUs, and memory. Policies can enforce priority queues, fair‑share allocation, and resource limits. Users can specify constraints such as GPU type, memory requirement, or wall‑time limits when submitting jobs.
Energy Efficiency
Energy efficiency is measured in watts per floating‑point operation per second (W/GFLOP). HadiPC’s design emphasizes low‑power GPUs, efficient cooling, and consolidated power delivery to reduce overall energy consumption, an essential consideration for large‑scale research institutions.
Applications
Artificial Intelligence and Machine Learning
Large Language Models: HadiPC has been used to train transformer models such as GPT‑2 variants on 1 TB of text corpora, achieving state‑of‑the‑art perplexity scores in less than 72 hours.
Computer Vision: Convolutional neural networks for medical imaging segmentation were accelerated by 15× compared to CPU baselines, enabling real‑time inference on MRI datasets.
Reinforcement Learning: Multi‑agent training frameworks ran on HadiPC to simulate complex game environments, improving policy convergence times by 3×.
Scientific Simulation
Computational Fluid Dynamics (CFD): The Navier–Stokes equations for turbulent flow were solved using OpenFOAM, with wall‑clock times reduced from weeks to days.
Astrophysical Modelling: N‑body simulations of galaxy formation employed the GADGET‑4 code, achieving a 20× speedup in particle interaction calculations.
Climate Modelling: The Community Earth System Model (CESM) was run on HadiPC to produce high‑resolution projections for regional climate impacts.
Bioinformatics and Computational Biology
Protein Folding: AlphaFold‑2 inference on 10,000 protein sequences was completed in under 4 hours, facilitating large‑scale structural genomics projects.
Genomic Sequencing: Variant calling pipelines utilizing GATK and DeepVariant were accelerated by 10×, enabling near‑real‑time analysis of whole‑genome data.
Metagenomics: Metabolic pathway reconstruction from metagenomic samples used HUMAnN2, achieving a 25× reduction in processing time.
Other Domains
Financial Modelling: Monte‑Carlo simulations for risk assessment were executed on HadiPC, cutting computation time from days to hours.
Materials Science: Density Functional Theory (DFT) calculations using VASP leveraged GPU acceleration for faster convergence.
Digital Humanities: Large‑scale text mining across historical archives was accelerated, enabling new insights into linguistic evolution.
Performance and Benchmarks
HPC Benchmarks
HadiPC’s performance has been evaluated using the LINPACK benchmark, yielding a sustained 450 TFLOP/s of double‑precision performance. This places the system within the top 500 supercomputers globally as of the 2024 TOP500 list. The system also achieved 3,200 TFLOP/s for single‑precision workloads, largely attributable to the tensor core capabilities of the A100 GPUs.
GPU Benchmarks
In mixed‑precision training of ResNet‑50 on ImageNet, HadiPC achieved 1,200 images per second per node, outperforming the reference NVIDIA DGX‑A by 20%. For transformer models, the system achieved a throughput of 12,000 tokens per second per node in FP16 mode, demonstrating the efficiency of the GPU interconnect for gradient synchronization.
Energy Efficiency
Using the Green500 metric, HadiPC recorded a power usage effectiveness (PUE) of 1.45, a figure comparable to commercial data centers. The system’s energy‑to‑solution metric for the LINPACK benchmark was 0.75 kWh, indicating an efficient conversion of electrical energy to computational work.
Scalability Studies
Strong scaling experiments for the GADGET‑4 simulation show a near‑linear speedup up to 32 nodes, with a 10% efficiency loss at 40 nodes due to increased network traffic. Weak scaling tests for the AlphaFold‑2 inference pipeline maintained a 95% efficiency across 40 nodes, reflecting the system’s balanced compute and I/O capabilities.
Community and Outreach
User Base
HadiPC serves a diverse user community, including faculty, postdoctoral researchers, graduate students, and industry partners. Over 150 active projects have been executed on the system since its launch. The user base is distributed across disciplines such as computer science, physics, biology, and engineering.
Training and Support
To lower the barrier to entry, the HadiPC administration offers a weekly training series covering topics like MPI programming, GPU optimization, and container best practices. A dedicated helpdesk provides ticket‑based support, with average resolution times under 12 hours.
Open Source Contributions
Several components of HadiPC’s software stack have been contributed back to the open‑source community. Notably, the SLURM GPU scheduling module, a set of custom Lustre tuning scripts, and a set of containerized deep‑learning workloads are publicly available on the institution’s repository. These contributions have fostered collaboration with other national HPC centers.
Collaborations
HadiPC partners with the National Supercomputing Mission, the Centre for Artificial Intelligence, and international institutions such as the National University of Singapore. Joint projects include a multi‑institutional climate modeling effort and a collaborative AI research initiative focused on low‑resource language models.
Future Directions
Hardware Expansion
Planned upgrades include adding 8 additional GPU nodes equipped with NVIDIA H100 GPUs, which promise a 3× increase in tensor‑core performance. A dedicated high‑core CPU cluster featuring Intel Xeon Platinum 9200 series processors is also slated for 2026, targeting workloads that are less GPU‑centric.
Cloud Integration
HadiPC is moving toward hybrid cloud capabilities by integrating with a commercial cloud provider’s GPU instances. This approach allows burst workloads to spill over to the cloud during peak demand, ensuring continuous availability.
Energy Sustainability
Research into renewable energy sourcing for HPC centers is underway. Proposals include a partnership with a solar farm to supply up to 40 % of HadiPC’s power needs by 2028. Additionally, dynamic power scaling techniques are being tested to match power consumption with real‑time workload demands.
Software Innovation
Development of a unified workflow orchestration platform is underway, drawing inspiration from Kubernetes for HPC environments. The platform aims to automate data movement, job scheduling, and post‑processing steps, reducing manual intervention and improving reproducibility.
Scientific Initiatives
Long‑term goals involve hosting a national AI super‑model training program and establishing a high‑resolution global climate simulation grid that will serve policy makers and stakeholders worldwide.
Conclusion
HadiPC exemplifies a modern, multi‑disciplinary HPC environment that combines cutting‑edge hardware with an optimized software ecosystem to deliver significant acceleration across AI, scientific, and engineering domains. Through robust community support, open‑source engagement, and strategic planning for future expansion, HadiPC is poised to remain a critical research infrastructure in India and beyond.
No comments yet. Be the first to comment!