Skip to content

Altair Grid Engine™ – Building a Modern HPC Scheduler

Anyone working in high-performance computing (HPC) has likely come across Altair Grid Engine at some point in their career. Altair Grid Engine has been around in various forms since 1993. The scheduler caught fire with Sun Microsystems’ acquisition of Gridware in the summer of 2000, and subsequent decision to open-source the software.

Then called Sun Grid Engine (SGE), the platform grew to an estimated 8,000 installations by 2008, overtaking more established competitors and becoming one of the industry’s pre-eminent workload managers.

Enterprise-Grade HPC

Starting in the early 2000s, HPC market requirements began to shift. High-performance workloads expanded beyond their scientific roots into a variety of enterprise applications. In industries such as life sciences, manufacturing, and semiconductor design, customers have tens of millions of dollars invested in hardware, software, and personnel. While open-source software models historically had served customers well, HPC was becoming increasingly critical to operations. Volunteer developers and best-effort community support were no longer enough for enterprise clients. Customers realized that open-source efforts were frequently shifting the burden of development, integration, testing, and support costs to the customer for little or no financial return. Customers demanded dependable code stewardship, commercial-grade support, and new features to keep pace with rapidly evolving enterprise requirements.

Univa Grid Engine is born

When Oracle acquired Sun Microsystems in 2010, it inherited a widely popular but unprofitable software asset in a market where Oracle had little experience. In the confusion that followed Oracle’s decision to take Grid Engine back to a closed-source model and the inevitable software forks that followed, Univa Grid Engine was born.

Coming from a heritage in HPC, Univa’s founders recognized the value in Grid Engine. Univa moved quickly to provide commercial support to the market vacated by Oracle and invest in its own version of Grid Engine, hiring the core Germany-based Grid Engine development team in 2011. Two years later, after several milestone releases, Univa acquired the remaining Grid Engine assets from Oracle. These moves cemented Univa’s position as the sole commercial provider of Grid Engine software and support.

Rebuilt from the ground up

Perhaps the most significant work done on Univa Grid Engine was under the hood, including implementing a new multi-threaded central controller (qmaster), new scheduling and sharing policies, and addressing hundreds of critical bugs and recursive feature eliminations (RFEs) that had languished in the open-source project. While not glamorous, and often invisible to users, the hard work of shoring-up and re-architecting Grid Engines’ foundation was critical to building a modern scheduler.

As the quality, stability, and scalability of Univa Grid Engine rapidly improved, commercial customers voted with their wallets and Univa grew its client base. In addition to architectural improvements, Univa also moved to implement significant new functionality. New features included job classes, Windows support, enhancements to improve graphics processing unit (GPU) utilization, and advanced container support. With these features, new sharing policies, and updated Messaging Passage Interface (MPI) integrations, Univa Grid Engine quickly achieved parity with, and in many cases exceeded, the feature sets of commercial competitors.

Not your Father’s Grid Engine

When looking at Univa Grid Engine from afar, it is tempting to compare it to the last open-source Grid Engine, now almost ten years old. Univa Grid Engine maintains full backward compatibility after all, but Univa’s engineering team bristles at such comparisons.

I believe comparing open source Grid Engine to modern Univa Grid Engine is like comparing a 20-year-old sedan to a modern, turbo-charged, high-performance luxury car. Both will get you from A to B, but only one does so with style, comfort, and modern safety, and there is no question what solution will perform better.

There have been hundreds of major improvements to Univa Grid Engine over the years, with significant subsystems being entirely re-written or developed from scratch. Key areas of improvement include:

  • Scalability and throughput
  • Advanced container support (Docker and Singularity)
  • New resource scheduling and sharing policies
  • Reliability and diagnosability enhancements
  • Core-binding, affinity scheduling, and NUMA support
  • Advanced GPU scheduling
  • Modern RESTful APIs

As the use of cloud has increased in HPC, Univa has invested heavily in optimizing Univa Grid Engine for the cloud. New features include cloud-friendly management features, seamless scalability, and excellent reliability and performance for cloud-scale workloads across major public clouds.

Univa Navops Launch, a companion product to Univa Grid Engine, helps enterprises migrate compute-intensive HPC workloads to the cloud. Navops Launch is application, resource, and budget-aware, providing real-time insights into workloads and spending with complete visibility to HPC cloud resources.

The proof is in the benchmarks

In 2015, after four years of improving the Univa Grid Engine scheduler, Univa undertook a benchmark, comparing the latest open-source Grid Engine release (6.2u5) to Univa Grid Engine 8.5.0. In a series of published tests, Univa Grid Engine demonstrated performance gains ranging between 2x and 9.5x for complex workload scheduling requirements. HPC is all about performance and these throughput advantages, coupled with scalability improvements, translate directly into improved productivity, better utilization of assets, and lower operating costs both on-premise and in the cloud.

In 2019, Univa partnered with Western Digital and AWS to run production-scale multiphysics simulations in one of the largest commercial cloud deployments to date. A Univa Grid Engine cluster was deployed across six AWS availability zones comprised of >40,000 cloud fleet instances and 1,000,000+ vCPUs. A simulation that previously ran for 20 days was compressed to just 8 hours, including the time required to deploy and scale the cluster – a staggering 60x improvement. This level of scalability and resource use efficiency is far beyond what is possible with an open-source scheduler.

Looking toward the future

In September 2020, Altair acquired Univa, launching Altair Grid Engine onto the next stage of its evolution. With advanced support for GPU-aware scheduling, containers, and cloud computing, Altair Grid Engine is a key pillar in Altair’s HPC portfolio for modern HPC and cloud workloads. Altair will continue to invest in Univa’s technology to support existing customers while integrating Altair Grid Engine with Altair’s existing HPC and data analytics solutions. HPC is a critical element of digital transformation, playing a vital role in all areas of computational science and data analytics. These moves will expand the market opportunity for Altair Grid Engine and solidify Altair’s leadership in workload management and cloud enablement for HPC.