Grid Engine in the Age of Cloud

February 11, 2020

Among Grid Engine users, there is a high interest in cloud computing. In a recent InsideHPC survey sponsored by [Altair], 92% of users said that they were “open to or already using cloud,” and 64% described cloud as “having proven value or high potential.”¹ These figures are bolstered by December 2019 research from Hyperion that shows a dramatic 60% increase in cloud spending from just under $2.5 billion in 2018 to $4 billion in 2019.²

On-premise Grid Engine clusters are here to stay, but HPC cloud spending is projected to nearly double once more to $7.4 billion by 2023. Against this backdrop, I thought it would be useful to discuss some of the enhancements we’ve made to Grid Engine to ensure that it remains the best choice for enterprises shifting workloads to the cloud.

Agile scheduling helps reduce cloud spending – Performance has been a major focus area as we’ve continued to enhance Grid Engine. Published benchmarks demonstrate that Altair Grid Engine schedules workloads on average twice as fast as open-source Grid Engine and delivers almost 10x the throughput for specific scheduling problems. Faster scheduling won’t benefit all workloads, but for others, it can be critical – particularly on large clusters with high-volumes of short-running jobs such as those common in life sciences, financial services, and engineering simulation. Even small improvements in throughput can yield large savings. For example, a 10% throughput improvement may reduce an organization’s monthly cloud bill from $100K/month to $90K/month for dramatic savings on an annualized basis.

Cloud scaling helps complete workloads faster – Organizations increasingly turn to the cloud for very large workloads so it’s critical to not only support large clusters but to scale them quickly so that cloud instances are usable instantly as they come online. Altair Grid Engine provides a variety of scalability improvements including fast bulk addition and deletion of execution hosts. In partnership with Western Digital and Amazon Web Services, [Altair] recently demonstrated this extreme scalability, deploying a million+ vCPU Altair Grid Engine cluster comprised of over 40,000 instances – a scale comparable to the world’s largest supercomputers. The Altair Grid Engine cluster grew to over one million vCPUs in one hour and 32 minutes and ran for six hours, completing a simulation comprised of 2.5 million tasks that previously required 20 days on local infrastructure in under eight hours in the cloud. Cloud scaling and throughput go hand-in-hand. It’s only practical to deploy a cluster at this scale if you can keep all the cluster nodes busy. Altair Grid Engine can dispatch up to ~3,000 tasks per second,³ and was able to keep all cores busy 99% of the time. You can read more about this project in the article Mission is Possible: Tips on Building a Million Core Cluster.

Nonstop cluster reconfiguration – For cloud-deployed clusters, the meter is always running. It’s not practical to suspend all work every time cluster parameters need to be changed. Altair Grid Engine supports dynamic re-configuration, avoiding the need to restart the scheduler and idle instances when configuration changes are made. Also, as of version 8.6.3, Altair Grid Engine supports bulk operations against execution hosts and projects allowing users to change settings on multiple hosts in a single operation dynamically for greater efficiency.³ Nonstop cluster reconfiguration and bulk changes make it easier to administer cloud-resident clusters and reduce downtime translating into lower operating costs in the cloud.

Advanced Container support – When deploying complex software environments to the cloud, it’s essential to provision software environments quickly and reliably. HPC users have many options, including deploy generic cloud instances and using post-provisioning scripts (slow and tedious), loading application functionality into custom machine images such as AMIs (more efficient, but hard to maintain), or starting cluster instances pre-loaded with Docker or Singularity runtimes and pulling application images from a container registry. Containers are one of the better ways of packaging complex software environments presently. While it’s possible to run containerized workloads on open-source Grid Engine, this can get complicated quickly.⁴ Recent Altair Grid Engine releases provide transparent support for containerized workloads, accurately reporting metrics, and circumventing well security concerns related to the Docker daemon. You can learn more in the article Using Altair Grid Engine with Docker.

Efficient GPU cloud scheduling – Access to state-of-the-art GPU resources is a common reason for Altair Grid Engine users to tap cloud resources. GPU cloud instances can be costly however. For example, the current on-demand price for a single p3.16xlarge instance (8 NVIDIA Tesla V100 GPUs connected via NVLink) is $US 15.91/hour or over $11K/month.⁵ When spending this amount of money, using instances efficiently becomes critical. GPU-aware scheduling features are critical to maximizing efficiency and reducing costs for on-premise and cloud GPU workloads. GPU scheduling enhancements in Altair Grid Engine include CPU-GPU core affinity, topology-aware scheduling, NVIDIA-docker support, and a direct integration with NVIDIA DCGM. Read the article Managing GPU workloads with Altair Grid Engine to learn more about specific enhancements in Altair Grid Engine to efficiently manage GPU workloads.

REST API for simplified cloud automation – When cloud bursting or deploying cloud resident clusters, HPC users rely on automation to perform repetitive tasks efficiently. In addition to automating cloud resource deployments (via Navops Launch , custom scripts or cloud-specific tools such as AWS CloudFormation) users also need programmatic ways to configure workload management settings such as definitions for hostgroups, queues, projects, parallel environments and more. Altair Grid Engine includes a comprehensive REST API that enables users to manage both cluster configuration and workloads. Users can script REST operations using cURL or take advantage of language bindings for Java, Node.js, Meteor or Python. Python users can alternatively automate the configuration of cloud clusters using Altair Grid Engine PyCL (Python Configuration Library) a developer-friendly front-end that wraps the Altair Grid Engine qconf command to simplify cluster configuration.⁷

Multi-cloud deployments and cloud bursting – Cloud bursting is an increasingly common use case in HPC. Despite efforts to simplify management by standardizing on a single cloud provider, multi-cloud deployments are all but inevitable. Mergers, acquisitions and the need to collaborate with entities storing datasets in other clouds are forcing organizations to multi-cloud environments. According to Gartner, 80% of organizations already deal with more than one cloud provider.⁷ Navops Launch provides an easy way for Altair Grid Engine users to automate cluster deployments and burst to multiple clouds. Hybrid, multi-cloud bursting is useful for a wide variety of workloads.

Cloud spend management – As cloud spending continues to grow, cloud-spend management is a growing concern. The same sponsored InsideHPC survey referenced above found that while 84% of HPC organizations see value in being able to associate spending to various departments, projects and cost-centers automatically, 76% of respondents have no automated solution. Gartner estimates that 80% of cloud users will overshoot IaaS budgets through 2020 due to a lack of spending controls. This presents a clear challenge for Altair Grid Engine users – especially as multi-cloud deployments become the norm. Navops Launch augments Altair Grid Engine clusters, providing real-time visibility to cloud spending by cost-center, department and projects across multiple clouds. Additionally, Altair Grid Engine users can leverage built-in Navops Launch automations taking proactive steps, ensuring that cloud spending is managed based on configurable policies.

References

Cloud Adoption for HPC: Trends and Opportunities - https://insidehpc.com/2019/11/cloud-adoption-for-hpc-trends-and-opportunities/
HPC in the cloud rolls through an inflection point - https://www.nextplatform.com/2019/12/13/hpc-in-the-cloud-rolls-through-an-inflection-point/
Altair Grid Engine peak task dispatch rate - https://aws.amazon.com/blogs/aws/western-digital-hdd-simulation-at-cloud-scale-2-5-million-hpc-tasks-40k-ec2-spot-instances/
See running containers under SGE - https://arc.liv.ac.uk/SGE/howto/sge-container.html
AWS EC2 GPU P3 instance pricing - https://aws.amazon.com/ec2/instance-types/p3/
UGE PyCL project on GitHub - https://github.com/gridengine/config-api
Gartner May 24^th, 2019 – Are you Ready for Multicloud and Intercloud Data Management? – https://www.gartner.com/en/documents/3923929/are-you-ready-for-multicloud-and-intercloud-data-managem

Featured Articles