TGen established its HPBC to give its scientists the powerful computational resources they need. “One of the ringing endorsements I can give PBS Professional... It just works.”
Putting Saguaro to Work
Today, HPBC typically serves about 65 accounts on Saguaro, most of them within TGen.
Scientific collaborators at ASU and other research institutions are also active users of the
resource. They use BLAST, AMBER, Gaussian, and other commercial and in-house-developed
applications to run thousands of jobs on Saguaro. PBS Professional provides the flexibility to
run large jobs across, say, 128 nodes, while running thousands of small serial jobs on
a single node.
Saguaro, a 16-node development cluster, and three IBM SMP compute servers — two
running SUSE Linux and one running AIX — all run on a high-performance SAN that connects
to Saguaro over three Cisco 4006 switches. Users can watch their jobs interactively using a
1TB IBM GPFS Parallel File System that is accessible to every node on the cluster.
One characteristic of PBS Professional that has helped HPBC cope with the demand for
Saguaro’s resources is hands-off dependability and simplicity of maintenance. “One of the
ringing endorsements I can give PBS Professional is that once we got it set up and working, I
have not had to do anything to it at all,” says Lowey. “I went through last year and upgraded
my entire cluster to Red Hat EL3.0. Part of that process was reinstalling PBS Professional. I
followed the instructions in the manual and it took about 20 minutes. It was quite simple.”
Looking Ahead: Upgrades and a Web-Based Interface
One of HPBC’s goals is to move TGen to a web-based job submission model, and an internal
web-based data analysis website is already operating. Another goal is flexible queues tied
together with a switch architecture, which will enable HPBC to run, say, 32-node jobs on a
single switch blade, or 128-node jobs on a single switch, removing the latency of switch-toswitch communication. These and other advances will involve PBS Professional.
TGen’s successful experience with PBS Professional will soon lead to an upgraded version,
which HPBC is currently evaluating. Of particular interests are the job array, redundancy and
failover features of the current release. Saguaro has received heavy utilization since it came
into full production in late 2003, and an increase in failures is inevitable. PBS Professional’s
Automatic Job Recovery will automatically redo any interrupted job upon detecting that the
nodes have gone down.
“I’m excited that upgrading my production cluster and looking at other uses of
PBS Professional in our HPC environment are among my goals this year,” said Lowey.