March 28, 2011

Zenoss Blog: No Node Left Behind: Super Computer Monitoring – Zenoss interview with LANL – Open Source Network Monitoring and Systems Management


Monitoring the largest super computing systems in the world is no easy business. Los Alamos National Laboratories (LANL) is one of those organizations, and continues to use Zenoss to monitor its Top 10 supercomputer labeled 'Roadrunner'. LANL has developed and released an open source High-Performance Computing (HPC) version of Zenoss, that they use in their own environment. See the HPC Zenoss code, and community here: HPC

During interactions with the Zenoss community, I had the opportunity to interview Cindy Martin, who has been the liaison between LANL's HPC Zenoss project and the Zenoss community.

Thanks go out to Cindy for taking the time to allow this interview!

Community Support Engineer

Nick Yeates: Cindy, can you introduce yourself, and share with us your title and role?

Cindy Martin: I am Cindy Martin and I work with LANL in the High Performance Computing Division. I was tasked to implement a monitoring tool for all of our clusters here at LANL.

Nick Yeates: Please give us a brief overview of what LANL does? Computing-wise, research-wise, etc.

Cindy Martin: In the realm of high performance computing our clusters are used in almost every scientific area you can imagine. A few examples of the things we are working on include HIV research, probing the dark matter of the genome, gene regulation, ocean and atmospheric flows, Earth's Climate system through multi-resolution modeling, and of course weapons research and decommissioning. LANL has projects in all the core sciences.

Nick Yeates: Wow, that runs the gamut! I can barely imagine the computing power and scale behind some of those scientific areas. what kind of computing power and clustering technologies does LANL employ?

Cindy Martin: we have several cluster configurations. so from a processor perspective we have systems that are over 10s of thousands of nodes, as well as systems that have less. the clusters all have very fast internal networks. we use infiniband as well as management networks that use ethernet. currently we have twelve clusters in production and are in the process of building another five clusters.

Nick Yeates: How have you used Zenoss to assist in managing this vast IT outlay?

Cindy Martin: When I started this project there was really no cohesive monitoring tool. we had scripts that would do health checks, but not much beyond that capability. Troubleshooting problems was very difficult and arduous. so the goal was to find a tool that was scalable or that we could make scalable. we needed a central repository for network and system data, to lessen the time required for troubleshooting. we did a lot of research into our requirements and in talking with other Laboratories found that no one entity had solved this problem well. so we looked to open source tools and Zenoss had the most complete and easily extendable infrastructure. To date we use Zenoss on all of our new systems. It tracks our issues, assets, and based on certain events will offline nodes within the cluster and then notify the appropriate staff.

Nick Yeates: I want to take a moment to point anyone interested, to take a look at your freely available High-Performance Computing (HPC)

version of Zenoss, which is a customized version of Zenoss that LANL has developed off of our core 2.4 code. what major features did your team add or tweak?

Cindy Martin: our first challenge was scalability. These systems can be very large and chatty. so we addressed that by having multiple instances of Zenoss that can talk to one another.

we had to add a new interface for our operations staff for monitoring our specific components, issue tracking, asset tracking, association of events, reporting interface for correlation of job and system data, auto calculation of downtime, and the ability to keep a historical model of the system and its components.

Nick Yeates: let me ask more about those features. what do you mean by historical model, and how was that implemented? is it an auditing platform?

Cindy Martin: Well it didn't start out that way. we needed to be able to track components as they moved through the clusters. often times we would remove a node, think it was fixed, put it back into the cluster in a different location and have a similar problem. Storing this data in the MySQL database allows us to correlate issues on a particular node and where it has been over time in the cluster. As it turns out, our security team was very interested in this feature. They were tracking some of this hardware by hand and this feature allowed them to track it in a more automated fashion. the implementation involved using our clustering software to feed our Zenoss MySQL database with any changes in a components location. we then used our reporting interface to correlate the issue and location data.

Nick Yeates: you mentioned that lots of events were a problem. A question from the community is: How do you handle event floods?

Cindy Martin: oh yes, that is a problem. When we were doing our load testing, we used a full reboot of the Roadrunner system as our guide to the number of events we should be able to handle in a second. we then used those numbers to determine how many instances of Zenoss we needed to manage the cluster. In the case of Roadrunner it was eleven Zenoss instances. we have had it happen a couple of times and usually it is tied to a network issue.

Nick Yeates: Do you employ any dependency mappings, transforms, or other methods to filter and rid of the large number of events you see?

Cindy Martin: oh yes we use the event mapping capability extensively, as well as transforms. we currently have over a thousand mappings in our Zenoss implementation. Not only are we doing transforms, we are doing updates to the database directly through the mapping. This was something we discovered we could do later in the process and something that helped us tremendously.

Nick Yeates: It sounds like issue tracking, asset tracking, and association of events are interrelated. what did you do in these areas? How were they tied into Zenoss?

Cindy Martin: They are all related, that is correct. so we created our ZenPack that would add the necessary tables in the database as well as the interface into Zenoss. we used the event console as our template for the issue and asset tracking components. the association of events we actually had Zenoss developers help us implement, and I think something like that ended up in the new versions. we use a parent-child relationship to "roll-up" an event into a super-event. we did that same thing for the issues.

Nick Yeates: on a different note, I have a final, future-looking question: With the current excitement around "cloud computing", how might HPC operations be effected by cloud notions of elasticity, self-service, and self-management? Do you foresee supercomputers being replaced by the cloud, do you see them joining to be the same thing, or do you see no change at all?

Cindy Martin: I don't see that happening anytime soon. maybe way out in the future.

Zenoss Blog: No Node Left Behind: Super Computer Monitoring - Zenoss interview with LANL - Open Source Network Monitoring and Systems Management