Hyak, Lolo and UW's CyberInfrastructure
Located in the UW Towers and operated by UW-IT, Hyak is a scalable high performance computer comprised of multi-cpu and multi-gpu compute nodes embedded in a fast internode communication fabric. A "Condo of Condos" organizational approach is used in which the power, cooling, cabinets and workforce infrastructure are provided to researchers by the UW, and researchers purchase their own compute nodes to slot into this infrastructure. The computations performed on Hyak are optimized by a scheduler that is designed with pre-emption at its core. As of January 2016, it has more than 10,000 compute cores (expandable to 24,000) and 48 GPUs, is connected through a fast (10 GBs) low-latency communication fabric, a fast (>5 GBs) scratch storage and fast (>4 GBs) aggregate uplinks. As Hyak is seamlessly integrated with a data center, fast storage and networks and managed by HPC experts (who also consult on optimizing applications) it should be considered a system, as opposed to a stand alone computer. The nature of Hyak means that it should be considered an elastic supercomputer or a super-cloud. When researchers are not using their own nodes, an aspect that is inherent in code and workflow development, they are available for everyone else to use free of charge (the backfill queue), and so researchers are encouraged to purchase the number of nodes that will accommodate their baseline compute needs and not their peak needs. On average ~60% of the compute cycles from Hyak are consumed by owners using their own nodes, while ~40% are consumed via backfill. Interestingly, the cost of a delivered compute cycle from Hyak to UW researchers is one-tenth of the cost to purchase it from the Amazon or Microsoft cloud services. While each phase of the infrastructure paid for by the UW has a 6-year lifetime, the nodes that are purchased by researchers are supported for nominally 3 years. The reason for this 3-year limit is that the rate of node failures and errors becomes unmanageable beyond this time period with the workforce that is presently supported.
Preemption, only meaningful in a multi-user environment, optimizes productivity from a computational resource. While a researcher `owns'' a certain number of compute nodes, if that research group does not have calculations waiting to be executed, those nodes are made available to other researchers to enable them to do science which otherwise would not have been accomplished. However, when the ``owners'' of the nodes submits calculations to be performed, the nodes are immediately cleared to execute the calculations of the owners.
Hyak is used by researchers across the UW, as can be seen from the usage breakdown by domain shown at the left. The biggest users are the departments of Physics, Biochemistry and Chemical Engineering, all of which have programs that require large simulations and run on leadership-class supercomputers. There are a few users that have large capability computing needs, while most users have capacity computing needs for smaller partitions for a modest amount of time. This distribution of usage mirrors that seen at major supercomputing centers.
Entering Hyak Phase-2, the compute resources available to UW researchers have more than doubled during the last few years. These resources are being put to optimal use, and in terms of scientific productivity, the gains from the backfill queue are obvious.
Lolo is the UW's archival data storage facility that is available to all UW researchers. It provides safe long-term storage of data and codes, and is distinct from a backup "service" as its primary function is to archive data whose generation is complete, or data that will be modified infrequently, i.e. "cold data". Further, it is a facility that can be used for working with large data sets from collaborations or teams of researchers, providing a "collaboration filesystem". There are two good reasons for offering archival data storage. The first is that it costs less than backups, that require the system to comb through data, saving files that have been recently changed. The second is that is reduces the Hyak system need for first tier storage, saving researchers funds.
Hyak and Lolo reside with science DMZ embedded in a High Speed research Network (HSRN). They communicate with the outside world via a internet-2 100GBs Pacific Northwest Gogapop connection, and communicate internally with 100 GBs, 40 GBs and 10GBs connections. The communicate with the UW campus through a 40 GBs link.