## **Emerging Technologies on the Pathway to Extreme Scale** Computing

## Jeffrey Vetter

Presented to the INT 11-2C Exascale Workshop University of Washington







**Computational Science and Engineering** 

http://ft.ornl.gov vetter@computer.org

# In a nutshell ...

- Toward Exascale
  - Highlights from recent projections for exascale
- Challenges
  - Micro, macro power
  - Memory capacity and bandwidth
  - Parallelism
  - Programmability
  - Algorithms
- Research solutions
  - Heterogeneity with GPUs: Keeneland
  - Programming models: SHOC, Maestro
  - NVRAM
  - In situ analysis

## **Toward Exascale**



Publicly Released, Feb 2010 Process for identifying exascale applications and technology for DOE missions ensures broad community input

- Town Hall Meetings April-June 2007
- Scientific Grand Challenges Workshops Nov, 2008 – Oct, 2009
  - Climate Science (11/08),
  - High Energy Physics (12/08),
  - Nuclear Physics (1/09),
  - Fusion Energy (3/09),
  - Nuclear Energy (5/09),
  - Biology (8/09),
  - Material Science and Chemistry (8/09),
  - National Security (10/09)
  - Cross-cutting technologies (2/10)
- Exascale Steering Committee
  - "Denver" vendor NDA visits 8/2009
  - SC09 vendor feedback meetings
  - Extreme Architecture and Technology Workshop 12/2009
- International Exascale Software Project
  - Santa Fe, NM 4/2009; Paris, France 6/2009; Tsukuba, Japan 10/2009, etc.



Scientific Grand Challenges

Scientific Grand Challenge

IN CLIMATE CHANGE SCIENCE AN





#### MISSION IMPERATIVES



#### FUNDAMENTAL SCIENCE

http://extremecomputing.labworks.org/ http://www.exascale.org/iesp/Main\_Page

# **Holistic View of HPC**

### Performance, Resilience, Power, Programmability

#### Applications

- Materials
- Climate
- Fusion
- National Security
- Combustion
- Nuclear Energy
- Cybersecurity
- Biology
- High Energy Physics
- Energy Storage
- Photovoltaics
- National Competitiveness
- <u>Usage Scenarios</u>
  - Ensembles
  - UQ
  - Visualization
  - Analytics

#### Programming Environment

- <u>Domain specific</u>
  - Libraries
  - Frameworks
  - Templates
  - Domain specific languages
  - Patterns
  - Autotuners
- Platform specific
  - Languages
  - Compilers
  - Interpreters/Scripting
  - Performance and Correctness Tools
  - Source code control

#### System Software

- Resource Allocation
- Scheduling
- Security
- Communication
- Synchronization
- Filesystems
- Instrumentation
- Virtualization

### Architectures

- <u>Processors</u>
  - Multicore
  - Graphics Processors
- FPGA
- DSP
- <u>Memory and Storage</u>
  - Shared (cc, scratchpad)
  - Distributed
  - RAM
  - Storage Class Memory
  - Disk
  - Archival
- Interconnects
  - Infiniband
  - IBM Torrent
  - Cray Gemini, Aires
  - BGL/P/Q
  - 1/10/100 GigE

# **Contemporary Systems**

| Date  | System                    | Location        | Comp           | Comm        | Peak<br>(PF) | Power<br>(MW) |
|-------|---------------------------|-----------------|----------------|-------------|--------------|---------------|
| 2010  | Tianhe-1A                 | NSC in Tianjin  | Intel + NVIDIA | Proprietary | 4.7          | 4.0           |
| 2010  | Nebulae                   | NSC In Shenzhen | Intel + NVIDIA | IB          | 2.9          | 2.6           |
| 2010  | Tsubame 2                 | TiTech          | Intel + NVIDIA | IB          | 2.4          | 1.4           |
| 2011  | K Computer (612 cabinets) | Kobe            | SPARC64 VIIIfx | Tofu        | 8.7          | 9.8           |
| ~2012 | Cray 'Titan'              | ORNL            | AMD + NVIDIA   | Gemini      | 20?          | 7?            |
| ~2012 | BlueWaters                | NCSA/UIUC       | POWER7         | IBM Hub     | 10?          | 10?           |
| ~2012 | BlueGeneQ                 | ANL             | SoC            | IBM         | 10?          |               |
| ~2012 | BlueGeneQ                 | LLNL            | SoC            | IBM         | 20?          |               |
|       | Others                    |                 |                |             |              |               |

# Tianhe-1A uses 7000+ NVIDIA GPUs

- Tianhe-1A uses
  - 7,168 NVIDIA Tesla M2050 GPUs
  - 14,336 Intel Westmeres
- Performance
  - 4.7 PF peak
  - 2.5 PF sustained on HPL
- 4.04 MW
  - If Tesla GPU's were not used in the system, the whole machine could have needed 12 megawatts of energy to run with the same performance, which is equivalent to 5000 homes
- Custom fat-tree interconnect
  - 2x bandwidth of Infiniband ODR

| Ehe Ne            | w <u>J</u> lo | rk Eimes      |          | Business Day<br>Technology |         |            |           |       |             |  |  |
|-------------------|---------------|---------------|----------|----------------------------|---------|------------|-----------|-------|-------------|--|--|
| WORLD             | U.S.          | N.Y. / REGION | BUSINESS | TECHNOLOGY                 | SCIENCE | HEALTH     | SPORTS    | OPIN  | ION         |  |  |
| Search Technology |               |               |          | ternet   Start-Ups         |         | s Computin | g   Compa | inies | Bits<br>Blo |  |  |

### China Wrests Supercomputer Title From U.S.

By ASHLEE VANCE Published: October 28, 2010

A Chinese scientific research center has built the fastest supercomputer ever made, replacing the United States as maker of the swiftest machine, and giving China bragging rights as a technology superpower.





The Tianhe-1A computer in Tianjin, China, links thousands upon thousands of chips

Dongarra, a University of Tennessee computer scientist who maintains the official supercomputer rankings.

systems handle mathematical calculations, said Jack

Although the official list of the top 500 fastest machines, which comes out every six months, is not due to be completed by Mr. Dongarra until next week, he said the

Chinese computer "blows away the existing No. 1 machine." He added, "We don't close the books until Nov. 1, but I would say it is unlikely we will see a system that is faster."

The computer, known as Tianhe-1A,

has 1.4 times the horsepower of the

national laboratory in Tennessee, as

# **Recent news - K**



- #1 on TOP500
- 8.162 PF (93% of peak)
  - 3.1x TOP500 #2
  - 9.8 MW
- 672 racks
- 68,544 processors, 1PB memory



## **SPARC64™ VIIIfx Chip Overview**



- Architecture Features
  - 8 cores
  - Shared 5 MB L2\$
  - Embedded Memory Controller
  - 2 GHz

### Fujitsu 45nm CMOS

- 22.7mm x 22.6mm
- 760M transistors
- 1271 signal pins

### Performance (peak)

- 128GFlops
- 64GB/s memory throughput
- Power
  - 58W (TYP, 30°C)
  - Water Cooling Low leakage power and High reliability

All Rights Reserved, Copyright© FUJITSU LIMITED 2009

### Source: Fujitsu



### IBM PERCS Project Heart of Blue Waters: Two New Chips







### **Building Blue Waters Blue Waters**

Blue Waters will be the most powerful computer in the world for scientific research when it comes on line in 2011-2.



#### Quad-chip Module

4 Power7 chips Up to 1 TF (peak) 128 GB memory 512 GB/s Hub Chip 1.128 TB/s

128 Gb/s memory bw 45 nm technology

IH Server Node 8 QCM's (256 cores) Up to 8 TF (peak) 1 TB memory 4 TB/s 8 Hub chips. 9 TB/s

Power supplies PCIe slots

Fully water cooled

**IH** Supernode 4 IH Server Nodes 1024 cores Up to 32 TF (peak) 41 TB memory 16 TB/s 32 Hub chips

36 TB/s



**Blue Waters** ≥10 PF Peak ~1 PF sustained

≥300.000 cores

**Blue Waters** is built from components that can also be used to build systems with a wide range of capabilities—from deskside to beyond Blue Waters.

**Power7 Chip** 

8 cores, 32 threads

Up to 256 GF (peak)

L1, L2, L3 cache (32 MB)

Extreme-scale Computing

31 March 2011

**USC-DOE Materials Science Conference** 

Source: NCSA

# Exascale Trends and Challenges

## **Notional Exascale Architecture Targets**

| System attributes             | 2001     | 2010     | "2015"              |          | "2018"        |           |
|-------------------------------|----------|----------|---------------------|----------|---------------|-----------|
| System peak                   | 10 Tera  | 2 Peta   | 200 Petaflop/sec    |          | 1 Exaflop/sec |           |
| Power                         |          | 6 MW     | 15 MW               |          | 20 MW         |           |
| System memory                 | 0.006 PB | 0.3 PB   | 5                   | РВ       | 32-6          | 4 PB      |
| Node performance              | 0.024 TF | 0.125 TF | 0.5 TF              | 7 TF     | 1 TF          | 10 TF     |
| Node memory BW                |          | 25 GB/s  | 0.1 TB/sec          | 1 TB/sec | 0.4 TB/sec    | 4 TB/sec  |
| Node concurrency              | 16       | 12       | O(100) O(1,000)     |          | O(1,000)      | O(10,000) |
| System size<br>(nodes)        | 416      | 18,700   | 50,000 5,000        |          | 1,000,000     | 100,000   |
| Total Node<br>Interconnect BW |          | 1.5 GB/s | 150 GB/sec 1 TB/sec |          | 250 GB/sec    | 2 TB/sec  |
| MTTI                          |          | day      | oy O(1 day)         |          | O(1 day)      |           |

## Note the Uneven Impact on System Balance!

|                        | 2010       | 2018       | Factor Change |
|------------------------|------------|------------|---------------|
| System peak            | 2 Pf/s     | 1 Ef/s     | 500           |
| Power                  | 6 MW       | 20 MW      | 3             |
| System Memory          | 0.3 PB     | 10 PB      | 33            |
| Node Performance       | 0.125 Tf/s | 10 Tf/s    | 80            |
| Node Memory BW         | 25 GB/s    | 400 GB/s   | 16            |
| Node Concurrency       | 12 cpus    | 1,000 cpus | 83            |
| Interconnect BW        | 1.5 GB/s   | 50 GB/s    | 33            |
| System Size (nodes)    | 20 K nodes | 1 M nodes  | 50            |
| Total Concurrency      | 225 K      | 1 B        | 4,444         |
| Storage                | 15 PB      | 300 PB     | 20            |
| Input/Output bandwidth | 0.2 TB/s   | 20 TB/s    | 100           |

DOE Exascale Initiative Roadmap, Architecture and Technology Workshop, San Diego, December, 2009.

# **NVIDIA Echelon System Sketch**



DARPA Echelon team: NVIDIA, ORNL, Micron, Cray, Georgia Tech, Stanford, UC-Berkeley, U Penn, Utah, Tennessee, Lockheed Martin

## Challenges to Exascale

## **Performance Growth**

- 1) System power is the primary constraint
- 2) Memory bandwidth and capacity are not keeping pace
- 3) Concurrency (1000x today)
- 4) **Processor** architecture is an open question
- 5) Programming model heroic compilers will not hide this
- 6) Algorithms need to minimize data movement, not flops
- 7) I/O bandwidth unlikely to keep pace with machine speed
- 8) Reliability and resiliency will be critical at this scale
- 9) Bisection bandwidth limited by cost and energy

# Unlike the last 20 years most of these (1-7) are equally important across scales, e.g., 100 10-PF machines



Source: Hitchcock, Exascale Research Kickoff Meeting



### Both macro and micro energy trends drive all other factors



|      |      |      |      |      |      |      |      |      |      | l '  |      | I |
|------|------|------|------|------|------|------|------|------|------|------|------|---|
| 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | I |



If energy costs ~\$1/MW/yr, then how much is the energy cost for an exascale system?!?!

## Facilities and Power ... Not just ORNL









... but the more important trend is power management at the chip level ...

# #3/4: (Uncertainty in) Concurrency and Processor Architecture

# **Dark Silicon**



Source: ARM

## Mobile and Embedded Systems Community has Experiences



Source: Delagi, ISSCC 2010

## CPU GPU SoC: Features & Block Diagram

### O CPU

- Three 3.2 GHz PowerPC<sup>®</sup> cores
- Shared 1MB L2 cache
- Per Core:
  - Dual Thread Execution
  - 32K L1 I-cache, 32K L1 D-cache
  - 2-issue per cycle
  - Branch, Integer, Load/Store Units
  - VMX128 Units enhanced for games

### O GPU

- 48 parallel unified shaders
- 24 billion shader instructions per second
- 4 billion pixels/sec pixel fill rate
- 500 million triangles/sec geometry rate
- High Speed IO interface to 10 MB EDRAM

### O Compatibility

**30X** 360.

- Functional and Performance equivalent to prior Xbox 360 GPU/CPU
- FSB Latency and BW match prior FSB





# **AMD's Llano: A-Series APU**

- Combines
  - 4 x86 cores
  - Array of Radeon cores
  - Multimedia accelerators
  - Dual channel DDR3
- 32nm
- Up to 29 GB/s memory bandwidth
- Up to 500 Gflops SP
- 45W TDP



## Intel<sup>®</sup> MIC Architecture: An Intel Co-Processor Architecture



Many cores and many, many more threads Standard IA programming and memory model



# **Graphics Processors**

## **GPU Rationale – What's different now?**



# **NVIDIA Fermi/GF100**

- 3B transistors in 40nm
- Up to 512 CUDA Cores
  - New IEEE 754-2008 floating-point standard
    - FMA
    - 8× the peak double precision arithmetic performance over NVIDIA's last generation GPU
  - 32 cores per SM, 21k threads per chip
- 384b GDDR5, 6 GB capacity
  - ~120-144 GB/s memory BW
- C/M2070
  - 515 GigaFLOPS DP, 6GB
  - ECC Register files, L1/L2 caches, shared memory and DRAM









CL

FP Ur







|              | Instruction Cache |           |                |                |                |     |  |  |  |  |
|--------------|-------------------|-----------|----------------|----------------|----------------|-----|--|--|--|--|
|              | ١                 | Varp Sche | duler          | Warp Scheduler |                |     |  |  |  |  |
|              |                   | Dispatch  | Unit           |                | Dispatch Unit  |     |  |  |  |  |
|              |                   |           | Register File  | (4096 x 32-    |                | +   |  |  |  |  |
|              |                   |           |                |                |                |     |  |  |  |  |
|              | Core              | Core      | Core           | Core           | LD/ST<br>LD/ST | SFU |  |  |  |  |
|              | Core              | Core      | Core           | Core           | LD/ST          |     |  |  |  |  |
| JDA Core     | Core              | Core      | Core           | Core           | LD/ST          | SFU |  |  |  |  |
| nit INT Unit | Core              | Core      | Core           | Core           | LD/ST          | 5-0 |  |  |  |  |
| Result Queue | Core              | Core      | Core           | Core           | LD/ST<br>LD/ST |     |  |  |  |  |
| /            | Core              | Core      | Core           | Core           | LD/ST          | SFU |  |  |  |  |
|              | Core              | Core      | Core           | Core           | LD/ST<br>LD/ST |     |  |  |  |  |
|              | Core              | Core      | Core           | Core           | LD/ST<br>LD/ST | SFU |  |  |  |  |
|              |                   | xxxxxxx   | Interconn      | ect Network    | ~~~~~          |     |  |  |  |  |
|              |                   |           | interconne     | ectivetwork    |                |     |  |  |  |  |
|              |                   |           | 64 KB Shared M | emory / L1     | Cache          |     |  |  |  |  |
|              |                   |           | Unifor         | n Cache        |                |     |  |  |  |  |
|              |                   |           |                |                |                |     |  |  |  |  |

## HP ProLiant SL390s G7 2U half width tray





# **Keeneland – Initial Delivery System**



## Keeneland ID installation – 10/29/10

















# **Organizations using Keeneland** <sup>37</sup>

Allinea Arizona Brown University Dartmouth Florida State George Washington Georgia Tech HP Illinois NCAR NCSA North Carolina Northwestern NVIDIA Ohio State ORNL OSU Purdue

Reservoir Labs Rice Univ. San Diego Stanford Temple Trideum Corp U. Chicago U. Cincinnati U. Delaware U. Florida U. Georgia U. Illinois U. Maryland U. New Hampshire U. of Oregon U. South Calif. Univ of Calif, Davis UCSD

Univ South Calif. Univ of Delaware Univ of Calif Univ of California, Los Angeles Univ of Utah Univ of Utah Univ of Rochester Univ of Wisconsin UTK Vanderbilt Yale

New users being added weekly

# Early (Co-design) Success Stories

### **Computational Materials**

- Quantum Monte Carlo
  - High-temperature superconductivity and other materials science
  - 2008 Gordon Bell Prize
- GPU acceleration speedup of 19x in main QMC Update routine
  - Single precision for CPU and GPU: target single-precision only cards
- Full parallel app is 5x faster, start to finish, on a GPU-enabled

GPU study: J.S. Meredith, G. Alvarez, T.A. Maier, T.C. Schulthess, J.S. Vetter, "Accuracy and Performance of Graphics Processors: A Quantum Monte Carlo Application Case Study", *Parallel Comput.*, 35(3):151-63, 2009.

Accuracy study: G. Alvarez, M.S. Summers, D.E. Maxwell, M. Eisenbach, J.S. Meredith, J. M. Larkin, J. Levesque, T. A. Maier, P.R.C. Kent, E.F. D'Azevedo, T.C. Schulthess, "New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors", SuperComputing, 2008. [Gordon Bell Prize winner]

## Combustion

### S3D

- Massively parallel direct numerical solver (DNS) for the full compressible Navier-Stokes, total energy, species and mass continuity equations
- Coupled with detailed chemistry
- Scales to 150k cores on Jaguar
- Accelerated version of S3D's Getrates kernel in CUDA on Tesla T10
  - 31.4x SP speedup
  - 16.2x DP speedup

K. Spafford, J. Meredith, J. S. Vetter, J. Chen, R. Grout, and R. Sankaran. Accelerating S3D: A GPGPU Case Study. Proceedings of the Seventh International Workshop on Algorithms, Models, and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar 2009) Delft, The Netherlands.

## Peptide folding on surfaces

- Peptide folding on a hydrophobic surface
  - www.chem.ucsb.edu/~sheagroup
- Surfaces can modulate the folding and aggregation pathways of proteins. Here, we investigate the folding of a small helical peptide in the presence of a hydrophobic surface of graphite. Simulations are performed using explicit solvent and a fully atomic representation of the peptide and the surface.
- Benefits of running on a GPU cluster:
  - Reduction in the the number of computing nodes needed: one GPU is at least faster than 8 CPUs in GPU-accelerated AMBER Molecular Dynamics.
  - The large simulations that we are currently running would be prohibitive using CPUs. The efficiency of the CPU parallelization becomes poorer with increasing number of CPUs.
  - It can also decrease consumption of memory and network bandwidth in simulations with large number of atoms.



### Hadron Polarizability in Lattice QCD

Understanding the structure of subnuclear particles represents the main challenge for today's nuclear physics. Photons are used to probe this structure in experiments carried out at laboratories around the world. To interpret the results of these experiments we need to understand how electromagnetic field interacts with subnuclear particles. Theoretically, the structure of subnuclear particles is described by Quantum Chromodynamics (QCD). Lattice QCD is a 4-dimensional discretized version of this theory that can be solved numerically. The focus of our project is to understand how the electric field deforms neutrons and protons by computing the polarizability using lattice QCD techniques.



#### Andrei Alexandru The George Washington University





#### Why GPUs?

- Lattice QCD simulations require very large bandwidth to run efficiently. GPUs have 10–15 times larger memory bandwidth compared to CPUs.
- > Lattice QCD simulations can be efficiently parallelized.
  - > Bulk of calculation spent on one kernel.
  - The kernel requires only nearest neighbor information.
  - Cut the lattice into equal sub-lattices. Effectively use single instruction multiple-data (SIMD) paradigm.

#### http://samurai.phys.gwu.edu/wiki/index.php/Hadron\_polarizability





Experimental and current values for neutron electric polarizability in lattice QCD.

Alexandru and F. X. Lee, [arXiv:0810.2833]











Performance comparison between Keeneland's GPU cluster and Kraken's Cray XT-5 machine. The CPU core count is translated to GPU equivalent count by dividing the total number of CPUs by 22, which is the number of CPU cores equivalent to a single-GPU performance.

#### A. Alexandru. et. al, [arXiv:1103.5103]







### LAMMPS with GPUs

- Parallel Molecular Dynamics
  - http://lammps.sandia.gov
  - Classical Molecular Dynamics
  - Atomic models, Polymers, Metals, Bio-simulations, Coarse-grain (picture), Ellipsoids, etc.
  - Already good strong and weak scaling on CPUs via MPI



- Better performance on fewer nodes
   => larger problems faster
- Neighbor, non-bonded force, and longrange GPU acceleration
- Allows for CPU/GPU concurrency
- Implementation and benchmarks by W.
   Michael Brown, NCCS, ORNL



# **#5: Programming Systems**

## **Holistic View of HPC**

#### Performance, Resilience, Power, Programmability

| ons   | Programming<br>Environment                                                                                                                                    |   | System Software                                                                                                                                    | Arch                                                                                                                            |
|-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|---|----------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| ý     | <ul> <li><u>Domain specific</u></li> <li>Libraries</li> <li>Frameworks</li> <li>Templates</li> <li>Domain specific<br/>languages</li> <li>Patterns</li> </ul> |   | <ul> <li>Resource Allocation</li> <li>Scheduling</li> <li>Security</li> <li>Communication</li> <li>Synchronization</li> <li>Filesystems</li> </ul> | <ul> <li><u>Processors</u></li> <li>Multicore</li> <li>Graphics</li> <li>FPGA</li> <li>DSP</li> <li><u>Memory an</u></li> </ul> |
| /sics | Autotuners cal     Critical     Platform specific                                                                                                             | > | <ul><li>Instrumentation</li><li>Virtualization</li></ul>                                                                                           | <ul> <li>Shared (</li> <li>Distribute</li> <li>RAM</li> <li>Storage (</li> </ul>                                                |

- Archival
- Interconnects
  - Infiniband
  - IBM Torrent
  - Cray Gemini, Aires
  - BGL/P/Q
  - 1/10/100 GigE

#### Applicatio

- Materials
- Climate
- Fusion
- National Security
- Combustion
- Nuclear Energy
- Cybersecurity
- Biology
- High Energy Physical
- Energy Storage
- Photovoltaics
- National Competitiveness
- Usage Scenarios
  - Ensembles
  - UQ
  - Visualization
  - Analytics

- Languages Compilers
- Interpreters/Scripting
- Performance and **Correctness Tools**
- Source code control

hitectures

- <u>rs</u>

  - s Processors
- ind Storage
  - (cc, scratchpad)
- ted
- **Class Memory**
- Disk

### **Keeneland Software Environment**

- Integrated with NSF TeraGrid/XD
  - Including TG and NICS software stack
- Programming environments
  - CUDA
  - OpenCL
  - Compilers
    - GPU-enabled
  - Scalable debuggers
  - Performance tools
  - Libraries

- Tools and programming options are changing rapidly
  - HMPP, PGI, LLVM,
     Polaris, R-stream,
- Additional software activities
  - Performance and correctness tools
  - Scientific libraries
  - Virtualization















# **OpenCL Programming Model**

Local Work Item

Local Work Group



N-Dimensional Grid of Work Items





### Scalable HeterOgeneous Computing (SHOC) Benchmark Suite

- Benchmark suite with a focus on scientific computing workloads, including common kernels like SGEMM, FFT, Stencils
- Parallelized with MPI, with support for multi-GPU and cluster scale comparisons
- Implemented in CUDA and OpenCL for a 1:1 performance comparison
- Includes stability tests
- Performance portability

- Level 0
  - BusSpeedDownload: measures bandwidth of transferring data across the PCIe bus to a device.
  - BusSpeedReadback: measures bandwidth of reading ddagaek from a degagae
  - DeviceMemory: measures bandwidth of memory are san to various of device memory including global, local, and image memory is
  - KernelCompile: measures compile time for screened Oper Cornels, which range in complexity
  - PeakFlops: measures maximum acrossible floating performance using a combination of auto-generated and hun coded kernets.
  - QueueDelay: measures in orderhead of a state OpenCL command queue.
- Level 1
  - FFT: for ve and reverse 10 C
  - MD potation of the termard-Jones potential from molecular dynamics, a specific case of one nbody of term.
  - Reduction: Working point on an array of single precision floating point values.
  - SGEM 1. sagle-precision matrix-matrix multiply.
  - Son scan (also known as parallel prefix sum) on an array of single precision floating t values.
  - Sort: sorts an array of key-value pairs using a radix sort algorithm
  - Stencil2D: a 9-point stencil operation applied to a 2D data set. In the MPI version, data is distributed across MPI processes organized in a 2D Cartesian topology, with periodic halo exchanges.
  - Triad: STREAM Triad operations, implemented in OpenCL.

#### Software available at <a href="http://bit.ly/shocmarx">http://bit.ly/shocmarx</a>

A. Danalis, G. Marin, C. McCurdy, J. Meredith, P.C. Roth, K. Spafford, V. Tipparaju, and J.S. Vetter, "The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite," in Third Workshop on General-Purpose Computation on Graphics Processors (GPGPU 2010)`. Pittsburgh, 2010

#### **Single Precision MD**



**Single Precision FFT** 



### Example: Sparse Matrix Vector Multiplication (SpMV)

- Motivation
  - Extremely common scientific kernel
  - Bandwidth bound, and much harder to get performance than GEMM
- Basic design
  - 3 Algorithms, padded & unpadded data
  - CSR and ELLPACKR data formats
  - Supports random matrices or matrix market format
  - Example: Gould, Hu, &
     Scott: expanded system-3D
     PDE.



### **SpMV Performance**



# **Comparing CUDA and OpenCL**



#### Single precision, Tesla C1060 GPU Comparing NVIDIA OpenCL implementation from 2.3 and 3.0 GPU

Computing SDK



# Performance and Correctness

### Vancouver: Integrated Performance Analysis of MPI/GPU Applications



CUDA memory transfer (white)

MPI communication (yellow)

Partner: U of Oregon Tau Group

### Vancouver: Integrated Performance Analysis of Compiler CUDA Generated Applications

| 000           |                | TAU: ParaProf: n,c,t 0,0,0 - /Users/sameer/rs/taudata                                                                                            | /mm                |                     |                   |                    |  |  |  |
|---------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|---------------------|-------------------|--------------------|--|--|--|
| Metric: TIME  |                |                                                                                                                                                  |                    |                     |                   |                    |  |  |  |
| Value: Exclus | sive percer    | nt                                                                                                                                               |                    |                     |                   |                    |  |  |  |
| 55.367%       |                | pgi_cu_downloadx multiply_matrices var=a, dims=2, desc.devx=0                                                                                    | ), desc.devstride= | =1, desc.hoststride | =1, desc.size=30  | 000, desc.extent=  |  |  |  |
|               | 39.329%        |                                                                                                                                                  | nm2.f90}{9}]       |                     |                   |                    |  |  |  |
|               |                | 1.822% 🗧 mymatrixmultiply [{mmdriv.f90} {1,0}]<br>1.648% 🧧pgi_cu_uploadx multiply_matrices var=c, dims=2, desc.devx=0, c                         | lesc.devstride=1.  | desc.hoststride=1   | . desc.size=3000  | ), elementsize=4 [ |  |  |  |
|               |                | 1.618% 🧧pgi_cu_uploadx multiply_matrices var=b, dims=2, desc.devx=0, d                                                                           | desc.devstride=1   |                     |                   |                    |  |  |  |
|               |                | 0.083%  pgi_cu_free multiply_matrices [{/mnt/netapp/home1/sameer/mm/<br>0.07%  pgi_cu_alloc multiply_matrices [{/mnt/netapp/home1/sameer/mm      |                    |                     |                   |                    |  |  |  |
|               |                | 0.037%   multiply_matrices [{mm2.f90} {5,0}]                                                                                                     | /111112.190/(9/)   |                     |                   |                    |  |  |  |
|               |                | 0.007%  pgi_cu_module multiply_matrices [{/mnt/netapp/home1/sameer/n                                                                             |                    |                     |                   |                    |  |  |  |
|               |                | 0.006%  pgi_cu_launch multiply_matrices (multiply_matrices_11_gpu,gx=18<br>0.005%  pgi_cu_paramset multiply_matrices [{/mnt/netapp/home1/sameer  |                    | bx=16,by=16,bz:     | =1,flag=0) [{/mnt | /netapp/home1/sa   |  |  |  |
|               |                | 0.004%  pgi_cu_launch multiply_matrices (multiply_matrices_15_gpu,gx=18                                                                          | 8,gy=188,gz=1,     |                     |                   |                    |  |  |  |
|               |                | 0.002%  pgi_cu_module_function2 multiply_matrices name=multiply_matric<br>0.002%  pgi_cu_module_function2 multiply_matrices name=multiply_matric |                    |                     |                   |                    |  |  |  |
|               |                | 0.002% [pgi_cu_module_function2 multiply_matrices name=multiply_matrice                                                                          | es_15_gpu, argna   | me=(nuii), argsize  | =44, variante=(n  |                    |  |  |  |
| (             |                |                                                                                                                                                  |                    |                     |                   |                    |  |  |  |
| 000           |                | TAU: ParaProf: Thread Statistics: n,c,t, 0,0,0 – /Users/sameer/rs/taudata/mm                                                                     |                    |                     |                   |                    |  |  |  |
| Exclusive TIN |                | Name                                                                                                                                             | Exclusive TIME     |                     | Calls             | Child Calls        |  |  |  |
|               | 55.4%<br>39.3% | pgi_cu_downloadx multiply_matrices var=a, dims=2, desc.devx=0, desc.devstride=1, desc.hosts                                                      | 2.721              | 2.721               | 5                 | 0                  |  |  |  |
|               | 1.8%           | pgi_cu_init multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]<br>mymatrixmultiply [{mmdriv.f90} {1,0}]                                | 0.09               | 4,914               | 5                 | 5                  |  |  |  |
|               | 1.6%           | pgi_cu_uploadx multiply_matrices var=c, dims=2, desc.devx=0, desc.devstride=1, desc.hoststrid                                                    |                    | 0.081               | 5                 | 0                  |  |  |  |
|               | 1.6%           | pgi_cu_uploadx multiply_matrices var=b, dims=2, desc.devx=0, desc.devstride=1, desc.hoststrid                                                    |                    | 0.079               | 5                 | 0                  |  |  |  |
|               | 0.1%           | pgi_cu_free multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}]                                                                            | 0.004              | 0.004               | 15                | 0                  |  |  |  |
|               | 0.1%           | pgi_cu_alloc multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]                                                                        | 0.003              | 0.003               | 15                | 0                  |  |  |  |
|               | 0.0%           | multiply_matrices [{mm2.f90} {5,0}]                                                                                                              | 0.002              | 4.825               | 5                 | 85                 |  |  |  |
|               | 0.0%           | pgi_cu_module multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}{9}]                                                                       | 0                  | 0                   | 5                 | 0                  |  |  |  |
|               | 0.0%           | pgi_cu_launch multiply_matrices (multiply_matrices_11_gpu,gx=188,gy=188,gz=1,bx=16,by=16                                                         |                    | -                   | 5                 | 0                  |  |  |  |
|               | 0.0%           | pgi_cu_paramset multiply_matrices [{/mnt/netapp/home1/sameer/mm/mm2.f90}]                                                                        | 0                  | -                   | 10                | 0                  |  |  |  |
|               | 0.0%           | pgi_cu_launch multiply_matrices (multiply_matrices_15_gpu,gx=188,gy=188,gz=1,bx=16,by=16                                                         |                    | 0                   | 5                 | 0 🔺                |  |  |  |
|               | 0.0%           | pgi_cu_module_function2 multiply_matrices name=multiply_matrices_11_gpu, argname=(null), arg                                                     | . 0                | 0                   | 5                 | 0 🏹                |  |  |  |

# **Performance Portability**

### Maestro



- Portability
- Load balancing
- Autotuning

K. Spafford, J. Meredith, and J. Vetter, "Maestro: Data Orchestration and Tuning for OpenCL Devices," in *Euro-Par* 2010 - Parallel Processing, vol. 6272, Lecture Notes in Computer Science, P. D'Ambra, M. Guarracino et al., Eds.: Springer Berlin / Heidelberg, 2010, pp. 275-86.

## **Maestro: Multibuffering**



Fig. 2. Double Buffering–This figure contrasts the difference between (a) the function offload model and (b) a very simple case of double buffering. Devices which can concurrently execute kernels and transfer data are able to hide some communication time with computation.

# **Maestro : Autotuning Workgroups**



Fig. 3. Autotuning the local work group size – This figure shows the performance of the MD kernel on various platforms at different local work group sizes, normalized to the performance at a group size of 16. Lower runtimes are better.

# **Combined Autotuning Results**



Fig. 6. Combined autotuning results – (a) Shows the combined benefit of autotuning both the local work group size the double buffering chunk size for a single GPU of the test platforms. (b) Shows the combined benefit of autotuning both the local work group size and the multi-GPU load imbalance using both devices (GPU+GPU or GPU+CPU) of the test platforms. Longer bars are better.

# #2: Memory Bandwidth and Capacity

### **Critical Concern : Memory Capacity**

|                        | 2010       | 2018       | Factor Change |
|------------------------|------------|------------|---------------|
| System peak            | 2 Pf/s     | 1 Ef/s     | 500           |
| Power                  | 6 MW       | 20 MW      | 3             |
| System Memory          | 0.3 PB     | 10 PB      | 33            |
| Node Performance       | 0.125 Tf/s | 10 Tf/s    | 80            |
| Node Memory BW         | 25 GB/s    | 400 GB/s   | 16            |
| Node Concurrency       | 12 CPUs    | 1,000 CPUs | 83            |
| Interconnect BW        | 1.5 GB/s   | 50 GB/s    | 33            |
| System Size (nodes)    | 20 K nodes | 1 M nodes  | 50            |
| Total Concurrency      | 225 K      | 1 B        | 4,444         |
| Storage                | 15 PB      | 300 PB     | 20            |
| Input/Output bandwidth | 0.2 TB/s   | 20 TB/s    | 100           |

 Table 1: Potential Exascale Computer Design for 2018 and its relationship to current HPC designs.

- Small memory capacity has profound impact on other features
- Feeding the core
- Poor efficiencies
- Small messages, I/O

### **Memory Performance**



Source: DARPA Exascale Computing Study

### **Memory Capacity**



Source: DARPA Exascale Computing Study

# **New Technologies Offer Promise**

| Device Type         | HDD                       | DRAM                    | NAND Flash            | FRAM                    | MRAM                    | STTRAM                 | PCRAM                 | NRAM                   |
|---------------------|---------------------------|-------------------------|-----------------------|-------------------------|-------------------------|------------------------|-----------------------|------------------------|
| Maturity            | Product                   | Product                 | Product               | Product                 | Product                 | Prototype              | Product               | Prototype              |
| Present Density     | 400Gb/in <sup>2 [7]</sup> | 8Gb/chip <sup>[9]</sup> | 64Gb/chip [10]        | 128Mb/chip              | 32Mb/chip               | 2Mb/chip               | 512Mb/chip            | NA                     |
| Cell Size (SLC)     | (2/3)F <sup>2</sup>       | 6F <sup>2</sup>         | 4F <sup>2</sup>       | 6F <sup>2</sup>         | 20F <sup>2</sup>        | 4F <sup>2</sup>        | 5F <sup>2</sup>       | 5F <sup>2</sup>        |
| MLC Capability      | No                        | No                      | 4bits/cell            | No                      | 2bits/cell              | 4bits/cell             | 4bits/cell            | No                     |
| Program Energy/bit  | NA                        | 2pJ                     | 10nJ                  | 2pJ                     | 120pJ                   | 0.02pJ                 | 100pJ                 | 10pJ <sup>[11]</sup>   |
| Access Time (W/R)   | 9.5/8.5ms <sup>[8]</sup>  | 10/10ns                 | 200/25us              | 50/75ns                 | 12/12ns                 | 10/10ns                | 100/20ns              | 10/10ns [11]           |
| Endurance/Retention | NA                        | 10 <sup>16</sup> /64ms  | 10 <sup>5</sup> /10yr | 10 <sup>15</sup> /10yr  | 10 <sup>16</sup> /10yr  | 10 <sup>16</sup> /10yr | 10 <sup>5</sup> /10yr | 10 <sup>16</sup> /10yr |
| Device Type         | RRAM                      | CBRAM                   | SEM                   | Polymer                 | Molecular               | Racetrack              | Holographic           | Probe                  |
| Maturity            | Research                  | Prototype               | Prototype             | Research                | Research                | Research               | Product               | Prototype              |
| Present Density     | 64Kb/chip                 | 2Mb/chip                | 128Mb/chip            | 128b/chip               | 160Kb/chip              | NA                     | 515Gb/in <sup>2</sup> | 1Tb/in <sup>2</sup>    |
| Cell Size           | 6F <sup>2</sup>           | 6F <sup>2</sup>         | 4F <sup>2</sup>       | 6F <sup>2</sup>         | 6F <sup>2</sup>         | N/A                    | N/A                   | N/A                    |
| MLC Capability      | 2bits/cell                | 2bits/cell              | No                    | 2bits/cell              | No                      | 12bits/cell            | N/A                   | N/A                    |
| Program Energy/bit  | 2pJ                       | 2pJ                     | 13pJ                  | NA                      | NA                      | 2pJ                    | N/A                   | 100pJ <sup>[12]</sup>  |
| Access Time (W/R)   | 10/20ns                   | 50/50ns                 | 100/20ns              | 30/30ns                 | 20/20ns                 | 10/10ns                | 3.1/5.4ms             | 10/10us                |
| Endurance/Retention | 10 <sup>6</sup> /10yr     | 10 <sup>6</sup> /Months | 10 <sup>9</sup> /days | 10 <sup>4</sup> /Months | 10 <sup>5</sup> /Months | 10 <sup>16</sup> /10yr | 10 <sup>5</sup> /50yr | 10 <sup>5</sup> /NA    |

M.H. Kryder et al., "After Hard Drives," IEEE Trans. Mag.,

### Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems

Jeffrey Vetter, ORNL Robert Schreiber, HP Labs Trevor Mudge, U Michigan Yuan Xie, PSU

|                                      | SRAM               | DRAM             | NAND<br>Flash | PC-RAM | STT-<br>RAM | R-RAM |
|--------------------------------------|--------------------|------------------|---------------|--------|-------------|-------|
| Data Retention                       | N                  | N                | Y             | Y      | Y           | Y     |
| Memory Cell Factor (F <sup>2</sup> ) | 50-120             | 6-10             | 2-5           | 6-12   | 4-20        | <1    |
| Read Time (ns)                       | 1                  | 30               | 50            | 20-50  | 2-20        | <50   |
| Write / Erase Time (ns)              | 1                  | 50               | 106-105       | 50-120 | 2-20        | <100  |
| Number of Rewrites                   | 1016               | 1016             | 1.05          | 1010   | 1015        | 1015  |
| Power Read/Write                     | Low                | Low              | High          | Low    | Low         | Low   |
| Power (Other than<br>R/W)            | Leakage<br>Current | Refresh<br>Power | None          | None   | None        | None  |

#### Impact and Champions

#### Reliance on NVM addresses device scalability, energy efficiency and reliability concerns associated with DRAM

- More memory NVM scalability and density permits significantly more memory/core than projected by current Exascale estimates.
- Less power NVMs require zero stand-by power.
- More reliable alleviates increasing DRAM soft-error rate problem.
- Node architecture with persistent storage near processing elements enables new computation paradigms
  - Low-cost checkpoints, easing checkpoint frequency concerns.
  - Inter-process data sharing, easing in-situ analysis (UQ, Visualization)

#### Novel Ideas

- New resilience-aware designs for non-volatile memory applications
  - Mechanical-disk-based data-stores are completely replaced with energy-efficient non-volatile memories (NVM).
  - Most levels of the hierarchy, including DRAM and last levels of SRAM cache, are completely eliminated.
- New energy-aware systems/applications for nonvolatile memories (nanostores)
  - Compute capacity, comprised of balanced low-power simple cores, is co-located with the data store.

#### Milestones

- Identify and evaluate the most promising non-volatile memory (NVM) device technologies.
- Explore assembly of NVM technologies into a storage and memory stack
- Build the abstractions and interfaces that allow software to exploit NVM to its best advantage
- Propose an exascale HPC system architecture that builds on our new memory architecture
- Characterize key DOE applications and investigate how they can benefit from these new technologies







# Opportunities go far beyond a plugin replacement for disk drives...

- New distributed computer architectures that address exascale resilience, energy, and performance requirements
  - replace mechanical-disk-based data-stores with energy-efficient non-volatile memories
  - explore opportunities for NVM memory, from plug-compatible replacement (like the NV DIMM, below) to radical, new data-centric compute hierarchy (nanostores)
  - place low power compute cores close to the data store
  - reduce number of levels in the memory hierarchy
- Adapt existing software systems to exploit this new capabilities





# How might these new technologies help applications in new ways?

- Major differences
  - Persistence
  - Read/Write asymmetries
  - Zero standby power
- Assuming new (disruptive) technologies, how should applications use them in an exascale architecture?
  - What is the proper level of integration into the memory hierarchy?
  - What data should be stored in this memory?
  - How should we change the usage scenarios?

# App1

- A computational fluid dynamic solver
  - cover a broad range of applications
  - We instrument the eddy problem, a 2D problem
- RWT



Cumulative distribution of RWT value for NEK5000

# Summary

- Toward Exascale
  - Highlights from recent projections for exascale
- Challenges
  - Micro, macro power
  - Memory capacity and bandwidth
  - Parallelism
  - Programmability
  - Algorithms
- Research solutions
  - Heterogeneity with GPUs: Keeneland
  - Programming models: SHOC, Maestro
  - NVRAM

- Applications teams should
  - Prepare for hardware diversity!
  - Think of new ways to employ technologies
- <u>http://ft.ornl.gov</u>