Introduction to the Roofline Model Samuel Williams Computational

Introduction to the Roofline Model Samuel Williams Computational

Introduction to the Roofline Model Samuel Williams Computational Research Division Lawrence Berkeley National Lab [email protected] Acknowledgements This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. This material is based upon work supported by the DOE RAPIDS SciDAC Institute. This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC0205CH11231. This research used resources of the Oak Ridge Leadership Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Why Use Performance Models or Tools? Identify performance bottlenecks Motivate software optimizations Determine when were done optimizing Assess performance relative to machine capabilities Motivate need for algorithmic changes Predict performance on future machines / architectures Sets realistic expectations on performance for future procurements

Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of todays applications. 3 Performance Models / Simulators Historically, many performance models and simulators tracked latencies to predict performance (i.e. counting cycles) The last two decades saw a number of latency-hiding techniques Out-of-order execution (hardware discovers parallelism to hide latency) HW stream prefetching (hardware speculatively loads data) Massive thread parallelism (independent threads satisfy the latency-bandwidth product) Effectively latency hiding has resulted in a shift from a latency-limited computing regime to a throughput-limited computing regime 4 Roofline Model The Roofline Model is a throughputoriented performance model Tracks rates not times Augmented with Littles Law (concurrency = latency*bandwidth) Independent of ISA and architecture (applies to CPUs, GPUs, Google TPUs1, etc) Three Components: Machine Characterization

(realistic performance potential of the system) Application Execution Monitoring Theoretical Application Bounds https://crd.lbl.gov/departments/computer-science/PAR/research/roofline (how well could my app perform with perfect compilers, caches, overlap, ) Jouppi et al, In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA, 2017. 1 5 (DRAM) Roofline One could hope to always attain peak performance (Flop/s) However, finite locality (reuse) and bandwidth limit performance. Consider idealized processor/caches Plot the performance bound using Arithmetic Intensity (AI) as the x-axis AI = Flops / Bytes presented to DRAM Attainable Flop/s = min( peak Flop/s, AI * peak GB/s ) Log-log scale makes it easy to doodle, extrapolate performance along Moores Law, etc Kernels with AI less than machine balance are ultimately DRAM bound (well refine this later) B/

s G AM D R Attainable Flop/s Peak Flop/s Memory-bound Compute-bound Arithmetic Intensity (Flop:Byte) 6 Roofline Example #1 Typical machine balance is 5-10 flops per byte Unlikely to improve Consider STREAM Triad #pragma omp parallel for for(i=0;i

Artifact of technology and money AM Peak Flop/s D R 40-80 flops per double to exploit compute capability Attainable Flop/s Gflop/s AI * DRAM GB/s TRIAD 0.083 Arithmetic Intensity (Flop:Byte) 7 Roofline Example #2 Conversely, 7-point constant coefficient stencil Cache can filter all but 1 read and 1 write per point AI = 0.44 flops per byte == memory bound, but 5x the flop rate #pragma omp parallel for for(k=1;k

+ old[ijk+jStride] + old[ijk-kStride] + old[ijk+kStride]; }}} B/ s Gflop/s AI * DRAM GB/s G 8 memory references (7 reads, 1 store) per point AM Peak Flop/s D R 7 flops Attainable Flop/s 7-point Stencil TRIAD 0.083 0.44 Arithmetic Intensity (Flop:Byte) 8 Hierarchical Roofline Real processors have multiple levels of memory Registers

L1, L2, L3 cache MCDRAM/HBM (KNL/GPU device memory) DDR (main memory) NVRAM (non-volatile memory) Applications can have locality in each level Unique data movements imply unique AIs Moreover, each level will have a unique bandwidth 9 Hierarchical Roofline Construct superposition of Rooflines performance is bound by the minimum B/ s G ca ch e AM R B/ s

G R D D Although an loop nest may have multiple AIs and multiple bounds (flops, L1, L2, DRAM) B/ s G Measure AI for each level of memory M C D Peak Flop/s L2 Measure a bandwidth Attainable Flop/s DDR Bound DDR AI*BW < MCDRAM AI*BW Arithmetic Intensity (Flop:Byte) 10 Hierarchical Roofline Construct superposition of

Rooflines B/ s G ca ch e AM B/ s G R performance is bound by the minimum R D D Although an loop nest may have multiple AIs and multiple bounds (flops, L1, L2, DRAM) B/ s G Measure AI for each level of memory M C D Peak Flop/s L2

Measure a bandwidth Attainable Flop/s DDR bottleneck pulls performance below MCDRAM Roofline Arithmetic Intensity (Flop:Byte) 11 Hierarchical Roofline Construct superposition of Rooflines performance is bound by the minimum B/ s G ca ch e AM R B/ s G R MCDRAM bound MCDRAM AI*BW < DDR AI*BW D D Although an loop nest may have multiple AIs and multiple bounds (flops, L1, L2, DRAM)

B/ s G Measure AI for each level of memory M C D Peak Flop/s L2 Measure a bandwidth Attainable Flop/s Arithmetic Intensity (Flop:Byte) 12 Hierarchical Roofline Construct superposition of Rooflines B/ s G ca ch e AM B/ s G R performance is bound by the minimum

R D D Although an loop nest may have multiple AIs and multiple bounds (flops, L1, L2, DRAM) B/ s G Measure AI for each level of memory M C D Peak Flop/s L2 Measure a bandwidth Attainable Flop/s MCDRAM bottleneck pulls performance below DDR Roofline Arithmetic Intensity (Flop:Byte) 13 Data, Instruction, Thread-Level Parallelism

We have assumed one can attain peak flops with high locality. In reality, this is premised on sufficient unrolling, out-of-order execution (hide FPU latency) OpenMP across multiple cores Without these, Peak performance is not attainable Some kernels can transition from memory-bound to compute-bound n.b. in reality, DRAM bandwidth is often tied to DLP and TLP (single core cant saturate BW w/scalar code) B/ s G Vectorization (16 flops per instruction) R No FMA D D Use special instructions (e.g. fused multiply-add) Attainable Flop/s

Peak Flop/s No vectorization Lack of DLP pulls performance below DDR Roofline Arithmetic Intensity (Flop:Byte) 14 Initial LBL/NERSC Roofline Efforts Initial LBL Roofline Efforts / Goals 1.Node Characterization 2.Application Instrumentation/Characterization 3.Using Roofline to drive application performance analysis and optimization for KNL. 16 Node Characterization? Cori / KNL Marketing Numbers can be deceptive TurboMode / Underclock for AVX Pin BW vs. real bandwidth compiler failings on high-AI loops. LBL developed the Empirical Roofline Toolkit (ERT)

Characterize CPU/GPU systems Peak Flop rates Bandwidths for each level of memory MPI+OpenMP/CUDA == multiple GPUs https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/ 17 SummitDev / 4GPUs Instrumentation with Performance Counters? Characterizing applications with performance counters can be problematic Flop Counters can be broken/missing in production processors Vectorization/Masking can complicate counting Flops Counting Loads and Stores doesnt capture cache reuse while counting cache misses doesnt account for prefetchers. DRAM counters (Uncore PMU) might be accurate, but are privileged and thus nominally inaccessible in user mode may need vendor (e.g. Cray) and center (e.g. NERSC) approved OS/kernel changes 18 Forced to Cobble Together Tools Use tools known/observed to work on NERSCs Cori (KNL, HSW) Used Intel SDE (Pin binary instrumentation + emulation) to create software Flop counters Used Intel VTune performance tool (NERSC/Cray approved) to access uncore counters

Accurate measurement of Flops (HSW) and DRAM data movement (HSW and KNL) Used by NESAP (NERSC KNL application readiness project) to characterize apps on Cori http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/ NERSC is LBLs production computing division CRD is LBLs Computational Research Division NESAP is NERSCs KNL application readiness project LBL is part of SUPER (DOE SciDAC3 Computer Science Institute) 19 Initial Roofline Analysis of NESAP Codes 100 RooflineModel wo/FMA 1RHS 1 0.01 0.1 1 10 Arithme cIntensity(FLOP/byte) Original 10 4RHS 8RHS RooflineModel wo/FMA 1RHS 4RHS 8RHS nRHS+SELL+SB RooflineModel wo/FMA Original

100 SELL 10 SB SELL+SB 1 0.1 1 10 Arithme cIntensity(FLOP/byte) 20 nRHS+SELL+SB RooflineModel wo/FMA 100 e n i l f w/Tiling o o R w/Tiling r +Vect y o l f n t 1 o n e i M

0.1 A1 uffic10 R De cIntensity(F R s LOP/byte) Arithm n A i S is IC P 10000 Original 10 SELL+SB 0.1 1 10 Arithme cIntensity(FLOP/byte) 1000 1000 SB 1 1000 1 0.01 0.1 1 10 Arithme cIntensity(FLOP/byte) wo/FMA SELL 10000

10 RooflineModel 100 10000 100 10000 GFLOP/s 1000 PICSAR 1000 GFLOP/s 1000 GFLOP/s 10000 GFLOP/s GFLOP/s EMGeo 10000 10 GFLOP/s KNL 2P HSW MFDn

RooflineModel wo/FMA 100 Original 10 w/Tiling w/Tiling+Vect 1 0.1 1 10 Arithme cIntensity(FLOP/byte) Evaluation of LIKWID AMReX Application Characterization https://github.com/RRZE-HPC/likwid 21 32 WarpX (4Px8T) PeleLM (32Px1T) Nyx (4Px8T) Nyx (32Px1T) 8 MFIX (32Px1T) 16 Combustor (4Px8T) Useful tool that complements other tools 64

Combustor (32Px1T) Limited by quality of hardware performance counter implementation (garbage in/garbage out) 128 HPGMG (4Px8T) No detailed timing breakdown or optimization advice 256 HPGMG (32Px1T) Scalable in distributed memory (MPI-friendly) Fast, high-level characterization L2 L3 DRAM Roofline 512 Bandwidth(GB/s) Works on NERSC production systems Minimal overhead (<1%) (2Px16c HSW == Cori Phase 1) 1024 WarpX (32Px1T) LIKWID provides easy to use wrappers for measuring performance counters Need an integrated solution Having to compose VTune, SDE, and plotting tools worked correctly and benefited NESAPs application readiness forced users to learn/run multiple tools and manually parse/graph the output forced users to instrument routines of interest in their application lacked integration with compiler/debugger/disassembly

LIKWID was fast and easy to use Suffered from the same limitations as VTune/SDE ERT Characterized flops, and bandwidths (cache, DRAM) Interoperable with MPI, OpenMP, and CUDA Required users to manually parse/incorporate the output 22 Intel Advisor Includes Roofline Automation Automatically instruments applications (one dot per loop nest/function) Computes FLOPS and AI for each function (CARM) Full AVX-512 integration that incorporates mask values Integrated Cache Simulator1 (hierarchical roofline / multiple AIs) Automatically benchmarks target system (calculates ceilings) Full integration with existing Advisor capabilities Technology Preview, not in official product roadmap so far. This version will be made available during the hands-on component of this tutorial. Memory-bound, invest into cache blocking etc 1 23 Compute bound: invest into SIMD,.. Hierarchical Roofline vs. Cache-Aware Roofline understanding different Roofline formulations in Advisor There are two Major Roofline Formulations: Hierarchical Roofline (original Roofline w/ DRAM, L3, L2, )

Williams, et al, Roofline: An Insightful Visual Performance Model for Multicore Architectures, CACM, 2009 Chapter 4 of Auto-tuning Performance on Multicore Computers, 2008 Defines multiple bandwidth ceilings and multiple AIs per kernel Performance bound is the minimum of flops and the memory intercepts (superposition of original, single-metric Rooflines) Cache-Aware Roofline Ilic et al, "Cache-aware Roofline model: Upgrading the loft", IEEE Computer Architecture Letters, 2014 Defines multiple bandwidth ceilings, but uses a single AI (flop:L1 bytes) As one looses cache locality (capacity, conflict, ) performance falls from one BW ceiling to a lower one at constant AI Why Does this matter? Some tools use the Hierarchical Roofline, some use cache-aware == Users need to understand the differences Cache-Aware Roofline model was integrated into production Intel Advisor Evaluation version of Hierarchical Roofline1 (cache simulator) has also been integrated into Intel Advisor You will be allowed to explore both in the handon component of this tutorial Technology Preview, not in official product roadmap so far.

This version will be made available during the hands-on component of this tutorial. 1 25 Hierarchical Roofline Cache-Aware Roofline Captures cache effects Captures cache effects AI is Flop:Bytes after being filtered by lower cache levels AI is Flop:Bytes as presented to the L1 cache (plus non-temporal stores) Multiple Arithmetic Intensities Single Arithmetic Intensity AI independent of problem size (one per level of memory) AI dependent on problem size (capacity misses reduce AI)

Memory/Cache/Locality effects are observed as decreased AI Memory/Cache/Locality effects are observed as decreased performance Requires performance counters or cache simulator to correctly measure AI Requires static analysis or binary instrumentation to measure AI 26 Example: STREAM L1 AI 2 flops 2 x 8B load (old) 1 x 8B store (new) = 0.08 flops per byte #pragma omp parallel for for(i=0;i

Iteration i doesnt touch any data associated with iteration i+delta for any delta. leads to a DRAM AI equal to the L1 AI 27 Example: STREAM Hierarchical Roofline Cache-Aware Roofline s B/ G L1 AM G B/ s Multiple AIs. 1) Flop:DRAM bytes 2) Flop:L1 bytes (same) Observed performance is correlated with DRAM bandwidth Single AI based on flop:L1 bytes D R B/ s G AM Attainable Flop/s s

B/ G D R Peak Flop/s Performance is bound to the minimum of the two Intercepts AIL1 * L1 GB/s AIDRAM * DRAM GB/s L1 Attainable Flop/s Peak Flop/s 0.083 Arithmetic Intensity (Flop:Byte) 0.083 Arithmetic Intensity (Flop:Byte) 28 Example: 7-point Stencil (Small Problem) L1 AI 7 flops 7 x 8B load (old) 1 x 8B store (new) = 0.11 flops per byte

some compilers may do register shuffles to reduce the number of loads. #pragma omp parallel for for(k=1;k

s B/ G L1 AM G B/ s Multiple AIs. 1) flop:DRAM ~ 0.44 2) flop:L1 ~ 0.11 D R B/ s G AM Attainable Flop/s s B/ G D R Peak Flop/s Performance bound is the minimum of the two L1 Attainable Flop/s Peak Flop/s 0.11 0.44 Arithmetic Intensity (Flop:Byte)

0.11 Arithmetic Intensity (Flop:Byte) 30 Example: 7-point Stencil (Small Problem) Hierarchical Roofline Cache-Aware Roofline s B/ G L1 AM G B/ s Multiple AIs. 1) flop:DRAM ~ 0.44 2) flop:L1 ~ 0.11 Observed performance is between L1 and DRAM lines (== some cache locality) Single AI based on flop:L1 bytes D R B/ s G AM Attainable Flop/s s B/ G D R

Peak Flop/s Performance bound is the minimum of the two L1 Attainable Flop/s Peak Flop/s 0.11 0.44 Arithmetic Intensity (Flop:Byte) 0.11 Arithmetic Intensity (Flop:Byte) 31 Example: 7-point Stencil (Large Problem) Hierarchical Roofline Cache-Aware Roofline s B/ G B/ s G AM B/ s G AM Single AI based on flop:L1 bytes D R G D R

Multiple AIs. 1) flop:DRAM ~ 0.20 2) flop:L1 ~ 0.11 Observed performance is closer to DRAM line (== less cache locality) L1 B/ s Attainable Flop/s Peak Flop/s Capacity misses reduce DRAM AI and performance L1 Attainable Flop/s Peak Flop/s 0.11 0.20 Arithmetic Intensity (Flop:Byte) 0.11 Arithmetic Intensity (Flop:Byte) 32 Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline G B/ s

Observed performance is closer to DRAM line (== less cache locality) G B/ s L1 Single AI based on flop:L1 bytes D R AM B/ s G AM D R Actual observed performance is tied to the bottlenecked resource and can be well below a cache Roofline (e.g. L1). Attainable Flop/s s Peak Flop/s B/ G L1 Attainable Flop/s Peak Flop/s 0.11 0.20

Arithmetic Intensity (Flop:Byte) 0.11 Arithmetic Intensity (Flop:Byte) 33 Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline G B/ s Observed performance is closer to DRAM line (== less cache locality) G B/ s L1 Single AI based on flop:L1 bytes D R AM B/ s G AM D R Actual observed performance is tied to the bottlenecked resource and can be well below a cache

Roofline (e.g. L1). Attainable Flop/s s Peak Flop/s B/ G L1 Attainable Flop/s Peak Flop/s 0.11 0.20 Arithmetic Intensity (Flop:Byte) 0.11 Arithmetic Intensity (Flop:Byte) 34 Questions?

Recently Viewed Presentations

  • Warm Up - 2/24 - Monday

    Warm Up - 2/24 - Monday

    Area of a non-right triangle. If we know two sides of a triangle and the angle between them, we can just plug the values into . ????=12?∙?∙sin⁡(?) Where a and b are sides of the triangle and C is the...
  • Chapter 31

    Chapter 31

    Chapter 31. Assisting with Office/Ambulatory Surgery . ... 1-inch border between sterile area and nonsterile area. Do not turn your back on sterile field. Sterile Principles. Anything below waist considered contaminated.
  • Chapter 10 CHAPTER 10 Signature: Microsoft Word 2003

    Chapter 10 CHAPTER 10 Signature: Microsoft Word 2003

    View a condensed outline of the document without all of the text in between headings or subheadings. - useful for viewing, moving insertion point, moving text, maintaining consistency use drop-down list and click desired level use Collapse button to collapse...
  • Blogging in Urban Design Education: A Virtual Public Domain ...

    Blogging in Urban Design Education: A Virtual Public Domain ...

    Blogging in Urban Design Education: A Virtual Public Domain of Exchange. [email protected] [email protected] . www.nclurbandesign.org
  • Chemistry for Changing Times 11th Edition Hill and Kolb

    Chemistry for Changing Times 11th Edition Hill and Kolb

    Arial Calibri Default Design 1_Default Design Chemistry for Changing Times 12th Edition Hill and Kolb Atoms: The Greek Idea Atoms: The Greek Idea Atoms: The Greek Idea Lavoisier: The Law of Conservation of Mass Lavoisier: The Law of Conservation of...
  • Developing a Dynamic Biomechanical Model of the Knee Using 3D ...

    Developing a Dynamic Biomechanical Model of the Knee Using 3D ...

    Use Meshlab to re-mesh the models and fill holes in the designs so that they become printable. ... Heading. Use this placeholder to add text or other content. Second level. Third level. Fourth level. Fifth level. Six. Seven. Eight. Nine....
  • Chapter 7

    Chapter 7

    Beginning of Chapter 7. In this chapter we shall cover those aspects of SQL that let us create "active" elements. An active element is an expression or statement that we write once and store in the database, expecting the element...
  • Lake County IL RACES/ARES Training Introduction to Emergency

    Lake County IL RACES/ARES Training Introduction to Emergency

    Lake County IL RACES/ARES® Training Introduction to Emergency Communication Course*