This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query studies, where one needs to run a large number of simulations. This work introduces a reformulation, called rank-2 Galerkin, of the Galerkin ROM for LTI dynamical systems which converts the nature of the ROM problem from memory bandwidth to compute bound. We present the details of the formulation and its implementation, and demonstrate its utility through numerical experiments using, as a test case, the simulation of elastic seismic shear waves in an axisymmetric domain. We quantify and analyze performance and scaling results for varying numbers of threads and problem sizes. Finally, we present an end-to-end demonstration of using the rank-2 Galerkin ROM for a Monte Carlo sampling study. We show that the rank-2 Galerkin ROM is one order of magnitude more efficient than the rank-1 Galerkin ROM (the current practice) and about 970X more efficient than the full order model, while maintaining excellent accuracy in both the mean and statistics of the field.

#1Renee Swischuk(MIT: Massachusetts Institute of Technology)H-Index: 2

#2Boris Kramer(MIT: Massachusetts Institute of Technology)H-Index: 12

Last. Karen Willcox(University of Texas at Austin)H-Index: 48

view all 4 authors...

This paper presents a physics-based data-driven method to learn predictive reduced-order models (ROMs) from high-fidelity simulations and illustrates it in the challenging context of a single-injec...

Last. Wen-Jing Ma(CAS: Chinese Academy of Sciences)H-Index: 3

view all 3 authors...

We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine lear...

#2Nan Jiang(Missouri University of Science and Technology)H-Index: 10

Last. Zhu Wang(USC: University of South Carolina)H-Index: 17

view all 3 authors...

Many applications of computational fluid dynamics require multiple simulations of a flow under different input conditions. In this paper, a numerical algorithm is developed to efficiently determine a set of such simulations in which the individually independent members of the set are subject to different viscosity coefficients, initial conditions, and/or body forces. The proposed scheme applied to the flow ensemble leads to need to solve a single linear system with multiple right-hand sides, and...

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for ...

Last. P. Sadayappan(OSU: Ohio State University)H-Index: 71

view all 10 authors...

Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high performance on sparse computations on GPUs is very challenging. A tremendous amount of recent research has focused on various GPU implementations of the SpMV kernel. But the multi-vector SpMM kernel has received much less attention. In this paper, we pre...

Last. Jeffrey S. Vetter(ORNL: Oak Ridge National Laboratory)H-Index: 55

view all 5 authors...

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precis...

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic exec...

This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide rang...