Skip to content

Single Processor Machines

Program, Compiler, and Uni-processor model

We know that the processor executes machine codes (e.x. Assembly) and a compiler compiles a piece of code (say C/C++/Fortan) into machine code.

Variables Operations Control flow
program typed variables (int, float, pointer, array, struct) arithmetic ops, logical ops, etc. if, for, function calls
Assembly (Processor) move bytes between memory and register arithmetic instructions (ALU) jump/branch instructions (CU)

Ideally, assuming control, save, load are free, arithmetic operations all have the same cost.

Compiler Optimizations

Compiler optimizations aim to reduce number of instructions by

  • Improve register reuse (fewer save/load instruction)
    • interchange nested loops
    • reorder instructions, ex. e = a + b; f = e - 1 becomes R1 = a + b; R1 = R1 - 1
  • Eliminate unnecessary control flows (fewer branch/jump)
    • unroll small loops
    • fuse loops (merge nested loops into one)
    • eliminate dead code branch
  • Strength reduction (turn expensive instruction into cheaper ones)
    • multiply by 2 becomes shift right

However, most optimizations are performed locally. They often give up optimizations on complex code to preserve correctness.

Memory Hierarchy

The memory system can contain multiple levels. For example, a personal computer system (Ryzen5-2700) should have

memory device latency size bandwidth
register per-code 0 bytes
L1 cache per-core >1ns 768KB 1TB/sec
L2 cache per-processor ~3ns 4MB 1TB/sec
L3 cache (or SRAM) per-CPU (or motherboard) ~6ns 16MB ~500GB/sec
Main memory DDR4 RAM ~20ns 8-64GB ~25GB/sec
Hard drive SSD ~10ms 1TB ~5GB

Caches Locality

A cache line is the contiguous block that read/write in between memory and cache per instruction. It's typically a few memory addresses.

Since cache has significantly smaller latency, we want to store previously accessed cache lines so that references to memory is reduced. A cache hit if the data required by the processor is already in cache, otherwise a cache miss, which resulting a load from main memory.

  • Temporal locality: reuse data already in cache
  • Spatial locality: Operate on data that are stored close in memory, so that the data can be brought into cache all together.

Single Processor Parallelism

Single Instruction Multiple Data (SIMD)

SIMD (single instruction, multiple data) can execute one arithmetic instruction (one clock cycle) on multiple scalars.

For example AVX-512 do arithmetic operations on 512 bits (aka 8 double precision / 16 single precision operations) at a time.

Some restriction on SIMD includes: expose parallelism to the compiler (flags/pragma to the compiler), data needs to be contiguous in memory and cache aligned.

Memory Alignment and Strides

Note that cache line is loaded from memory to cache at one time, which means contiguous memory accesses cannot be done in parallel (otherwise very easy to cause data conflicts). However, non-contiguous memory access can be done in parallel.

Memory alignment is a common technique, instead of writing data in contiguous blocks, align them on cache line boundaries. For example, if the cache line is 32 bytes, we can store data in

double x[]; // each double is 8 bytes
x[0], x[4], x[8] // stride for 32 / 8 = 4 doubles 

Fused Multiply Add (FMA)

FMA refers to operations like \(x = y + c \cdot z\), which is very common for matrix multiplication. With FMA support, such operation can be done as one instruction instead of two. Also, only one rounding is performed, hence rounding error won't accumulate and is smaller.

Roofline Model

Roofline: an insightful visual performance model for multicore architectures

The idea is that applications are measured by

factor unit level
arithmetic performance FLOPs/s machine
memory bandwidth bytes/s machine
computational intensity FLOPs/byte application

Then, consider a task of \(N\) FLOPs or arithmetic operations and \(M\) bytes of data movement. Assuming a cold start (all data initially resides in DRAM) and each byte of data will be moved exactly once. Then we have that

time = max(N / (Peak FLOPS per sec), M / Peak Bandwidth)
N / time = min(Peak FLOPS per sec, (N / M) * Peak Bandwidth)

Thus, we can get the roofline performance model, where x-axis is the FLOPs / data ratio (N/M) and y-axis is the speed FLOPs / sec.