Skip to content

Parallel Algorithm Design

General Guidelines

Given a program, we are interested in - which tasks can be performed concurrently - how to map concurrent tasks (ops) onto multiple threads/processes/devices - how to partition data and assign to processes - how to handle concurrent accesses to shared data - synchronizations to make sure data integrity (correctness)

Task Decomposition

A task is a unit of computation taht can be extracted from the main program and assigned to a process, and which can be run concurrently with other tasks.

For example, for matrix-vector multiplication \(y = Ax\), it can be seen as \(n\) tasks

y = [A[i, :] * x[:] for i in range(n)]

Task Dependencies

If a task A requires the results/data from task B, then A depends on B since A has to wait for B is ready.

Thus, we can represent the dependency as a directed acyclic graph where tasks are nodes and dependencies are edges. A node is a start node if no incoming edges, and finish node is no outgoing edges.

Suppose that each task takes approximate the same time and there are as many processes as possible, then the longest path will dominate the running time. Otherwise, if we assign each node with the time for executing its task, then the most costly path will dominate.

Degree of Concurrency

Max degree of concurrency is the max #tasks that can be executed simultaneously at any given time

Average degree of concurrency is the average #tasks during the program's execution.

Consider the task dependencies graph. Since it is a DAG, each node can be assigned with a level as the number of nodes of the longest path to a starting node.
Degree of concurrency is computed as the sum of all nodes at the same level.
Critical path is the path with the largest sum of all nodes.
Max deg of concurrency is computed as the max of degree of concurrency at each level.
Average degree of concurrency is the sum of all tasks, divided by the sum on the critical path.

Granularity

determined by how many tasks what the workload/size of each task. Often described as coarse or fine.

For matrix-vector multiplication example. A fine-grained decomposition will be \(n\) tasks

y = [A[i, :] * x[:] for i in range(n)]

A coarse-grained will be 2 tasks

y = A[:n/2, :] * x[:] , A[n/2:, :] * x[:] 

With higher granularity, it looks like more concurrency is possible. However, it is more difficult to properly divide tasks, and it will take more time for computing the partitioning.

Task Interactions

Task dependency only consider producer-consumer interactions, a.k.a. input/output. However, there are other kinds of interactions. For example, exchange data, synchronize.

Read-only interactions only need to read data shared among tasks. For example, all tasks in vec-mat multi. \(y=Ax\) need to access vector x.

Read-write interactions task can read or write data shared. For example, several threads read the same array and sum them up to a number.

Typically, read-only is safer to divide, while read-write should be kept on the same process as much as possible.

Decomposition Techs

Recursive Decomposition

Typically useful for problems with divide-and-conquer strategy, where each subproblem can be solved concurrently.

The program will recusively span new processes/threads, and recollect them.

Data Decomposition

Partition the data on which computations are performed.

For example, for mat-vec multiplication, we partition on the output vector \(y\), each process compute the dot-product of two vectors.
Or we can partition on the input data, we partion \(A\) into rows and \(B\) into scalars, each process compute part of \(b\) and sum them together.

Mapping to Processes

Given a task graph with different time for each task. We want to map the tasks onto \(p\) processes. The goal is to minimize completion time by - load balances, maximize the use of concurrency (processes don't get idle) - minimize interactions among processes

Static vs. Dynamic Mapping

static dynamic
how assign tasks before execution starts assign tasks during execution
example parallel_for Python Pool
pros less overhead on partitioning, if the task size is well known, can be balanced dynamic load balancing, usually more effective than static if size unknown
cons If the task size is unknown, a naive assignment may result in severe load imbalances. overhead on dynamically assigning work, overheads for data movement

Data Interaction Overheads

Processes share data and/or may require data generated by other processes. We need to = minimize the volume of interaction overheads (use local data as much as possible) - minimize interactions frequency (use large chunks of shared data to reduce number of interactions)

Contention and hotspots

Contention happens if accessing shared data concurrently. For example, several processes try to access the same memory block, or interact with one specific process with messages.

One common category of problems is reduction. Which an array is reduced to a specific value, such as sum, prod. A naive implementation of sum will use \(p\) processes to compute sum of its array partition and add to one variable, causing high contention. A more appropriate way is to decentralize the shared data by pair-wise summing processes in \(\lg(n)\) steps.

Overlap computations with interactions

Process may idle waiting for shared data before sync happens. We can assign more tasks before current task is completed (e.g. context switch) or initiate an interaction earlier than necessary.

Need more hardware/OS support. Commonly seen in OS kernel, the OS is switching among program processes, while waiting for kernel mode execution or disk IO.

Replicate Data or Computations

Instead of interaction overheads, simply replicate the necessary data for each process.

Benefiticial for read-only interactions and the shared data is not too large.
If the interactions has write, then maintaining the coherent copies will have more overheads since you need to broadcast the changes to all processes.

Common Parallel Algorithm Models

data parallel work pool master slave
description each process works on one data partition processes take work from a pool, take another when finish current work a master process generates work and allocates to worker processes
decomposition static and uniform data partitioning depends on the problem depends on the problem
mapping (mostly) static dynamic Often dynamic
strategies for interactions locality-preserving decomposition, overlap computation with interaction adjust granularity, tradeoff between load imbalance and overhead for managing pool Choose granularity for master, overlap computation
typical use case problems with known size non-fixed and/or imbalanced task size distributed parallel arch (eg. web server)

Performance Model

Note that often times \(N\) processes does not results in \(N\) times performance due to overheads.

Overheads include inter-process communication, idling, and excess computation (sometimes necessary for reducing communications).

Given the resources (#processes), the execution time is obviously end_time - start_time, which is determined by slowest process.

Speedup

One measure for performance gain is speedup \(S=T_s/T_p\), i.e. the time ratio between serial execution and parallel execution with \(p\) processors.

Note that \(T_s\) must be fully optimized serial implementation, otherwise this measurement is not useful.

Considerations

Also, it is very clear that \(S \leq p\), i.e. Maximum achievable speedup as linear speedup.

If \(S > p\), i.e. superlinear speedup can happen if sequential algorithm is at a disadvantage compared to parallel version. For example, data too large to fit into L1 cache (since it is per-core) so that data accesses are slower. However, in most cases, espeically for shared memory models, this rarely happens.

Amdahl's Law

Let the total time for a squential execution be \(1\),

\[ T_s = T_{1} + T_{2} = 1 \]

where \(T_1\) is the fraction done sequentially and \(T_2\) be the fraction done parallelizable (with 1 process). Then consider \(p\) processes, \(T_1\) is not parallelizable so that unchanged, \(T_2\) can have at most a linear speedup

\[ T_p = T_{1} + T_{2}' \leq T_{1} + T_{2}/p \]

Thus, the speedup is

\[S(p) = \frac{1}{T_1+T_{2}/p}\leq T_1^{-1}\]

Thus, the speedup is upper bounded by the fraction of sequencial work.

Efficiency

Efficiency is defined as

\[E = S / p\]

Since \(1 \leq S \leq p\), \(0 \leq E \leq 1\) where \(1\) is the ideal efficiency.