Data Parallel Algorithms
Data parallel algorithms often apply to an array and the same operation is performed onto each (or each group) or elements at the same time.
Common operators
- Unary operators: \(B = f(A)\) such as
square
. - Binary operators: \(C = f(A, B)\) such as
add
. - Broadcast: \(C = f(a, B)\), such as
fill
(map a scalar to the array) orsaxpy
(\(aX + Y\), scalar multiplication and add) -
Memory operations: array indexing include (stride) scatter/gather. Commonly seen in
numpy
. For example -
Masked operations: A mask is an array of
- Reduction accumulate an array to one value with any associative op. Note that associativity is very important by allowing up to change operation orders (so that optimize and parallelize). Examplestrue / false
or0 / 1
and operations only performed when mask element is 1.sum(A), prod(A), dot(A, B)
. -
Scan fill an array with partial reductions of any associative op. Examples
cumsum = scan(add)
- Inclusive scan includes input \(x_i\) for output \(y_i\), exclusive scan does not.
Ideal cost
Suppose that we have infinite number of processors, free control, and free communication. Then, the cost on the parallelism will be the depth of the spanned data dependency tree.
- For unary and binary operations, the output only depends on corresponding elements, thus the cost is \(O(1)\).
- For broadcast, data get doubled at each time, thus the cost is \(O(\log n)\).
- For reduction, associativity allows up to do operation group-wise, thus the cost is \(\Theta(\log_b n)\), where \(b\) is the number of elements in a group, in general, the operations are binary, \(b=2\). Note that this is also a lower bound.
Matrix multiplications
For a matrix multiplication \(C = A \cdot B\), wlog. assume that the matrics are all \(n\times n\). We note that for each entry
Which is
- A element-wise multiplication between \(A_{i,:}\) and \(B_{:, j}\). Costs \(O(1)\)
- A reduction sum on \(D_{i,j}\). Costs \(O(\log n)\)
Thus, the total cost is \(\in O(\log n)\).
Scan via parallel prefix
A naïve parallel scanning can broadcast the \(i\)th value to an array of length \(i\), and perform reduction on it to get the output for \(i\)th output. Then, the cost if \(O(\log n)\), but the total work is \(1+2+\cdots + n \in O(n^2)\).
An idea was that we can divide 1 scan into two parts. Wlog assume that the operation is add. Observe that
we have a recursive way to reduce the problem
- Pairwise add two numbers so that even entries become \(x_{i-1} + x_i\).
- Compute scan on the even entries, so that even entries will have its scanned output
- For each odd number, add its previous number, which is the even entry that has the correct scanned output.
def recursive_prefix_sum(arr):
print(arr)
n = len(arr)
if n <= 2:
if n == 2:
arr[1] += arr[0]
return
arr[1::2] = arr[::2] + arr[1::2] if n % 2 == 0 else arr[:n-1:2] + arr[1::2]
print(arr)
recursive_prefix_sum(arr[1::2])
arr[2::2] = arr[1:n-1:2] + arr[2::2] if n % 2 == 0 else arr[1::2] + arr[2::2]
print(arr)
recursive_prefix_sum(np.arange(16))
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
# [ 0 1 2 5 4 9 6 13 8 17 10 21 12 25 14 29]
# [ 1 5 9 13 17 21 25 29]
# [ 1 6 9 22 17 38 25 54]
# [ 6 22 38 54]
# [ 6 28 38 92]
# [28 92]
# [ 6 28 66 120]
# [ 1 6 15 28 45 66 91 120]
# [ 0 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120]
Non-trivial Scan applications
Some applications that can benefit from scan
- n-bits Integer adding (scan for the carryover bits) in \(O(\log n)\) time
- Inverting a \(n\times n\) dense matrices / triangular matrices in \(O(\log^2 n)\) time
- Evaluating expressions with \(n\) tokens in \(O(\log n)\) time
- Evaluating linear recurrences (e.x. Fibonacci series) of size \(n\) in O(\log n)$ time
- Sparse matrix-vertex multiplication
SpMV
using segmented scan
Stream Compression
Given a mask \(m\) and array \(A\), output \(A[m]\) as a dense array. This is different from masked operation, where the operations happens in-place.
Solution Compute cumulative sum on \(m\), and then scatter values into result.
# n: 8
# arr: [3, 2, 4, 1, 5, 3, 3, 1]
# mask: [1, 0, 1, 1, 0, 0, 1, 1]
idx = scan(mask, add)
# idx: [1, 1, 2, 3, 3, 3, 4, 5]
result = empty(idx[n - 1]) # empty arr of size 5
result[idx[mask]] = arr[mask]
# result: [3, 4, 1, 3, 1]
Stream compression is the same as remove elements satisfying a condition. In this case, we let mask = condition(arr)
.
Radix sort
Radix sort works for sorting array with fixed range, and its binary representation is ordered. For example, sorting an int32
array.
// Radix sort for 32 bit unsigned integer arr of size n
void radix_sort(uint32_t *arr, size_t n) {
uint32_t bit = 1;
uint32_t buffer[n];
uint32_t temp[n];
int start0, start1;
for (int b = 0; b < 32; b++) {
for (int i = 0; i < n; i++) {
temp[i] = arr[i];
buffer[i] = arr[i] & bit;
if (buffer[i]) start1++;
}
for (int i = 0; i < n; i++) {
if (buffer[i])
arr[start1++] = temp[i];
else
arr[start0++] = temp[i];
}
bit <<= 1;
}
}
To parallelize, observe that we can mask all entries end with 1 from buffer[i] = arr[i] & bit
and entries end with 0 from the complement. Then, the problem becomes a stream compression for 0-ending entries and a stream compression for 1-ending entries where the starting position is count of 0-ending entries (scalar add via broadcast).
Linked list length
For a linked list, finding the size of a linked list takes \(O(n)\) time for serial algorithms. For parallel cases, we can see the problem as a reduction problem or a scan problem (if we want to assign the distance to tail
for every node).
In general, the algorithm works exactly the same as recursive prefix scan, but at the same time we prefix the pointers as well.
Fibonacci Series
Note that Fibonacci series is a linear occurrence problem \(F_{n+1} = F_{n} + F_{n-1}\) and depends on previous two states. However, we can expand this by
so that it only depends on previous one state.
Also, matmul is associative so that it go backs to a scan problem.
Adding n-bit integers
A serial n-bit bitwise add is implemented as
c[-1] = 0 % carryover c_in
for i = 0:n-1
s[i] = (a[i] xor b[i]) xor c[i-1] % 1 or 3 one's
c[i] = ((a[i] xor b[i]) and c[i-1]) or (a[i] and b[i]) % has carryover
Now, we see that s[i]
and c[i]
both depends on c[i-1], a[i], b[i]
. We can first precompute things related to a[i], b[i]
in \(O(1)\) time.
s[i] = p[i] xor c[i-1] % p = a xor b for bit propagation
c[i] = (p[i] and c[i-1]) or g[i] % g = a and b for bit generation
Mathematically, we could represent \(c\) as a boolean matmul
Inverting matrices
Wlog suppose that \(T\) is \(n\times n\) lower triangular matrix. Then, we can break \(M\) into \(\frac{n}{2}\times \frac{n}{2}\) matrices \(T = \begin{bmatrix}A &\mathbf{0}\\C&B\end{bmatrix}\) where \(A\) and \(B\) are both lower triangular and \(C\) is dense. In addition, we could prove the lemma
proof.
Therefore, we can construct the recursive triangular inverting algorithm.
def tri_inv(T, n):
if n == 1:
return 1. / T
A = T[:n//2, :n//2]
C = T[n//2:, :n//2]
B = T[n//2:, n//2:]
Ainv = tri_inv(A, n//2) # Ainv, Binc in parallel
Binv = tri_inv(B, n//2)
T[:n//2, :n//2] = Ainv
T[n//2:, n//2:] = Binv
T[n//2:, :n//2] = - Binv @ C @ Ainv # log(n) matmul
The total cost is \(O(\log^2 n)\) by masters theorem