Optimized parallel implementations of the k-means clustering algorithm:
1. on multi-core CPU with vector units: thread parallelization using OpenMP, auto-vectorization using AVX units
2. on NVIDIA GPU: using shared memory, dynamic parallelism, and multiple streams\\
2. on NVIDIA GPU: using shared memory, dynamic parallelism, and multiple streams
In particular, for both implementations we use a two-step summation method with package processing to handle the effect of rounding errors that may occur during the phase of updating cluster centroids.