Two parallel and optimized implementations of the k-means clustering algorithm:
1.k-means on CPU: thread parallelization using OpenMP, auto-vectorization using AVX units
2.k-means on GPU: using shared memory, dynamic parallelism, and multiple streams
Optimized parallel implementations of the k-means clustering algorithm:
1.on multi-core CPU with vector units: thread parallelization using OpenMP, auto-vectorization using AVX units
2.on NVIDIA GPU: using shared memory, dynamic parallelism, and multiple streams
In particular, for both implementations we use a two-step summation method with package processing to handle the effect of rounding errors that may occur during the phase of updating cluster centroids.