Optimized parallel implementations of the k-means clustering algorithm:
1. on multi-core CPU with vector units: thread parallelization using OpenMP, auto-vectorization using AVX units
2. on NVIDIA GPU: using shared memory, dynamic parallelism, and multiple streams
1.**on multi-core CPU with vector units**: thread parallelization using OpenMP, auto-vectorization using AVX units
2.**on NVIDIA GPU**: using shared memory, dynamic parallelism, and multiple streams
In particular, for both implementations we use a two-step summation method with package processing to handle the effect of rounding errors that may occur during the phase of updating cluster centroids.
...
...
@@ -11,7 +11,7 @@ In particular, for both implementations we use a two-step summation method with
- If necessary, update the CUDA path according to your own situation.
## "main.h" Configuration
The configuration for benchmark dataset, block size, etc., are adjustable in the "main.h" file.
The configuration for benchmark dataset, block size, etc., are adjustable in the _main.h_ file.
Our CUDA C code does not generate any synthetic data, so users should specify the path and filename of their benchmark dataset in the `INPUT_DATA` constant, and also give the `NbPoints`, `NbDims`, `NbClusters`. If users want to impose the initial centroids, they should provide a text file containing the coordinates of initial centroids and specifiy the corresponding path and filename in the `INPUT_INITIAL_CENTROIDS` constant.