Commit 2de1711d authored by He Guanlin's avatar He Guanlin
Browse files

Update README.md

parent 3b811226
......@@ -15,9 +15,14 @@ The configuration for benchmark dataset, block size, etc., are adjustable in the
Our k-means code does NOT generate any synthetic data, so your need to give the path and filename of your benchmark dataset in the `INPUT_DATA` constant, and also specifiy the `NbPoints`, `NbDims`, `NbClusters`. If you want to impose initial centroids, you need to provide a text file and specifiy the corresponding path and filename in the `INPUT_INITIAL_CENTROIDS` constant.
## Benchmark datasets
We tested our code on one synthetic dataset and two real-world datasets. Each of them contains millions of instances, and therefore is too large to be loaded here.
The synthetic dataset used in our paper below is too large (about 1.8GB) to be loaded here. So we provide the _Synthetic_Data_Generator.py_ instead. Since the generator uses the random function, the dataset generated each time will have different values but will always keep the same distribution.
## Benchmark Datasets
We tested our code on one synthetic dataset created by our own and two real-world datasets downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). Each of them contains millions of instances, hence is too large to be loaded here. Instead we provide the _Synthetic_Data_Generator.py_, and describe the filtering operations on real-world datasets.
- **Synthetic dataset.** It contains 50 million instances uniformly distributed in 4 convex clusters. Each instance has 4 dimensions. Since the _Synthetic_Data_Generator.py_ uses the `random` function, the dataset generated each time will have different values but will always keep the same distribution.
- **Household power consumption dataset.** It contains 2,075,259 measurements of electric power consumption in a household over a period
of nearly 4 years. Each measurement has 9 attributes. We remove the measurements containing missing values and also remove the first 2
attributes that record the date and time of measurements. The remaining set that we use for evaluation contains 2,049,280 measurements
with 7 numerical attributes.
- **US census 1990 dataset.** It contains 2,458,285 instances with 68 categorical attributes. It is a simplified and discretized version of the USCensus1990raw dataset which contains one percent sample drawn from the full 1990 US census data.
## Execution
Before execution, recompile the code by entering the `make` command if any change has been made to the code.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment