Commit 4532afba authored by He Guanlin's avatar He Guanlin
Browse files

Update README.md

parent 8158f90f
......@@ -20,11 +20,11 @@ Optionally, if you want to impose initial centroids, you need to provide a text
## Benchmark Datasets
We tested our code on one synthetic dataset created by our own and two real-world datasets downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). Each of them contains millions of instances, hence is too large to be loaded here. Instead we provide the _Synthetic_Data_Generator.py_, and describe the filtering operations on real-world datasets.
- **Synthetic dataset** (our dataset). It contains 50 million instances uniformly distributed in 4 convex clusters. Each instance has 4 dimensions. Since the _Synthetic_Data_Generator.py_ uses the `random` function, the dataset generated each time will have different values but will always keep the same distribution.
- **Household power consumption dataset** (UCI Machine Learning Repository). It contains 2,075,259 measurements of electric power consumption in a household over a period
- [**Household power consumption dataset**](https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption) (UCI Machine Learning Repository). It contains 2,075,259 measurements of electric power consumption in a household over a period
of nearly 4 years. Each measurement has 9 attributes. We remove the measurements containing missing values and also remove the first 2
attributes that record the date and time of measurements. The remaining set that we use for evaluation contains 2,049,280 measurements
with 7 numerical attributes.
- **US census 1990 dataset** (UCI Machine Learning Repository). It contains 2,458,285 instances with 68 categorical attributes. It is a simplified and discretized version of the USCensus1990raw dataset which contains one percent sample drawn from the full 1990 US census data.
- [**US census 1990 dataset**](https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)) (UCI Machine Learning Repository). It contains 2,458,285 instances with 68 categorical attributes. It is a simplified and discretized version of the USCensus1990raw dataset which contains one percent sample drawn from the full 1990 US census data.
## Execution
Before execution, recompile the code by entering the `make` command if any change has been made to the code.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment