Clusutils: Clusgen Manual

clusgen manual

Back to Manual
Back to Clusutils' Homepage

Besides traditional one-letter POSIX-style options, clusgen also supports GNU long options. Options begin with a dash, and consist of a single character. GNU style long options consist of two dashes and a keyword. If the option takes an argument, then the keyword is immediately followed by whitespace and then the argument's value. The type of argument value is encoded thus: <n> indicates an integer value greater than 0; <x> indicates a real valued percentage from 0.0 to 1.0; and <f> indicates a real value greater than 0.0.

In general, clusgen is invoked with the following usage:

> clusgen [options]

-h --help
Prints a short summary of the command line options.

-C <n> --number-of-clusters <n>
Indicates the number of clusters to be generated. n should be an integer with a value between 1 and ... The default value for this option is 2.

-P <n> --number-of-points <n>
Indicates the number of points to be generated. n should be an integer with a value between 1 and ... The default value for this option is 10.

-D <n> --number-of-data-dimensions <n>
Indicates the number of dimensions that the points in each cluster should have. n should be an integer with a value between 1 and ... The default value for this option is 2.

-S <n> --number-of-space-dimensions <n>
Indicates the number of dimensions that the data space that the clusters live in should have. This option should only be used if you want the number of space dimensions to be greater than the number of data dimensions, e.g., clusters of 10-dimensional points in a 20-dimensional space. n should be an integer with a value between 1 and ... The default value for this option is whatever the value for -D is.

-N <n> --number-of-noise-dimensions <n>
Indicates the number of extra dimensions of noise that the data space should have. n should be an integer with a value between 1 and ... The default value for this option is 0.

-c --calvin-mode
If this option is selected, then the random number library will be seeded the same way every time.

-z --standardize-dimensions
Standardizes the coordinates of every point within each cluster. Basically centers all of the data around the origin.

-l --randomize-vector-lengths
Multiply the length of each vector (data point) by a random value -- the vector length is changed but the vector's slope is preserved. This is for experiments in which similarity is measured using angles: vectors close together in the original space will still have small angles separating them, but L2 distances between them will be large.

-u --no-cluster-truncation
By default, all clusters are truncated to [...]. This option turns off the truncation.

-f --fast
This option changes the truncation algorithm to make it much faster. The new algorithm has not been extensively tested however, and may not generate data with the expected statistical properties..

-n <f> --random-noise <f>
f represents a value for which a random number from 0.0 to f will be added to each dimension's value of each point. The default value for this option is 0.0.

-o <x> --percentage-outliers <x>
x represents the percentage of extra points to add to each cluster as outliers. For example, if you have two clusters of 100 points each, and x has a value of 0.1, then you will end up with 220 total points. The average distance of the outliers from the cluster centroids is determined by the -m option.

-m <f> --multiply-variances-for-outliers-by <f>
f represents a number for which to multiply the variances of the distances of the outlier points from the centroids of the clusters. The default value for this option is 9.0. Any value <= 1.0 will generate a warning, since the points generated will no longer be outliers, but in-liers.

-e <f> --error-perturbation <f>
For each point, add to each dimension a random deviate multiplied by f.

-t <f> --translate-clusters-by <f>
Translate all points on all dimensions by a value of f. This is so outliers can be sufficiently distant from the cluster centroids without having negative values on any dimension.

-d < 1, 2, or 3 > --density-level < 1, 2, or 3 >
If a value of 1 is selected (the default), then the total points are distributed evenly among all the clusters. If a value of 2 is selected, then one cluster gets 10% of the total points. If a value of 3 is selected, then one cluster gets 60% of the total points.

-O < raw, xml, or all > --output-format < raw, xml, or all >
Currently, two output formats are supported. A raw format, which simply lists each point's dimension values, one point per line. The xml format structures the data by cluster, and also gives additional information about the data set.

-v --vertical-output-format
When outputting a raw format file, this option will list each point's dimension values, one dimension per line. Additionally, the format will look like (space separated):
datum-id dimension-index value