Besides traditional one-letter POSIX-style options, clusgen also supports
GNU long options. Options begin with a dash, and consist of a single character.
GNU style long options consist of two dashes and a keyword. If the option
takes an argument, then the keyword is immediately followed by whitespace
and then the argument's value. The type of argument value is encoded thus:
<n> indicates an integer value greater than
0; <x> indicates a real valued percentage from
0.0 to 1.0; and <f> indicates a real
value greater
than 0.0.
In general, clusgen is invoked with the following usage:
> clusgen [options]
-h
--help
Prints a short summary of the command line options.
-C <n>
--number-of-clusters <n>
Indicates the number of
clusters to be generated. n should be an integer with a value
between 1 and ... The default value for this option is 2.
-P <n>
--number-of-points <n>
Indicates the number of
points to be generated. n should be an integer with a value
between 1 and ... The default value for this option is 10.
-D <n>
--number-of-data-dimensions <n>
Indicates the number
of dimensions that the points in each cluster should have. n
should be an integer with a value between 1 and ... The default value
for this option is 2.
-S <n>
--number-of-space-dimensions <n>
Indicates the
number of dimensions that the data space that the clusters live in
should have. This option should only be used if you want the number
of space dimensions to be greater than the number of data dimensions,
e.g., clusters of 10-dimensional points in a 20-dimensional space.
n should be an integer with a value between 1 and ... The
default value for this option is whatever the value for -D is.
-N <n>
--number-of-noise-dimensions <n>
Indicates the
number of extra dimensions of noise that the data space should have.
n should be an integer with a value between 1 and ... The
default value for this option is 0.
-c
--calvin-mode
If this option is selected, then the random
number library will be seeded the same way every time.
-z
--standardize-dimensions
Standardizes the coordinates of
every point within each cluster. Basically centers all of the data
around the origin.
-l
--randomize-vector-lengths
Multiply the length of each
vector (data point) by a random value -- the vector length is changed
but the vector's slope is preserved. This is for experiments in which
similarity is measured using angles: vectors close together in the
original space will still have small angles separating them, but L2
distances between them will be large.
-u
--no-cluster-truncation
By default, all clusters are
truncated to [...]. This option turns off the truncation.
-f
--fast
This option changes the truncation algorithm to make
it much faster. The new algorithm has not been extensively tested
however, and may not generate data with the expected statistical
properties..
-n <f>
--random-noise <f>
f represents a value for which a random number from 0.0 to
f will be added to each dimension's value of each point.
The default value for this option is 0.0.
-o <x>
--percentage-outliers <x>
x represents the percentage of extra points to add
to each cluster as outliers. For example, if you have two clusters of
100 points each, and x has a value of 0.1, then you will
end up with 220 total points. The average distance of the outliers from
the cluster centroids is determined by the -m option.
-m <f>
--multiply-variances-for-outliers-by <f>
f represents a number for which to multiply the variances
of the distances of the outlier points from the centroids of the clusters.
The default value for this option is 9.0. Any value <= 1.0 will generate a warning, since the points generated will no longer be outliers, but in-liers.
-e <f>
--error-perturbation <f>
For each point, add to each dimension a random deviate multiplied by f .
-t <f>
--translate-clusters-by <f>
Translate all points on all dimensions by a value of f . This
is so outliers can be sufficiently distant from the cluster centroids
without having negative values on any dimension.
-d < 1, 2, or 3 >
--density-level < 1, 2, or 3 >
If a value of 1 is selected (the default), then the total
points are distributed evenly among all the clusters. If a value of 2
is selected, then one cluster gets 10% of the total points. If a value
of 3 is selected, then one cluster gets 60% of the total
points.
-O < raw, xml, or all >
--output-format < raw, xml, or all >
Currently, two output formats are supported. A raw format,
which simply lists each point's dimension values, one point per line.
The xml format structures the data by cluster, and also gives
additional information about the data set.
-v
--vertical-output-format
When outputting a raw format file, this option will list each point's dimension values, one dimension per line. Additionally, the format will look like (space separated): datum-id dimension-index value
|