cssutil(1)
utility to measure and manipulate CRM114 statistics files.
Description
cssutil
NAME
cssutil - utility to measure and manipulate CRM114 statistics files.
SYNOPSIS
cssutil [.css file] [OPTIONS]
WARNING
This man page is taken from an older CRM114 version. It is provided as a convenience to Debian users and may not be up-to-date. If you would like to update it, please send appropriate patches to the Debian bug tracking system.
OPTIONS
-h
print basic help
-b
brief - print only a summary of the statistics of the .css
file (otherwise, prints a full list of how many bins are in
each counter state)
-q
quiet mode; no warning messages
-r
report then exit (no menu). The default if -r is not
specified is to drop into a command-menu based system.
-s
if no css file found, create new one with this many buckets.
Default is 1 million + 1 buckets
-S
same as -s, but round up to next 2ˆn + 1 boundary.
-v
print version and exit
-D
dump css file to stdout in the architecture-independent CSV
format, suitable for reloading with -R in an architecture.
(note that .css files are a hardware-architecture dependent
format)
-R
create and restore css from the hardware-architecture
independent CSV format file (reads from stdin if csv-file is
not supplied.
THE COMMAND MENU
If -r is not supplied, a menu appears with the following options. Note that all of these operations are "in place" and surgical- there is NO undo functionality. Wise users will make a backup copy of all .css files before using cssutil to alter values.
-Z
zero all bins at or below a value. This is useful for
deleting all small-count features from the .css statistics
files leaving higher-count features untouched.
-S
subtract a constant from all bins - this rolls all features
back a constant amount.
-D
divide all bins by a constant - this rolls features back
linearly, rather than in scalar fashion.
-R
rescan - regenerate the statistics output that was initially
printed.
-P
pack - re-slot features to optimize access time.
-Q
- gracefully exit, saving changes. (note that since these
operations are in-place and surgical, there is no option to
exit without saving changes.
DESCRIPTION
cssutil
is a general utility to manipulate and measure the .css
format statistics files used by CRM114’s Markovian and
OSB classifiers. The biggest uses are to check the available
space remaining in a .css file, to selectively groom a .css
file, and to port architecture-dependent .css files to and
from an ASCII CSV format, which is architecture independent.
The cssutil program can be used to create
information-less .css files:
cssutil -b -r spam.css
cssutil -b -r nonspam.css
. This creates
the full-size files ./spam.css and ./nonspam.css, holding no
information. The cssutil program can be used check
that the .css files are reasonable. Invoke cssutil
as:
cssutil -b -r spam.css
cssutil -b -r nonspam.css
You should get
back a report something like this:
Sparse spectra file spam.css statistics:
Total available
buckets : 1048576
Total buckets in use : 506987
Total hashed datums in file : 1605968
Average datums per bucket : 3.17
Maximum length of overflow chain : 39
Average length of overflow chain : 1.84
Average packing density : 0.48
Note that the packing density is 0.48; this means that this .css file is about half full of features. Once the packing density gets above about 0.9, you will notice that CRM114 will take longer to process text. The penalty is small below packing densities below about 0.95 and only about a factor of 2 at 0.97 . Best is to keep it below .7 to .8.
SHORTCOMINGS
Note that cssutil as of version 20040816 is NOT capable of dealing with the CRM114 Winnow classifier’s floating-point .cow files. Worse, cssutil is unaware of it’s shortcomings, and will try anyway. The only recourse is to be aware of this issue and not use cssutil on a Winnow classifier floating point .cow format file.
HOMEPAGE AND REPORTING BUGS
http://crm114.sourceforge.net/
VERSION
This manpage: $Id: cssutil.azm,v 1.4 2004/08/19 09:23:24 vanbaal Exp $ This manpage describes cssutil as shipped with crm114 version 20040816.BlameClockworkOrange.
AUTHOR
William S. Yerazunis. Manpage typesetting by Joost van Baal and Shalendra Chhabra
COPYRIGHT
Copyright (C) 2001, 2002, 2003, 2004 William S. Yerazunis. This is free software, copyrighted under the FSF’s GPL. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the file COPYING for more details.
SEE ALSO
cssmerge(1), cssdiff(1), crm(1)
See Also
- cssmerge(1)
- cssdiff(1)
- crm(1)