Discretestats is a Python script for generating summary statistics from lines of textual data containing a field with discrete values, such as a system log file. It is intended to by used to extract summary data from input, which is then piped to clichart for graphical display.
Discretestats is based on the idea of extracting a ‘key’ field and a ‘value’ field from each line of data. The most common key field is a timestamp, while the value field can be anything that has discrete values. The output can be thought of as a table - one row for each key field value, and one column for each value field value, with each cell containing the count that value and key. A typical example is generating statistics on the number of error and warning messages in a system log per minute.
The differences between discretestats and linestats are:
- Any value fields in linestats must be numbers (since you’re interested in averages, minima etc). The field in discretestats can be anything (but it’s usually a string)
- Discretestats tells you how often each distinct value occurs for each key, whereas using a count in linestats just shows how many times the key occurs
- In linestats, you specify exactly the columns to be output. In discretestats you don’t always know how many columns there will be, since there’s a column for each discrete value.
- Accept data from stdin or from a file
- Identify the key field based on whitespace-separated fields, a substring, or a regular expression
- Accumulate counts for each discrete value in a single value field
- Identify the value field based on whitespace-separated fields, a substring, or a regular expression
- Output counts for discrete value in thevalue field, for each value in the key field, e.g. for every unique timestamp
- Ignore input lines that do not contain a supplied regular expression
- Sort the output by the key field
- Output as comma-separated or whitespace-separated
- Quote column headings and keys containing spaces and/or commas.
You use the tool like this:
discretestats.py [inputOptions] [outputOptions] [inputFile]
If no input file is specified, reads from stdin. Output is always to stdout.
-h Show help (this screen) and exit -k <keyspec> Specifies how to extract a key from each line (default is to use the whole line). keyspec can be: s:<substring> Extract key as a substring. See Substrings f:<index> Extract key as a field. Fields are separated by white space, 0-based r:<regex> Extract key as a regular expression -m <regex> Only include lines matching this regular expression -v <valuespec> Required. Specifies a value for which to accumulate statistics from each line. valuespec may be substring, field or regex as for -k option.
-c Output as CSV (default is whitespace-separated) -s Sort output by key column
--nojit Disable Psyco Just-In-Time compiler
Substrings are specified by one or two indexes into the line.
- Substring indexes must be separated by a colon (:)
- The substring is taken from the first index (inclusive) to the second index (exclusive)
- Each is index is 0-based (i.e. numbers from 0)
- Negative indexes count from the end of the line, i.e. -1 refers to the last character in the line
- If no second index is given, the substring is taken from the first index to the end of the line.
Examples (with ‘s:’ prefix):
s:0:5 Extract the first 5 characters from each line s:32: Extract from the 33rd character to the end of the line s:10:-8 Extract from the 9th character to the 9th-last
Regular expressions are mainly used to extract key or value fields from lines, although they are also used for the -m (match) option.
Regular expressions follow the Perl 5 syntax as implemented by Python (NOT grep/egrep!). The main difference is that the * and + operators are greedy by default - if you want the egrep behaviour, append ? to them, e.g. change ab.*yz to ab.*?yz. See the Python regular expression documentation for full information.
When the regex is used to extract a value, if it contains a bracketed group the value returned for the group is used - otherwise, the entire match is used. E.g. thread count: ([0-9]+) will return the number matched by the bracketed group, while thread count: [0-9]+ will return the entire string that matches.
Note that you must quote or escape special characters to prevent the shell from interpreting them, typically with single quotes.
Examples (with ‘r:’ prefix):
'r:^\d\d:\d\d' Extract the first 5 characters, which must be in the form 99:99 'r:A:(\d+)' Find the string 'A:' followed by 1 or more digits, and return the digits