Each row of the file will be converted to one record in the database or one row in the matrix. Values on one row are separated by delimiters. Fixed-width input is also OK; see below.
By default, the delimiters are set to "|,\t", meaning that a pipe, comma, or tab will delimit separate entries. To change the default, please use an argument to apop_text_to_db or apop_text_to_data like .delimiters=" \t"
or .delimiters="|"
. apop_opts.input_delimiters
is deprecated.
The input text file must be UTF-8 or traditional ASCII encoding. Delimiters must be ASCII characters. If your data is in another encoding, try the POSIX-standard iconv
program to filter the data to UTF-8.
- The character after a backslash is read as a normal character, even if it is a delimiter,
#
, '
, or "
.
- If a field contains several such special characters, surround it by
's
or "s
. The surrounding marks are stripped and the text read verbatim.
- Text does not need to be delimited by quotes (unless there are special characters). If a text field is quote-delimited, I'll strip them. E.g., "Males, 30-40", is an OK column name, as is "Males named \\"Joe\"".
- Everything after a # is taken to be comments and ignored.
- Blank lines (empty or consisting only of white space) are also ignored.
- If you are reading into an array or
gsl_matrix
or apop_data set, all text fields are taken as zeros. You will be warned of such substitutions unless you set beforehand.
- There are often two delimiters in a row, e.g., "23, 32,, 12". When it's two commas like this, the user typically means that there is a missing value and the system should insert an NAN; when it is two tabs in a row, this is typically just a formatting glitch. Thus, if there are multiple delimiters in a row, I check whether the second (and subsequent) is a space or a tab; if it is, then it is ignored, and if it is any other delimiter (including the end of the line) then a NaN is inserted.
If this rule doesn't work for your situation, you can explicitly insert a note that there is a missing data point. E.g., try:
perl -pi.bak -e 's/,,/,NaN,/g' data_file
If you have missing data delimiters, you will need to set apop_opts.nan_string to text that matches the given format. E.g.,
SQLite stores these NaN-type values internally as NULL
; that means that functions like apop_query_to_data will convert both your nan_string string and NULL
to an NaN
value.
- The system uses the standards for C's
atof()
function for floating-point numbers: INFINITY, -INFINITY, and NaN work as expected. I use some tricks to get SQLite to accept these values, but they work.
- If there are row names and column names, then the input will not be perfectly square: there should be no first entry in the row with column names like 'row names'. That is, for a 100x100 data set with row and column names, there are 100 names in the top row, and 101 entries in each subsequent row (name plus 100 data points).
- White space before or after a field is ignored. So
1, 2,3, 4 , 5, " six ",7
is eqivalent to 1,2,3,4,5," six ",7
.
- NUL characters are treated as white space, so if your fields have NULs as padding, you should have no problem. NULs inside of a string will probably break.
- Fixed-width formats are supported (for plain ASCII encoding only), but you have to provide a list of field ending positions. For example, given we have three columns, named NUM, LE, and OL. The names can be read from the first row if you so specify. You will have to provide a list of integers giving the end of each field: 3, 5, 7.