Patterns in static

Apophenia

An outline of the library

Expand all | Collapse all

pip Getting started

If you are entirely new to Apophenia, have a look at the Gentle Introduction here.

As well as the information in this outline, there is a separate page covering the details of setting up a computing environment and another page with some sample code for your perusal.

For another concrete example of folding Apophenia into a project, have a look at this sample program.

pip Some notes on C and Apophenia's use of C utilities.

pip Learning C

Modeling with Data has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are in abundance. Some people find pointers to be especially difficult; fortunately, there's a claymation cartoon which clarifies everything.

Coding often relies on gathering together many libraries; there is a section at the bottom of this outline linking to references for some libraries upon which Apophenia builds.

pip Usage notes

Here are some notes about the technical details of using the Apophenia library in your development environment.

Header aggregation

There is only one header. Put

#include <apop.h>

at the top of your file, and you're done. Everything declared in that file starts with apop_ (or Apop_).

Linking

You will need to link to the Apophenia library, which involves adding the -lapophenia flag to your compiler. Apophenia depends on SQLite3 and the GNU Scientific Library (which depends on a BLAS), so you will probably need something like:

gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g -Wall -O3

Your best bet is to encapsulate this mess in a Makefile. Even if you are using an IDE and its command-line management tools, see the Makefile page for notes on useful flags.

Debugging The end of Appendix O of Modeling with Data offers some GDB macros which can make dealing with Apophenia from the GDB command line much more pleasant. As per the next section, it also helps to set apop_opts.stop_on_warning='v' or 'w' when running under the debugger.

Standards compliance

To the best of our abilities, Apophenia complies to the C standard (ISO/IEC 9899:2011).

Easier calling syntax

Many functions allow optional named arguments to functions. For example:

apop_vector_distance(v1, v2); //assumes Euclidean distance
apop_vector_distance(v1, .metric='M'); //assumes v2=0, uses Manhattan metric.

See the Designated initializers page for details of this syntax.

pip Errors, logging, debugging and stopping

The error element

The apop_data set and the apop_model both include an element named error. It is normally 0, indicating no (known) error.

For example, apop_data_copy detects allocation errors and some circular links (when Data->more == Data) and fails in those cases. You could thus use the function with a form like

if (cp->error) {printf("Couldn't copy the input data; failing.\n"); return 1;}

There is sometimes (but not always) benefit to handling specific error codes, which are listed in the documentation of those functions that set the error element. E.g.,

if (cp->error == 'a') {printf("Couldn't allocate space for the copy; failing.\n"); return 1;}
if (cp->error == 'c') {printf("Circular link in the data set; failing.\n"); return 2;}
Verbosity level and logging

The global variable apop_opts.verbose determines how many notifications and warnings get printed by Apophenia's warning mechanism:

-1: turn off logging, print nothing (ill-advised)
0: notify only of failures and clear danger
1: warn of technically correct but odd situations that might indicate, e.g., numeric instability
2: debugging-type information; print queries
3: give me everything, such as the state of the data at each iteration of a loop.

These levels are of course subjective, but should give you some idea of where to place the verbosity level. The default is 1.

The messages are printed to the FILE handle at apop_opts.log_file. If this is blank (which happens at startup), then this is set to stderr. This is the typical behavior for a console program. Use

apop_opts.log_file = fopen("mylog", "w");

to write to the mylog file instead of stderr.

As well as the error and warning messages, some functions can also print diagnostics, using the Apop_notify macro. For example, apop_query and friends will print the query sent to the database engine iff apop_opts.verbose >=2 (which is useful when building complex queries). The diagnostics attempt to follow the same verbosity scale as the warning messages.

Stopping

Warnings and errors never halt processing. It is up to the calling function to decide whether to stop.

When running the program under a debugger, this is an annoyance: we want to stop as soon as a problem turns up.

The global variable apop_opts.stop_on_warning changes when the system halts:

'n': never halt. If you were using Apophenia to support a user-friendly GUI, for example, you would use this mode.
The default: if the variable is '\0' (the default), halt on severe errors, continue on all warnings.
'v': If the verbosity level of the warning is such that the warning would print to screen, then halt; if the warning message would be filtered out by your verbosity level, continue.
'w': Halt on all errors or warnings, including those below your verbosity threshold.

See the documentation for individual functions for details on how each reports errors to the caller and the level at which warnings are posted.

pip SQL, the syntax for querying databases

For a reference, your best bet is the Structured Query Language reference for SQLite. For a tutorial; there is an abundance of tutorials online. Here is a nice blog entry about complementaries between SQL and matrix manipulation packages.

Apophenia currently supports two database engines: SQLite and mySQL/mariaDB. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.

The global apop_opts.db_engine is initially NUL, indicating no preference for a database engine. You can explicitly set it:

apop_opts.db_engine='s' //use SQLite
apop_opts.db_engine='m' //use mySQL/mariaDB

If apop_opts.db_engine is still NUL on your first database operation, then I will check for an environment variable APOP_DB_ENGINE, and set apop_opts.db_engine='m' if it is found and matches (case insensitive) mariadb or mysql.

export APOP_DB_ENGINE=mariadb
apop_text_to_db indata mtab db_for_maria
unset APOP_DB_ENGINE
apop_text_to_db indata stab db_for_sqlite.db

Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see Database moments (plus pow()!).

pip Threading

Apophenia uses OpenMP for threading. You generally do not need to know how OpenMP works to use Apophenia, and many points of work will thread without your doing anything.

  • All functions strive to be thread-safe. Part of how this is achieved is that static variables are marked as thread-local or atomic, as per the C standard. There still exist compilers that can't implement thread-local or atomic variables, in which case your safest bet is to set OMP's thread count to one as below (or get a new compiler).
  • Some functions modify their inputs. It is up to you to use those functions in a thread-safe manner. The apop_matrix_realloc handles states and global variables correctly in a threaded environment, but if you have two threads resizing the same gsl_matrix at the same time, you're going to have problems.
  • There are few compilers that don't support OpenMP. Clang on MacOS may be the only current mainstream example as of this writing, and they are hard at work on implementing it. In the mean time, when compiling on such a system all work will be single-threaded.
  • Set the maximum number of threads to N with the environment variable
export OMP_NUM_THREADS N

or the C function

#include <omp.h>
omp_set_num_threads(N);

Use one of these methods with N=1 if you want a single-threaded program.

  • apop_map and friends distribute their for loop over the input apop_data set across multiple threads. Therefore, be careful to send thread-unsafe functions to it only after calling omp_set_num_threads(1).
  • The function apop_rng_get_thread retrieves a statically-stored RNG specific to a given thread. Therefore, if you use that function in the place of a gsl_rng, you can parallelize functions that make random draws.
  • apop_rng_get_thread allocates its store of threads using apop_opts.rng_seed, then incrementing that seed by one. You thus probably have threads with seeds 479901, 479902, 479903, .... [If you have a better way to do it, please feel free to modify the code to implement your improvement and submit a pull request on Github.]

See this tutorial on C threading if you would like to know more, or are unsure about whether your functions are thread-safe or not.

pip The book version

Apophenia co-evolved with Modeling with Data: Tools and Techniques for Statistical Computing. You can read about the book, or download a free PDF copy of the full text, at modelingwithdata.org.

If you are at this site, there is probably something there for you, including a tutorial on C and general computing form, SQL for data-handing, several chapters of statistics from various perspectives, and more details on working Apophenia.

As with many computer programs, the preferred manner of citing Apophenia is to cite its related book. Here is a BibTeX-formatted entry giving the relevant information:

@book{klemens:modeling,
title = "Modeling with Data: Tools and Techniques for Statistical Computing",
author="Ben Klemens",
year=2008,
publisher="Princeton University Press"
}

The rationale for the apop_model struct, based on an algebraic system of models, is detailed in a U.S. Census Bureau research report.

pip What is the status of the code?

[This section last updated 3 August 2014.]

Apophenia was first posted to SourceForge in February 2005, which means that we've had several years to develop and test the code in real-world applications.

The test suite, including the sample code and solution set for Modeling with Data, is about 5,500 lines over 135 files. gprof reports that it covers over 90% of the 7,700 lines in Apophenia's code base. A broad rule of thumb for any code base is that the well-worn parts, in this case functions like apop_data_get and apop_normal's log_likelihood, are likely to be entirely reliable, while the out-of-the-way functions (maybe the score for the Beta distribution) are worth a bit of caution. Close to all of the code has been used in production, so all of it was at least initially tested against real-world data.

It is currently at version 0.999, which is intended to indicate that it is substantially complete. Of course, a library for scientific computing, or even for that small subset that is statistics, will never cover all needs and all methods. But as it stands Apophenia's framework, based on the apop_data and apop_model, is basically internally consistent, has enough tools that you can get common work done quickly, and is reasonably fleshed out with a good number of models out of the box.

The apop_data structure is set, and there are enough functions there that you could use it as a subpackage by itself (especially in tandem with the database functions) for nontrivial dealings with data.

The apop_model structure is much more ambitious—Apophenia is really intended to be a novel system for developing models—and its internals can still be improved. The promise underlying the structure is that you can provide just one item, such as an RNG or a likelihood function, and the structure will do all of the work to fill in computationally-intensive methods for everything else; see Writing new settings groups for the details. Some directions aren't quite there yet (such as RNG -> most other things), the PMF model needs an internal index for faster lookups, and so on. Readers are invited to contribute better methods (such as an alternate means of estimating mixture models), or filling in more of existing models (write a dlog likelihood function for a model that does not currently have one), or submit new standard models not yet included.

pip How do I write extensions?

It's not a package, so you don't need an API—write your code and include it like any other C code. The system is written to not require a registration or initialization step to add a new model or other such parts. A new apop_model has to conform to some rules if it is to play well with apop_estimate, apop_draw, and so forth. See the notes at Writing new models. Once your new model or function is working, please post the code or a link to the code on the Apophenia wiki.

pip Data sets

The apop_data structure represents a data set. It joins together a gsl_vector, a gsl_matrix, an apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a gsl_matrix or a gsl_vector.

If you are viewing the HTML documentation, here is a diagram showing a sample data set with all of the elements in place. Together, they represet a data set where each row is an observation, which includes both numeric and text values, and where each row/column may be named.

RownameVector Matrix TextWeights
"Steven"
"Sandra"
"Joe"
Outcome
1
0
1
Age Weight (kg) Height (cm)
32 65 175
41 61 165
40 73 181
Sex State
Male Alaska
Female Alabama
Male Alabama
1
3.2
2.4

In a regression, the vector would be the dependent variable, and the other columns (after factor-izing the text) the independent variables. Or think of the apop_data set as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some code to print the entire matrix, starting at column -1.

for (int j = 0; j< data->matrix->size1; j++){
printf("%s\t", apop_name_get(data->names, j, 'r'));
for (int i = -1; i< data->matrix->size2; i++)
printf("%g\t", apop_data_get(data, j, i));
printf("\n");
}

Most functions assume that the data vector, data matrix, and text have the same row count: data->vector->size==data->matrix->size1 and data->vector->size==*data->textsize. This means that the apop_name structure doesn't have separate vector_names, row_names, or text_row_names elements: the rownames are assumed to apply for all.

See below for notes on easily managing the text element and the row/column names.

The apop_data set includes a more pointer, which will typically be NULL, but may point to another apop_data set. This is intended for a main data set and a second or third page with auxiliary information: estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two. apop_data_copy and apop_data_free will handle all the pages of information. The more pointer is not intended as a linked list for millions of data points—you can probably find a way to restructure your data to use a single table (perhaps via apop_data_pack and apop_data_unpack).

There are a great many functions to collate, copy, merge, sort, prune, and otherwise manipulate the apop_data structure and its components.

Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. You will find a link to the full GSL documentation at the end of this outline. Meanwhile, here are prototypes for a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient to indicate how they are used.

  • gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)
  • gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)
  • gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)
  • gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src)
  • gsl_matrix_transpose (gsl_matrix * m) : square matrices only
  • gsl_matrix_set_all (gsl_matrix * m, double x)
  • gsl_matrix_set_zero (gsl_matrix * m)
  • gsl_matrix_set_identity (gsl_matrix * m)
  • void gsl_vector_set_all (gsl_vector * v, double x)
  • void gsl_vector_set_zero (gsl_vector * v)
  • int gsl_vector_set_basis (gsl_vector * v, size_t i): set all elements to zero, but set item $i$ to one.
  • gsl_vector_reverse (gsl_vector * v): reverse the order of your vector's elements
  • gsl_vector_ptr and gsl_matrix_ptr. To increment an element in a vector use, e.g., *gsl_vector_ptr(v, 7) += 3; or (*gsl_vector_ptr(v, 7))++.

pip Reading from text files

The apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See Notes on input text file formatting for the full range and details of what can be read in.

If you have any columns of text, then you will need to read in via the database: use apop_text_to_db() to convert your text file to a database table, do any database-appropriate cleaning of the input data, then use apop_query_to_data() or apop_query_to_mixed_data() to pull the data to an apop_data set.

pip Alloc/free

  • gsl_matrix * gsl_matrix_alloc (size_t n1, size_t n2)
  • gsl_matrix * gsl_matrix_calloc (size_t n1, size_t n2)
  • void gsl_matrix_free (gsl_matrix * m)
  • gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)
  • gsl_vector * gsl_vector_alloc (size_t n)
  • gsl_vector * gsl_vector_calloc (size_t n)
  • void gsl_vector_free (gsl_vector * v)
  • gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src)

pip Using views

There are several macros for the common task of viewing a single row or column of a apop_data set.

apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table");
//Get a column using its name
Apop_col_t(d, "obs1", ov);
double obs1_sum = apop_vector_sum(ov);
//Get row zero of the data set's matrix as a vector; get its sum
double first_row_sum = apop_vector_sum(Apop_rv(d, 0));
//Get a row or rows as a standalone one-row apop_data set
//ten rows starting at row 3:
Apop_data_rows(d, 3, 10, d10);
//First column's sum
double first_col_sum = apop_vector_sum(Apop_cv(d, 0));
//Pull a 10x5 submatrix, whose first element is the (2,3)rd
//element of the parent data set's matrix
double sub_sum = apop_matrix_sum(Apop_subm(d, 2,3, 10,5));

To make it easier to use the result of these macros as an argument to a function, these macros have abbreviated names.

A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.

The view is an automatic variable, not a pointer, and therefore disappears at the end of the scope in which it is declared. If you want to retain the data after the function exits, copy it to another vector:

Apop_matrix_row(d->matrix, 2, rowtwo);
gsl_vector *outvector = apop_vector_copy(rowtwo);

Curly braces always delimit scope, not just at the end of a function. These macros work by generating a number of local variables, which you may be able to see in your debugger. When program evaluation exits a given block, all variables in that block are erased. Here is some sample code that won't work:

apop_data *outdata;
if (get_odd){
outdata = Apop_r(data, 1);
} else {
outdata = Apop_r(data, 0);
}
apop_data_show(outdata); //breaks: outdata points to out-of-scope variables.

For this if/then statement, there are two sets of local variables generated: one for the if block, and one for the else block. By the last line, neither exists. You can get around the problem here by making sure to not put the macro declaring new variables in a block. E.g.:

apop_data *outdata = Apop_r(data, get_odd ? 1 : 0);
apop_data_show(outdata);

This is a general rule about how variables declared in blocks will behave, but because the macros obscure the variable declarations, it is especially worth watching out for here.

pip Set/get

The set/get functions can act on both names or indices. Sample usages:

double twothree = apop_data_get(data, 2, 3); //just indices
apop_data_set(data, .rowname="A row", .colname="this column", .val=13);
double AIC = apop_data_get(your_model->info, .rowname="AIC");
  • double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)
  • double gsl_vector_get (const gsl_vector * v, size_t i)
  • void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)
  • void gsl_vector_set (gsl_vector * v, size_t i, double x)
  • double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)
  • double * gsl_vector_ptr (gsl_vector * v, size_t i)
  • const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)
  • const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)
  • gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)
  • gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)
  • gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)
  • gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)

pip Map/apply

These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The ..._sum functions return the sum of the mapped output.

There is an older and a newer set of functions. The older versions, which act on gsl_matrixes or gsl_vectors have more verbose names; the newer versions, which take in an apop_data set, use the Designated initializers syntax to add a few options and a more brief syntax.

You can do many things quickly with these functions.

Get the sum of squares of a vector's elements:

//given apop_data *dataset and gsl_vector *v:
double sum_of_squares = apop_map_sum(dataset, gsl_pow_2);
double sum_of_squares = apop_vector_map_sum(v, gsl_pow_2);

Here, we create an index vector [ $0, 1, 2, ...$].

double index(double in, int index){return index;}
apop_map(d, .fn_di=index, .inplace='y');

Given your log likelihood function and a data set where each row of the matrix is an observation, find the total log likelihood:

static double your_log_likelihood_fn(const gsl_vector * in)
{[your math goes here]}
double total_ll = apop_matrix_map_sum(dataset, your_log_likelihood_fn);

How many missing elements are there in your data matrix?

static double nan_check(const double in){ return isnan(in);}
int missing_ct = apop_map_sum(in, nan_check, .part='m');

Get the mean of the not-NaN elements of a data set:

static double no_nan_val(const double in){ return isnan(in)? 0 : in;}
static double not_nan_check(const double in){ return !isnan(in);}
static double apop_mean_no_nans(apop_data *in){
return apop_map_sum(in, no_nan_val)/apop_map_sum(in, not_nan_check);
}

The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the $t$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.

Notice how the older apop_vector_apply uses file-global variables to pass information into the functions, while the apop_map uses a pointer to the constant parameters to input to the functions.

#include <apop.h>
double row_offset;
gsl_rng *r;
void offset_rng(double *v){*v = gsl_rng_uniform(r) + row_offset;}
double find_tstat(gsl_vector *in){ return apop_mean(in)/sqrt(apop_var(in));}
double conf(double in, void *df){ return gsl_cdf_tdist_P(in, *(int *)df);}
//apop_vector_mean is a macro, so we can't point a pointer to it.
double mu(gsl_vector *in){ return apop_vector_mean(in);}
int main(){
apop_data *d = apop_data_alloc(10, 100);
r = apop_rng_alloc(3242);
for (int i=0; i< 10; i++){
row_offset = gsl_rng_uniform(r)*2 -1;
Apop_row_v(d, i, onerow);
apop_vector_apply(onerow, offset_rng);
}
size_t df = d->matrix->size2-1;
apop_data *means = apop_map(d, .fn_v = mu, .part ='r');
apop_data *tstats = apop_map(d, .fn_v = find_tstat, .part ='r');
apop_data *confidences = apop_map(tstats, .fn_dp = conf, .param = &df);
#ifndef Testing
printf("means:\n"); apop_data_show(means);
printf("\nt stats:\n"); apop_data_show(tstats);
printf("\nconfidences:\n"); apop_data_show(confidences);
#endif
//Some sanity checks, for Apophenia's test suite.
for (int i=0; i< 10; i++){
//sign of mean == sign of t stat.
assert(apop_data_get(means, i, -1) * apop_data_get(tstats, i, -1) >=0);
//inverse of P-value should be the t statistic.
assert(fabs(gsl_cdf_tdist_Pinv(apop_data_get(confidences, i, -1),100)
- apop_data_get(tstats, i, -1) < 1e-3));
}
}

One more toy example, demonstrating the use of apop_map and apop_map_sum :

#include <apop.h>
/* This sample code sets the elements of a data set's vector to one
if the index is even. Then, via the weights vector, it adds up
the even indices.
There is really no need to use the weights vector; this code
snippet is an element of Apophenia's test suite, and goes the long
way to test that the weights are correctly handled. */
double set_vector_to_even(apop_data * r, int index){
apop_data_set(r, 0, -1, .val=1-(index %2));
return 0;
}
double set_weight_to_index(apop_data * r, int index){
gsl_vector_set(r->weights, 0, index);
return 0;
}
double weight_given_even(apop_data *r){
return gsl_vector_get(r->vector, 0) ? gsl_vector_get(r->weights, 0) : 0;
}
int main(){
d->weights = gsl_vector_alloc(100);
apop_map(d, .fn_ri=set_vector_to_even, .inplace='v');
apop_map(d, .fn_ri=set_weight_to_index, .inplace='v');
double sum = apop_map_sum(d, .fn_r = weight_given_even);
assert(sum == 49*25*2);
}

pip Basic Math

  • int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_scale (gsl_matrix * a, const double x)
  • int gsl_matrix_add_constant (gsl_matrix * a, const double x)
  • gsl_vector_add (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_sub (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_mul (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_div (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_scale (gsl_vector * a, const double x)
  • gsl_vector_add_constant (gsl_vector * a, const double x)

pip Matrix math

See the GSL documentation for voluminous further options.

pip Summary stats

  • double gsl_matrix_max (const gsl_matrix * m)
  • double gsl_matrix_min (const gsl_matrix * m)
  • void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out)
  • void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax)
  • void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin)
  • void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax)
  • gsl_vector_max (const gsl_vector * v)
  • gsl_vector_min (const gsl_vector * v)
  • gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out)
  • gsl_vector_max_index (const gsl_vector * v)
  • gsl_vector_min_index (const gsl_vector * v)
  • gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)

pip Moments

For most of these, you can add a weights vector for weighted mean/var/cov/..., such as apop_vector_mean(d->vector, .weights=d->weights)

pip Conversion among types

There are no functions provided to convert from apop_data to the constituent elements, because you don't need a function.

If you need an individual element, you can just use its pointer to retrieve it:

apop_data *d = apop_text_to_mixed_data("vmw", "select result, age, "
"income, sampleweight from data");
double avg_result = apop_vector_mean(d->vector, .weights=d->weights);

In the other direction, you can use compound literals to wrap an apop_data struct around a loose vector or matrix:

//Given:
gsl_vector *v;
gsl_matrix *m;
// Then this form wraps the elements into \ref apop_data structs. Note that
// these are not pointers: they're automatically allocated and therefore
// the extra memory use for the wrapper is cleaned up on exit from scope.
apop_data *dv = &(apop_data){.vector=v};
apop_data *dm = &(apop_data){.matrix=m};
apop_data *v_dot_m = apop_dot(dv, dm);
//Here is a macro to hide C's ugliness:
#define As_data(...) (&(apop_data){__VA_ARGS__})
apop_data *v_dot_m2 = apop_dot(As_data(.vector=v), As_data(.matrix=m));
//The wrapped object is an automatically-allocated structure pointing to the
//original data. If it needs to persist or be separate from the original,
//make a copy:
apop_data *dm_copy = apop_data_copy(As_data(.vector=v, .matrix=m));

pip Name handling

If you generate your data set from the database via apop_query_to_data (or apop_query_to_text or apop_query_to_mixed_data) then column names appear as expected. Set apop_opts.db_name_column to the name of a column in your query result to use that column name for row names.

Sample uses, given apop_data set d:

int row_name_count = d->names->rowct
int col_name_count = d->names->colct
int text_name_count = d->names->textct
//Manually add a name:
apop_name_add(d->names, "the vector", 'v');
apop_name_add(d->names, "row 0", 'r');
apop_name_add(d->names, "row 1", 'r');
apop_name_add(d->names, "row 2", 'r');
apop_name_add(d->names, "numeric column 0", 'c');
apop_name_add(d->names, "text column 0", 't');
apop_name_add(d->names, "The name of the data set.", 'h');
//or append several names at once
apop_data_add_names(d, 'c', "numeric column 1", "numeric column 2", "numeric column 3");
//point to element i from:
char *rowname_i = d->names->row[i];
char *colname_i = d->names->col[i];
char *textname_i = d->names->text[i];
//The vector also has a name:
char *vname = d->names->vector;

pip Text data

The apop_data set includes a grid of strings, text, for holding text data.

Text should be encoded in UTF-8. US ASCII is a subset of UTF-8, so that's OK too.

There are a few simple forms for handling the text element of an apop_data set, which handle the tedium of memory-handling for you.

  • Use apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks.
  • Use apop_text_add to add text elements. It replaces any existing text in the given slot without memory leaks.
  • The number of rows of text data in tdata is tdata->textsize[0]; the number of columns is tdata->textsize[1].
  • Refer to individual elements using the usual 2-D array notation, tdata->text[row][col].
  • x[0] can always be written as *x, which may save some typing. The number of rows is *tdata->textsize. If you have a single column of text data (i.e., all data is in column zero), then item i is *tdata->text[i]. If you know you have exactly one cell of text, then its value is **tdata->text.
  • After apop_text_alloc, all elements are the empty string "", which you can check via if (!strlen(dataset->text[i][j])) printf("<blank>") or if (!*dataset->text[i][j]) printf("<blank>"). For the sake of efficiency when dealing with large, sparse data sets, all blank cells point to the same static empty string, meaning that freeing cells must be done with care. Your best bet is to rely on apop_text_add, apop_text_alloc, and apop_text_free to do the memory management for you.

Here is a sample program that uses these forms, plus a few text-handling functions.

#include <apop.h>
int main(){
apop_query("create table data (name, city, state);"
"insert into data values ('Mike Mills', 'Rockville', 'MD');"
"insert into data values ('Bill Berry', 'Athens', 'GA');"
"insert into data values ('Michael Stipe', 'Decatur', 'GA');");
apop_data *tdata = apop_query_to_text("select name, city, state from data");
printf("Customer #1: %s\n\n", *tdata->text[0]);
printf("The data, via apop_data_print:\n");
//the text alloc can be used as a text realloc:
apop_text_alloc(tdata, 1+tdata->textsize[0], tdata->textsize[1]);
apop_text_add(tdata, *tdata->textsize-1, 0, "Peter Buck");
apop_text_add(tdata, *tdata->textsize-1, 1, "Berkeley");
apop_text_add(tdata, *tdata->textsize-1, 2, "CA");
printf("\n\nAugmented data, printed via for loop:\n");
for (int i=0; i< tdata->textsize[0]; i++){
for (int j=0; j< tdata->textsize[1]; j++)
printf("%s\t", tdata->text[i][j]);
printf("\n");
}
apop_data *states = apop_text_unique_elements(tdata, 2);
char *states_as_list = apop_text_paste(states, .between=", ");
printf("\n States covered: %s\n", states_as_list);
}
  • apop_data_transpose() : also transposes the text data. Say that you use dataset = apop_query_to_text("select onecolumn from data"); then you have a sequence of strings, d->text[0][0], d->text[1][0], .... After apop_data dt = apop_data_transpose(dataset), you will have a single list of strings, dt->text[0], which is often useful as input to list-of-strings handling functions.

pip Generating factors

Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors.

A dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable in a regression) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors. The functions that generate factors and dummies will add an informational page to your apop_data set with a name like <categories for your_column> listing the conversion from the artificial numeric factor to the original data. Use apop_data_get_factor_names to get a pointer to that page.

You can use the factor table to translate from numeric categories back to text (though you probably have the original text column in your data anyway).

This is especially useful if you want text in two data sets to have the same categories. Generate factors in the first set, then copy the factor list to the second, then run apop_data_to_factors on the second.

pip Databases

These are convenience functions to handle interaction with SQLite or mySQL/mariaDB. They open one and only one database, and handle most of the interaction therewith for you.

You will probably first use apop_text_to_db to pull data into the database, then apop_query to clean the data in the database, and finally apop_query_to_data to pull some subset of the data out for analysis.

See the Database moments (plus pow()!) page for not-SQL-standard math functions that you can use when sending queries from Apophenia, such as pow, stddev, or sqrt.

  • apop_text_to_db : Read a text file on disk into the database. Most data analysis projects start with a call to this.
  • apop_data_print : If you include the argument .output_type='d', this prints your apop_data set to the database.
  • apop_query : Manipulate the database, return nothing (e.g., insert rows or create table).
  • apop_db_open : Optional, for when you want to use a database on disk.
  • apop_db_close : If you used apop_db_open, you will need to use this too.
  • apop_table_exists : Check to make sure you aren't reinventing or destroying data. Also, a clean way to drop a table.
  • Apophenia reserves the right to insert temp tables into the opened database. They will all have names beginning with "apop_", so the reader is advised to not use tables with such names, and is free to ignore or delete any such tables that turn up.
  • If you need to deal with two databases, use SQL's attach database. By default with SQLite, Apophenia opens an in-memory database handle. It is a sensible workflow to use the faster in-memory database as the primary, and then attach an on-disk database to read in data and write final output tables.

Extracting data from the database

Writing data to the database

See the print functions below. E.g.

apop_data_print(yourdata, .output_type='d', .output_name="dbtab");

pip Command-line utilities

A few functions have proven to be useful enough to be worth breaking out into their own programs, for use in scripts or other data analysis from the command line:

  • The apop_text_to_db command line utility is a wrapper for the apop_text_to_db command.
  • The apop_db_to_crosstab function is a wrapper for the apop_db_to_crosstab function.
  • For fans of Gnuplot, the apop_plot_query utility produces a plot from the database. It is especially useful for histograms, whcih are binned via apop_data_to_bins before plotting.

pip Models

This segment discusses the use of existing apop_model objects. If you need to write a new model, see Writing new models.

pip Introduction

Begin with the most common use: the estimate function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate:

apop_data *data = apop_query_to_data("select outcome, in1, in2, in3 from dataset");
apop_model *the_estimate = apop_estimate(data, apop_probit);
apop_model_print(the_estimate, NULL);

Along the way to estimating the parameters, most models also find covariance estimates for the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show.

The apop_probit model that ships with Apophenia is unparameterized: apop_probit.parameters==NULL. The output from the estimation, the_estimate, has the same form as apop_probit, but the_estimate->parameters has a meaningful value.

pip More estimation output

A call to apop_estimate produces more than just the estimated parameters. Most will produce any of a covariance matrix, some hypothesis tests, a list of expected values, log likelihood, AIC, AIC_c, BIC, et cetera.

First, note that if you don't want all that, adding to your model an apop_parts_wanted_settings group with its default values (see below on settings groups) signals to the model that you want only the parameters and to not waste CPU time on covariances, expected values, et cetera. See the apop_parts_wanted_settings documentation for examples and further refinements.

  • The actual parameter estimates are in an apop_data set at your_model->parameters.
  • A pointer to the apop_data set used for estimation, named data.
  • Scalar statistics of the model are listed in the output model's info group, and can be retrieved via a form like
apop_data_get(your_model->info, .rowname="log likelihood");
//or
apop_data_get(your_model->info, .rowname="AIC");
  • Covariances of the parameters are a page appended to the parameters; retrieve via
apop_data *cov = apop_data_get_page(your_model->parameters, "<Covariance>");
  • If the model calculates it, the table of expected values (typically including expected value, actual value, and residual) is a page stapled to the main info page. This is mostly for regression-type models. Retrieve via:
apop_data *predict = apop_data_get_page(your_model->info, "<Predicted>");

But we expect much more from a model than just estimating parameters from data.

Continuing the above example where we got an estimated Probit model named the_estimate, we can interrogate the estimate in various familiar ways. In each of the following examples, the model object holds enough information that the generic function being called can do its work:

apop_data *expected_value = apop_predict(NULL, the_estimate);
double density_under = apop_cdf(expected_value, the_estimate);
apop_data *draws = apop_model_draws(the_estimate, .count=1000);

Apophenia ships with many well-known models for your immediate use, including probability distributions, such as the apop_normal, apop_poisson, or apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit; e.g., assume the data is Normally distributed and find the mean and variance: apop_estimate(your_data, apop_normal).

There are also linear models like apop_ols, apop_probit, and apop_logit. As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model.

  • If you send a data set with the weights vector filled, apop_ols estimates Weighted OLS.
  • If the dependent variable has more than categories, The apop_probit and apop_logit models estimate a multinomial logit or probit.
  • There are separate apop_normal and apop_multivariate_normal functions because the parameter formats are slightly different: the univariate Normal keeps both μ and σ in the vector element of the parameters; the multivariate version uses the vector for the vector of means and the matrix for the Σ matrix. The univariate is so heavily used that it merits a special-case model.

Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the p method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters.

But some models have to break uniformity, like how a histogram has a list of bins that makes no sense for a Normal distribution. These are held in settings groups, which you will occasionally need to tweak to modify how a model is handled or estimated. The most common example would be for maximum likelihood, eg.

//Probit uses MLE. Redo the estimation using Newton's Method
Apop_settings_add_group(the_estimate, apop_mle, .verbose='y',
.tolerance=1e-4, .method=APOP_RF_NEWTON);
apop_model *re_est = apop_estimate(data, the_estimate);

See below for more details on using settings groups.

pip Parameterizing or initializing a model

The models that ship with Apophenia have the requisite procedures for estimation, making draws, and so on, but have parameters==NULL and settings==NULL. The model is thus, for many purposes, incomplete, and you will need to take some action to complete the model. As per the examples to follow, there are several possibilities:

  • Estimate it! Almost all models can be sent with a data set as an argument to the apop_estimate function. The input model is unchanged, but the output model has parameters and settings in place.
  • If your model has a variable number of parameters, you can directly set the parameters element via apop_data_falloc. For most purposes, you will also need to set the msize1, msize2, vsize, and dsize elements to the size you want. See the example below.
  • Some models have disparate, non-numeric settings rather than a simple matrix of parameters. For example, an kernel density estimate needs a model as a kernel and a base data set, which can be set via apop_model_copy_set.

Here is an example that shows the options for parameterizing a model. After each parameterization, 20 draws are made and written to a file named draws-[modelname].

#include <apop.h>
#define print_draws(mm) apop_data_print(apop_model_draws(mm, 20),\
.output_name= "draws-" #mm);
int main(){
apop_data *d = apop_model_draws(uniform_20, 10);
//Estimate a Normal distribution from the data:
print_draws(N);
//estimate a one-dimensional multivariate Normal from the data:
print_draws(mvN);
//fixed parameter list:
print_draws(std_normal);
//variable-size parameter list:
std_multinormal->msize1 =
std_multinormal->msize2 =
std_multinormal->vsize =
std_multinormal->dsize = 3;
std_multinormal->parameters = apop_data_falloc((3, 3, 3),
1, 1, 0, 0,
1, 0, 1, 0,
1, 0, 0, 1);
print_draws(std_multinormal);
//estimate a KDE using the defaults:
print_draws(k);
/*the documentation tells us that a KDE estimation consists of filling
an apop_kernel_density_settings group, so we can set it to use a
Normal(μ, 2) kernel via: */
.base_data=d,
print_draws(k2);
}

Where to from here? See the Models page for a list of basic functions that make use of them, along with a list of the canned models, including popular favorites like apop_beta, apop_binomial, apop_iv (instrumental variables), apop_kernel_density, apop_loess, apop_lognormal, apop_pmf (see the histogram section below), and apop_poisson.

If you need to write a new model, see Writing new models.

pip Model methods

pip Filtering & updating

The model structure makes it easy to generate new models that are variants of prior models. Bayesian updating, for example, takes in one apop_model that we call the prior, one apop_model that we call a likelihood, and outputs an apop_model that we call the posterior. One can produce complex models using simpler transformations as well. For example, to generate a one-parameter Normal(μ, 1) given the code for for a Normal(μ, σ):

This can be used anywhere the original Normal distribution can be. If we need to truncate the distribution in the data space:

//The constraint function.
double over_zero(apop_data *in, apop_model *m){
return apop_data_get(in) > 0;
}
apop_model *trunc = apop_model_dconstrain(.base_model=N_sigma1,
.constraint=over_zero);

Chaining together simpler transformations is an easy method to produce models of arbitrary detail.

pip Settings groups

[For info on specific settings groups and their contents and use, see the Settings page.]

Describing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. E.g., an MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of bins to be filled with data.

So, the apop_model includes a single list which can hold an arbitrary number of groups of settings, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type.

Settings groups are automatically initialized with default values when needed. If the defaults do no harm, then you don't need to think about these settings groups at all.

Here is an example where a settings group is worth tweaking: the apop_parts_wanted_settings group indicates which parts of the auxiliary data you want.

2 Apop_settings_add_group(m, apop_parts_wanted, .covariance='y');
3 apop_model *est = apop_estimate(data, m);

Line one establishes the baseline form of the model. Line two adds a settings group of type apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to 'n' when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want.

Notice that the _settings ending to the settings group's name isn't written—macros make it happen. The remaining arguments to Apop_settings_add_group (if any) follow the Designated initializers syntax of the form .setting=value.

There is an apop_model_copy_set macro that adds a settings group when it is first copied, joining up lines one and two above:

apop_model *m = apop_model_copy_set(apop_ols, apop_parts_wanted, .covariance='y');

Settings groups are copied with the model, which facilitates chaining estimations. Continuing the above example, you could re-estimate to get the predicted values and covariance via:

Apop_settings_set(est, apop_parts_wanted, predicted, 'y');
apop_model *est2 = apop_estimate(data, est);
  • Apop_settings_set, for modifying a single setting, doesn't use the designated initializers format.
  • The Settings page lists the settings structures included in Apophenia and their use.
  • Because the settings groups are buried within the model, debugging them can be a pain. Here is a documented macro for gdb that will help you pull a settings group out of a model for your inspection, to cut and paste into your .gdbinit. It shouldn't be too difficult to modify this macro for other debuggers.
define get_group
set $group = ($arg1_settings *) apop_settings_get_grp( $arg0, "$arg1", 0 )
p *$group
end
document get_group
Gets a settings group from a model.
Give the model name and the name of the group, like
get_group my_model apop_mle
and I will set a gdb variable named $group that points to that model,
which you can use like any other pointer. For example, print the contents with
p *$group
The contents of $group are printed to the screen as visible output to this macro.
end

For just using a model, that's all of what you need to know. For details on writing a new settings group, see Writing new settings groups .

pip Tests & diagnostics

Just about any hypothesis test consists of a few common steps:

If the statistic is from a common form, like the parameters from an OLS regression, then the commonly-associated $t$ test is probably included as part of the estimation output, typically as a row in the info element of the apop_data set.

Some tests, like ANOVA, produce a statistic using a specialized procedure, so Apophenia includes some functions, like apop_test_anova_independence and apop_test_kolmogorov, to produce the statistic and look up its significance level.

If you are producing a statistic that you know has a common form, like a central limit theorem tells you that your statistic is Normally distributed, then the convenience function apop_test will do the final lookup step of checking where your statistic lies on your chosen distribution.

See also the example at the end of A quick overview.

pip Monte Carlo methods

To give another example of testing, here is a function that used to be a part of Apophenia, but seemed a bit out of place. Here it is as a sample:

// Input: any old vector. Output: 1 - the p-value for a chi-squared
// test to answer the question, "with what confidence can I reject the
// hypothesis that the variance of my data is zero?"
double apop_test_chi_squared_var_not_zero(const gsl_vector *in){
Apop_stopif(!in, return NAN, 0, "input vector is NULL. Doing nothing.");
double sum=0;
gsl_vector *normed = apop_vector_normalize((gsl_vector *)in, .normalization_type='s');
gsl_vector_mul(normed, normed);
for(size_t i=0;i< normed->size;
sum +=gsl_vector_get(normed,i++));
gsl_vector_free(normed);
return gsl_cdf_chisq_P(sum, in->size);
}
Or, consider the Rao statistic, 

${\partial\over \partial\beta}\log L(\beta)'I^{-1}(\beta){\partial\over \partial\beta}\log L(\beta)$ where $L$ is your model's likelihood function and $I$ its information matrix. In code:

apop_data * infoinv = apop_model_numerical_covariance(data, your_model);
apop_data * score;
apop_score(data, score->vector, your_model);
apop_data * stat = apop_dot(apop_dot(score, infoinv), score);

Given the correct assumptions, this is $\sim \chi^2_m$, where $m$ is the dimension of $\beta$, so the odds of a Type I error given the model is:

double p_value = apop_test(stat, "chi squared", beta->size);

pip Empirical distributions and PMFs (probability mass functions)

The apop_pmf model wraps a apop_data set so it can be read as an empirical model, with a likelihoood function (equal to the associated weight for observed values and zero for unobserved values), a random number generator (which simply makes weighted random draws from the data), and so on. Setting it up is a model estimation from data like any other, done via apop_estimate(your_data, apop_pmf).

You have the option of cleaning up the data before turning it into a PMF. For example...

apop_data_pmf_compress(your_data); //remove duplicates
apop_data_sort(your_data);
apop_vector_normalize(your_data->weights); //weights sum to one
apop_model *a_pmf = apop_estimate(your_data, apop_pmf);

These are largely optional.

It is the weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no weights segment, then I assume that all rows have equal weight.

Most models have a parameters apop_data set that is filled when you call apop_estimate. For a PMF model, the parameters are NULL, and the data itself is used for calculation. Therefore, modifying the data post-estimation can break some internal settings set during estimation. If you modify the data, throw away any existing PMFs (apop_model_free) and re-estimate a new one.

Using apop_data_pmf_compress puts the data into one bin for each unique value in the data set. You may instead want bins of fixed with, in the style of a histogram, which you can get via apop_data_to_bins. It requires a bin specification. If you send a NULL binspec, then the offset is zero and the bin size is big enough to ensure that there are $\sqrt(N)$ bins from minimum to maximum. There are other preferred formulæ for bin widths that minimize MSE, which might be added as a future extension. The binspec will be added as a page to the data set, named "<binspec>". See the apop_data_to_bins documentation on how to write a custom bin spec.

There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec.

For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync:

apop_data_to_bins(first_set, NULL);
apop_data_to_bins(second_set, apop_data_get_page(first_set, "<binspec>"));

You can use apop_test_kolmogorov or apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.

Kernel density estimation will produce a smoothed PDF. See apop_kernel_density for details. Or, use apop_vector_moving_average for a simpler smoothing method.

pip Maximum likelihood methods

This section includes some notes on the maximum likelihood routine. As in the section on writing models above, if a model has a p or log_likelihood method but no estimate method, then calling apop_estimate(your_data, your_model) executes the default estimation routine of maximum likelihood.

If you are a not a statistician, then there are a few things you will need to keep in mind:

typedef struct {
double scaling;
double banana (double *params, coeff_struct *in){
return (gsl_pow_2(1-params[0]) +
in->scaling*gsl_pow_2(params[1]-gsl_pow_2(params[0])));
}

The function returns a single number to be minimized. You will need to write an apop_model to send to the optimizer, which is a two step process: write a log likelihood function wrapping the real objective function, and a model that uses that log likelihood. Here they are:

double ll (apop_data *d, apop_model *in){
return - banana(in->parameters->vector->data, in->more);
}
int main(){
coeff_struct co = {.scaling=100};
apop_model b = {"Bananas!", .log_likelihood= ll, .vsize=2,
.more = &co, .more_size=sizeof(coeff_struct)};

The vsize=2 specified that your parameters are a vector of size two, which means that in->parameters->vector->data is the list of doubles that you should send to banana. The more element of the structure is designed to hold any arbitrary structure; if you use it, you will also need to use the more_size element, as above.

#include <apop.h>
typedef struct {
double scaling;
long double banana (double *params, coeff_struct *in){
return (gsl_pow_2(1-params[0])
+ in->scaling*gsl_pow_2(params[1]-gsl_pow_2(params[0])));
}
long double ll (apop_data *d, apop_model *in){
return - banana(in->parameters->vector->data, in->more);
}
int main(){
coeff_struct co = {.scaling=100};
apop_model *b = &(apop_model) {"¡Bananas!", .log_likelihood= ll,
.vsize=2, .more = &co, .more_size=sizeof(coeff_struct)};
Apop_model_add_group(b, apop_mle, .verbose='y', .method="NM simplex");
Apop_model_add_group(b, apop_parts_wanted);
apop_model *e1 = apop_estimate(NULL, b);
apop_model_print(e1, NULL);
Apop_settings_set(b, apop_mle, method, "BFGS cg");
apop_model *e2 = apop_estimate(NULL, b);
apop_model_print(e2, NULL);
gsl_vector *one = apop_vector_fill(gsl_vector_alloc(2), 1, 1);
assert(apop_vector_distance(e1->parameters->vector, one) < 1e-2);
assert(apop_vector_distance(e2->parameters->vector, one) < 1e-2);
}
Apop_settings_add_group(your_model, apop_mle, .want_path='y');
apop_model *out = apop_estimate(your_data, your_model);
apop_data_show(Apop_settings_get(out, apop_mle, path));

pip Setting Constraints

The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem.

The solution is to rewrite the function being maximized such that the function is continuous at the constraint boundary but takes a steep downward slope. The unconstrained maximization routines will be able to search a continuous function but will never return a solution that falls beyond the parameter limits.

If you give it a likelihood function with no regard to constraints plus an array of constraints, apop_maximum_likelihood will combine them to a function that fits the above description and search accordingly.

A constraint function must do three things:

  • It must check the constraint, and if the constraint does not bind (i.e. the parameter values are OK), then it must return zero.
  • If the constraint does bind, it must return a penalty, that indicates how far off the parameter is from meeting the constraint.
  • if the constraint does bind, it must set a return vector that the likelihood function can take as a valid input. The penalty at this returned value must be zero.

The idea is that if the constraint returns zero, the log likelihood function will return the log likelihood as usual, and if not, it will return the log likelihood at the constraint's return vector minus the penalty. To give a concrete example, here is a constraint function that will ensure that both parameters of a two-dimensional input are both greater than zero. As with the constraints for many of the models that ship with Apophenia, it is a wrapper for apop_linear_constraint.

static long double beta_zero_greater_than_x_constraint(apop_data *data, apop_model *v){
//constraint is 0 < beta_2
static apop_data *constraint = NULL;
if (!constraint) constraint= apop_data_falloc((1,1,1), 0, 1);
return apop_linear_constraint(v->parameters->vector, constraint, 1e-3);
}

pip Missing data

pip Legible output

The output routines handle four sinks for your output. There is a global variable that you can use for small projects where all data will go to the same place.

apop_opts.output_type = 's'; //Stdout
apop_opts.output_type = 'f'; //named file
apop_opts.output_type = 'p'; //a pipe or already-opened file
apop_opts.output_type = 'd'; //the database

You can also set the output type, the name of the output file or table, and other options via arguments to individual calls to output functions. See apop_prep_output for the list of options.

C makes minimal distinction between pipes and files, so you can set a pipe or file as output and send all output there until further notice:

apop_opts.output_type = 'p';
apop_opts.output_pipe = popen("gnuplot", "w");
apop_plot_lattice(...); //see https://github.com/b-k/Apophenia/wiki/gnuplot_snippets
fclose(apop_opts.output_pipe);
apop_opts.output_pipe = fopen("newfile", "w");
fprintf(apop_opts.output_pipe, "\nNow set 2:\n");

Continuing the example, you can always override the global data with a specific request:

apop_vector_print(v, "vectorfile"); //put vectors in a separate file
apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db
apop_matrix_print(m, .output_pipe = stdout); //now show the same matrix on screen

I will first look to the input file name, then the input pipe, then the global output_pipe, in that order, to determine to where I should write. Some combinations (like output type = 'd' and only a pipe) don't make sense, and I'll try to warn you about those.

What if you have too much output and would like to use a pager, like less or more? In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is the form:

FILE *lesspipe = popen("less", "w");
assert(lesspipe);
apop_data_print(your_data_set, .output_pipe=lesspipe);
pclose(lesspipe);

popen will search your usual program path for less, so you don't have to give a full path.

The plot functions produce output for Gnuplot (so output type = 'd' again does not make sense). As above, you can pipe directly to Gnuplot or write to a file. Please consider these to be deprecated, as there is better graphics support in the works.

pip Assorted

A few more descriptive methods:

General utilities:

Math utilities:

pip Further references

For your convenience, here are links to some other libraries you are probably using.

Autogenerated by doxygen on Wed Oct 15 2014 (Debian 0.999b+ds3-2).