Patterns in static

Apophenia

Writing new models

The apop_model is intended to provide a consistent expression of any model that (implicitly or explicitly) expresses a likelihood of data given parameters, including traditional linear models, textbook distributions, Bayesian hierarchies, microsimulations, and any combination of the above. The unifying feature is that all of the models act over some data space and some parameter space (in some cases one or both is the empty set), and can assign a likelihood for a fixed pair of parameters and data given the model. This is a very broad requirement, often used in the statistical literature. For discussion of the theoretical structures, see A Useful Algebraic System of Statistical Models (PDF).

This page includes:

A walkthrough

Users are encouraged to always use models via the helper functions, like apop_estimate or apop_cdf. The helper functions do some boilerplate error checking, and are where the defaults are called: if your model has a log_likelihood method but no p method, then apop_p will use exp(log_likelihood). If you don't give an estimate method, then apop_estimate will call apop_maximum_likelihood.

So the game in writing a new model is to write just enough internal methods to give the helper functions what they need. In the not-uncommon best case, all you need to do is write a log likelihood function.

Here is how one would set up a model that could be estimated using maximum likelihood:

long double new_log_likelihood(apop_data *data, apop_model *m);

where data is the input data, and m is the parametrized model (i.e. your model with a parameters element set by the caller). This function will return the value of the log likelihood function at the given parameters.

apop_model *your_new_model = &(apop_model){"The Me distribution",
.vsize=n0, .msize1=n1, .msize2=n2, .dsize=nd,
.log_likelihood = new_log_likelihood };

You already have more than enough that something like this will work (the dsize is used for random draws):

apop_model *estimated = apop_estimate(your_data, your_new_model);

Once that baseline works, you can fill in other elements of the apop_model as needed.

For example, if you are using a maximum likelihood method to estimate parameters, you can get much faster estimates and better covariance estimates by specifying the dlog likelihood function (aka the score):

void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){
//some algebra here to find df/dp0, df/dp1, df/dp2....
gsl_vector_set(gradient, 0, d_0);
gsl_vector_set(gradient, 1, d_1);
}

The score has to be registered (see below) using

apop_score_insert(apop_new_dlog_likelihood, your_new_model);

Writing new settings groups

Your model may need additional settings or auxiliary information to function, which would require associating a model-specific struct with the model.

Before getting into the detail of how to make model-specific groups of settings work, note that there's a lightweight method of storing sundry settings, so in many cases you can bypass all of the following.

The apop_model structure has a void pointer named more which you can use to point to a model-specific struct. If more_size is larger than zero (i.e. you set it to your_model.more_size=sizeof(your_struct)), then it will be copied via memcpy by apop_model_copy, and freed by apop_model_free. Apophenia's estimation routines will never impinge on this item, so do what you wish with it.

The remainder of this subsection describes the information you'll have to provide to make use of the conveniences described to this point: initialization of defaults, smarter copying and freeing, and adding to an arbitrarily long list of settings groups attached to a model. You will need four items: a typedef for the structure itself, plus init, copy, and free functions. This is the sort of boilerplate that will be familiar to users of object oriented languages in the style of C++ or Java, but it's really a list of arbitrarily-typed elements, which makes this feel more like LISP. [And being a reimplementation of an existing feature of LISP, this section will be macro-heavy.]

typedef struct {
int size1, size2;
char *refs;
apop_data *dataset;
} ysg_settings;
Apop_settings_declarations(ysg)

The first item is a familiar structure definition. The last line is a macro that declares the three functions below. This is everything you would need in a header file, should you need one. These are just declarations; we'll write the actual init/copy/free functions below.

The structure itself gets the full name, ysg_settings. Everything else is a macro, and so you need only specify ysg, and the _settings part is filled in. Because of these macros, your struct name must end in _settings.

If you have an especially simple structure, then you can generate the three functions with these three macros in your .c file:

These macros generate appropriate functions to do what you'd expect: allocating the main structure, copying one struct to another, freeing the main structure. The spaces after the commas indicate that no special code gets added to the functions that these macros generate.

You'll never call these funtions directly; they are called by Apop_settings_add_group, apop_model_free, and other model or settings-group handling functions.

Now that initializing/copying/freeing of the structure itself is handled, the remainder of this section will be about how to add instructions for the struture internals, like data that is pointed to by the structure elements.

Apop_assert(in.size1, "I need you to give me a value for size1. Stopping.");
Apop_varad_set(size2, 10);
Apop_varad_set(dataset, apop_data_alloc(out->size1, out->size2));
Apop_varad_set(refs, malloc(sizeof(int)));
*refs=1;
)

Now, Apop_settings_add(a_model, ysg, .size1=100) would set up a group with a 100-by-10 data set, and set the owner bit to one.

//The elements of the set to copy are all copied, and then make one additional modification:
(*refs)++;
)
if (!(--in->refs)) {
free(in->dataset);
free(in->refs);
}
)

With those three macros in place and the header as above, Apophenia will treat your settings group like any other, and users can use Apop_settings_add_group to populate it and attach it to any model.

Registering new methods in vtables

For any given function (e.g., entropy, the dlog likelihood, Bayesian updating), there is probably a special case for well-known models like the Normal distribution. Rather than any procedure that could have a special-case calculation to the apop_model struct, functions may maintain a registry of models and associated special-case procedures.

This subsection will discuss how to add a function to an existing vtable.

This overview will not go into detail about setting up a new vtable. Briefly:

The data elements

The remainder of this page covers the detailed expectations regarding the elements of the apop_model structure. I begin with the data (non-function) elements, and then cover the method (function) elements. Some of the following will be requirements for all models and some will be advice to authors; I use the accepted definitions of "must", "shall", "may" and related words.

data

Parameters, vsize, msize1, msize2

Info

For those elements that require a count of input data, the calculations assume each row in the input apop_data set is a single datum.

Get these via, e.g., apop_data_get(your_model->info, .rowname="log likelihood"). When writing for any arbitrary function, be prepared to handle NaN, indicating that the element is not calculated or saved in the info page by the given model.

For OLS-type estimations, each row corresponds to the row in the original data. For filling in of missing data, the elements may appear anywhere, so the row/col indices are essential.

settings, more

In object-oriented jargon, settings groups are the private elements of the data set, to be pulled out in certain contexts, and ignored in all others. Therefore, there are no rules about internal use. The more element of the apop_model provides a lightweight means of attaching an arbitrary struct to a model. See Writing new settings groups above for details.

Methods

p, log_likelihood

prep

estimate

draw

cdf

constraint

Autogenerated by doxygen on Wed Oct 15 2014 (Debian 0.999b+ds3-2).