![]() |
|
Expand all | Collapse all
If you are entirely new to Apophenia, have a look at the Gentle Introduction here.
As well as the information in this outline, there is a separate page covering the details of setting up a computing environment and another page with some sample code for your perusal.
For another concrete example of folding Apophenia into a project, have a look at this sample program.
Modeling with Data has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are in abundance. Some people find pointers to be especially difficult; fortunately, there's a claymation cartoon which clarifies everything.
Coding often relies on gathering together many libraries; there is a section at the bottom of this outline linking to references for some libraries upon which Apophenia builds.
Here are some notes about the technical details of using the Apophenia library in your development environment.
Header aggregation
There is only one header. Put
at the top of your file, and you're done. Everything declared in that file starts with apop_
(or Apop_
).
Linking
You will need to link to the Apophenia library, which involves adding the -lapophenia
flag to your compiler. Apophenia depends on SQLite3 and the GNU Scientific Library (which depends on a BLAS), so you will probably need something like:
Your best bet is to encapsulate this mess in a Makefile. Even if you are using an IDE and its command-line management tools, see the Makefile page for notes on useful flags.
Debugging The end of Appendix O of Modeling with Data offers some GDB macros which can make dealing with Apophenia from the GDB command line much more pleasant. As per the next section, it also helps to set apop_opts.stop_on_warning='v'
or 'w'
when running under the debugger.
Standards compliance
To the best of our abilities, Apophenia complies to the C standard (ISO/IEC 9899:2011).
Easier calling syntax
Many functions allow optional named arguments to functions. For example:
See the Designated initializers page for details of this syntax.
error
elementThe apop_data set and the apop_model both include an element named error
. It is normally 0
, indicating no (known) error.
For example, apop_data_copy detects allocation errors and some circular links (when Data->more == Data
) and fails in those cases. You could thus use the function with a form like
There is sometimes (but not always) benefit to handling specific error codes, which are listed in the documentation of those functions that set the error
element. E.g.,
The global variable apop_opts.verbose
determines how many notifications and warnings get printed by Apophenia's warning mechanism:
-1: turn off logging, print nothing (ill-advised)
0: notify only of failures and clear danger
1: warn of technically correct but odd situations that might indicate, e.g., numeric instability
2: debugging-type information; print queries
3: give me everything, such as the state of the data at each iteration of a loop.
These levels are of course subjective, but should give you some idea of where to place the verbosity level. The default is 1.
The messages are printed to the FILE
handle at apop_opts.log_file
. If this is blank (which happens at startup), then this is set to stderr
. This is the typical behavior for a console program. Use
to write to the mylog
file instead of stderr
.
As well as the error and warning messages, some functions can also print diagnostics, using the Apop_notify macro. For example, apop_query and friends will print the query sent to the database engine iff apop_opts.verbose >=2
(which is useful when building complex queries). The diagnostics attempt to follow the same verbosity scale as the warning messages.
Warnings and errors never halt processing. It is up to the calling function to decide whether to stop.
When running the program under a debugger, this is an annoyance: we want to stop as soon as a problem turns up.
The global variable apop_opts.stop_on_warning
changes when the system halts:
'n'
: never halt. If you were using Apophenia to support a user-friendly GUI, for example, you would use this mode.
The default: if the variable is '\0'
(the default), halt on severe errors, continue on all warnings.
'v'
: If the verbosity level of the warning is such that the warning would print to screen, then halt; if the warning message would be filtered out by your verbosity level, continue.
'w'
: Halt on all errors or warnings, including those below your verbosity threshold.
See the documentation for individual functions for details on how each reports errors to the caller and the level at which warnings are posted.
For a reference, your best bet is the Structured Query Language reference for SQLite. For a tutorial; there is an abundance of tutorials online. Here is a nice blog entry about complementaries between SQL and matrix manipulation packages.
Apophenia currently supports two database engines: SQLite and mySQL/mariaDB. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.
The global apop_opts.db_engine
is initially NUL
, indicating no preference for a database engine. You can explicitly set it:
If apop_opts.db_engine
is still NUL
on your first database operation, then I will check for an environment variable APOP_DB_ENGINE
, and set apop_opts.db_engine='m'
if it is found and matches (case insensitive) mariadb
or mysql
.
Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see Database moments (plus pow()!).
Apophenia uses OpenMP for threading. You generally do not need to know how OpenMP works to use Apophenia, and many points of work will thread without your doing anything.
gsl_matrix
at the same time, you're going to have problems.N
with the environment variableor the C function
Use one of these methods with N=1
if you want a single-threaded program.
for
loop over the input apop_data set across multiple threads. Therefore, be careful to send thread-unsafe functions to it only after calling omp_set_num_threads(1)
.gsl_rng
, you can parallelize functions that make random draws.apop_opts.rng_seed
, then incrementing that seed by one. You thus probably have threads with seeds 479901, 479902, 479903, .... [If you have a better way to do it, please feel free to modify the code to implement your improvement and submit a pull request on Github.]See this tutorial on C threading if you would like to know more, or are unsure about whether your functions are thread-safe or not.
Apophenia co-evolved with Modeling with Data: Tools and Techniques for Statistical Computing. You can read about the book, or download a free PDF copy of the full text, at modelingwithdata.org.
If you are at this site, there is probably something there for you, including a tutorial on C and general computing form, SQL for data-handing, several chapters of statistics from various perspectives, and more details on working Apophenia.
As with many computer programs, the preferred manner of citing Apophenia is to cite its related book. Here is a BibTeX-formatted entry giving the relevant information:
The rationale for the apop_model struct, based on an algebraic system of models, is detailed in a U.S. Census Bureau research report.
[This section last updated 3 August 2014.]
Apophenia was first posted to SourceForge in February 2005, which means that we've had several years to develop and test the code in real-world applications.
The test suite, including the sample code and solution set for Modeling with Data, is about 5,500 lines over 135 files. gprof reports that it covers over 90% of the 7,700 lines in Apophenia's code base. A broad rule of thumb for any code base is that the well-worn parts, in this case functions like apop_data_get and apop_normal's log_likelihood
, are likely to be entirely reliable, while the out-of-the-way functions (maybe the score for the Beta distribution) are worth a bit of caution. Close to all of the code has been used in production, so all of it was at least initially tested against real-world data.
It is currently at version 0.999, which is intended to indicate that it is substantially complete. Of course, a library for scientific computing, or even for that small subset that is statistics, will never cover all needs and all methods. But as it stands Apophenia's framework, based on the apop_data and apop_model, is basically internally consistent, has enough tools that you can get common work done quickly, and is reasonably fleshed out with a good number of models out of the box.
The apop_data structure is set, and there are enough functions there that you could use it as a subpackage by itself (especially in tandem with the database functions) for nontrivial dealings with data.
The apop_model structure is much more ambitious—Apophenia is really intended to be a novel system for developing models—and its internals can still be improved. The promise underlying the structure is that you can provide just one item, such as an RNG or a likelihood function, and the structure will do all of the work to fill in computationally-intensive methods for everything else; see Writing new settings groups for the details. Some directions aren't quite there yet (such as RNG -> most other things), the PMF model needs an internal index for faster lookups, and so on. Readers are invited to contribute better methods (such as an alternate means of estimating mixture models), or filling in more of existing models (write a dlog likelihood function for a model that does not currently have one), or submit new standard models not yet included.
It's not a package, so you don't need an API—write your code and include
it like any other C code. The system is written to not require a registration or initialization step to add a new model or other such parts. A new apop_model has to conform to some rules if it is to play well with apop_estimate, apop_draw, and so forth. See the notes at Writing new models. Once your new model or function is working, please post the code or a link to the code on the Apophenia wiki.
The apop_data structure represents a data set. It joins together a gsl_vector
, a gsl_matrix
, an apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a gsl_matrix
or a gsl_vector
.
If you are viewing the HTML documentation, here is a diagram showing a sample data set with all of the elements in place. Together, they represet a data set where each row is an observation, which includes both numeric and text values, and where each row/column may be named.
Rowname | Vector | Matrix | Text | Weights | ||||||||||||||||||||||||||||||||||
|
|
|
|
|
In a regression, the vector would be the dependent variable, and the other columns (after factor-izing the text) the independent variables. Or think of the apop_data set as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some code to print the entire matrix, starting at column -1.
Most functions assume that the data vector, data matrix, and text have the same row count: data->vector->size==data->matrix->size1
and data->vector->size==*data->textsize
. This means that the apop_name structure doesn't have separate vector_names
, row_names
, or text_row_names
elements: the rownames
are assumed to apply for all.
See below for notes on easily managing the text
element and the row/column names.
The apop_data set includes a more
pointer, which will typically be NULL
, but may point to another apop_data set. This is intended for a main data set and a second or third page with auxiliary information: estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two. apop_data_copy
and apop_data_free
will handle all the pages of information. The more
pointer is not intended as a linked list for millions of data points—you can probably find a way to restructure your data to use a single table (perhaps via apop_data_pack and apop_data_unpack).
There are a great many functions to collate, copy, merge, sort, prune, and otherwise manipulate the apop_data structure and its components.
Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. You will find a link to the full GSL documentation at the end of this outline. Meanwhile, here are prototypes for a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient to indicate how they are used.
gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)
gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src)
gsl_matrix_transpose (gsl_matrix * m) : square matrices only
gsl_matrix_set_all (gsl_matrix * m, double x)
gsl_matrix_set_zero (gsl_matrix * m)
gsl_matrix_set_identity (gsl_matrix * m)
void gsl_vector_set_all (gsl_vector * v, double x)
void gsl_vector_set_zero (gsl_vector * v)
int gsl_vector_set_basis (gsl_vector * v, size_t i)
: set all elements to zero, but set item gsl_vector_reverse (gsl_vector * v)
: reverse the order of your vector's elements gsl_vector_ptr
and gsl_matrix_ptr
. To increment an element in a vector use, e.g., *gsl_vector_ptr(v, 7) += 3;
or (*gsl_vector_ptr(v, 7))++
.The apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See Notes on input text file formatting for the full range and details of what can be read in.
If you have any columns of text, then you will need to read in via the database: use apop_text_to_db() to convert your text file to a database table, do any database-appropriate cleaning of the input data, then use apop_query_to_data() or apop_query_to_mixed_data() to pull the data to an apop_data set.
See also:
gsl_matrix * gsl_matrix_alloc (size_t n1, size_t n2)
gsl_matrix * gsl_matrix_calloc (size_t n1, size_t n2)
void gsl_matrix_free (gsl_matrix * m)
gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)
gsl_vector * gsl_vector_alloc (size_t n)
gsl_vector * gsl_vector_calloc (size_t n)
void gsl_vector_free (gsl_vector * v)
gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src)
There are several macros for the common task of viewing a single row or column of a apop_data set.
To make it easier to use the result of these macros as an argument to a function, these macros have abbreviated names.
A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.
The view is an automatic variable, not a pointer, and therefore disappears at the end of the scope in which it is declared. If you want to retain the data after the function exits, copy it to another vector:
Curly braces always delimit scope, not just at the end of a function. These macros work by generating a number of local variables, which you may be able to see in your debugger. When program evaluation exits a given block, all variables in that block are erased. Here is some sample code that won't work:
For this if/then statement, there are two sets of local variables generated: one for the if
block, and one for the else
block. By the last line, neither exists. You can get around the problem here by making sure to not put the macro declaring new variables in a block. E.g.:
This is a general rule about how variables declared in blocks will behave, but because the macros obscure the variable declarations, it is especially worth watching out for here.
The set/get functions can act on both names or indices. Sample usages:
See also:
double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)
double gsl_vector_get (const gsl_vector * v, size_t i)
void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)
void gsl_vector_set (gsl_vector * v, size_t i, double x)
double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)
double * gsl_vector_ptr (gsl_vector * v, size_t i)
const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)
const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)
gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)
gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)
gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)
gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)
These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The ..._sum functions return the sum of the mapped output.
There is an older and a newer set of functions. The older versions, which act on gsl_matrix
es or gsl_vector
s have more verbose names; the newer versions, which take in an apop_data set, use the Designated initializers syntax to add a few options and a more brief syntax.
You can do many things quickly with these functions.
Get the sum of squares of a vector's elements:
Here, we create an index vector [ ].
Given your log likelihood function and a data set where each row of the matrix is an observation, find the total log likelihood:
How many missing elements are there in your data matrix?
Get the mean of the not-NaN elements of a data set:
The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.
Notice how the older apop_vector_apply uses file-global variables to pass information into the functions, while the apop_map uses a pointer to the constant parameters to input to the functions.
One more toy example, demonstrating the use of apop_map and apop_map_sum :
See also:
int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b)
int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b)
int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b)
int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b)
int gsl_matrix_scale (gsl_matrix * a, const double x)
int gsl_matrix_add_constant (gsl_matrix * a, const double x)
gsl_vector_add (gsl_vector * a, const gsl_vector * b)
gsl_vector_sub (gsl_vector * a, const gsl_vector * b)
gsl_vector_mul (gsl_vector * a, const gsl_vector * b)
gsl_vector_div (gsl_vector * a, const gsl_vector * b)
gsl_vector_scale (gsl_vector * a, const double x)
gsl_vector_add_constant (gsl_vector * a, const double x)
See the GSL documentation for voluminous further options.
double gsl_matrix_max (const gsl_matrix * m)
double gsl_matrix_min (const gsl_matrix * m)
void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out)
void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax)
void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin)
void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax)
gsl_vector_max (const gsl_vector * v)
gsl_vector_min (const gsl_vector * v)
gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out)
gsl_vector_max_index (const gsl_vector * v)
gsl_vector_min_index (const gsl_vector * v)
gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)
For most of these, you can add a weights vector for weighted mean/var/cov/..., such as apop_vector_mean(d->vector, .weights=d->weights)
There are no functions provided to convert from apop_data to the constituent elements, because you don't need a function.
If you need an individual element, you can just use its pointer to retrieve it:
In the other direction, you can use compound literals to wrap an apop_data struct around a loose vector or matrix:
double*
gsl_vector
double*
If you generate your data set from the database via apop_query_to_data (or apop_query_to_text or apop_query_to_mixed_data) then column names appear as expected. Set apop_opts.db_name_column
to the name of a column in your query result to use that column name for row names.
Sample uses, given apop_data set d
:
The apop_data set includes a grid of strings, text
, for holding text data.
Text should be encoded in UTF-8. US ASCII is a subset of UTF-8, so that's OK too.
There are a few simple forms for handling the text
element of an apop_data
set, which handle the tedium of memory-handling for you.
tdata
is tdata->textsize[0]
; the number of columns is tdata->textsize[1]
. tdata->text[row][col]
. x[0]
can always be written as *x
, which may save some typing. The number of rows is *tdata->textsize
. If you have a single column of text data (i.e., all data is in column zero), then item i
is *tdata->text[i]
. If you know you have exactly one cell of text, then its value is **tdata->text
. ""
, which you can check via if (!strlen(dataset->text[i][j])) printf("<blank>")
or if (!*dataset->text[i][j]) printf("<blank>")
. For the sake of efficiency when dealing with large, sparse data sets, all blank cells point to the same static empty string, meaning that freeing cells must be done with care. Your best bet is to rely on apop_text_add, apop_text_alloc, and apop_text_free to do the memory management for you.Here is a sample program that uses these forms, plus a few text-handling functions.
dataset = apop_query_to_text("select onecolumn from data");
then you have a sequence of strings, d->text[0][0], d->text[1][0],
.... After apop_data dt = apop_data_transpose(dataset)
, you will have a single list of strings, dt->text[0]
, which is often useful as input to list-of-strings handling functions.Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors.
A dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable in a regression) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors. The functions that generate factors and dummies will add an informational page to your apop_data set with a name like <categories for your_column>
listing the conversion from the artificial numeric factor to the original data. Use apop_data_get_factor_names to get a pointer to that page.
You can use the factor table to translate from numeric categories back to text (though you probably have the original text column in your data anyway).
This is especially useful if you want text in two data sets to have the same categories. Generate factors in the first set, then copy the factor list to the second, then run apop_data_to_factors on the second.
These are convenience functions to handle interaction with SQLite or mySQL/mariaDB. They open one and only one database, and handle most of the interaction therewith for you.
You will probably first use apop_text_to_db to pull data into the database, then apop_query to clean the data in the database, and finally apop_query_to_data to pull some subset of the data out for analysis.
See the Database moments (plus pow()!) page for not-SQL-standard math functions that you can use when sending queries from Apophenia, such as pow
, stddev
, or sqrt
.
.output_type='d'
, this prints your apop_data set to the database. attach database
. By default with SQLite, Apophenia opens an in-memory database handle. It is a sensible workflow to use the faster in-memory database as the primary, and then attach an on-disk database to read in data and write final output tables.Extracting data from the database
Writing data to the database
See the print functions below. E.g.
A few functions have proven to be useful enough to be worth breaking out into their own programs, for use in scripts or other data analysis from the command line:
apop_text_to_db
command line utility is a wrapper for the apop_text_to_db command. apop_db_to_crosstab
function is a wrapper for the apop_db_to_crosstab function. apop_plot_query
utility produces a plot from the database. It is especially useful for histograms, whcih are binned via apop_data_to_bins before plotting.This segment discusses the use of existing apop_model objects. If you need to write a new model, see Writing new models.
Begin with the most common use: the estimate
function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate:
Along the way to estimating the parameters, most models also find covariance estimates for the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show.
The apop_probit
model that ships with Apophenia is unparameterized: apop_probit.parameters==NULL
. The output from the estimation, the_estimate
, has the same form as apop_probit
, but the_estimate->parameters
has a meaningful value.
A call to apop_estimate produces more than just the estimated parameters. Most will produce any of a covariance matrix, some hypothesis tests, a list of expected values, log likelihood, AIC, AIC_c, BIC, et cetera.
First, note that if you don't want all that, adding to your model an apop_parts_wanted_settings group with its default values (see below on settings groups) signals to the model that you want only the parameters and to not waste CPU time on covariances, expected values, et cetera. See the apop_parts_wanted_settings documentation for examples and further refinements.
your_model->parameters
.data
.info
group, and can be retrieved via a form likeBut we expect much more from a model than just estimating parameters from data.
Continuing the above example where we got an estimated Probit model named the_estimate
, we can interrogate the estimate in various familiar ways. In each of the following examples, the model object holds enough information that the generic function being called can do its work:
Apophenia ships with many well-known models for your immediate use, including probability distributions, such as the apop_normal, apop_poisson, or apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit; e.g., assume the data is Normally distributed and find the mean and variance: apop_estimate(your_data
, apop_normal).
There are also linear models like apop_ols, apop_probit, and apop_logit. As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model.
weights
vector filled, apop_ols estimates Weighted OLS.Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the p
method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters.
But some models have to break uniformity, like how a histogram has a list of bins that makes no sense for a Normal distribution. These are held in settings groups, which you will occasionally need to tweak to modify how a model is handled or estimated. The most common example would be for maximum likelihood, eg.
See below for more details on using settings groups.
The models that ship with Apophenia have the requisite procedures for estimation, making draws, and so on, but have parameters==NULL
and settings==NULL
. The model is thus, for many purposes, incomplete, and you will need to take some action to complete the model. As per the examples to follow, there are several possibilities:
apop_estimate
function. The input model is unchanged, but the output model has parameters and settings in place.parameters
element via apop_data_falloc
. For most purposes, you will also need to set the msize1
, msize2
, vsize
, and dsize
elements to the size you want. See the example below.Here is an example that shows the options for parameterizing a model. After each parameterization, 20 draws are made and written to a file named draws-[modelname].
Where to from here? See the Models page for a list of basic functions that make use of them, along with a list of the canned models, including popular favorites like apop_beta, apop_binomial, apop_iv (instrumental variables), apop_kernel_density, apop_loess, apop_lognormal, apop_pmf (see the histogram section below), and apop_poisson.
If you need to write a new model, see Writing new models.
The model structure makes it easy to generate new models that are variants of prior models. Bayesian updating, for example, takes in one apop_model that we call the prior, one apop_model that we call a likelihood, and outputs an apop_model that we call the posterior. One can produce complex models using simpler transformations as well. For example, to generate a one-parameter Normal(μ, 1) given the code for for a Normal(μ, σ):
This can be used anywhere the original Normal distribution can be. If we need to truncate the distribution in the data space:
Chaining together simpler transformations is an easy method to produce models of arbitrary detail.
apop_model_stack(apop_normal, apop_poisson)
distribution. [For info on specific settings groups and their contents and use, see the Settings page.]
Describing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. E.g., an MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of bins to be filled with data.
So, the apop_model includes a single list which can hold an arbitrary number of groups of settings, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type.
Settings groups are automatically initialized with default values when needed. If the defaults do no harm, then you don't need to think about these settings groups at all.
Here is an example where a settings group is worth tweaking: the apop_parts_wanted_settings group indicates which parts of the auxiliary data you want.
Line one establishes the baseline form of the model. Line two adds a settings group of type apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to 'n'
when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want.
Notice that the _settings
ending to the settings group's name isn't written—macros make it happen. The remaining arguments to Apop_settings_add_group
(if any) follow the Designated initializers syntax of the form .setting=value
.
There is an apop_model_copy_set macro that adds a settings group when it is first copied, joining up lines one and two above:
Settings groups are copied with the model, which facilitates chaining estimations. Continuing the above example, you could re-estimate to get the predicted values and covariance via:
gdb
that will help you pull a settings group out of a model for your inspection, to cut and paste into your
.gdbinit. It shouldn't be too difficult to modify this macro for other debuggers.For just using a model, that's all of what you need to know. For details on writing a new settings group, see Writing new settings groups .
Just about any hypothesis test consists of a few common steps:
If the statistic is from a common form, like the parameters from an OLS regression, then the commonly-associated test is probably included as part of the estimation output, typically as a row in the
info
element of the apop_data set.
Some tests, like ANOVA, produce a statistic using a specialized procedure, so Apophenia includes some functions, like apop_test_anova_independence and apop_test_kolmogorov, to produce the statistic and look up its significance level.
If you are producing a statistic that you know has a common form, like a central limit theorem tells you that your statistic is Normally distributed, then the convenience function apop_test will do the final lookup step of checking where your statistic lies on your chosen distribution.
See also the example at the end of A quick overview.
To give another example of testing, here is a function that used to be a part of Apophenia, but seemed a bit out of place. Here it is as a sample:
Or, consider the Rao statistic,
where
is your model's likelihood function and
its information matrix. In code:
Given the correct assumptions, this is , where
is the dimension of
, so the odds of a Type I error given the model is:
The apop_pmf model wraps a apop_data set so it can be read as an empirical model, with a likelihoood function (equal to the associated weight for observed values and zero for unobserved values), a random number generator (which simply makes weighted random draws from the data), and so on. Setting it up is a model estimation from data like any other, done via apop_estimate(your_data
, apop_pmf).
You have the option of cleaning up the data before turning it into a PMF. For example...
These are largely optional.
It is the weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no weights
segment, then I assume that all rows have equal weight.
Most models have a parameters
apop_data set that is filled when you call apop_estimate. For a PMF model, the parameters
are NULL
, and the data
itself is used for calculation. Therefore, modifying the data post-estimation can break some internal settings set during estimation. If you modify the data, throw away any existing PMFs (apop_model_free
) and re-estimate a new one.
Using apop_data_pmf_compress puts the data into one bin for each unique value in the data set. You may instead want bins of fixed with, in the style of a histogram, which you can get via apop_data_to_bins. It requires a bin specification. If you send a NULL
binspec, then the offset is zero and the bin size is big enough to ensure that there are bins from minimum to maximum. There are other preferred formulæ for bin widths that minimize MSE, which might be added as a future extension. The binspec will be added as a page to the data set, named
"<binspec>"
. See the apop_data_to_bins documentation on how to write a custom bin spec.
There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec.
For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync:
You can use apop_test_kolmogorov or apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.
Kernel density estimation will produce a smoothed PDF. See apop_kernel_density for details. Or, use apop_vector_moving_average for a simpler smoothing method.
your_data
, apop_pmf); optional. This section includes some notes on the maximum likelihood routine. As in the section on writing models above, if a model has a p
or log_likelihood
method but no estimate
method, then calling apop_estimate(your_data, your_model)
executes the default estimation routine of maximum likelihood.
If you are a not a statistician, then there are a few things you will need to keep in mind:
NULL
data, and will be optimizing the parameters internal to the model.The function returns a single number to be minimized. You will need to write an apop_model to send to the optimizer, which is a two step process: write a log likelihood function wrapping the real objective function, and a model that uses that log likelihood. Here they are:
The vsize=2
specified that your parameters are a vector of size two, which means that in->parameters->vector->data
is the list of double
s that you should send to banana
. The more
element of the structure is designed to hold any arbitrary structure; if you use it, you will also need to use the more_size
element, as above.
.want_path='y'
to the settings group, then get the path
element from the settings group:The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem.
The solution is to rewrite the function being maximized such that the function is continuous at the constraint boundary but takes a steep downward slope. The unconstrained maximization routines will be able to search a continuous function but will never return a solution that falls beyond the parameter limits.
If you give it a likelihood function with no regard to constraints plus an array of constraints, apop_maximum_likelihood will combine them to a function that fits the above description and search accordingly.
A constraint function must do three things:
The idea is that if the constraint returns zero, the log likelihood function will return the log likelihood as usual, and if not, it will return the log likelihood at the constraint's return vector minus the penalty. To give a concrete example, here is a constraint function that will ensure that both parameters of a two-dimensional input are both greater than zero. As with the constraints for many of the models that ship with Apophenia, it is a wrapper for apop_linear_constraint.
estimate
element, just call apop_estimate to run an MLE. The output routines handle four sinks for your output. There is a global variable that you can use for small projects where all data will go to the same place.
You can also set the output type, the name of the output file or table, and other options via arguments to individual calls to output functions. See apop_prep_output for the list of options.
C makes minimal distinction between pipes and files, so you can set a pipe or file as output and send all output there until further notice:
Continuing the example, you can always override the global data with a specific request:
I will first look to the input file name, then the input pipe, then the global output_pipe
, in that order, to determine to where I should write. Some combinations (like output type = 'd'
and only a pipe) don't make sense, and I'll try to warn you about those.
What if you have too much output and would like to use a pager, like less
or more
? In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is the form:
popen
will search your usual program path for less
, so you don't have to give a full path.
stdout
.The plot functions produce output for Gnuplot (so output type = 'd'
again does not make sense). As above, you can pipe directly to Gnuplot or write to a file. Please consider these to be deprecated, as there is better graphics support in the works.
A few more descriptive methods:
General utilities:
Math utilities:
For your convenience, here are links to some other libraries you are probably using.