Code
using DataFrames
using EmbraceUncertainty: dataset
using MixedModels
using Tables
using TypedTables
Load the packages to be used
using DataFrames
using EmbraceUncertainty: dataset
using MixedModels
using Tables
using TypedTables
A call to fit a mixed-effects model using the MixedModels
package follows the formula/data specification that is common to many statistical model-fitting packages, especially those in R. In Julia, model formulas and contrasts, which are used to produce model matrices from a formula/data specification, are implemented in the StatsModels.jl package.
The data
argument must be able to be expressed as a column-oriented data table, which, for these purposes, is a named, ordered collection of columns, each of which is a homogeneous (i.e. all the elements have the same type) vector and all of which have the same length.
Implementations of column-oriented data tables are ubiquitous in data science, where they are often called data frames. These include the data.frame type in R and the data.table package for R, and the DataFrame
class in the pandas package for Python and, as a separate implementation, in the polars package for Python and Rust.
An increasing popular representation of data frames is as Arrow tables. Polars uses the Arrow format internally as does the DuckDB database. Recently it was announced that the 2.0 release of pandas will allow for Arrow tables.
Arrow specifies a language-independent tabular memory format that provides many desirable properties, such as provision for missing data and compact representation of categorical data vectors, for data science. The memory format also defines a file format for storing and exchanging data tables. Because the memory format is essentially the same as the file format, Arrow files can be memory-mapped providing very fast read speeds.
Furthermore, the Arrow Project provides a reference implementation of the Arrow format and tools for manipulating data in that format as a C++ library, which is used by implementations in several other languages, including C, C#, Go, Java, JavaScript, MATLAB, Python, R, and Ruby. The Julia implementation in Arrow.jl does not call functions in the C++ library. Instead it implements the format in Julia code in such a way that Arrow vectors behave like native Julia vectors. That is, Arrow vectors are a subtype of AbstractVector
. A similar approach is taken in the implementations of the Arrow format for the Rust language.
The data sets in the MixedModels
package and the auxiliary data sets used in this book are stored in the Arrow file format and retrieved as Arrow.Table
s. There are many examples throughout this book of loading such data sets.
= dataset(:contra) contra
Arrow.Table with 1934 rows, 5 columns, and schema:
:dist String
:urban String
:livch String
:age Float64
:use String
Often, for ease of access and for display, we convert the Arrow table to a Table
, which, contrary to convention, is a type defined in TypedTables.jl and not in Tables.jl.
Before going into detail about the properties and use of the Table
type, let us first discuss the role of Tables.jl.
An important characteristic of any system for working with data tables is whether the table is stored in memory column-wise or row-wise.
As described above, most implementations of data frames for data science store the data column-wise.
In relational database management systems (RDBMS), such as PostgreSQL, SQLite, and a multitude of commercial systems, a data table, called a relation, is typically stored row-wise. Such systems typically use SQL, the structured query language, to define and access the data in tables, which is why that acronym appears in many of the names. There are exceptions to the row-wise rule, such as DuckDB, an SQL-based RDBMS, that, as mentioned above, represents relations as Arrow tables.
Many external representations of data tables, such as in comma-separated-value (CSV) files, are row-oriented. Furthermore, it is often convenient to generate a data table a row at a time.
Thus it becomes convenient to have a “clearing house” that can accept either row tables or column tables and provided the desired form to a downstream package, such as StatsModels.jl. Tables.jl does exactly this. It is not an implementation of data tables itself, but rather it defines “an interface for tables in Julia”, allowing a row-oriented table to be accessed column-wise and vice-versa.
It defines a prototype column-oriented table, Tables.ColumnTable
, as a NamedTuple of vectors.
Tables.ColumnTable
NamedTuple{names, T} where {N, names, T<:Tuple{Vararg{AbstractVector, N}}}
and a prototype row-oriented table as a vector of NamedTuple
s.
Tables.RowTable
AbstractVector{T} where T<:NamedTuple (alias for AbstractArray{T, 1} where T<:NamedTuple)
The actual implementation of a row-table or column-table type may be different from these prototypes but it must provide access methods as if it were one of these types. Tables.jl
provides the “glue” to treat a particular data table type as if it were row-oriented, by calling Tables.rows
or Tables.rowtable
on it, or column-oriented, by calling Tables.columntable
on it.
TypedTables.jl is a lightweight package (about 1500 lines of source code) that provides a concrete implementation of column-tables, called simply Table
, as a NamedTuple
of vectors.
A Table
that is constructed from another type of column-table, such as an Arrow.Table
or a DataFrame
or an explicit NamedTuple
of vectors, is simply a wrapper around the original table’s contents. On the other hand, constructing a Table
from a row table first creates a ColumnTable
, then wraps it.
= Table(contra) contratbl
Table with 5 columns and 1934 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D01 Y 3+ 18.44 N
2 │ D01 Y 0 -5.56 N
3 │ D01 Y 2 1.44 N
4 │ D01 Y 3+ 8.44 N
5 │ D01 Y 0 -13.56 N
6 │ D01 Y 0 -11.56 N
7 │ D01 Y 3+ 18.44 N
8 │ D01 Y 3+ -3.56 N
9 │ D01 Y 1 -5.56 N
10 │ D01 Y 3+ 1.44 N
11 │ D01 Y 0 -11.56 Y
12 │ D01 Y 0 -2.56 N
13 │ D01 Y 1 -4.56 N
14 │ D01 Y 3+ 5.44 N
15 │ D01 Y 3+ -0.56 N
16 │ D01 Y 3+ 4.44 Y
17 │ D01 Y 0 -5.56 N
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
typeof(contratbl)
Table{@NamedTuple{dist::String, urban::String, livch::String, age::Float64, use::String}, 1, @NamedTuple{dist::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, urban::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, livch::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, age::Arrow.DictEncoded{Float64, Int8, Arrow.Primitive{Float64, Vector{Float64}}}, use::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}}}
(The output from that expression is a very long string. You need to scroll to the right over the output to see all the output.)
This type of table is said to be “strongly typed”, meaning that the data type itself contains a wealth of detail about the exact form of the table, allowing the Julia compiler to generate efficient code for operations on the table. That is the positive aspect of being so specific about the names of the columns and the details of the type of data in each column. However, it also means that this mechanism is not suitable for tables with a large number, say hundreds or thousands, of columns, which can overburden the compiler.
The methods for accessing columns or rows in a Table
are simple.
A column is accessed by its name as a “property”, either using the getproperty
extractor function or, more commonly, with the dot (.
) operator, returning a vector.
contratbl.urban
1934-element Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}:
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
⋮
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
The column names are Symbols, not strings, usually typed as a :
followed by the name, as shown in
columnnames(contratbl)
(:dist, :urban, :livch, :age, :use)
The :
form for creating the Symbol requires that the column name be a valid variable name in Julia. If, for example, a column name contains a blank, the :
form must be replaced by an expression like var"<name>"
, which invokes what is called a “string macro”.
"urban" contratbl.var
1934-element Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}:
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
⋮
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
A row is accessed by its index, either using the getindex
function or, more commonly, with the index in square brackets, returning a NamedTuple
for a singleton index or another Table
for a vector-valued index.
1] contratbl[
(dist = "D01", urban = "Y", livch = "3+", age = 18.44, use = "N")
2:5] contratbl[
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D01 Y 0 -5.56 N
2 │ D01 Y 2 1.44 N
3 │ D01 Y 3+ 8.44 N
4 │ D01 Y 0 -13.56 N
(Notice that the row numbers are not part of the table. Extracting a subset of the rows produces a table with row numbers starting at one, regardless of what the original row numbers were.)
But there is much more to the indexing than simply extracting a subset of rows - it provides an iterator interface to Table
.
Suppose we wish to select the rows from district D49
as a table. We could create a Boolean vector and use it to index into contratbl
.== "D49"] contratbl[contratbl.dist
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
When comparing a vector, like contratbl.dist
to a single string or number, like "D47"
we must “vectorize” the operation, which is done here using dot vectorization
Or we could filter
the rows of the table by applying a function to each row to determine if the dist
field has the value "D49"
.
isD49dist(row) = row.dist == "D49" # a 'one-liner' function definition
filter(isD49dist, contratbl)
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
Or we could write the filter function as an anonymous function
filter(r -> r.dist == "D49", contratbl)
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
Or we could write the filter function as the composition of a function that extracts the first value from the row, which is the dist
value, and a function that compares that value to "D49"
.
filter(==("D49") ∘ first, contratbl)
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
The function composition operator, ∘
, typed as \circ<tab>
, is described in this manual section,
Or we could write a generator expression
Table(r for r in contratbl if r.dist == "D49")
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
The point is that all of these variations are from the base Julia language and simply rely on the fact that contratbl
can be treated as an iterator over the rows of the table.
As shown in the last code block, the process of iterating over the rows of a Table
can be applied in reverse, constructing a Table
from an iterator or a generator expression that returns NamedTuple
s. In Chapter 6 a newdata
table is constructed from the Cartesian product of vectors of covariates as the newdata
table.
= Table(
newdata =a, ch=c, urban=u) # NamedTuple from iterator product
(; agefor a in -10:3:20, c in [false, true], u in ["N", "Y"]
)
Table with 3 columns and 44 rows:
age ch urban
┌──────────────────
1 │ -10 false N
2 │ -7 false N
3 │ -4 false N
4 │ -1 false N
5 │ 2 false N
6 │ 5 false N
7 │ 8 false N
8 │ 11 false N
9 │ 14 false N
10 │ 17 false N
11 │ 20 false N
12 │ -10 true N
13 │ -7 true N
14 │ -4 true N
15 │ -1 true N
16 │ 2 true N
17 │ 5 true N
⋮ │ ⋮ ⋮ ⋮
In general a Tuple
is written as a comma-separated set of values within parentheses.
typeof((1, true, 'R'))
Tuple{Int64, Bool, Char}
This looks like the arguments to a function call without the function name, which is not accidental - internally the structure is exactly that of the arguments to a function call. Just as we can optionally separate the positional arguments from the named arguments with ;
in a function call, we can indicate that we are generating a NamedTuple
by prefacing the named values with ;
as shown in this example.
In this expression the ;
is not necessary but there is another form where we just give the name of the variable, like simply specifying contrasts
in the function call like fit(MixedModel, form, data; contrasts)
, where the ;
indicates that the following arguments are named arguments so that contrasts
by itself is equivalent to contrasts=contrasts
specifying both the name and the value.
Thus the newdata
table could be constructed as
= Table(
newdata for age in -10:3:20, ch in [false, true],
(; age, ch, urban) in ["N", "Y"]
urban )
Table with 3 columns and 44 rows:
age ch urban
┌──────────────────
1 │ -10 false N
2 │ -7 false N
3 │ -4 false N
4 │ -1 false N
5 │ 2 false N
6 │ 5 false N
7 │ 8 false N
8 │ 11 false N
9 │ 14 false N
10 │ 17 false N
11 │ 20 false N
12 │ -10 true N
13 │ -7 true N
14 │ -4 true N
15 │ -1 true N
16 │ 2 true N
17 │ 5 true N
⋮ │ ⋮ ⋮ ⋮
TypedTables.jl leverages the power of the base Julia language and its implementation of concepts such as iterators to provide data manipulation without needing to re-implement each concept from scratch.
Because TypeTables.Table
wraps a NamedTuple
of vectors, which is an immutable type, a Table
’s column names and types cannot be changed. However, it is easy and fast to create a new Table
from an existing Table
. (The reason this operation is fast is because it does not copy the contents of the vectors in the table, it just creates a new NamedTuple
and wrapper referencing the existing contents.)
During the creation of a new Table
columns can be added or removed.
For example, in Chapter 6 we added a Boolean column, ch
, indicating if livch
is not "0"
, to the contra
table using an expression like
= Table(contratbl; ch=contratbl.livch .== "0") contratbl
Table with 6 columns and 1934 rows:
dist urban livch age use ch
┌───────────────────────────────────────
1 │ D01 Y 3+ 18.44 N false
2 │ D01 Y 0 -5.56 N true
3 │ D01 Y 2 1.44 N false
4 │ D01 Y 3+ 8.44 N false
5 │ D01 Y 0 -13.56 N true
6 │ D01 Y 0 -11.56 N true
7 │ D01 Y 3+ 18.44 N false
8 │ D01 Y 3+ -3.56 N false
9 │ D01 Y 1 -5.56 N false
10 │ D01 Y 3+ 1.44 N false
11 │ D01 Y 0 -11.56 Y true
12 │ D01 Y 0 -2.56 N true
13 │ D01 Y 1 -4.56 N false
14 │ D01 Y 3+ 5.44 N false
15 │ D01 Y 3+ -0.56 N false
16 │ D01 Y 3+ 4.44 Y false
17 │ D01 Y 0 -5.56 N true
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
If later we decide that we can do without this column we can drop it using getproperties
whose second argument should be a Tuple
of Symbol
s of the names of the columns to retain.
= Table(
contratbl getproperties(contratbl, (:dist, :urban, :livch, :age, :use)),
)
Table with 5 columns and 1934 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D01 Y 3+ 18.44 N
2 │ D01 Y 0 -5.56 N
3 │ D01 Y 2 1.44 N
4 │ D01 Y 3+ 8.44 N
5 │ D01 Y 0 -13.56 N
6 │ D01 Y 0 -11.56 N
7 │ D01 Y 3+ 18.44 N
8 │ D01 Y 3+ -3.56 N
9 │ D01 Y 1 -5.56 N
10 │ D01 Y 3+ 1.44 N
11 │ D01 Y 0 -11.56 Y
12 │ D01 Y 0 -2.56 N
13 │ D01 Y 1 -4.56 N
14 │ D01 Y 3+ 5.44 N
15 │ D01 Y 3+ -0.56 N
16 │ D01 Y 3+ 4.44 Y
17 │ D01 Y 0 -5.56 N
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
The JuliaData organization manages the development of several packages related to data science and data management, including DataFrames.jl, a comprehensive system for working with column-oriented data tables in Julia. Kamiński (2023), written by the primary author of that package, provides an in-depth introduction to data science facilities, in particular the DataFrames
package, in Julia.
This package is particularly well-suited to more advanced data manipulation such as the split-apply-combine
strategy (Wickham, 2011) and “joins” of data tables.
Bouchet-Valat & Kamiński (2023) compares the performance of DataFrames.jl to other data frame implementations in R and Python.
This page was rendered from git revision 57a0584 .