## Code

```
using DataFrames
using EmbraceUncertainty: dataset
using MixedModels
using Tables
using TypedTables
```

Load the packages to be used

```
using DataFrames
using EmbraceUncertainty: dataset
using MixedModels
using Tables
using TypedTables
```

A call to fit a mixed-effects model using the `MixedModels`

package follows the *formula/data* specification that is common to many statistical model-fitting packages, especially those in R. In Julia, model formulas and contrasts, which are used to produce model matrices from a formula/data specification, are implemented in the StatsModels.jl package.

The `data`

argument must be able to be expressed as a column-oriented *data table*, which, for these purposes, is a named, ordered collection of columns, each of which is a homogeneous (i.e. all the elements have the same type) vector and all of which have the same length.

Implementations of column-oriented data tables are ubiquitous in data science, where they are often called *data frames*. These include the *data.frame* type in R and the data.table package for R, and the `DataFrame`

class in the pandas package for Python and, as a separate implementation, in the polars package for Python and Rust.

An increasing popular representation of data frames is as Arrow tables. Polars uses the Arrow format internally as does the DuckDB database. Recently it was announced that the 2.0 release of pandas will allow for Arrow tables.

Arrow specifies a language-independent tabular memory format that provides many desirable properties, such as provision for missing data and compact representation of categorical data vectors, for data science. The memory format also defines a file format for storing and exchanging data tables. Because the memory format is essentially the same as the file format, Arrow files can be memory-mapped providing very fast read speeds.

Furthermore, the Arrow Project provides a reference implementation of the Arrow format and tools for manipulating data in that format as a C++ library, which is used by implementations in several other languages, including C, C#, Go, Java, JavaScript, MATLAB, Python, R, and Ruby. The Julia implementation in Arrow.jl does not call functions in the C++ library. Instead it implements the format in Julia code in such a way that Arrow vectors behave like native Julia vectors. That is, Arrow vectors are a subtype of `AbstractVector`

. A similar approach is taken in the implementations of the Arrow format for the Rust language.

The data sets in the `MixedModels`

package and the auxiliary data sets used in this book are stored in the Arrow file format and retrieved as `Arrow.Table`

s. There are many examples throughout this book of loading such data sets.

`= dataset(:contra) contra `

```
Arrow.Table with 1934 rows, 5 columns, and schema:
:dist String
:urban String
:livch String
:age Float64
:use String
```

Often, for ease of access and for display, we convert the Arrow table to a `Table`

, which, contrary to convention, is a type defined in TypedTables.jl and not in Tables.jl.

Before going into detail about the properties and use of the `Table`

type, let us first discuss the role of Tables.jl.

An important characteristic of any system for working with data tables is whether the table is stored in memory column-wise or row-wise.

As described above, most implementations of data frames for data science store the data column-wise.

In *relational database management systems* (RDBMS), such as PostgreSQL, SQLite, and a multitude of commercial systems, a data table, called a *relation*, is typically stored row-wise. Such systems typically use SQL, the *structured query language*, to define and access the data in tables, which is why that acronym appears in many of the names. There are exceptions to the row-wise rule, such as DuckDB, an SQL-based RDBMS, that, as mentioned above, represents relations as Arrow tables.

Many external representations of data tables, such as in comma-separated-value (CSV) files, are row-oriented. Furthermore, it is often convenient to generate a data table a row at a time.

Thus it becomes convenient to have a “clearing house” that can accept either row tables or column tables and provided the desired form to a downstream package, such as StatsModels.jl. Tables.jl does exactly this. It is not an implementation of data tables itself, but rather it defines “an interface for tables in Julia”, allowing a row-oriented table to be accessed column-wise and vice-versa.

It defines a prototype column-oriented table, `Tables.ColumnTable`

, as a NamedTuple of vectors.

` Tables.ColumnTable`

`NamedTuple{names, T} where {N, names, T<:Tuple{Vararg{AbstractVector, N}}}`

and a prototype row-oriented table as a vector of `NamedTuple`

s.

` Tables.RowTable`

`AbstractVector{T} where T<:NamedTuple (alias for AbstractArray{T, 1} where T<:NamedTuple)`

The actual implementation of a row-table or column-table type may be different from these prototypes but it must provide access methods as if it were one of these types. `Tables.jl`

provides the “glue” to treat a particular data table type as if it were row-oriented, by calling `Tables.rows`

or `Tables.rowtable`

on it, or column-oriented, by calling `Tables.columntable`

on it.

TypedTables.jl is a lightweight package (about 1500 lines of source code) that provides a concrete implementation of column-tables, called simply `Table`

, as a `NamedTuple`

of vectors.

A `Table`

that is constructed from another type of column-table, such as an `Arrow.Table`

or a `DataFrame`

or an explicit `NamedTuple`

of vectors, is simply a wrapper around the original table’s contents. On the other hand, constructing a `Table`

from a row table first creates a `ColumnTable`

, then wraps it.

`= Table(contra) contratbl `

```
Table with 5 columns and 1934 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D01 Y 3+ 18.44 N
2 │ D01 Y 0 -5.56 N
3 │ D01 Y 2 1.44 N
4 │ D01 Y 3+ 8.44 N
5 │ D01 Y 0 -13.56 N
6 │ D01 Y 0 -11.56 N
7 │ D01 Y 3+ 18.44 N
8 │ D01 Y 3+ -3.56 N
9 │ D01 Y 1 -5.56 N
10 │ D01 Y 3+ 1.44 N
11 │ D01 Y 0 -11.56 Y
12 │ D01 Y 0 -2.56 N
13 │ D01 Y 1 -4.56 N
14 │ D01 Y 3+ 5.44 N
15 │ D01 Y 3+ -0.56 N
16 │ D01 Y 3+ 4.44 Y
17 │ D01 Y 0 -5.56 N
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
```

`typeof(contratbl)`

`Table{@NamedTuple{dist::String, urban::String, livch::String, age::Float64, use::String}, 1, @NamedTuple{dist::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, urban::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, livch::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, age::Arrow.DictEncoded{Float64, Int8, Arrow.Primitive{Float64, Vector{Float64}}}, use::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}}}`

(The output from that expression is a very long string. You need to scroll to the right over the output to see all the output.)

This type of table is said to be “strongly typed”, meaning that the data type itself contains a wealth of detail about the exact form of the table, allowing the Julia compiler to generate efficient code for operations on the table. That is the positive aspect of being so specific about the names of the columns and the details of the type of data in each column. However, it also means that this mechanism is not suitable for tables with a large number, say hundreds or thousands, of columns, which can overburden the compiler.

The methods for accessing columns or rows in a `Table`

are simple.

A column is accessed by its name as a “property”, either using the `getproperty`

extractor function or, more commonly, with the dot (`.`

) operator, returning a vector.

` contratbl.urban`

```
1934-element Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}:
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
⋮
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
```

The column names are Symbols, not strings, usually typed as a `:`

followed by the name, as shown in

`columnnames(contratbl)`

`(:dist, :urban, :livch, :age, :use)`

The `:`

form for creating the Symbol requires that the column name be a valid variable name in Julia. If, for example, a column name contains a blank, the `:`

form must be replaced by an expression like `var"<name>"`

, which invokes what is called a “string macro”.

`"urban" contratbl.var`

```
1934-element Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}:
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
"Y"
⋮
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
"N"
```

A row is accessed by its index, either using the `getindex`

function or, more commonly, with the index in square brackets, returning a `NamedTuple`

for a singleton index or another `Table`

for a vector-valued index.

`1] contratbl[`

`(dist = "D01", urban = "Y", livch = "3+", age = 18.44, use = "N")`

`2:5] contratbl[`

```
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D01 Y 0 -5.56 N
2 │ D01 Y 2 1.44 N
3 │ D01 Y 3+ 8.44 N
4 │ D01 Y 0 -13.56 N
```

(Notice that the row numbers are not part of the table. Extracting a subset of the rows produces a table with row numbers starting at one, regardless of what the original row numbers were.)

But there is much more to the indexing than simply extracting a subset of rows - it provides an iterator interface to `Table`

.

Suppose we wish to select the rows from district `D49`

as a table. We could create a Boolean vector and use it to index into `contratbl`

`.== "D49"] contratbl[contratbl.dist `

```
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
```

.== is the vectorized form of the equality comparison, ==

When comparing a vector, like `contratbl.dist`

to a single string or number, like `"D47"`

we must “vectorize” the operation, which is done here using dot vectorization

Or we could `filter`

the rows of the table by applying a function to each row to determine if the `dist`

field has the value `"D49"`

.

```
isD49dist(row) = row.dist == "D49" # a 'one-liner' function definition
filter(isD49dist, contratbl)
```

```
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
```

Or we could write the filter function as an anonymous function

`filter(r -> r.dist == "D49", contratbl)`

```
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
```

Or we could write the filter function as the composition of a function that extracts the first value from the row, which is the `dist`

value, and a function that compares that value to `"D49"`

.

`filter(==("D49") ∘ first, contratbl)`

```
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
```

The function composition operator

The function composition operator, `∘`

, typed as `\circ<tab>`

, is described in this manual section,

Or we could write a generator expression

`Table(r for r in contratbl if r.dist == "D49")`

```
Table with 5 columns and 4 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D49 N 0 -12.56 N
2 │ D49 N 0 -9.56 N
3 │ D49 N 0 -10.56 N
4 │ D49 N 3+ 2.44 N
```

The point is that all of these variations are from the base Julia language and simply rely on the fact that `contratbl`

can be treated as an *iterator* over the rows of the table.

As shown in the last code block, the process of iterating over the rows of a `Table`

can be applied in reverse, constructing a `Table`

from an iterator or a generator expression that returns `NamedTuple`

s. In Chapter 6 a `newdata`

table is constructed from the Cartesian product of vectors of covariates as the `newdata`

table.

```
= Table(
newdata =a, ch=c, urban=u) # NamedTuple from iterator product
(; agefor a in -10:3:20, c in [false, true], u in ["N", "Y"]
)
```

```
Table with 3 columns and 44 rows:
age ch urban
┌──────────────────
1 │ -10 false N
2 │ -7 false N
3 │ -4 false N
4 │ -1 false N
5 │ 2 false N
6 │ 5 false N
7 │ 8 false N
8 │ 11 false N
9 │ 14 false N
10 │ 17 false N
11 │ 20 false N
12 │ -10 true N
13 │ -7 true N
14 │ -4 true N
15 │ -1 true N
16 │ 2 true N
17 │ 5 true N
⋮ │ ⋮ ⋮ ⋮
```

Expression for creating a NamedTuple

In general a `Tuple`

is written as a comma-separated set of values within parentheses.

`typeof((1, true, 'R'))`

`Tuple{Int64, Bool, Char}`

This looks like the arguments to a function call without the function name, which is not accidental - internally the structure is exactly that of the arguments to a function call. Just as we can optionally separate the positional arguments from the named arguments with `;`

in a function call, we can indicate that we are generating a `NamedTuple`

by prefacing the named values with `;`

as shown in this example.

In this expression the `;`

is not necessary but there is another form where we just give the name of the variable, like simply specifying `contrasts`

in the function call like `fit(MixedModel, form, data; contrasts)`

, where the `;`

indicates that the following arguments are named arguments so that `contrasts`

by itself is equivalent to `contrasts=contrasts`

specifying both the name and the value.

Thus the `newdata`

table could be constructed as

```
= Table(
newdata for age in -10:3:20, ch in [false, true],
(; age, ch, urban) in ["N", "Y"]
urban )
```

```
Table with 3 columns and 44 rows:
age ch urban
┌──────────────────
1 │ -10 false N
2 │ -7 false N
3 │ -4 false N
4 │ -1 false N
5 │ 2 false N
6 │ 5 false N
7 │ 8 false N
8 │ 11 false N
9 │ 14 false N
10 │ 17 false N
11 │ 20 false N
12 │ -10 true N
13 │ -7 true N
14 │ -4 true N
15 │ -1 true N
16 │ 2 true N
17 │ 5 true N
⋮ │ ⋮ ⋮ ⋮
```

TypedTables.jl leverages the power of the base Julia language and its implementation of concepts such as iterators to provide data manipulation without needing to re-implement each concept from scratch.

Because `TypeTables.Table`

wraps a `NamedTuple`

of vectors, which is an immutable type, a `Table`

’s column names and types cannot be changed. However, it is easy and fast to create a new `Table`

from an existing `Table`

. (The reason this operation is fast is because it does not copy the contents of the vectors in the table, it just creates a new `NamedTuple`

and wrapper referencing the existing contents.)

During the creation of a new `Table`

columns can be added or removed.

For example, in Chapter 6 we added a Boolean column, `ch`

, indicating if `livch`

is not `"0"`

, to the `contra`

table using an expression like

`= Table(contratbl; ch=contratbl.livch .== "0") contratbl `

```
Table with 6 columns and 1934 rows:
dist urban livch age use ch
┌───────────────────────────────────────
1 │ D01 Y 3+ 18.44 N false
2 │ D01 Y 0 -5.56 N true
3 │ D01 Y 2 1.44 N false
4 │ D01 Y 3+ 8.44 N false
5 │ D01 Y 0 -13.56 N true
6 │ D01 Y 0 -11.56 N true
7 │ D01 Y 3+ 18.44 N false
8 │ D01 Y 3+ -3.56 N false
9 │ D01 Y 1 -5.56 N false
10 │ D01 Y 3+ 1.44 N false
11 │ D01 Y 0 -11.56 Y true
12 │ D01 Y 0 -2.56 N true
13 │ D01 Y 1 -4.56 N false
14 │ D01 Y 3+ 5.44 N false
15 │ D01 Y 3+ -0.56 N false
16 │ D01 Y 3+ 4.44 Y false
17 │ D01 Y 0 -5.56 N true
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
```

If later we decide that we can do without this column we can drop it using `getproperties`

whose second argument should be a `Tuple`

of `Symbol`

s of the names of the columns to retain.

```
= Table(
contratbl getproperties(contratbl, (:dist, :urban, :livch, :age, :use)),
)
```

```
Table with 5 columns and 1934 rows:
dist urban livch age use
┌────────────────────────────────
1 │ D01 Y 3+ 18.44 N
2 │ D01 Y 0 -5.56 N
3 │ D01 Y 2 1.44 N
4 │ D01 Y 3+ 8.44 N
5 │ D01 Y 0 -13.56 N
6 │ D01 Y 0 -11.56 N
7 │ D01 Y 3+ 18.44 N
8 │ D01 Y 3+ -3.56 N
9 │ D01 Y 1 -5.56 N
10 │ D01 Y 3+ 1.44 N
11 │ D01 Y 0 -11.56 Y
12 │ D01 Y 0 -2.56 N
13 │ D01 Y 1 -4.56 N
14 │ D01 Y 3+ 5.44 N
15 │ D01 Y 3+ -0.56 N
16 │ D01 Y 3+ 4.44 Y
17 │ D01 Y 0 -5.56 N
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
```

The JuliaData organization manages the development of several packages related to data science and data management, including DataFrames.jl, a comprehensive system for working with column-oriented data tables in Julia. Kamiński (2023), written by the primary author of that package, provides an in-depth introduction to data science facilities, in particular the `DataFrames`

package, in Julia.

This package is particularly well-suited to more advanced data manipulation such as the `split-apply-combine`

strategy (Wickham, 2011) and “joins” of data tables.

Bouchet-Valat & Kamiński (2023) compares the performance of DataFrames.jl to other data frame implementations in R and Python.

*This page was rendered from git revision 57a0584
.*