Appendix A — Working with data tables

Load the packages to be used

Code
using DataFrames
using EmbraceUncertainty: dataset
using MixedModels
using Tables
using TypedTables

A call to fit a mixed-effects model using the MixedModels package follows the formula/data specification that is common to many statistical model-fitting packages, especially those in R. In Julia, model formulas and contrasts, which are used to produce model matrices from a formula/data specification, are implemented in the StatsModels.jl package.

The data argument must be able to be expressed as a column-oriented data table, which, for these purposes, is a named, ordered collection of columns, each of which is a homogeneous (i.e. all the elements have the same type) vector and all of which have the same length.

Implementations of column-oriented data tables are ubiquitous in data science, where they are often called data frames. These include the data.frame type in R and the data.table package for R, and the DataFrame class in the pandas package for Python and, as a separate implementation, in the polars package for Python and Rust.

A.1 The Arrow format for data tables

An increasing popular representation of data frames is as Arrow tables. Polars uses the Arrow format internally as does the DuckDB database. Recently it was announced that the 2.0 release of pandas will allow for Arrow tables.

Arrow specifies a language-independent tabular memory format that provides many desirable properties, such as provision for missing data and compact representation of categorical data vectors, for data science. The memory format also defines a file format for storing and exchanging data tables. Because the memory format is essentially the same as the file format, Arrow files can be memory-mapped providing very fast read speeds.

Furthermore, the Arrow Project provides a reference implementation of the Arrow format and tools for manipulating data in that format as a C++ library, which is used by implementations in several other languages, including C, C#, Go, Java, JavaScript, MATLAB, Python, R, and Ruby. The Julia implementation in Arrow.jl does not call functions in the C++ library. Instead it implements the format in Julia code in such a way that Arrow vectors behave like native Julia vectors. That is, Arrow vectors are a subtype of AbstractVector. A similar approach is taken in the implementations of the Arrow format for the Rust language.

The data sets in the MixedModels package and the auxiliary data sets used in this book are stored in the Arrow file format and retrieved as Arrow.Tables. There are many examples throughout this book of loading such data sets.

contra = dataset(:contra)
Arrow.Table with 1934 rows, 5 columns, and schema:
 :dist   String
 :urban  String
 :livch  String
 :age    Float64
 :use    String

Often, for ease of access and for display, we convert the Arrow table to a Table, which, contrary to convention, is a type defined in TypedTables.jl and not in Tables.jl.

Before going into detail about the properties and use of the Table type, let us first discuss the role of Tables.jl.

A.2 Tables.jl provides an “interface for tables”

An important characteristic of any system for working with data tables is whether the table is stored in memory column-wise or row-wise.

As described above, most implementations of data frames for data science store the data column-wise.

In relational database management systems (RDBMS), such as PostgreSQL, SQLite, and a multitude of commercial systems, a data table, called a relation, is typically stored row-wise. Such systems typically use SQL, the structured query language, to define and access the data in tables, which is why that acronym appears in many of the names. There are exceptions to the row-wise rule, such as DuckDB, an SQL-based RDBMS, that, as mentioned above, represents relations as Arrow tables.

Many external representations of data tables, such as in comma-separated-value (CSV) files, are row-oriented. Furthermore, it is often convenient to generate a data table a row at a time.

Thus it becomes convenient to have a “clearing house” that can accept either row tables or column tables and provided the desired form to a downstream package, such as StatsModels.jl. Tables.jl does exactly this. It is not an implementation of data tables itself, but rather it defines “an interface for tables in Julia”, allowing a row-oriented table to be accessed column-wise and vice-versa.

It defines a prototype column-oriented table, Tables.ColumnTable, as a NamedTuple of vectors.

Tables.ColumnTable
NamedTuple{names, T} where {N, names, T<:Tuple{Vararg{AbstractVector, N}}}

and a prototype row-oriented table as a vector of NamedTuples.

Tables.RowTable
AbstractVector{T} where T<:NamedTuple (alias for AbstractArray{T, 1} where T<:NamedTuple)

The actual implementation of a row-table or column-table type may be different from these prototypes but it must provide access methods as if it were one of these types. Tables.jl provides the “glue” to treat a particular data table type as if it were row-oriented, by calling Tables.rows or Tables.rowtable on it, or column-oriented, by calling Tables.columntable on it.

A.3 The Table type from TypedTables

TypedTables.jl is a lightweight package (about 1500 lines of source code) that provides a concrete implementation of column-tables, called simply Table, as a NamedTuple of vectors.

A Table that is constructed from another type of column-table, such as an Arrow.Table or a DataFrame or an explicit NamedTuple of vectors, is simply a wrapper around the original table’s contents. On the other hand, constructing a Table from a row table first creates a ColumnTable, then wraps it.

contratbl = Table(contra)
Table with 5 columns and 1934 rows:
      dist  urban  livch  age     use
    ┌────────────────────────────────
 1  │ D01   Y      3+     18.44   N
 2  │ D01   Y      0      -5.56   N
 3  │ D01   Y      2      1.44    N
 4  │ D01   Y      3+     8.44    N
 5  │ D01   Y      0      -13.56  N
 6  │ D01   Y      0      -11.56  N
 7  │ D01   Y      3+     18.44   N
 8  │ D01   Y      3+     -3.56   N
 9  │ D01   Y      1      -5.56   N
 10 │ D01   Y      3+     1.44    N
 11 │ D01   Y      0      -11.56  Y
 12 │ D01   Y      0      -2.56   N
 13 │ D01   Y      1      -4.56   N
 14 │ D01   Y      3+     5.44    N
 15 │ D01   Y      3+     -0.56   N
 16 │ D01   Y      3+     4.44    Y
 17 │ D01   Y      0      -5.56   N
 ⋮  │  ⋮      ⋮      ⋮      ⋮      ⋮
typeof(contratbl)
Table{@NamedTuple{dist::String, urban::String, livch::String, age::Float64, use::String}, 1, @NamedTuple{dist::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, urban::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, livch::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}, age::Arrow.DictEncoded{Float64, Int8, Arrow.Primitive{Float64, Vector{Float64}}}, use::Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}}}

(The output from that expression is a very long string. You need to scroll to the right over the output to see all the output.)

This type of table is said to be “strongly typed”, meaning that the data type itself contains a wealth of detail about the exact form of the table, allowing the Julia compiler to generate efficient code for operations on the table. That is the positive aspect of being so specific about the names of the columns and the details of the type of data in each column. However, it also means that this mechanism is not suitable for tables with a large number, say hundreds or thousands, of columns, which can overburden the compiler.

A.3.1 Accessing columns or rows in a Table

The methods for accessing columns or rows in a Table are simple.

A column is accessed by its name as a “property”, either using the getproperty extractor function or, more commonly, with the dot (.) operator, returning a vector.

contratbl.urban
1934-element Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}:
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 ⋮
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"

The column names are Symbols, not strings, usually typed as a : followed by the name, as shown in

columnnames(contratbl)
(:dist, :urban, :livch, :age, :use)

The : form for creating the Symbol requires that the column name be a valid variable name in Julia. If, for example, a column name contains a blank, the : form must be replaced by an expression like var"<name>", which invokes what is called a “string macro”.

contratbl.var"urban"
1934-element Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}:
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 "Y"
 ⋮
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"
 "N"

A row is accessed by its index, either using the getindex function or, more commonly, with the index in square brackets, returning a NamedTuple for a singleton index or another Table for a vector-valued index.

contratbl[1]
(dist = "D01", urban = "Y", livch = "3+", age = 18.44, use = "N")
contratbl[2:5]
Table with 5 columns and 4 rows:
     dist  urban  livch  age     use
   ┌────────────────────────────────
 1 │ D01   Y      0      -5.56   N
 2 │ D01   Y      2      1.44    N
 3 │ D01   Y      3+     8.44    N
 4 │ D01   Y      0      -13.56  N

(Notice that the row numbers are not part of the table. Extracting a subset of the rows produces a table with row numbers starting at one, regardless of what the original row numbers were.)

But there is much more to the indexing than simply extracting a subset of rows - it provides an iterator interface to Table.

A.3.2 A trivial example done several ways

Suppose we wish to select the rows from district D49 as a table. We could create a Boolean vector and use it to index into contratbl

contratbl[contratbl.dist .== "D49"]
Table with 5 columns and 4 rows:
     dist  urban  livch  age     use
   ┌────────────────────────────────
 1 │ D49   N      0      -12.56  N
 2 │ D49   N      0      -9.56   N
 3 │ D49   N      0      -10.56  N
 4 │ D49   N      3+     2.44    N

When comparing a vector, like contratbl.dist to a single string or number, like "D47" we must “vectorize” the operation, which is done here using dot vectorization

Or we could filter the rows of the table by applying a function to each row to determine if the dist field has the value "D49".

isD49dist(row) = row.dist == "D49" # a 'one-liner' function definition
filter(isD49dist, contratbl)
Table with 5 columns and 4 rows:
     dist  urban  livch  age     use
   ┌────────────────────────────────
 1 │ D49   N      0      -12.56  N
 2 │ D49   N      0      -9.56   N
 3 │ D49   N      0      -10.56  N
 4 │ D49   N      3+     2.44    N

Or we could write the filter function as an anonymous function

filter(r -> r.dist == "D49", contratbl)
Table with 5 columns and 4 rows:
     dist  urban  livch  age     use
   ┌────────────────────────────────
 1 │ D49   N      0      -12.56  N
 2 │ D49   N      0      -9.56   N
 3 │ D49   N      0      -10.56  N
 4 │ D49   N      3+     2.44    N

Or we could write the filter function as the composition of a function that extracts the first value from the row, which is the dist value, and a function that compares that value to "D49".

filter(==("D49")  first, contratbl)
Table with 5 columns and 4 rows:
     dist  urban  livch  age     use
   ┌────────────────────────────────
 1 │ D49   N      0      -12.56  N
 2 │ D49   N      0      -9.56   N
 3 │ D49   N      0      -10.56  N
 4 │ D49   N      3+     2.44    N

The function composition operator, , typed as \circ<tab>, is described in this manual section,

Or we could write a generator expression

Table(r for r in contratbl if r.dist == "D49")
Table with 5 columns and 4 rows:
     dist  urban  livch  age     use
   ┌────────────────────────────────
 1 │ D49   N      0      -12.56  N
 2 │ D49   N      0      -9.56   N
 3 │ D49   N      0      -10.56  N
 4 │ D49   N      3+     2.44    N

The point is that all of these variations are from the base Julia language and simply rely on the fact that contratbl can be treated as an iterator over the rows of the table.

As shown in the last code block, the process of iterating over the rows of a Table can be applied in reverse, constructing a Table from an iterator or a generator expression that returns NamedTuples. In Chapter 6 a newdata table is constructed from the Cartesian product of vectors of covariates as the newdata table.

newdata = Table(
  (; age=a, ch=c, urban=u)  # NamedTuple from iterator product
  for a in -10:3:20, c in [false, true], u in ["N", "Y"]
)
Table with 3 columns and 44 rows:
      age  ch     urban
    ┌──────────────────
 1  │ -10  false  N
 2  │ -7   false  N
 3  │ -4   false  N
 4  │ -1   false  N
 5  │ 2    false  N
 6  │ 5    false  N
 7  │ 8    false  N
 8  │ 11   false  N
 9  │ 14   false  N
 10 │ 17   false  N
 11 │ 20   false  N
 12 │ -10  true   N
 13 │ -7   true   N
 14 │ -4   true   N
 15 │ -1   true   N
 16 │ 2    true   N
 17 │ 5    true   N
 ⋮  │  ⋮     ⋮      ⋮

In general a Tuple is written as a comma-separated set of values within parentheses.

typeof((1, true, 'R'))
Tuple{Int64, Bool, Char}

This looks like the arguments to a function call without the function name, which is not accidental - internally the structure is exactly that of the arguments to a function call. Just as we can optionally separate the positional arguments from the named arguments with ; in a function call, we can indicate that we are generating a NamedTuple by prefacing the named values with ; as shown in this example.

In this expression the ; is not necessary but there is another form where we just give the name of the variable, like simply specifying contrasts in the function call like fit(MixedModel, form, data; contrasts), where the ; indicates that the following arguments are named arguments so that contrasts by itself is equivalent to contrasts=contrasts specifying both the name and the value.

Thus the newdata table could be constructed as

newdata = Table(
  (; age, ch, urban) for age in -10:3:20, ch in [false, true],
  urban in ["N", "Y"]
)
Table with 3 columns and 44 rows:
      age  ch     urban
    ┌──────────────────
 1  │ -10  false  N
 2  │ -7   false  N
 3  │ -4   false  N
 4  │ -1   false  N
 5  │ 2    false  N
 6  │ 5    false  N
 7  │ 8    false  N
 8  │ 11   false  N
 9  │ 14   false  N
 10 │ 17   false  N
 11 │ 20   false  N
 12 │ -10  true   N
 13 │ -7   true   N
 14 │ -4   true   N
 15 │ -1   true   N
 16 │ 2    true   N
 17 │ 5    true   N
 ⋮  │  ⋮     ⋮      ⋮

TypedTables.jl leverages the power of the base Julia language and its implementation of concepts such as iterators to provide data manipulation without needing to re-implement each concept from scratch.

A.3.3 Adding or removing columns from a Table

Because TypeTables.Table wraps a NamedTuple of vectors, which is an immutable type, a Table’s column names and types cannot be changed. However, it is easy and fast to create a new Table from an existing Table. (The reason this operation is fast is because it does not copy the contents of the vectors in the table, it just creates a new NamedTuple and wrapper referencing the existing contents.)

During the creation of a new Table columns can be added or removed.

For example, in Chapter 6 we added a Boolean column, ch, indicating if livch is not "0", to the contra table using an expression like

contratbl = Table(contratbl; ch=contratbl.livch .== "0")
Table with 6 columns and 1934 rows:
      dist  urban  livch  age     use  ch
    ┌───────────────────────────────────────
 1  │ D01   Y      3+     18.44   N    false
 2  │ D01   Y      0      -5.56   N    true
 3  │ D01   Y      2      1.44    N    false
 4  │ D01   Y      3+     8.44    N    false
 5  │ D01   Y      0      -13.56  N    true
 6  │ D01   Y      0      -11.56  N    true
 7  │ D01   Y      3+     18.44   N    false
 8  │ D01   Y      3+     -3.56   N    false
 9  │ D01   Y      1      -5.56   N    false
 10 │ D01   Y      3+     1.44    N    false
 11 │ D01   Y      0      -11.56  Y    true
 12 │ D01   Y      0      -2.56   N    true
 13 │ D01   Y      1      -4.56   N    false
 14 │ D01   Y      3+     5.44    N    false
 15 │ D01   Y      3+     -0.56   N    false
 16 │ D01   Y      3+     4.44    Y    false
 17 │ D01   Y      0      -5.56   N    true
 ⋮  │  ⋮      ⋮      ⋮      ⋮      ⋮     ⋮

If later we decide that we can do without this column we can drop it using getproperties whose second argument should be a Tuple of Symbols of the names of the columns to retain.

contratbl = Table(
  getproperties(contratbl, (:dist, :urban, :livch, :age, :use)),
)
Table with 5 columns and 1934 rows:
      dist  urban  livch  age     use
    ┌────────────────────────────────
 1  │ D01   Y      3+     18.44   N
 2  │ D01   Y      0      -5.56   N
 3  │ D01   Y      2      1.44    N
 4  │ D01   Y      3+     8.44    N
 5  │ D01   Y      0      -13.56  N
 6  │ D01   Y      0      -11.56  N
 7  │ D01   Y      3+     18.44   N
 8  │ D01   Y      3+     -3.56   N
 9  │ D01   Y      1      -5.56   N
 10 │ D01   Y      3+     1.44    N
 11 │ D01   Y      0      -11.56  Y
 12 │ D01   Y      0      -2.56   N
 13 │ D01   Y      1      -4.56   N
 14 │ D01   Y      3+     5.44    N
 15 │ D01   Y      3+     -0.56   N
 16 │ D01   Y      3+     4.44    Y
 17 │ D01   Y      0      -5.56   N
 ⋮  │  ⋮      ⋮      ⋮      ⋮      ⋮

A.4 The DataFrames package

The JuliaData organization manages the development of several packages related to data science and data management, including DataFrames.jl, a comprehensive system for working with column-oriented data tables in Julia. Kamiński (2023), written by the primary author of that package, provides an in-depth introduction to data science facilities, in particular the DataFrames package, in Julia.

This package is particularly well-suited to more advanced data manipulation such as the split-apply-combine strategy (Wickham, 2011) and “joins” of data tables.

Bouchet-Valat & Kamiński (2023) compares the performance of DataFrames.jl to other data frame implementations in R and Python.

This page was rendered from git revision 05a171b .