Utility package for subsetting, partitioning, iterating, and resampling of Machine Learning datasets. Aside from providing common functionality, this library also allows for first class support of custom user-defined data structures.
Typical Machine Learning experiments require a lot of rather mundane but error prone data handling glue code. One particularly interesting category of data handling functionality - and the sole focus of this package - are what we call data access pattern. These "pattern" include subsetting, resampling, iteration, and partitioning of various types of data sets.
MLDataPattern was designed around the core requirement of
providing first class support for user-defined data sources. This
idea is based on the assumption that the data source a user is
working with, is likely of some very user-specific custom type.
That said, we also put a lot of attention into first class
support for the most commonly used data container, such as
Note that this package serves as a back-end for the end-user facing MLDataUtils.jl. If you are an end-user who is mainly interested in working with data interactively and conveniently you may want to use MLDataUtils instead.
Let us take a look at a hello world example (with little explanation) to get a feeling for how to use this package in a typical ML scenario. Note how the code snippet below does not reason about any training algorithm in any way. It is not the responsibility of the access pattern to decide in what form some algorithm needs the data. The output of the pattern depend solely on the input. In fact, the package is designed to be data agnostic. Instead, the focus is on efficiently chaining subsetting operations and thus to avoid temporary allocations where possible.
using MLDataUtils # reexports MLDataPattern # X is a matrix of floats # Y is a vector of strings X, Y = MLDataUtils.load_iris() # The iris dataset is ordered according to their labels, # which means that we should shuffle the dataset before # partitioning it into training- and test-set. Xs, Ys = shuffleobs((X, Y)) # Notice how we use tuples to group data. # We leave out 15 % of the data for testing (cv_X, cv_Y), (test_X, test_Y) = splitobs((Xs, Ys); at = 0.85) # Next we partition the data using a 10-fold scheme. # Notice how we do not need to splat train into X and Y for (train, (val_X, val_Y)) in kfolds((cv_X, cv_Y); k = 10) for epoch = 1:100 # Iterate over the data using mini-batches of 5 observations each for (batch_X, batch_Y) in eachbatch(train, size = 5) # ... train supervised model on minibatches here end end end
In the above code snippet, the inner loop for
eachbatch is the
only place where data other than indices is actually being
copied. That is because
val_X, etc. are all
array views of type
SubArray (the same applies to all the Y's
of course). In contrast to this,
batch_Y will be
Array. Naturally, array views only work for arrays, but
this package provides a generalization of such a "subset" for any
type of data container.
batch_Y, will be the same
instances each iteration with only their values changed. In other
words, they both are preallocated buffers that will be reused
each iteration and filled with the data for the current batch.
Naturally, it is not a requirement to work with buffers like
this, as stateful iterators can have undesired side effects when
used without care. This package provides different alternatives
for different use cases.
For a much more detailed treatment check out the latest documentation
Additionally, you can make use of Julia's native docsystem. The
following example shows how to get additional information on
kfolds within Julia's REPL:
This package is registered in
METADATA.jl and can be installed
as usual. Just start up Julia and type the following code snippet
into the REPL. It makes use of the native Julia package manger.
import Pkg Pkg.add("MLDataPattern")
This code is free to use under the terms of the MIT license