ORM

Object-Relational Model

The Low-Level API is an object-relational model for machine learning. Each class in the ORM maps to a table in a SQLite database that serves as a machine learning metastore.

The real power lies in the relationships between these objects (e.g. LabelSplitsetFeature and QueueJobPredictorPrediction), which enable us to construct rule-base protocols for various types of data and analysis.

Goobye, X_train, y_test. Hello, object-oriented machine learning.

from aiqc.orm import *

Automatic ‘id’ method argument

If an ORM-based classes is instantiated, then any method called by the resulting object will automatically pass in the object’s self.id in as its first positional argument:

queue = Queue.get_by_id(id)
queue.run_jobs()

However, if the class has not been instantiated, then the id is required:

Queue.run_jobs(id)

Although I did not design this pattern, if you think about it, it makes sense. ORMs allow you to fluidly traverse relational objects. If you had to check the ``object.id`` of everything you returned before interacting with it, then that would ruin the user-friendly experience.

0. BaseModel

The BaseModel class applies to all tables in the ORM. It’s metadata in the truest sense of the word.

Localized timestamps are handled by utils.config.timezone_now(). They are made human-readable via strftime('%Y%b%d_%H:%M:%S') → “2022Jun23_07:13:14”


0a. Methods


└── created_at()

Returns the creation timestamp in human-readable format.


└── updated_at()

Returns timestamp of the most recent update in human-readable format.


└── flip_star()

A way to toggle (favorite/ unfavorite) the is_starred attribute in order to make entries easy to find.


└── set_info()

Add descriptive information about an entry so that you remember why you created it

set_info(name, description)

Argument

Type

Default

Description

name

str

None

Short name to remember this entry by

description

str

None

What is unique about this entry?


0b. Attributes

Attribute

Type

Description

id

AutoField

Auto-incrementing integer (1-based, not zero-based) PrimaryKey

time_created

DateTimeField

Records a timestamp when the record is created

time_updated

DateTimeField

Records a timestamp when the record is created. Overwritten every time the record is updated.

is_starred

BooleanField

Used to indicated that the entry is a favorite

name

CharField

Short name to remember this entry by

description

CharField

What is unique about this entry?


1. Dataset

Datasets

The Dataset class provides the following subclasses for working with different types of data:

Type

Dimensionality

Supported Formats

Format (if ingested)

Tabular

2D

Files (Parquet, CSV, TSV) / Pandas DataFrame (in-memory)

Parquet

Sequence

3D

NumPy (in-memory ndarray, npy file)

npy

Image

4D

NumPy (in-memory ndarray, npy file) / Pillow-supported formats

npy

The names are merely suggestive, as the primary purpose of these subclasses is to provide a way to register data of known dimensionality. For example, a practitioner could ingest many uni-channel/ grayscale images as a 3D Sequence Dataset instead of a multi-channel 4D Image Dataset.

Why not 2D NumPy? The Dataset.Tabular class is intended for strict, column-specific dtypes and Parquet persistence upon ingestion. In practice, this conflicted too often with NumPy’s array-wide dtyping. We use the best tools for the job (df/pq for 2D) and (array/npy for ND).


1a. Methods

1ai. Registration

Most of the Dataset registration methods share these arguments/ concepts:

Argument

Description

ingest

Determines if raw data is either stored directly inside the metastore or remains on disk to be accessed via path/url. In-memory data like DataFrames and ndarrays must be ingested. Whereas file-based data like Parquet, NPY, Image folders/urls may remain remote. Regardless of whether or not the raw data is ingested, metadata is always derived from it by parsing: 2D via DataFrame and N-D via ndarray.

rename_columns

Useful for assigning column names to arrays or delimited files that would otherwise be unnamed. len(rename_columns) must match the number of columns in the raw data. Normally, an int-based range is assigned to unnamed columns. In this case, AIQC converts each column name to a string e.g. ‘1’ during the registration process.

retype

Change the dtype of data using np.types. All Dataset subclasses support mass typing via np.type/ str(np.type). Only the Tabular subclass supports inidividual column retyping via dict(column=str(np.type)) ``. If ``rename_columns is used in conjuction with retype=dict(), then each dict['column'] key must match its counterpart in rename_columns.

description

What information does this dataset contain? What is unique about this dataset/ version – did you edit the raw data, add rows, or change column names/ dtypes?

name

Triggers dataset versioning. Datasets that share a name will be assigned an auto-incrementing version:int number provided that they are not duplicates of each other based on a sha256_hexdigest:str hash. If you try to create an exact duplicate, it will warn you and return the matching duplicate instead of creating a new entity. This behavior makes it easy to rerun pipelines where Datasets are created inline.

Ingestion provides the following benefits, especially for entry-level users:

  • Persist in-memory datasets (Pandas DataFrames, NumPy ndarrays).

  • Keeps data coupled with the experiment in the portable SQLite file.

  • Provides a more immutable and out-of-the-way storage location in comparison to a laptop file system.

  • Encourages preserving tabular dtypes with the ecosystem-friendly Parquet format.

Why would I avoid ingestion?

  • Happy with where the original data lives: e.g. S3 bucket.

  • Don’t want to duplicate the data.

sha256? – It’s the one-way hash algorithm that GitHub aspires to upgrade to. AIQC runs it on compressed data because it’s easier and probably less-error prone than intercepting the bytes of the fastparquet intermediary tables before appending the Parquet magic bytes.

Is SQLite a legitimate datastore? – In many cases, SQLite queries are faster than accessing data via a filesystem. It’s a stable, 22 year-old technology that serves as the default database for iOS e.g. Apple Photos. AIQC uses it store raw data in byte format as a BlobField. I’ve stored tens-of-thousands of files in it over several years and never experienced corruption. Keep in mind that AWS S3 is blob store, and the Microsoft equivalent service is literally called Azure Blob Storage. The max size of a BlobField is 2GB, so ~20GB after compression. Either way, the goal of machine learning isn’t to record the entire population within the weights of a neural network, it’s to find subsets that are representative of the broader population.


1ai1. Dataset.Tabular

Here are some of the ways practitioners can use this 2D structure:

Multiple subjects (1 row per sample) * Multi-variate 1D (1 col per attribute)

Single subject (1 row per timestamp) * Multi-variate 1D (1 col per attribute)

Multiple subjects (1 row per timestamp) * Uni-variate 0D (1 col per sample)

Tabular datasets may contain both features and labels


└── Dataset.Tabular.from_df()

dataset = Dataset.Tabular.from_df(
    dataframe
    , rename_columns
    , retype
    , description
    , name
)

Argument

Type

Default

Description

df

DataFrame

Required

pd.DataFrame with int-based single index. DataFrames are always ingested.

rename_columns

list[str]

None

See Registration

retype

np.type / dict(column:np.type)

None

See Registration

description

str

None

See Registration

name

str

None

See Registration


└── Dataset.Tabular.from_path()

Dataset.Tabular.from_path(
    file_path
    , ingest
    , rename_columns
    , retype
    , header
    , description
    , name
)

Argument

Type

Default

Description

file_path

str

Required

Parsed based on how the file name ends (.parquet, .tsv, .csv)

ingest

bool

True

See Registration. Defaults to True because I don’t want to rely on CSV files as a source of truth for dtypes, and compression works great in Parquet.

rename_columns

list[str]

None

See Registration

retype

np.type / dict(column:np.type)

None

See Registration

header

object

None

See Registration

description

str

None

See Registration

name

str

None

See Registration


1ai2. Dataset.Sequence

Here are some of the ways practitioners can use this 3D structure:

Single subject (1 patient) * Multiple 2D sequences

Multiple subjects * Single 2D sequence

Sequence datasets are somewhat multi-modal in that, in order to perform supervised learning on them, they must eventually be paired with a Dataset.Tabular that acts as its Label.


└── Dataset.Sequence.from_numpy()

Dataset.Sequence.from_numpy(
    arr3D_or_npyPath
    , ingest
    , rename_columns
    , retype
    , description
    , name
)

Argument

Type

Default

Description

arr3D_or_npyPath

object / str

Required

3D array in the form of either an ndarray or npy file path

ingest

bool

None

See Registration. If left blank, ndarrays will be ingested and npy will not. Errors if ndarray and False.

rename_columns

list[str]

None

See Registration

retype

np.type / dict(column:np.type)

None

See Registration

description

str

None

See Registration

name

str

None

See Registration


1ai3. Dataset.Image

Here are some of the ways you can practitioners this 4D structure:

Single subject (1 patient) * Multiple 3D images

Multiple subjects * Single 3D image

Users can ingest 4D data using either: - The Pillow library, which supports various formats - Or NumPy arrays as a simple alternative

Image datasets are somewhat multi-modal in that, in order to perform supervised learning on them, they must eventually be paired with a Dataset.Tabular that acts as its Label.


└── Dataset.Image.from_numpy()

Dataset.Image.from_numpy(
    arr4D_or_npyPath
    , ingest
    , rename_columns
    , retype
    , description
    , name
)

Argument

Type

Default

Description

arr4D_or_npyPath

object / str

Required

4D array in the form of either an ndarray or npy file path

ingest

bool

None

See Registration. If left blank, ndarrays will be ingested and npy will not. Errors if input is ndarray and ingest==False.

rename_columns

list[str]

None

See Registration

retype

np.type / dict(column:np.type)

None

See Registration

description

str

None

See Registration

name

str

None

See Registration


└── Dataset.Image.from_folder()

Dataset.Image.from_folder(
    folder_path
    , ingest
    , rename_columns
    , retype
    , description
    , name
)

Argument

Type

Default

Description

folder_path

str

Required

Folder of images to be ingested via Pillow. All images must be cropped to the same dimensions ahead of time.

ingest

bool

False

See Registration

rename_columns

list[str]

None

See Registration

retype

np.type / dict(column:np.type)

None

See Registration

description

str

None

See Registration

name

str

None

See Registration


└── Dataset.Image.from_urls()

Dataset.Image.from_urls(
    urls
    , source_path
    , ingest
    , rename_columns
    , retype
    , description
    , name
)

Argument

Type

Default

Description

urls

list(str)

Required

URLs that point to an image to be ingested via Pillow. All images must be cropped to the same dimensions ahead of time.

source_path

str

None

Optionally record a shared directory, bucket, or FTP site where images are stored. The backend won’t use this information for anything.

ingest

bool

False

See Registration

rename_columns

list[str]

None

See Registration

retype

np.type / dict(column:np.type)

None

See Registration

description

str

None

See Registration

name

str

None

See Registration


1aii. Fetch

The following methods are exposed to end-users in case they want to inspect the data that they have ingested.


└── Dataset.to_arr()

Argument

Type

Default

Description

id

int

None

The identifier of the Dataset of interest

columns

list(str)

None

If left blank, includes all columns

samples

list(int)

None

If left blank, includes all samples

Subclass

Returns

Tabular

ndarray.ndim==2

Sequence

ndarray.ndim==3

Image

ndarray.ndim==4


└── Dataset.to_df()

Argument

Type

Default

Description

id

int

None

The identifier of the Dataset of interest

columns

list(str)

None

If left blank, includes all columns

samples

list(int)

None

If left blank, includes all samples

Subclass

Returns

Tabular

DataFrame

Sequence

list(DataFrame)

Image

list(list(DataFrame))


└── Dataset.to_pillow()

Argument

Type

Default

Description

id

int

None

The identifier of the Dataset of interest

samples

list(int)

None

If left blank, includes all samples

Subclass

Returns

Image

list(PIL.Image)


└── Dataset.get_dtypes()

Argument

Type

Default

Description

id

int

None

The identifier of the Dataset of interest

columns

list(str)

None

If left blank, includes all columns

Regardless of how the initial Dataset.dtype was formatted [e.g. single np.type / str(np.type) / dict(column=np.type)], this function intentionally returns then dtype of each column within a dict(column=str(np.type) format.


1b. Attributes

These are the fields in the Dataset table

Attribute

Type

Description

typ

CharField

The Dataset type: Tabular, Sequence, Image

source_format

CharField

The file format (Parquet, CSV, TSV) or in-memory class (DataFrame, ndarray)

source_path

CharField

The path of the original file/ folder

urls

JSONField

A list of URLs as an alternative to file paths/ folders

columns

JSONField

List of str-based names for each column

dtypes

JSONField

The type of each column. Tabular dtype is saved in dict(column=str(np.type)) `` format. Where Sequence and Image dtype is saved in a singular ``str(np.type)

shape

JSONField

Human-readable dictionary about the dimensions of the data e.g. samples:10, columns:5

sha256_hexdigest

CharField

A hash of the data to determine its uniqueness for versioning.

memory_MB

IntegerField

Size of the dataset in megabytes when loaded into memory

contains_nan

BooleanField

Whether or not the dataset contains any blank cells

header

PickleField

pd.read_csv(header) for TSV/CSV files.

is_ingested

BooleanField

Quick flag to see if the data was ingested. Exists to prevent querying the blob field unnecessarily.

blob

BlobField

The raw bytes of the data obtained via BytesIO().getvalue

version

IntegerField

The auto-incrementing version number assigned to unique datasets that share name


2. Feature

Determines the columns that will be used as predictive features during training. Columns is always the last dimension shape[-1] of a dataset.


2a. Methods


└── Feature.from_dataset()

Feature.from_dataset(
    dataset_id
    , include_columns
    , exclude_columns
)

Argument

Type

Default

Description

dataset_id

int

Required

Dataset.id from which you want to derive Dataset.columns.

include_columns

list(str)

None

Specify columns that will be included in the Feature. All columns that are not specified will not be included.

exclude_columns

list(str)

None

Specify columns that will not be included in the Feature. All columns that are not specified will be included.

If neither include_columns nor exclude_columns is defined, then all columns will be used.

Both include_columns and exclude_columns cannot be used at the same time


Fetch

Theses methods wrap Dataset’s fetch methods:

Method

Arguments

Returns

to_arr()

id:int, columns:list(str)=Feature.c olumns, samples:list(int)=None

ndarray 2D / 3D / 4D

to_df()

id:int, columns:list(str)=Feature.c olumns, samples:list(int)=None

df / list(df) / list(list(df))

get_dtypes()

id:int, columns:list(str)=Feature.c olumns

dict(column=str(np.type))


2b. Attributes

These are the fields in the Feature table

Attribute

Type

Description

columns

JSONField

The columns included in this featureset

columns_excluded

JSONField

The columns, if any, in the dataset that were not included

fitted_featurecoders

PickleField

When FeatureCoder’s fit an sklearn preprocessor to these columns, the fit objects are saved here for downstream inverse_transform’ing

dataset

ForeignKeyField

Where these columns came from


3. Label

Determines the column(s) that will be used as a target during supervised analysis. Do no create a Label if you intend to conduct unsupervised/ self-supervised analysis.


3a. Methods


└── Label.from_dataset()

Label.from_dataset(
    dataset_id
    , columns
)

Argument

Type

Default

Description

dataset_id

int

Required

Dataset.id from which you want to derive Dataset.columns. Only Tabular Datasets may be used as a Label.

columns

list(str)

None

Specify columns that will be included in the Label. If left blank, defaults to all columns. If more than 1 column is provided, then the data in those columns must be in One-Hot Encoded (OHE) format.


Fetch

Theses methods wrap Dataset’s fetch methods:

Method

Arguments

Returns

to_arr()

id:int, columns:list(str)=Label.col umns, samples:list(int)=None

ndarray 2D / 3D / 4D

to_df()

id:int, columns:list(str)=Label.col umns, samples:list(int)=None

df / list(df) / list(list(df))

get_dtypes()

id:int, columns:list(str)=Label.col umns

dict(column=str(np.type))


3b. Attributes

These are the fields in the Feature table

Attribute

Type

Description

columns

JSONField

The column(s) included in this featureset.

column_count

IntegerField

The number of columns in the Label. Used to determine if it is in validated OHE format or not

unique_classes

JSONField

Records all of the different values found in categorical columns. Not used for continuous columns.

fitted_labelcoder

PickleField

When a LabelCoder fit’s an sklearn preprocessor to these columns, the fit objects are saved here for downstream inverse_transform’ing

dataset

ForeignKeyField

Where these columns came from


4. Interpolate

If you don’t have time series data then you do not need interpolation.

If you have continuous columns with missing data in a time series, then interpolation allows you to fill in those blanks mathematically. It does so by fitting a curve to each column. Therefore each column passed to an interpolater must satisfy: np.issubdtype(dtype, np.floating).

Interpolation is the first preprocessor because you need to fill in blanks prior to encoding.

pandas.DataFrame.interpolate

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html

Is utilized due to its ease of use, variety of methods, and support of sparse indices. However, it does not follow the fit/transform pattern like many of the class-based sklearn preprocessors, so the interpolated training data is concatenated with the evalaution split during the interpolation of evaluation splits.

Below are the default settings if interpolate_kwargs=None that get passed to df.interpolate(). In my experience, method=spline produces the best results. However, if either (a) spline fails to fit to your data, or (b) you know that your pattern is linear - then try method=linear.

interpolate_kwargs = dict(
    method            = 'spline'
    , limit_direction = 'both'
    , limit_area      = None
    , axis            = 0
    , order           = 1
)

Because the sample dimension is different for each Dataset Type, they approach interpolation differently.

Dataset Type

Approach

Tabular

Unlike encoders, there is no fit object. So first the training data rows are interpolated independently. Then, when it comes time to interpolate other splits like validation, the training data is included in the sequence to be interpolated.

Sequence

Interpolation is ran on each 2D sequence separately

Image

Interpolation is ran on each 2D channel separately


4a. LabelInterpolater

Label is intended for a single column, so only 1 Interpolater will be used during Label.preprocess()


4ai. Methods

└── LabelInterpolater.from_label()

LabelInterpolater.from_label(
    label_id
    , process_separately
    , interpolate_kwargs
)

Argument

Type

Default

Description

label_id

int

Required

Points to the Label.columns to use

process_separately

bool

True

Used to restrict the fit to the training data, this may be flipped to False. However, doing so causes data leakage.

interpolate_kwargs

dict

None

Gets passed to df.interpolate(). See Interpolate section for defaults.


4aii. Attributes

These are the fields in the LabelInterpolater table

Attribute

Type

Description

process_separately

BooleanField

Whether or not the training data was interpolated by fitting to the entire dataset or not. Indicator of data leakage.

interpolate_kwargs

JSONField

Gets passed to df.interpolate(). See Interpolate section for defaults.

matching_columns

JSONField

The columns that were successfully interpolated

label

ForeignKeyField

The Label that this LabelInterpolater is applied to


4b. FeatureInterpolater

For multivariate datasets, columns/dtypes may need to be handled differently. So we use column/dtype filters to apply separate transformations. If the first transformation’s filter includes a certain column/dtype, then subsequent filters may not include that column/dtype.


4bi. Methods

└── FeatureInterpolater.from_feature()

FeatureInterpolater.from_feature(
    feature_id
    , process_separately
    , interpolate_kwargs
    , dtypes
    , columns
    , verbose
)

Argument

Type

Default

Description

feature_id

int

Required

Points to the Feature.columns to use

process_separately

bool

True

Used to restrict the fit to the training data, this may be flipped to False. However, doing so causes data leakage.

interpolate_kwargs

dict

None

The interpolate_kwargs:dict=N one object is what gets passed to Pandas interpolation. In my experience, method=spline produces the best results. However, if either (a) spline fails to fit to your data, or (b) you know that your pattern is linear - then try method=linear.

dtypes

list(str)

None

The dtypes to include

columns

list(str)

None

The columns to include. Errors if any of the columns were already included by dtypes.

verbose

bool

True

If True, messages will be printed about the status of the interpolaters as they attempt to fit on the filtered columns


4bii. Attributes

These are the fields in the FeatureInterpolater table

Attribute

Type

Description

idx

IntegerField

Zero-based auto-incrementer that counts the number of FeatureInterpolaters attached to a Feature.

process_separately

BooleanField

Whether or not the training data was interpolated by fitting to the entire dataset or not. Indicator of data leakage.

interpolate_kwargs

JSONField

Gets passed to df.interpolate(). See Interpolate section for defaults.

matching_columns

JSONField

The columns that matched the filter

leftover_columns

JSONField

The columns that were not included in the filter

leftover_dtypes

JSONField

The dtypes that were not included in the filter

original_filter

JSONField

dict().keys()==['include' ,'dtypes','columns']

feature

ForeignKeyField

The Feature that this FeatureInterpolater is applied to


5. Encode

Transform data into numerical format that is close to zero. Reference Encoding for more information.

There are two phases of encoding: 1. fit on train - where the encoder learns about the values of the samples made available to it. Ideally, you only want to fit aka learn from your training split so that you are not leaking information from your validation and test spits into your model! However, categorical encoders are always fit on the entire dataset because they are not prone to leakage and any weights tied to empty OHE inputs will zero-out. 2. transform each split/fold

Only sklearn.preprocessing methods are officially supported, but we have experimented with sklearn.feature_extraction.text.CountVectorizer


5a. LabelCoder

Label is intended for a single column, so only 1 LabelCoder will be used during Label.preprocess()

Unfortunately, the name “LabelEncoder” is occupied by sklearn.preprocessing.LabelEncoder


5ai. Methods

└── LabelCoder.from_label()

LabelCoder.from_label(
    label_id
    , sklearn_preprocess
)

Argument

Type

Default

Description

label_id

int

Required

Points to the Label.columns to use

sklearn_preprocess

object

Required

An instantiated sklearn.preprocessing class-based encoder - e.g. StandardScaler() neither StandardScaler nor scale(). AIQC will automatically correct the attributes of your encoder to smooth out any common errors they would cause. For example, preventing sparse SciPy matrix output (errors during tensor conversion) and data copy().


5aii. Attributes

These are the fields in the LabelCoder table

Attribute

Type

Description

only_fit_train

BooleanField

Whether or not the encoder was fit on the training data or the entire dataset

is_categorical

BooleanField

If the encoder is meant for categorical data, and therefore automatically fit on the entire dataset

sklearn_preprocess

PickleField

The instantiated sklearn.preprocessing class that was fit

matching_columns

JSONField

The columns that matched the dtype/ column name filters

encoding_dimension

CharField

Did the encoder succeed on 1D/ 2D uni-column/ 2D multi-column?

label

ForeignKeyField

The Label that this LabelCoder is applied to


5b. FeatureCoder

For multivariate datasets, columns/dtypes may need to be handled differently. So we use column/dtype filters to apply separate transformations. If the first transformation’s filter includes a certain column/dtype, then subsequent filters may not include that column/dtype.


5bi. Methods

└── FeatureCoder.from_feature()

FeatureCoder.from_feature(
    feature_id
    , sklearn_preprocess
    , include
    , dtypes
    , columns
    , verbose
)

Argument

Type

Default

Description

feature_id

int

Required

Points to the Feature.columns to use

sklearn_preprocess

object

Required

An instantiated sklearn.preprocessing class-based encoder - e.g. StandardScaler() neither StandardScaler nor scale(). AIQC will automatically correct the attributes of your encoder to smooth out any common errors they would cause. For example, preventing sparse SciPy matrix output (errors during tensor conversion) and data copy().

include

bool

True

Whether to include or exclude the dtypes/columns that match the filter. You can create a filter for all columns by setting include=False and then setting both dtypes and columns to None.

dtypes

list(str)

None

The dtypes to filter

columns

list(str)

None

The columns to filter. Errors if any of the columns were already used by dtypes.

verbose

bool

True

If True, messages will be printed about the status of the encoders as they attempt to fit on the filtered columns


5bii. Attributes

These are the fields in the FeatureCoder table

Attribute

Type

Description

idx

IntegerField

Zero-based auto-incrementer that counts the number of FeatureCoders attached to a Feature.

sklearn_preprocess

PickleField

The instantiated sklearn.preprocessing class that was fit

encoded_column_names

JSONField

After the columns are encoded, what are their names? OHE appends _<category> to the original column names as it expands

matching_columns

JSONField

The columns that matched the filter

leftover_columns

JSONField

The columns that were not included in the filter

leftover_dtypes

JSONField

The dtypes that were not included in the filter

original_filter

JSONField

dict().keys()==['include' ,'dtypes','columns']

encoding_dimension

CharField

Did the encoder succeed on 1D/ 2D uni-column/ 2D multi-column?

only_fit_train

BooleanField

Whether or not the encoder was fit on the training data or the entire dataset

is_categorical

BooleanField

If the encoder is meant for categorical data, and therefore automatically fit on the entire dataset

feature

ForeignKeyField

The Feature that this FeatureCoder is applied to


6. Shape

Changes the shape of data. Only supports Features, not Labels.

Reshaping is applied at the end of Feature.preprocess(). So if the feature data has been altered via time series windowing or One Hot Encoder, then those changes will be reflected in the shape that is fed to `

When working with architectures that are highly dimensional such convolutional and recurrent networks (Conv1D, Conv2D, Conv3D / ConvLSTM1D, ConvLSTM2D, ConvLSTM3D), you’ll often find yourself needing to reshape data to fit a layer’s required input shape.

  • Reducing unused dimensions - When working with grayscale images (1 channel, 25 rows, 25 columns) it’s better to use Conv1D instead of Conv2D.

  • Adding wrapper dimensions - Perhaps your data is a fit for ConvLSTM1D, but that layer is only supported in the nightly TensorFlow build so you want to add a wrapper dimension in order to use the production-ready ConvLSTM2D.

AIQC favors a “channels_first” (samples, channels, rows, columns) approach as opposed to “channels_last” (samples, rows, columns, channels).

Can’t I just reshape the tensors during the training loop? You could. However, AIQC systemtically provides the shape of features and labels to Algorith.fn_build to make designing the topology easier, so it’s best to get the shape right beforehand. Additionally, if you reshape your data within the training loop, then you may also need to reshape the output of Algorithm.fn_predict so that it is correctly formatted for automatic post-processing. It’s also more computationally efficient to do the reshaping once up front.


The reshape_indices argument is ultimately fed to np.reshape(newshape). We use index n to point to the value at ndarray.shape[n].

Reshaping by Index

Let’s say we have a 4D feature consisting of 3D images (samples * channels * rows * columns). Our problems is that the images are B&W, so we don’t want a color channel because it would add unecessary dimensionality to our model. So we want to drop the dimension at the shape index 1.

reshape_indices = (0,2,3)

Thus we have wrangled ourselves a 3D feature consisting of 2D images (samples * rows * columns).

Reshaping Explicitly

But what if the dimensions we want cannot be expressed by rearranging the existing indices? If you define a number as a str, then that number will be used as directly as the value at that position.

So if I wanted to add an extra wrapper dimension to my data to serve as a single color channel, I would simply do:

reshape_indices = (0,'1',1,2)

Then couldn’t I just hardcode my shapes with strings? Yes, but FeatureShaper is applied to all of the splits, which are assumed to have different shapes, which is why we use the indices.

Multiplicative Reshaping

Sometimes you need to stack/nest dimensions. This requires multiplying one shape index by another.

For example, if I have a 3 separate hours worth of data and I want to treat it as 180 minutes, then I need to go from a shape of (3 hours * 60 minutes) to (180 minutes). Just provide the shape indices that you want to multiply in a tuple like so:

<!> if your model is unsupervised (aka generative or self-supervised), then it must output data in “column (aka width) last” shape. Otherwise, automated column decoding will be applied along the wrong dimension.


6a. Methods

└── FeatureShaper.from_feature()

FeatureShaper.from_feature(
    feature_id
    , reshape_indices
)

Argument

Type

Default

Description

feature_id

int

Required

The Feature.id to use

reshape_indices

tuple(int/str/tuple)

Required

See Strategies.


6b. Attributes

These are the fields of the FeatureShaper table

Attribute

Type

Description

reshape_indices

PickleField

See #Reshaping-by-Index.Pickle because tuple has no JSON equivalent.

column_position

IntegerField

The shape index used for columns aka width.

feature

ForeignKeyField

The Feature that reshaping is applied to.


7. Window

Window facilitates sliding windows for a time series Feature. It does not apply to Labels. This is used for unsupervised (aka self-supervised) walk-forward forecasting for time series data.

size_window determine how many timepoints are included in a window, and size_shift determines how many timepoints to slide over before defining a new window.

For example, if we want to be able to predict the next 7 days worth of weather using the past 21 days of weather, then our size_window=21 and our size_shift=7.


Challenges

Dealing with stratified windowed data demands a systematic approach.

windowDimensions

Windowing always increases dimensionality

After data is windowed, its dimensionality increases by 1. Why? Well, originally we had a single time series. However, if we window that data, then we have many time series subsets.

interpolateWindows

As the highest dimension, it becomes the “sample”

No matter what dimensionality the original data has, it will be windowed along the first dimension.

This means that the windows now serve as the samples, which is important for stratification. If we have a year’s worth of windows, we don’t want all of our training windows to come from the same season. Therefore, Window must be created prior to Splitset.

Windowing may causes overlap in splits

In addition to increasing the dimensionality of our data, it makes it harder to nail down the boundaries of our splits in order to prevent data leakage.

As seen in the diagram above, the timesteps of the train and test splits may overlap. So if we are fitting an interpolater to our training split, the first 3 NaNs would be included, but the last 2 would not.

Windows

Shifted and unshifted windows

In a walk-forward analysis, we learn about the future by looking at the past. So we need 2 sets of windows:

  • Unshifted windows (orange in diagram above): represent the past and serves as the features we learn from

  • Shifted windows (green in diagram above): represent the future and serves as the target we predict

However, when conducting inference, we are trying to predict the shifted windows not learn from them. So we don’t need to record any shifted windows.


7a. Methods

└── Window.from_feature()

Window.from_feature(
    feature_id
    , size_window
    , size_shift
    , record_shifted
)

Argument

Type

Default

Description

dataset_id

int

Required

Feature.id from which you want to derive windows.

size_window

int

Required

The number of timesteps to include in a window.

size_shift

int

Required

The number of timesteps to shift forward.

record_shifted

bool

True

Whether or not we want to keep a shifted set of windows around. During pure inference, this is False.


7b. Attributes

These are the fields of the Window table

Attribute

Type

Description

size_window

IntegerField

Number of timesteps in each window

size_shift

IntegerField

The number of timesteps in the shift forward.

window_count

IntegerField

Not a relationship count! Number of windows in the dataset. This becomes the new samples dimension for stratification.

samples_unshifted

JSONField

Underlying sample indices of each window in the past-shifted windows.

samples_shifted

JSONField

Underlying sample indices of each window in the future-shifted windows.

feature

ForeignKeyField

The Feature that this windowing is applied to


8. Splitset

Used for sample stratification. Reference Stratification section of the Explainer.

Split

Description

train

The samples that the model will be trained upon. Later, we’ll see how we can make cross-folds from our training split. Unsupervised learning will only have a training split.

validation (optional)

The samples used for training evaluation. Ensures that the test set is not revealed to the model during training.

test (optional)

The samples the model has never seen during training. Used to assess how well the model will perform on unobserved, natural data when it is applied in the real world aka how generalizable it is.

Because Splitset groups together all of the data wrangling entities (Features, Label, Folds) it essentially represents a Pipeline, which is why it bears the name Pipeline in the High-Level API.


Cross-Validation

Cross-validation is triggered by fold_count:int during Splitset creation. Reference the scikit-learn documentation to learn more about cross-validation.

CrossFoldBins

Each row in the diagram above is a Fold object.

Each green/blue box represents a bin of stratified samples. During preprocessing and training, we rotate which blue bin serves as the validation samples (fold_validation). The remaining green bins in the row serve as the training samples (folds_train_combined).

Let’s say we defined fold_count=5. What are the implications?

  • Creates 5 Folds related to a Splitset.

  • 5x more preprocessing and caching; each fold_validation is excluded from the fit on folds_train_combinared. Fits are saved to the orm.Fold object as opposed to the orm.Feature/Label objects.

  • 5x more models will be trained for each experiment.

  • 5x more evaluation.

Disclaimer about inherent limitations & challenges

Do not use cross-validation unless the distribution of each resulting fold (total sample count divided by fold_count) is representatitve of your broader sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that the total sample count is evenly divisble by fold_count. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either unjustifiably well (100% accuracy when only the most common label class is present) or poorly (1 incorrect prediction in a small fold negatively skews an otherwise good model).

If you’ve ever performed cross-validation manually with too few samples, then you’ll know that it’s easy enough to construct the folds, but then it’s a pain to calculate performance metrics (e.g. zero_division, absent OHE classes) due to the absence of outlying classes and bins. Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like multi-label classification using 150 samples then you may run into errors during evaluation.


Samples Cache

Each Splitset has as cache_path attribute, which represents a local directory where preprocessed data is stored during training & evaluation.

The output of feature.preprocess() and label.preprocess() are written to this folder prior to training so that:

  1. Each Job does not have to preprocess data from scratch.

  2. The original data does not need to be held in memory between Jobs.

└── aiqc/cache/samples/splitset_uid
    └── <fold_index> | "no_fold"
        └── <split>
            └── label.npy
            └── feature_<i>.npy
  • <fold_index> is a folder for each Fold, since they have different samples. Whereas “no_fold” is a single folder for a regular splitset where there are no folds. ‘no_fold’ just keeps the folder depth uniform for regular splitsets

  • <split>: The samples[<split>] of interest: ‘train’, ‘validation’, ‘folds_train_combined’, ‘fold_validation’, ‘test’.

  • feature_<n> accounts for Splitsets with more than 1 Feature.

The Splitset.cache_hot:bool argument indicates whether or not the cache for that splitset is populated or not.

Samples are automatically cached during Queue.run_jobs().

See also: orm.splitset.clear_cache() and utils.config.clear_cache_all()


8a. Methods


└── Splitset.make()

Splitset.make(
    feature_ids
    , label_id
    , size_test
    , size_validation
    , bin_count
    , fold_count
    , unsupervised_stratify_col
    , name
    , description
    , predictor_id
)

Argument

Type

Default

Description

feature_ids

list(int)

Required

Multiple Feature.id’s may be included to enable multi-modal (aka mixed data-type) analysis. All of these Features must have the same number of samples.

label_id

int

None

The Label to be used as a target for supervised analysis. Must have the sample number of samples as the Features.

size_test

float

None

Percent of samples to be placed into the test split. Must be > 0.0 and < 1.0.

size_validation

float

None

Percent of samples to be placed into the validation split. Must be > 0.0 and < 1.0. If this is not None and used in combination with fold_count, then there will be 4 splits.

bin_count

int

None

For continous stratification columns, how many bins (aka quantiles) should be used?

fold_count

int

None

The number or cross-validation folds to generate. See Cross-Validation.

unsupervised_stratify_c ol

str

None

Used during unsupervised analysis. Specify a column from the first Feature in feature_ids to use for stratification. For example, when forecasting, it may make sense to stratify by the day of the year.

name

str

None

Used for versioning a pipeline (collection of inputs, label, and stratification). Two versions cannot have identical attributes.

description

str

None

What is unique about this this pipeline?

size_train = 1.00 - (size_test + size_validation) the backend ensures that the sizes sum to 1.00

How does continuous binning work? Reference the handy Pandas.qcut() and the source code pd.qcut(x=array_to_bin, q=bin_count, labels=False, duplicates='drop') for more detail.


└── Splitset.cache_samples()

See Samples Cache section for a description

Splitset.cache_samples(id)

Argument

Type

Default

Description

id

int

Required

The identifier of the Splitset of interest


└── Splitset.clear_cache()

See Samples Cache section for a description. Deletes the entire directory located at Splitset.cache_path.

Splitset.clear_cache(id)

Argument

Type

Default

Description

id

int

Required

The identifier of the Splitset of interest


└── Splitset.fetch_cache()

See Samples Cache section for a description. This fetches a specific file from the cache.

Splitset.fetch_cache(
    id
    , split
    , label_features
    , fold_id
    , library
)

Argument

Type

Default

Description

id

int

Required

The identifier of the Splitset of interest

split

int

Required

The samples[<split>] of interest: ‘train’, ‘validation’, ‘folds_train_combined’, ‘fold_validation’, ‘test’.

label_features

str

Required

Either 'label' or 'features'

fold_id

int

None

The identifier of the Fold of interest, if any

library

str

None

If 'pytorch', it will convert each returned array to FloatTensor()

label_features Value

Returns

'label'

ndarray

'features'

ndarray or list(ndarray)


8b. Attributes

These are the fields of the Splitset table

Attribute

Type

Description

cache_path

CharField

Where the splitset stores its cached samples

cache_hot

BooleanField

If the samples are currently stored in the cache

samples

JSONField

The bins that splits have been stratified into dict(split=[sample_indice s])

sizes

JSONField

Human-readable sizes of the splits dict(split=dict(percent=f loat,count=int))

supervision

CharField

Either “supervised” or “unsupervised” if the Splitset has a Label.

has_validation

BooleanField

Logical flag indicating if this Splitset has a validation split.

fold_count

IntegerField

The number of cross-validation Folds that belong to this Splitset

bin_count

IntegerField

The number of bins used to stratify a continuous column label or unsupervised_stratify column

**unsupervised_stratifyCol **

CharField

Used during unsupervised analysis. Specify a column from the first Feature in feature_ids to use for stratification. For example, when forecasting, it may make sense to stratify by the day of the year.

key_train

CharField

'train' by default, but 'folds_train_combined' if Splitset has Folds. None for an inference splitset.

key_evaluation

CharField

'test' if neither validation split nor Folds are used. 'validation' if a validation split is used. 'fold_validation' if Splitset has Folds. None for an inference splitset.

key_test

CharField

'test' by default. None for an inference splitset.

version

IntegerField

[TBD]

label

ForeignKeyField

The Label, if any, that supervises this splitset

predictor

DeferredForeignKey

During inference, a new Splitset of samples to be predicted may attach to a Predictor. Samples dict will bear the key of the Predictor.Splitset.count( ) e.g. 'infer_0'.

These are the fields of the Fold table

Attribute

Type

Description

idx

IntegerField

Zero-based auto-incrementer that counts the Folds

samples

JSONField

Contains the sample indices of the training folds and leftout validation fold, as well as any validation and test splits defined in the regular Splitset.samples dict().keys()==['folds_tr ain_combined', 'fold_valida tion', 'validation', 'test' ]

fitted_labelcoder

PickleField

When LabelCoders’s fit an sklearn preprocessor, the fit objects are saved here for downstream inverse_transform’ing

fitted_featurecoders

PickleField

When FeatureCoder’s fit an sklearn preprocessor, the fit objects are saved here for downstream inverse_transform’ing

splitset

ForeignKeyField

The Splitset that this Fold belongs to


9. Algorithm

Now that our data has been prepared, we transition to the 2nd half of the ORM where the focus is the logic that will be applied to that data.

The Algorithm contains all of the components needed to construct, train, and use our model.

Reference the tutorials for examples of how Algorithms are defined.


PyTorch Fit

Provides an abstraction that eliminates the boilerplate code normally required to train and evaluate a PyTorch model.

  • Before training - it shuffles samples, batches samples, and then shuffles batches.

  • During training - it calculates batch loss, epoch loss, and epoch history metrics.

  • After training - it calculates metrics for each split.

model, history = utils.pytorch.fit(
    # These arguments come directly from `fn_train`
    model
    , loser
    , optimizer

    , train_features
    , train_label
    , eval_features
    , eval_label

    # These arguments are user-defined
    , epochs
    , batch_size
    , enforce_sameSize
    , allow_singleSample
    , metrics
)

User-Defined Arguments

Type

Default

Description

epochs

int

30

The number of times to loop over the features

batch_size

int

5

Divides features and lables into chunks to be trained upon

enforce_sameSize

bool

True

If True, drops len(batch!=batch_size)

allow_singleSample

bool

False

If False, drops len(batch!=1)

metrics

list(torchmetrics.metric())

None

List of instantiated torchmetrics classes e.g. Accuracy


History Metrics

The goal of the Predictor.history object is to record the training and evaluation metrics at the end of each epic so that they can be interpretted in the learning curve plots. Reference the evaluation section.

  • Keras: any metrics=[] specified are automatically added to the History callback object.

  • PyTorch: if you use fit seen above, then you don’t need to worry about this. Users are responsible for calculating their own metrics (we recommend the torchmetrics package) and placing them into a history dictionary that mirrors the schema of the Keras history object. Reference the torch examples.

The schema of the history dictionary is as follows: dict(<metric>:ndarray, val_<metric>=ndarray). For example, if you wanted to record the history of the ‘loss’ and ‘accuracy’ metrics manually for PyTorch, you would construct it like so:

history = dict(
    loss           = ndarray
    , val_loss     = ndarray

    , accuracy     = ndarray
    , val_accuracy = ndarray
)

TensorFlow Early Stopping

Early stopping isn’t just about efficiency in reducing the number of epochs. If you’ve specified 300 epochs, there’s a chance your model catches on to the underlying patterns early, say around 75-125 epochs. At this point, there’s also good chance what it learns in the remaining epochs will cause it to overfit on patterns that are specific to the training data, and thereby and lose it’s simplicity/ generalizability.

The metric=val_* prefix refers to the evaluation samples.

Remember, regression does not have accuracy metrics.

TrainingCallback.MetricCutoff is a custom class we wrote to make early stopping easier, so you won’t find information about it in the official Keras documentation.

Placed within fn_train:

from aiqc.utils.tensorflow import TrainingCallback

#Define one or more metrics to monitor.
metrics_cuttoffs = [
    dict(metric='accuracy', cutoff=0.96, above_or_below='above'),
    dict(metric='loss', cutoff=0.1, above_or_below='below')
    dict(metric='val_accuracy', cutoff=0.96, above_or_below='above'),
    dict(metric='val_loss', cutoff=0.1, above_or_below='below')
]
cutoffs = TrainingCallback.MetricCutoff(metrics_cuttoffs)

# Pass it into keras callbacks
model.fit(
    # other fit args
    callbacks = [cutoffs]
)

Tip: try using a val_accuracy threshold by itself for best results


9a. Methods

Assemble an architecture consisting of components defined in functions.

The **hp kwargs are common to every Algorithm function except fn_predict. They are used to systematically pass a dictionary of hyperparameters into these functions. See Hyperparameters.


└── Algorithm.make()

Algorithm.make(
    library
    , analysis_type
    , fn_build
    , fn_train
    , fn_predict
    , fn_lose
    , fn_optimize
)

Argument

Type

Default

Description

library

str

Required

‘keras’ or ‘pytorch’ depending on the type of model defined in fn_build

analysis_type

str

Required

‘classification_binary’, ‘classification_multi’, or ‘regression’. Unsupervised/ self-supervised falls under regression. Used to determine which performance metrics are run. Errors if it is incompatible with the Label provided: e.g. classification_binary is incompatible with an np.floating Label.column.

fn_build

func

Required

See below. Build the model architecture.

fn_train

func

Required

See below. Train the model.

fn_predict

func

None

See below. Run the model.

fn_lose

func

None

See below. Calculate loss.

fn_optimize

func

None

See below. Optimization strategy.

Required Functions

def fn_build(
    features_shape:tuple
    , label_shape:tuple
    , **hp:dict
):
    # Define tf/torch model
    return model

The *_shape arguments contain the shape of a single sample, as opposed to a batch or entire dataset. features_shape is plural because it may contain the shape of multiple features. However, if only 1 feature was used then it will not be inside a list.

def fn_train(
    model:object
    , loser:object
    , optimizer:object
    , train_features:ndarray
    , train_label:ndarray
    , eval_features:ndarray
    , eval_label:ndarray
    , **hp:dict
):
    # Define training/ eval loop.
    # See `utils.pytorch.fit`

    # if tensorflow
    return model
    # if torch
    # See `utils.pytorch.fit` and history metrics below
    return history:dict, model

Optional Functions

Where are the defaults for optional functions defined? See utils.tensorflow and utils.pytorch for examples of loss, optimization, and prediction.

def fn_predict(model:object, features:ndarray):
    #if classify. predictions always ordinal, never OHE.
    return prediction, probabilities #both as ndarray

    #if regression
    return prediction #ndarray
def fn_lose(**hp:dict):
    # Define tf/torch loss function
    return loser
def fn_optimize(**hp:dict):
    # Define tf/torch optimizer
    return optimizer

└── Algorithm.get_code()

Returns the strings of the Algorithm functions:

dict(
    fn_build      = aiqc.utils.dill.reveal_code(Algorithm.fn_build)
    , fn_lose     = aiqc.utils.dill.reveal_code(Algorithm.fn_lose)
    , fn_optimize = aiqc.utils.dill.reveal_code(Algorithm.fn_optimize)
    , fn_train    = aiqc.utils.dill.reveal_code(Algorithm.fn_train)
    , fn_predict  = aiqc.utils.dill.reveal_code(Algorithm.fn_predict)
)

9b. Attributes

These are the fields of the Algorithm table

Attribute

Type

library

CharField

analysis_type

CharField

fn_build

BlobField

fn_lose

BlobField

fn_optimize

BlobField

fn_train

BlobField

fn_predict

BlobField

See #9.-Algorithm for descriptions


10. Hyperparameters

As mentioned in Algorithm, the **hp argument is used to systematically pass hyperparameters into the Algorithm functions.

For example, given the follow set of hyperparamets:

hyperparameters = dict(
    epoch_count     = [30]
    , learning_rate = [0.01]
    , neuron_count  = [24, 48]
)

A grid search would produce the 2 unique Hyperparamcombo’s:

[
    dict(
        epoch_count     = 30
        , learning_rate = 0.01
        , neuron_count  = 24 #<-- varies
    )

    , dict(
        epoch_count     = 30
        , learning_rate = 0.01
        , neuron_count  = 48 #<-- varies
    )
]

We access the current value in our model functions like so: hp['neuron_count'].


10a. Methods

└── Hyperparamset.from_algorithm()

Hyperparamset.from_algorithm(
    algorithm_id
    , hyperparameters
    , search_count
    , search_percent
)

Argument

Type

Default

Description

algorithm_id

int

Required

The Algorithm.id whose functions these hyperparameters will be used with

hyperparameters

dict(str:list)

Required

See example in Hyperparameters. Must be JSON compatible.

search_count

int

None

Randomly select n hyperparameter combinations to test. Must be greater than 1. No upper limit, it will test all combinations if number of combinations is exceeded.

search_percent

float

None

Given all of the available hyperparameter combinations, search x%. Between 0.0:1.0. Cannot be used if search_count is used.

“Bayesian TPE (Tree-structured Parzen Estimator)” via hyperopt has been suggested as a future area to explore, but it does not exist right now.


10b. Attributes

These are the fields of the Hyperparamset table

Attribute

Type

Description

hyperparameters

JSONField

The original dict(param=list) of all possible values

search_count

IntegerField

The number of randomly selected combinations of hyperparameters

search_percent

FloatField

The percent of randomly selected combinations of hyperparameters

algorithm

ForeignKeyField

The Algorithm.id whose functions these hyperparameters will be used with

These are the fields of the Hyperparamcombo table

Attribute

Type

Description

idx

IntegerField

Zero-based counts the number of the number of hyperparamcombos

hyperparameters

JSONField

The specific combination of hyperparameters that will be fed to the Algorithm functions

hyperparamset

ForeignKeyField

The Hyperparamset that this combination of hyperparameters was derived from


11. Queue

The Queue is the central object of the “logic side” of the ORM. It ties together everything we need to run training Job’s for hyperparameter tuning. That’s why it is referred to as an Experiment in the High-Level API.


11a. Methods


└── Queue.from_algorithm()

Queue.from_algorithm(
    algorithm_id
    , splitset_id
    , repeat_count
    , permute_count
    , hyperparamset_id
    , description
)

Argument

Type

Default

Description

algorithm_id

int

Required

The Algorithm.id whose functions will be used during training and evaluation

splitset_id

int

Required

The Splitset.id whose samples will be used during training and evaluation

repeat_count

int

1

Each job will be repeat n times. Designed for use with random weight initialization (aka non-deterministic). This is why 1 Job has many Predictors

permute_count

int

3

Triggers a shuffled permutation of each training data column to determine which columns have the most impact on loss in comparison baseline training loss: [training loss - (median loss of <n> permutations)]` `. The count determines how many times the shuffled permutation is ran before taking the median loss. Permutation does *not* get run on ``Feature.dataset.typ=='ima ge'. Set this to 0 if you do not care about feature importance. Retroactive feature importance is possible via Prediction.calcFeatureImp ortance().

hyperparamset_id

int

None

The Hyperparamset.id whose samples will be used during training and evaluation. This needs to be specified because an Algorithm can have many Hyperparamsets.

description

str

None

What is unique about this experiment?


└── Queue.run_jobs()

Jobs are simply ran on a loop on the main process.

Stop the queue with a keyboard interrupt e.g. ctrl+Z/D/C in Python shell or i,i in Jupyter. It is listening for interupts so it will usually stop gracefully. Even if it errors upon during interrupt, it’s not a problem. You can rerun the queue and it will resume on the same job it was running previously.

Queue.run_jobs(id)

Argument

Type

Default

Description

id

int

Required

The identifier of the Queue of interest


└── Queue.plot_performance()

Plots every model trained by the queue for comparison.

  • X axis = loss

  • Y axis = score

Queue.plot_performance(
    id
    , call_display
    , max_loss
    , min_score
    , score_type
    , height
)

Argument

Type

Default

Description

id

int

Required

The identifier of the Queue of interest

call_display

bool

True

If True, calls display() on plot. If False, returns Plotly figure object.

max_loss

float

None

Models with any split with higher loss than this threshold will not be plotted.

min_score

typ

None

Models with any split with a lower score than this threshold will not be plotted.

score_type

typ

None

Defaults to "accuracy" for classification analysis, and "r2" for regression analysis. See aiqc.utils.meter for available metrics.

height

typ

None

Default height is 560 but you can force it to be taller


└── Queue.metrics_df()

Displays metrics for every split/fold of every model.

Queue.metrics_df(
    id
    , selected_metrics
    , sort_by
    , ascending
)

Argument

Type

Default

Description

id

int

Required

The identifier of the Queue of interest

ascending

typ

False

Descending if False.

selected_metrics

list(str)

None

If you get overwhelmed by the variety of metrics returned, then you can include the ones you want selectively by name.

sort_by

str

None

You can sort the dataframe by any column name.


└── Queue.metricsAggregate_df()

Aggregate statistics about every metric of every model trained in the Queue – displays the average, median, standard deviation, minimum, and maximum across all splits/folds.

Queue.metricsAggregate_df(
    id
    , ascending        = False
    , selected_metrics = None
    , selected_stats   = None
    , sort_by          = None
)

Argument

Type

Default

Description

id

int

Required

The identifier of the Queue of interest

ascending

typ

False

Descending if False.

selected_metrics

list(str)

None

If you get overwhelmed by the variety of metrics returned, then you can include the ones you want selectively by name.

sort_by

str

None

You can sort the dataframe by any column name.


11b. Attributes

These are the fields of the Queue table

Attribute

Type

Description

repeat_count

IntegerField

The number of times to repeat each Job.

total_runs

IntegerField

The total number of models to be trained as a result of this queue being created.

permute_count

IntegerField

Number of permutations to run on each column before taking the median impact on loss. 0 means permutation was skipped.

runs_completed

IntegerField

Counts the runs that have actually finished

algorithm

ForeignKeyField

The model functions to use during training and evaluation

splitset

ForeignKeyField

The pipeline of samples to feed to the models during training and evaluation

hyperparamset

ForeignKeyField

Contains all of the hyperparameters to be used for the Jobs


12. Job

The Queue spawns Job’s. A Job is like a spec/ manifest for training a model. It may be repeated.

# jobs = Hyperamset.hyperamcombo.count() * Queue.repeat_count * splitset.folds.count()


12a. Methods

There are no noteworthy, user-facing methods for the Job class


12b. Attributes

These are the fields of the Job table

Attribute

Type

Description

repeat_count

IntegerField

The number of times this Job is to be repeated

queue

ForeignKeyField

The Queue this Job was created by

hyperparamcombo

ForeignKeyField

The parameters this Job uses

fold

ForeignKeyField

The cross-validation samples that this Job uses


13. Predictor

As the Jobs finish, they save the model and history metrics within a Predictor object.


13a. Methods

└── Predictor.get_model()

predictor.get_model(id)

Handles fetching and initializing the model (and PyTorch optimizer) from Predictor.model_file and Predictor.input_shapes

Argument

Type

Default

Description

id

int

None

The identifier of the Predictor of interest


└── Predictor.get_hyperparameters()

This is a shortcut to fetch the hyperparameters used to train this specific model. as_pandas toggles between dict() and DataFrame.

Predictor.get_hyperparameters(id, as_pandas)

Argument

Type

Default

Description

id

int

None

The identifier of the Predictor of interest

as_pandas

bool

True

If True, returns a DataFrame. If False, returns a list(dict(param_name=[val ues]))


└── Predictor.plot_learning_curve()

A learning curve will be generated for each train-evaluation pair of metrics in the Predictor.history dictionary

Predictor.plot_learning_curve(
    id
    , skip_head
    , call_display
)

Argument

Type

Default

Description

id

int

None

The identifier of the Predictor of interest

skip_head

bool

True

Skips displaying the first 15% of epochs. Loss values in the first few epochs can often be extremely high before they plummet and become more gradual. This really stretches out the graph and makes it hard to see if the evaluation set is diverging or not.

call_display

bool

True

If True, calls display() on plot. If False, returns Plotly figure object(s).


13b. Attributes

These are the fields of the Predictor table

Attribute

Type

Description

repeat_index

IntegerField

Counts how many predictors have been trained using a Job spec

time_started

DateTimeField

When the Job started

time_succeeded

DateTimeField

When the Job finished

time_duration

IntegerField

Total time in seconds it took to complete the Job

model_file

BlobField

Contains a dilled (advanced Pickle) of the trained model. See Predictor.get_model() for exporting.

features_shapes

PickleField

tuple or list of tuples containing the np.shape(s) of feature(s)

label_shape

PickleField

tuple containing np.shape of a single sample’s label

history

JSONField

Contains the training history loss/metrics

is_starred

BooleanField

Flag denoting if this model is of interest

job

ForeignKeyField

The Job that trained this Predictor


14. Prediction

When data is fed through a Predictor, you get a Prediction. During training, Predictions are automatically generated for every split/fold in the Queue.splitset.


14a. Methods

└── Prediction.calc_featureImportance()

This method is provided for conducting feature importance after training. It was decoupled from training for the following reasons:

  • Permutation is computationally expensive, especially for many-columned datasets.

  • We don’t care about the feature importance of our best models.

What data is used when calculating feature importance? All splits/folds are concatenated back into a single dataset. This assumes that all splits/folds are relatively equally balanced with respect to their label values. For example, if you have unbalanced multi-labels (55:35:10 distribution of classes) then a given feature’s importance may be biased based on how well it predicts the larger class. For binary classification scenarios, this should not matter as much since predicting one class also helps in predicting the opposite class.

Upon completion it will update the Prediction.feature_importance and Prediction.permute_count attributes.

Prediction.calc_featureImportance(id, permute_count)

Argument

Type

Default

Description

id

int

None

The identifier of the Prediction of interest

permute_count

int

Required

The count determines how many times the shuffled permutation is ran before taking the median loss. [training loss - (median loss of <n> permutations)]` ` Permutation *skips* ``Feature.dataset.typ=='ima ge'.


└── Prediction.importance_df()

Returns a dataframe of feature columns ranked by their median importance

Prediction.importance_df(id, top_n, feature_id)

Argument

Type

Default

Description

id

int

None

The identifier of the Prediction of interest

top_n

int

None

The number of columns to return

feature_id

int

None

Limit returned columns to a specific feature identifier


└── Prediction.plot_feature_importance()

Plots prediction.feature_importance if Queue.permute_count>0 or Prediction.calc_featureImportance() was ran after the fact.

Prediction.plot_feature_importance(
    id
    , call_display
    , top_n
    , height
    , margin_left
)

Argument

Type

Default

Description

id

int

None

The identifier of the Prediction of interest

call_display

bool

True

If True, calls display() on plot. If False, returns Plotly figure object.

top_n

int

10

The number of features to display. If greater than the actual number of features, it just returns all features.

boxpoints

object

False

Determines how whiskers, outliers, and points are shown. Options are: False, 'all', 'suspectedoutliers', and 'outliers'. Reference Plotly Box Plots.

height

int

None

If None, dynamically makes the plot taller to fit all of the columns

margin_left

int

None

If None, dynamically makes the y axis margin wider the longest column name


└── Prediction.plot_roc_curve()

Receiver operating curve (ROC) for classification metrics.

Prediction.plot_roc_curve(id, call_display)

Argument

Type

Default

Description

id

int

None

The identifier of the Prediction of interest

call_display

bool

True

If True, calls display() on plot. If False, returns Plotly figure object.


└── Prediction.plot_precision_recall()

Precision/recall curve for classification metrics.

Prediction.plot_precision_recall(id, call_display)

Argument

Type

Default

Description

id

int

None

The identifier of the Prediction of interest

call_display

bool

True

If True, calls display() on plot. If False, returns Plotly figure object.


└── Prediction.plot_confusion_matrix()

Confusion matrices for classification metrics.

Prediction.plot_confusion_matrix(id, call_display)

Argument

Type

Default

Description

id

int

None

The identifier of the Prediction of interest

call_display

bool

True

If True, calls display() on plot. If False, returns Plotly figure object(s).


└── Prediction.plot_confidence()

Plot the binary/multi-label classification probabilities for a single sample.

Prediction.plot_confidence(
    id,
    , prediction_index
    , height
    , call_display
    , split_name
)

Argument

Type

Default

Description

id

int

None

The identifier of the Prediction of interest

prediction_index

int

0

The index of the sample of interest

height

int

175

Force the height of the chart.

call_display

bool

True

If True, returns a DataFrame. If False, returns a list(dict(param_name=[val ues]))

split_name

int

None

The identifier of the Prediction of interest


14b. Attributes

Attribute

Type

Description

predictions

PickleField

Decoded predictions ndarray for per split/ fold/ inference

permute_count

IntegerField

The number of times this feature importance permuted each column

feature_importance

JSONField

Importance of each column. Only calculated for training split/fold. Schema: dict(str(feature.id)=dict (median=float,loss_impacts= list(float)))

probabilities

PickleField

Prediction probabilities per split/ fold. None for regression. Schema: dict(split=ndarray)

metrics

PickleField

Statistics for each split/fold that vary based on the analysis_type.

metrics_aggregate

PickleField

Contains the average, median, standard deviation, minimum, and maximum for each statistic across all splits/folds.

plot_data

PickleField

Metrics reformatted for plot functions.


Evaluation

To see the visualization of performance metrics of Queue, Predictor and Prediction in action – reference the Evaluation documentation.