Datasets

Overview

This notebook contains information about the prepackaged datasets that are referenced throughout the documentation. These datasets are either:


Prerequisites

If you’ve already completed the instructions on the Installation page, then let’s get started.

[2]:
from aiqc import datum

The module for interacting with the datasets is called datum so that it does not overlap with commonly used names like ‘data’ or ‘datasets’.


Prepackaged Local Data

The list_datums() method provides metadata about each of file that is included in the package, so that you can find one that suits your purposes.

By default it returns a Pandas DataFrame, but you can list_datums(format='list') to change that.

[3]:
datum.list_datums()
[3]:
name dataset_type analysis_type label label_classes features samples description location
0 exoplanets.parquet tabular regression SurfaceTempK N/A 8 433 Predict temperature of exoplanet. local
1 heart_failure.parquet tabular regression died 2 12 299 Biometrics to predict loss of life. local
2 iris.tsv tabular classification_multi species 3 4 150 3 species of flowers. Only 150 rows, so cross-folds not represent population. local
3 sonar.csv tabular classification_binary object 2 60 208 Detecting either a rock "R" or mine "M". Each feature is a sensor reading. local
4 houses.csv tabular regression price N/A 12 506 Predict the price of the house. local
5 iris_noHeaders.csv tabular classification multi species 3 4 150 For testing; no column names. local
6 iris_10x.tsv tabular classification multi species 3 4 1500 For testing; duplicated 10x so cross-folds represent population. local
7 brain_tumor.csv image classification_binary status 2 1 color x 160 tall x 120 wide 80 csv acts as label and manifest of image urls. remote
8 galaxy_morphology.tsv image classification_binary morphology 2 3 colors x 240 tall x 300 wide 40 tsv acts as label and manifest of image urls. remote
9 spam.csv text classification_binary v1 2 text data 5572 collection of spam/ ham (not spam) messages local
10 epilepsy.parquet sequence classification_binary seizure 2 1 x 178 readings 1000 <https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition> Storing the data tall so that it compresses better.`label_df = df[['seizure']]; sensor_arr3D = df.drop(columns=['seizure']).to_numpy().reshape(1000,178,1)` local
11 delhi_climate.parquet sequence forecasting N/A N/A 3 1575 <https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data>. Both train and test (pruned last day from train). 'pressure' and 'wind' columns seem to have outliers. Converted 'date' column to 'day_of_year.' local
12 liberty_moon.csv image forecasting N/A N/A 1 color x 50 tall x 60 wide 15 moon glides from top left to bottom right remote

The location where Python packages are installed varies from system to system so pkg_resources is used to find the location of these files dynamically.

Using the value of the name column, you can fetch that file via to_pandas().

[4]:
df = datum.to_pandas(name='houses.csv')
df.head()
[4]:
crim zn indus chas nox rm age dis rad tax ptratio lstat price
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2

Alternatively, if you prefer to work directly with the file itself, then you can obtain the location of the file via the get_demo_file_path() method.

[5]:
datum.get_path('houses.csv')
[5]:
'/Users/layne/Desktop/AIQC/aiqc/data/houses.csv'

Remote Data

In order to avoid bloating the package, larger dummy datasets are not included in the package. These kind of datasets consist of many large files.

They are located within the repository at https://github.com/aiqc/aiqc/remote_datum.

If you want to fetch them on your own, you can do so by appending ?raw=true to the end of an individual file’s URL.

Otherwise, their locations are hardcoded into the datum methods.

We’ll use the brain_tumor.csv file as an example.

[6]:
df = datum.to_pandas(name='brain_tumor.csv')
df.head()
[6]:
status size count symmetry url
0 0 0 0 NaN https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_0.jpg
1 0 0 0 NaN https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_1.jpg
2 0 0 0 NaN https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_2.jpg
3 0 0 0 NaN https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_3.jpg
4 0 0 0 NaN https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_4.jpg

This file acts as a manifest for multi-modal datasets in that each row of this CSV represents a sample:

  • The 'status' column of this file serves as the Label of that sample. We’ll construct a Dataset.Tabular from this.

  • Meanwhile, the 'url' column acts as a manifest in that it contains the URL of the image file for that sample. We’ll construct a Dataset.Image from this.

[7]:
img_urls = datum.get_remote_urls('brain_tumor.csv')
img_urls[:3]
[7]:
['https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_0.jpg',
 'https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_1.jpg',
 'https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_2.jpg']

At this point, you can use either the high or low level AIQC APIs [e.g. aiqc.Dataset.Image.from_ulrs()] to ingest that data and work with it as normal.


Alternative Sources

If the datasets described above do not satisfy your use case, then have a look at the following repositories: