Datasets

Overview

This notebook contains information about the prepackaged datasets that are referenced throughout the documentation. These datasets are either:

Prerequisites

If you’ve already completed the instructions on the Installation page, then let’s get started.

[2]:

from aiqc import datum

The module for interacting with the datasets is called datum so that it does not overlap with commonly used names like ‘data’ or ‘datasets’.

Prepackaged Local Data

The list_datums() method provides metadata about each of file that is included in the package, so that you can find one that suits your purposes.

By default it returns a Pandas DataFrame, but you can list_datums(format='list') to change that.

[3]:

datum.list_datums()

[3]:

	name	dataset_type	analysis_type	label	label_classes	features	samples	description	location
0	exoplanets.parquet	tabular	regression	SurfaceTempK	N/A	8	433	Predict temperature of exoplanet.	local
1	heart_failure.parquet	tabular	regression	died	2	12	299	Biometrics to predict loss of life.	local
2	iris.tsv	tabular	classification_multi	species	3	4	150	3 species of flowers. Only 150 rows, so cross-folds not represent population.	local
3	sonar.csv	tabular	classification_binary	object	2	60	208	Detecting either a rock "R" or mine "M". Each feature is a sensor reading.	local
4	houses.csv	tabular	regression	price	N/A	12	506	Predict the price of the house.	local
5	iris_noHeaders.csv	tabular	classification multi	species	3	4	150	For testing; no column names.	local
6	iris_10x.tsv	tabular	classification multi	species	3	4	1500	For testing; duplicated 10x so cross-folds represent population.	local
7	brain_tumor.csv	image	classification_binary	status	2	1 color x 160 tall x 120 wide	80	csv acts as label and manifest of image urls.	remote
8	galaxy_morphology.tsv	image	classification_binary	morphology	2	3 colors x 240 tall x 300 wide	40	tsv acts as label and manifest of image urls.	remote
9	spam.csv	text	classification_binary	v1	2	text data	5572	collection of spam/ ham (not spam) messages	local
10	epilepsy.parquet	sequence	classification_binary	seizure	2	1 x 178 readings	1000	<https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition> Storing the data tall so that it compresses better.`label_df = df[['seizure']]; sensor_arr3D = df.drop(columns=['seizure']).to_numpy().reshape(1000,178,1)`	local
11	delhi_climate.parquet	sequence	forecasting	N/A	N/A	3	1575	<https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data>. Both train and test (pruned last day from train). 'pressure' and 'wind' columns seem to have outliers. Converted 'date' column to 'day_of_year.'	local
12	liberty_moon.csv	image	forecasting	N/A	N/A	1 color x 50 tall x 60 wide	15	moon glides from top left to bottom right	remote

The location where Python packages are installed varies from system to system so pkg_resources is used to find the location of these files dynamically.

Using the value of the name column, you can fetch that file via to_pandas().

[4]:

df = datum.to_pandas(name='houses.csv')
df.head()

[4]:

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	lstat	price
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2

Alternatively, if you prefer to work directly with the file itself, then you can obtain the location of the file via the get_demo_file_path() method.

[5]:

datum.get_path('houses.csv')

[5]:

'/Users/layne/Desktop/AIQC/aiqc/data/houses.csv'

Remote Data

In order to avoid bloating the package, larger dummy datasets are not included in the package. These kind of datasets consist of many large files.

They are located within the repository at https://github.com/aiqc/aiqc/remote_datum.

If you want to fetch them on your own, you can do so by appending ?raw=true to the end of an individual file’s URL.

Otherwise, their locations are hardcoded into the datum methods.

We’ll use the brain_tumor.csv file as an example.

[6]:

df = datum.to_pandas(name='brain_tumor.csv')
df.head()

[6]:

	symmetry	url
0	NaN	https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_0.jpg
1	NaN	https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_1.jpg
2	NaN	https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_2.jpg
3	NaN	https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_3.jpg
4	NaN	https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_4.jpg

This file acts as a manifest for multi-modal datasets in that each row of this CSV represents a sample:

The 'status' column of this file serves as the Label of that sample. We’ll construct a Dataset.Tabular from this.
Meanwhile, the 'url' column acts as a manifest in that it contains the URL of the image file for that sample. We’ll construct a Dataset.Image from this.

[7]:

img_urls = datum.get_remote_urls('brain_tumor.csv')
img_urls[:3]

[7]:

['https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_0.jpg',
 'https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_1.jpg',
 'https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_2.jpg']

At this point, you can use either the high or low level AIQC APIs [e.g. aiqc.Dataset.Image.from_ulrs()] to ingest that data and work with it as normal.

Alternative Sources

If the datasets described above do not satisfy your use case, then have a look at the following repositories:

UCI: https://archive.ics.uci.edu/ml/datasets.php
Kaggle: https://www.kaggle.com/datasets
Quilt: https://open.quiltdata.com/
sklearn: https://scikit-learn.org/stable/datasets/toy_dataset.html
seaborn: https://github.com/mwaskom/seaborn-data