Datasets
Overview
This notebook contains information about the prepackaged datasets that are referenced throughout the documentation. These datasets are either:
Prerequisites
If you’ve already completed the instructions on the Installation page, then let’s get started.
[2]:
from aiqc import datum
The module for interacting with the datasets is called datum
so that it does not overlap with commonly used names like ‘data’ or ‘datasets’.
Prepackaged Local Data
The list_datums()
method provides metadata about each of file that is included in the package, so that you can find one that suits your purposes.
By default it returns a Pandas DataFrame, but you can
list_datums(format='list')
to change that.
[3]:
datum.list_datums()
[3]:
name | dataset_type | analysis_type | label | label_classes | features | samples | description | location | |
---|---|---|---|---|---|---|---|---|---|
0 | exoplanets.parquet | tabular | regression | SurfaceTempK | N/A | 8 | 433 | Predict temperature of exoplanet. | local |
1 | heart_failure.parquet | tabular | regression | died | 2 | 12 | 299 | Biometrics to predict loss of life. | local |
2 | iris.tsv | tabular | classification_multi | species | 3 | 4 | 150 | 3 species of flowers. Only 150 rows, so cross-folds not represent population. | local |
3 | sonar.csv | tabular | classification_binary | object | 2 | 60 | 208 | Detecting either a rock "R" or mine "M". Each feature is a sensor reading. | local |
4 | houses.csv | tabular | regression | price | N/A | 12 | 506 | Predict the price of the house. | local |
5 | iris_noHeaders.csv | tabular | classification multi | species | 3 | 4 | 150 | For testing; no column names. | local |
6 | iris_10x.tsv | tabular | classification multi | species | 3 | 4 | 1500 | For testing; duplicated 10x so cross-folds represent population. | local |
7 | brain_tumor.csv | image | classification_binary | status | 2 | 1 color x 160 tall x 120 wide | 80 | csv acts as label and manifest of image urls. | remote |
8 | galaxy_morphology.tsv | image | classification_binary | morphology | 2 | 3 colors x 240 tall x 300 wide | 40 | tsv acts as label and manifest of image urls. | remote |
9 | spam.csv | text | classification_binary | v1 | 2 | text data | 5572 | collection of spam/ ham (not spam) messages | local |
10 | epilepsy.parquet | sequence | classification_binary | seizure | 2 | 1 x 178 readings | 1000 | <https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition> Storing the data tall so that it compresses better.`label_df = df[['seizure']]; sensor_arr3D = df.drop(columns=['seizure']).to_numpy().reshape(1000,178,1)` | local |
11 | delhi_climate.parquet | sequence | forecasting | N/A | N/A | 3 | 1575 | <https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data>. Both train and test (pruned last day from train). 'pressure' and 'wind' columns seem to have outliers. Converted 'date' column to 'day_of_year.' | local |
12 | liberty_moon.csv | image | forecasting | N/A | N/A | 1 color x 50 tall x 60 wide | 15 | moon glides from top left to bottom right | remote |
The location where Python packages are installed varies from system to system so
pkg_resources
is used to find the location of these files dynamically.
Using the value of the name
column, you can fetch that file via to_pandas()
.
[4]:
df = datum.to_pandas(name='houses.csv')
df.head()
[4]:
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 5.33 | 36.2 |
Alternatively, if you prefer to work directly with the file itself, then you can obtain the location of the file via the get_demo_file_path()
method.
[5]:
datum.get_path('houses.csv')
[5]:
'/Users/layne/Desktop/AIQC/aiqc/data/houses.csv'
Remote Data
In order to avoid bloating the package, larger dummy datasets are not included in the package. These kind of datasets consist of many large files.
They are located within the repository at https://github.com/aiqc/aiqc/remote_datum
.
If you want to fetch them on your own, you can do so by appending
?raw=true
to the end of an individual file’s URL.Otherwise, their locations are hardcoded into the datum methods.
We’ll use the brain_tumor.csv
file as an example.
[6]:
df = datum.to_pandas(name='brain_tumor.csv')
df.head()
[6]:
status | size | count | symmetry | url | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | NaN | https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_0.jpg |
1 | 0 | 0 | 0 | NaN | https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_1.jpg |
2 | 0 | 0 | 0 | NaN | https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_2.jpg |
3 | 0 | 0 | 0 | NaN | https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_3.jpg |
4 | 0 | 0 | 0 | NaN | https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_4.jpg |
This file acts as a manifest for multi-modal datasets in that each row of this CSV represents a sample:
The
'status'
column of this file serves as the Label of that sample. We’ll construct aDataset.Tabular
from this.Meanwhile, the
'url'
column acts as a manifest in that it contains the URL of the image file for that sample. We’ll construct aDataset.Image
from this.
[7]:
img_urls = datum.get_remote_urls('brain_tumor.csv')
img_urls[:3]
[7]:
['https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_0.jpg',
'https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_1.jpg',
'https://raw.githubusercontent.com/aiqc/aiqc/main/remote_datum/image/brain_tumor/images/healthy_2.jpg']
At this point, you can use either the high or low level AIQC APIs [e.g. aiqc.Dataset.Image.from_ulrs()
] to ingest that data and work with it as normal.
Alternative Sources
If the datasets described above do not satisfy your use case, then have a look at the following repositories: