Data Preparation
In this chapter, we will load and prepare the USA Housing dataset. All steps that we execute will be stored in ML Aide.
Download required Dataset
First, download the dataset
and save it in a subdirectory called data
.
mkdir data
curl https://raw.githubusercontent.com/MLAide/docs/master/docs/tutorial/housing.csv --output ./data/housing.csv
Create an API key
Later, we want to send all parameters, metrics, and models of our experiments to ML Aide. Therefore, we need to set an API key in our python client. Otherwise, we won't be able to authenticate against the ML Aide server.
- In the upper right click on
adam
>Settings
- Go to
API Keys
in the left navigation - Click on
Add API Key
- Enter any description and click on
Create
- Copy the show API key and store it somewhere safe. The API key won't be shown again. If you lose your API key you have to create a new one.
Create connection to ML Aide webserver
Our data preparation will be implemented in data_preparation.py
. Therefore, create a new file with this name.
To create a connection to the ML Aide webserver with Python clients you have to use mlaide.MLAideClient
. An object
of this class is the main entry point for all kinds of operations. Replace api_key
with your personal API key that
you created using the ML Aide web UI.
from mlaide import MLAideClient, ConnectionOptions
import pandas as pd
options = ConnectionOptions(
server_url='http://localhost:8881/api/v1', # the ML Aide demo server runs on port 8881 per default
api_key='<your api key>'
)
mlaide_client = MLAideClient(project_key='usa-housing', options=options)
Create Run
Before we read or process anything we should start tracking all relevant information in ML Aide. In ML Aide a run is the key concept to track parameters, metrics, artifacts and models. All runs belong to one or more experiments.
run_data_preparation = mlaide_client.start_new_run(experiment_key='linear-regression', run_name='data preparation')
Now we can read and process this dataset. Also, we can register the dataset as an artifact in ML Aide. This gives us the ability to reproduce the following steps - even if the dataset is lost, deleted, or modified. The artifact can be used in other runs as an input. This helps to track down the lineage of a machine learning model to its root. In the end, don't forget to mark the run as completed.
housing_data = pd.read_csv('data/housing.csv')
# add dataset as artifact
artifact = run_data_preparation.create_artifact(name="USA housing dataset", artifact_type="dataset", metadata={})
run_data_preparation.add_artifact_file(artifact, 'data/housing.csv')
run_data_preparation.set_completed_status()
Start your python script using your shell with python data_preparation.py
. After the script completed check the
web UI to see the created run and the artifact.
Summary
In this chapter we
- created an API key for authorization
- created our first run in ML Aide
- attached the dataset as an artifact to the run
- connected the python client to the ML Aide webserver
Your code should look like the following snippet shows.
Code
from mlaide import MLAideClient, ConnectionOptions
import pandas as pd
options = ConnectionOptions(
server_url='http://localhost:8881/api/v1', # the ML Aide demo server runs on port 8881 per default
api_key='<your api key>'
)
mlaide_client = MLAideClient(project_key='usa-housing', options=options)
run_data_preparation = mlaide_client.start_new_run(experiment_key='linear-regression', run_name='data preparation')
housing_data = pd.read_csv('data/housing.csv')
# add dataset as artifact
artifact = run_data_preparation.create_artifact(name="USA housing dataset", artifact_type="dataset", metadata={})
run_data_preparation.add_artifact_file(artifact, 'data/housing.csv')
run_data_preparation.set_completed_status()
a lot of housing data
The next step is to create a model based on this dataset.