Predicting House Prices
January 8, 2021
By Greg Krause
Gartner places AI Engineering in the Top Strategic Technology Trends for 2021.
Microsoft's cloud solution for this, Azure Machine Learning (AML), is a suite of tools that "empower developers and data scientists with a wide range of productive experiences for building, training, and deploying machine learning models faster".
In this post, we will utilize a subset of AML's features to tackle the House Prices - Advanced Regression Techniques prediction competition on Kaggle.
What is AutoML?
According to the Azure concept page, AutoML is the "process of automating the time consuming, iterative tasks of machine learning model development".
The models created by Azure AutoML can then be registered as a service, or referenced for baseline performance expectations.
Prerequisites
The code snippets in this post are intended to be run from an AML Jupyter Notebook. For more information, please see Tutorial: Get started with Azure Machine Learning in Jupyter Notebooks.
- Create a new AML notebook named kaggle-house-prices-advanced-regression-techniques.ipynb
- Download the competition dataset
AML Workspace File Structure
house-prices-advanced-regression-techniques/
kaggle-house-prices-advanced-regression-techniques.ipynb
data/
test.csv # From competition dataset
train.csv # From competition dataset
submission.csv # We will create this file
Using AML Jupyter Python Notebook
Import Dependencies
from azureml.core import Dataset, Experiment, Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import RunConfiguration
from azureml.train.automl import AutoMLConfig
import logging
import pandas as pd
Load Data and Fill NA/NaN Values with 0
train_df = pd.read_csv('./data/train.csv').fillna(0)
test_df = pd.read_csv('./data/test.csv').fillna(0)
Upload Data to Azure Blob
ws = Workspace.from_config()
default_store = ws.get_default_datastore()
default_store.upload_files(
['./data/train.csv'],
target_path='kaggle-house-prices-training',
overwrite=True,
show_progress=True
)
Create and Register Training Dataset
Datasets are used to "access data for your local or remote experiments with the AML Python SDK".
train_dataset = Dataset.Tabular.from_delimited_files(
default_store.path('kaggle-house-prices-training')
)
train_dataset = train_dataset.register(ws, 'kaggle-house-prices-training')
Create Compute Cluster If Not Exists
AML Compute Cluster is "managed-compute infrastructure that allows you to easily create a single or multi-node compute".
We will create a modest, one node sandbox "cluster". This compute cluster will automatically zero-scale when not in use.
amlcompute_cluster_name = "sandbox"
try:
aml_compute = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
except ComputeTargetException:
print('Compute cluster %s not found. Attempting to create it now.' % amlcompute_cluster_name)
compute_config = AmlCompute.provisioning_configuration(
vm_size='Standard_DS2_v2',
max_nodes=1
)
aml_compute = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)
aml_compute.wait_for_completion(show_output=True)
Define Compute RunConfiguration
The RunConfiguration can be used to define any conda or pip python packages required.
aml_run_config = RunConfiguration()
aml_run_config.target = aml_compute
aml_run_config.environment.docker.enabled = True
aml_run_config.environment.python.user_managed_dependencies = False
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
conda_packages=['packaging']
)
Define AutoMLConfig
At the time of writing, Azure AutoML supports classification, regression, and forecasting ML task types. To predict a continuous, non-discrete home value, we will use a regression configuration.
We will utilize normalized_root_mean_squared_error as our loss function, paired with enable_early_stopping for cost savings.
To view all available hyperparameters and AutoML config options, see AutoMLConfig Class Documentation
automl_settings = {
"n_cross_validations": 3,
"primary_metric": 'normalized_root_mean_squared_error',
"enable_early_stopping": True,
"experiment_timeout_hours": 1,
"max_concurrent_iterations": 4,
"max_cores_per_iteration": -1,
"verbosity": logging.INFO,
}
automl_config = AutoMLConfig(
task = 'regression',
compute_target = aml_compute,
training_data = train_dataset,
label_column_name = 'SalePrice',
**automl_settings
)
Create Experiment
experiment_name = 'kaggle-house-prices-training'
experiment = Experiment(workspace=ws, name=experiment_name)
Run AutoML Training Job Experiment and Wait for Completion
If you utilized the settings in this post, the AutoML job should take approximately 40 minutes to complete.
remote_run = experiment.submit(automl_config, show_output=False)
remote_run.wait_for_completion()
Retrieve the Best Model
Azure AutoML tests a variety of ML algorithms and hyperparameters to find the best performing values. Here, we will retrieve the model with the lowest normalized_root_mean_squared_error (as defined in our AutoMLConfig).
best_run, fitted_model = remote_run.get_output()
Generate Predictions
Generate SalePrice predictions, select the desired fields (as defined by the Kaggle competition), and write to a local csv. This file can then be submitted via the competition site.
test_df['SalePrice'] = fitted_model.predict(test_df)
kaggle_submission = test_df[['Id', 'SalePrice']]
kaggle_submission.to_csv('./data/submission.csv', index=False)
Results
The model created managed to obtain a Root-Mean-Squared-Error (RMSE) score of 0.14511.
If we take a look at the leaderboard score distribution for this competition, it seems as though scores tend to top out around 0.11. While this model didn't place in the top 10 (or top 1,000), its RMSE of 0.14 comes close behind.
Parting Thoughts
AutoML appears to live up to its promise of "automating the time consuming, iterative tasks of machine learning model development". It may not win Kaggle competitions, but it sure is a solid start.