Forecasting Quarterly House Sales with MindsDB
Introduction
In this tutorial, we’ll create and train a machine learning model, or as we
call it, an AI Table
or a predictor
. By querying the model, we’ll predict the
real estate sales using a multivariate time series strategy.
Make sure you have access to a working MindsDB installation, either locally or at MindsDB Cloud.
If you want to learn how to set up your account at MindsDB Cloud, follow this guide. Another way is to set up MindsDB locally using Docker or Python.
Check out the House Sales tutorial using Nixtla’s StastForecast engine here.
Let’s get started.
Data Setup
Connecting the Data
There are a couple of ways you can get the data to follow through with this tutorial.
Connecting as a database
You can connect to a demo database that we’ve prepared for you. It contains the data used throughout this tutorial (the example_db.demo_data.house_sales
table).
Now you can run queries directly on the demo database. Let’s preview the data that we’ll use to train our predictor.
Connecting as a file
The dataset we use in this tutorial is the pre-processed version of the House Property Sales data. You can download the CSV
data file here (we use the ma_lga_12345.csv file) and upload it via MindsDB SQL Editor.
Follow this guide to find out how to upload a file to MindsDB.
Now you can run queries directly on the file as if it were a table. Let’s preview the data that we’ll use to train our predictor.
Pay Attention to the Queries
From now on, we’ll use the files.house_sales
file as a table. Make sure
you replace it with example_db.demo_data.house_sales
if you connect the data
as a database.
Understanding the Data
We use the house sales dataset, where each row is one house or one unit, to
predict the MA
column values. It tracks quarterly moving averages (MA
) of
house sales aggregated by real estate type and the number of bedrooms in each
listing.
Below is the sample data stored in the files.house_sales
table.
Where:
Column | Description | Data Type | Usage |
---|---|---|---|
saledate | The date of sale. | date | Feature |
MA | Moving average of the historical median price of the house or unit. | integer | Label |
type | Type of property (house or unit ). | character varying | Feature |
bedrooms | Number of bedrooms. | integer | Feature |
Info “Labels and Features” A label is a column whose values will be
predicted (the y variable in simple linear regression).
A feature is a
column used to train the model (the x variable in simple linear regression).
Training a Predictor
Let’s create and train the machine learning model. For that, we use the
CREATE MODEL
statement and specify the
input columns used to train FROM
(features) and what we want to
PREDICT
(labels).
We use all of the columns as features, except for the MA
column, whose values
will be predicted.
MindsDB makes it simple so that we don’t need to repeat the predictor creation
process for every group, that is, for every distinct number of bedrooms or for
every distinct type of real estate. Instead, we just group by both the
bedrooms
and type
columns, and the predictor learns from all series and
enables forecasts for all of them!
Status of a Predictor
A predictor may take a couple of minutes for the training to complete. You can monitor the status of the predictor by using this SQL command:
If we run it right after creating a predictor, we get this output:
A bit later, this is the output:
And at last, this should be the output:
Now, if the status of our predictor says complete
, we can start making
predictions!
Making Predictions
You can make predictions by querying the predictor joined with the data table.
The SELECT
statement lets you make predictions for the
label based on the chosen features for a given time period. Usually, you want to
know what happens right after the latest training data point that was fed. We
have a special keyword for that, the LATEST
keyword.
On execution, we get:
Please note that in the SELECT
statement, we select
m.saledate
instead of t.saledate
because we make predictions for future
dates that are not in the data table.
Now, try changing the type
column value to unit, or the bedrooms
column
value to any number between 1 to 5, and check how the forecasts vary. This is
because MindsDB recognizes each grouping as being its own different time series.
What’s Next?
Have fun while trying it out yourself!
- Bookmark MindsDB repository on GitHub.
- Sign up for a free MindsDB account.
- Engage with the MindsDB community on Slack or GitHub to ask questions and share your ideas and thoughts.
If this tutorial was helpful, please give us a GitHub star here.