Building a Price Prediction API using ML.NET and ASP.NET Core Web API— Part 1
Will Velida
Posted on December 7, 2019
This will be a two part tutorial on how you can use ML.NET to build machine learning models and then implementing the model that you build in a ASP.NET Web API. The first part will cover how you can build a model that predicts prices (I’ll be building on the Regression tutorial that’s on the ML.NET docs) and then upload that model to Azure Storage. The second part will then cover how we can consume the model that’s stored in Azure Storage in a ASP.NET Web API.
About ML.NET
ML.NET allows developers to add machine learning capabilities to their applications. .NET developers can train custom models that transform an application’s data and make predictions based of that data. We can perform a variety of different predictions with ML.NET, ranging from classification to recommendations.
We can develop our machine learning models using the following process:
- Collect our training data and load it into a IDataView object.
- Create a pipeline and specify the steps needed to extract features from our data and apply a ML algorithm on it.
- Train the model by calling the Fit() method on the pipeline.
- Evaluate the model and iterate to improve its performance if needed.
- Save the model
- Load the model back into a ITransformer object
- Finally, make predictions by calling CreatePredictionEngine.Predict(). We can also call CreatePredictionEnginePool in some circumstances (I’ll explain more later on)
Let’s explain a bit of the architecture behind ML.NET with this workflow in mind.
Each ML.NET application has a MLContextobject. This object is a singleton and contains a bunch of catalogs that we can use for data loading, transforming, trainers, feature engineering, model evaluation etc.
I’ll dive a little deeper into how we use these catalogs to build our Model trainer as I go through our example.
Our Model Builder
For our demo, I’m going to be building a regression model that predicts the prices of New Your City taxi fares. This model is the same as the one the ML.NET team have built for their tutorial on regression, but in this case I’m going to explain how the components behind each catalog object and show you how you can upload an ML.NET model to Azure Blob Storage. The purpose of this is when we build our API, we can then consume our model by calling the Blob uri.
As with all their tutorials, the ML.NET team have kindly provided us with a dataset that we can use for our tutorial. This does take the joy of performing any data cleaning and preparation tasks that we would normally do when doing any type of analysis, but it does allow us to dive straight into the cool machine learning goodness!
A word of warning, it’s not all algorithms and insights. The data that you usually get in the wild won’t be nice and clean csv files that are just waiting for you, you’ll have to put in the hard yards and clean the data yourself to ensure that your ML algorithms are making the right predictions. I know that in this context, we’re just messing about with some taxi fare data, but if you’re working with real data that has an effect on real people, I can’t overstate the importance of data preparation in any ML pipeline.
Anyway onto our code. These blog posts will just highlight the relevant pieces of code to the ML.NET library. The entire codebase will be available on my GitHub, so for now I will describe the meaty bits of code and describe what they are doing.
I’ve created my ModelTrainer as a .NET Core Console Application that takes our csv file, trains a regression model and then uploads the model to Azure Blob Storage.
In our demo, we are reading data from our csv file and building our pipeline on top of that. In order to use ML.NET to build a model on our data, we need to build both an input data class and an output class for our predictions.
In our input class (TaxiTrip) we have decorated our class properties with a LoadColumn attribute. Using these attributes, we can specify the indexes of the source columns in our dataset. So here, VendorId would be at position 0, RateCode at position 1 and so on.
In our output prediction class, we are producing a float field that has a ColumnName attribute called Score. In our regression model, the score column contains predicted values for our FareAmount.
Now that we have our classes in place, we can start to build our pipeline. We want to take our csv file with our taxi fare data, create our pipeline and then train our model. We will then upload this model to Azure Blob Storage. The following method TrainAndUploadModelAsync() does this:
Let’s break this down line by line.
We first upload our csv file into a IDataView object. The IDataView class provides us with a way of describing numbers or text data in a tabular format. As of writing, we can upload text files or real time data (such as data from SQL databases). In our case, we are retrieving our csv file and uploading it into the IDataView object. You can read more about the IDataView class and what you can do with it here.
Inside our MLContext catalog, we have a range of extensions at our disposal to build a pipeline. Line by line, our pipeline performs a number of tasks that we need in order to perform our regression tasks.
Since we want to predict the taxi fare amount, we want to use the CopyColumns transformer to copy the FareAmount to set this as the Label that we want to predict like so:
var pipeline = mlContext.Transforms.CopyColumns("Label", "FareAmount")
Now in our dataset, we have categorical data that we need to encode into numbers. We can use the OneHotEncodingTransformerto assign different numeric key values to different categorical values in each of the columns. The columns that we need to encode are VendorId, RateCode and PaymentType. We can do this like so:
Append(mlContext.Transforms.Categorical.OneHotEncoding("VendorIdEncoded", "VendorId"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("RateCodeEncoded", "RateCode"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("PaymentTypeEncoded", "PaymentType"))
We use One hot encoding here because label encoding implies that they higher the number, the more significance that property has. Since this is not the case, we use One hot encoding here just to categorize the property value in a numeric form that has no weighting on our regression algorithm.
We now want to combine all the features that we want to use to predict our label into a Features column. ML Algorithms processes only features from the Features column. We can do so like this:
.Append(mlContext.Transforms.Concatenate("Features", "VendorIdEncoded", "RateCodeEncoded", "PassengerCount", "TripTime", "TripDistance", "PaymentTypeEncoded"))
Finally, we want to choose a learning algorithm and apply it to our pipeline. We can do so like this:
.Append(mlContext.Regression.Trainers.FastTree());
Now that we have our pipeline, we will want to train our model by fitting our pipeline object over our data view. We can do this using the Fit() method.
var model = pipeline.Fit(dataView);
With our trained model, we want to save and upload it to Azure Blob Storage. In order to do this, we can save our transformer model to a stream by passing through our model, the Schema of our data view object that was used to train the model and the stream that we are using. In the code below, I also have a helper method that saves our model as a zip file to a specified container. The contents of our model have been saved to our MemoryStream().
using (var stream = new MemoryStream())
{
mlContext.Model.Save(model, dataView.Schema, stream);
stream.Position = 0;
await _azureStorageHelpers.UploadBlobToStorage(container, "Model.zip", stream);
}
We can verify that our model was uploaded successfully by looking inside the container of our storage account:
The Save() method has overloads for situations where we would want to specify a file path instead of a Stream. For more details, check out the docs here.
Conclusion
In this blog post, we’ve trained a simple regression model on some Taxi Fare data and uploaded the trained model to Azure Storage. In a future tutorial, we will consume this model using the Blob uri in our API.
To extend this sample further, we could have evaluated the effectiveness of our model against a single prediction to see if our model is any good. Potentially, we could cache the effectiveness of our model and test to see as new data comes in, if our model is still fit for purpose. ML.NET has an Evaluate() method that allows us to assess our models and provides us with metrics to see how it performs (such as RSquared and RMS values). We could compare our current performance of our models against previous performances and see if we need to fine tune our pipeline based on these scores.
I hope that you learnt something in the blog post. If you want to have a look at the code, check out the GitHub repo for it! As of writing, I’m still working on the API part, so stay tuned for that in the near future!
Posted on December 7, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
December 7, 2019