Tensorflow.js: Building a quick and dirty stock market predictor

When you think of Machine Learning, the first thing that strikes you is Python. Well, great community support and plenty of available packages make Python a great choice. But, while going through Andrew Ng's ML course I realized that Machine Learning is about how you define your model and not about the programming language being used. So I thought, why not give Tensorflow.js a try.

Building a quick and dirty stock market predictor using Tensorflow.js

I'll be using ICICI bank's dataset to predict the closing price based on the provided opening price.

The data is the price history and trading volumes of the ICICI bank stock. The data spans from 1st January 2000 to 30th April 2021.

Checkout Kaggle for various datasets.

Choosing a model

Let's have a look at the first 1000 values of the dataset using a scatter plot.

Plotting the open price against the closing price

Now by looking at the data, we can see that if we define a line of best fit then we establish a relation between the opening and the closing price.

Does this ring any bells? Remember the equation of straight line we studied in high school?



y = mx + c

m -> slope of the line
c -> y intercept

And this is exactly what simple linear regression ML models use. It is a statistical model which is used to define a relationship between two variables. The independent variable x is used to predict the value of the dependent variable y.

In ML terminology this equation is called the hypothesis.

Now the ICICI bank stock dataset has two columns named Open & Close and contains more than 1000 rows. So instead of adding/operating on these values one by one, they are generally represented in the form of a matrix.

Understanding the cost function

Cost function (sometimes also called an error function)is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function.

Source Wikipedia

In other words, it is the difference between the value that the hypothesis function spits out and the actual value. Since we are looking to find the line of best fit, the aim is to minimize the cost. We want our predicted value to be very close to the actual value while the model is being compiled.

Squared error cost function used for linear regression

Source Medium

Let's have a glance at the hypothesis function

x -> This will be the opening price (Nx1 matrix)
m,c -> Their value is chosen to minimize the cost function. Let's park the explanation part for now.

In the world of Tensorflow.js these matrices are called tensors. You can read more about them here.

Getting things ready

Add the below mentioned script tags to your HTML file to ensure that Tensorflow and tfjs-vis (used for visualization) are available on your page.



 <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script>
 <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-vis"></script>

Loading the csv file and plotting the values on Scatter plot

We are using tfvis here to plot our dataset.



function plot(points, predictedPoints) {
    const data = { values: [points, ...(predictedPoints ? [predictedPoints] : [])],
        series: ['original', ...(predictedPoints ? ['prediction'] : [])] };

    const surface = { name: 'ICICI Bank stock price prediction' };
    tfvis.render.scatterplot(surface, data, {xLabel: 'Open', yLabel: 'Close'});            
}

// All the tensorflow utility functions can be 
accessed through the variable 'tf'
// File path can be changed
let dataset = tf.data.csv('http://localhost:4000/ICICIBANK.csv');
let points = dataset.map(item => ({
       x: item.Open,
       y: item.Close
}));

let pointsArr = await points.toArray();
if(pointsArr.length&1) pointsArr.pop();
/**
* Shuffling the data set so that our model does not 
* encounter similar values in each step
* */
tf.util.shuffle(pointsArr)

plot(pointsArr);

Now the price values can be in different ranges, so it becomes really important to bring the values on a common scale. This process is also called normalization. Typically you would want to bring the values in the range 0-1.



/**
 * Normalize the tensor
* */
function normalize(tensor, prevMin, prevMax) {
    const min = prevMin || tensor.min(),
         max = prevMax || tensor.max(),
         normalisedTensor = tensor.sub(min).div(max.sub(min));
         return normalisedTensor;
}

/**
* Denormalize the tensor
* */
function denormalize(tensor, min, max) {
      return tensor.mul(max.sub(min)).add(min);
}

Defining the feature and output tensor



let featureTensor = tf.tensor2d(features,[features.length,1]);
let outputTensor = tf.tensor2d(outputs, [outputs.length,1]);
let normalisedFeatures = normalize(featureTensor);
let normalisedOutput = normalize(outputTensor);

Splitting the datasets into training and testing

Why is splitting required?
Splitting ensures that our model is built using a specific set of data so that when we evaluate the model against the test data it is actually evaluated against something it has never encountered during the creation phase. It also gives you a sense of how it might perform in production.

Generally 70% of the data is reserved for training

If you don't find the reasoning very intuitive, I would highly recommend reading this blog.



let [trainFeatures, testFeatures] =
tf.split(normalisedFeatures,2);

let [trainOutput, testOuput] = tf.split(normalisedOutput,2);

Creating a model

We'll use the Tensorflow layers API to create the model.



function createModel() {
    let model = tf.sequential();

    model.add(tf.layers.dense({
        units: 1,
        inputDim: 1,
        activation: 'linear',
        useBias: true
    }));

    // sgd -> gradient descend
    let optimizer = tf.train.sgd(0.1);
    model.compile({
        loss: 'meanSquaredError',
        optimizer
    })
    return model;
}

let model = createModel();

tf.sequential() - This means that the model will be sequential i.e output of one layer will act as an input to the other.
units - Our model has one unit.
inputDim - input dimension is 1 as we have only one feature which is the opening price
activation - We are using linear regression here, so using linear activation function here.
useBias - 'c' in our hypothesis function is called the bias term

Now, the point that is a little unclear here is tf.train.sgd. Remember that we parked the explanation part for m,c previously. Gradient descend is the algorithm that tries to find the minimum value for these terms to minimize the loss(happens at every iteration). Read more about it here. It takes in a learning rate to find the step of descent. A traditional default value for the learning rate is 0.1 or 0.01, and this may represent a good starting point on your problem..

As mentioned earlier our cost(or loss) function will be a squared error function.

Evaluating the model against the test set



let testing =  await model.evaluate(testFeatures, testOuput);

Predicting the values and plotting them

using tfvis to create a scatterplot



async function plotPrediction(model) {
    let normalisedXs = [];
    while(normalisedXs.length < 1000){
        var r = Math.random();
        normalisedXs.push(r);
    }
    normalisedXs = tf.tensor2d(normalisedXs, [1000,1])
    const normalisedYs = model.predict(normalisedXs);
    const xs = denormalize(normalisedXs, featureTensor.min(), featureTensor.max()).dataSync();
    const ys = denormalize(normalisedYs, outputTensor.min(), outputTensor.max()).dataSync();

    const predictedPoints = Array.from(xs).map((val, ind) => ({
        x: val, y: ys[ind]
    }));
    plot(pointsArr, predictedPoints);
}

Let's see how the scatter plot looks like for our predicted values

Well, there are a couple of things that I didn't mention like saving the model, loading the model from storage, etc. But you can find the complete code in this Github Repo.

A question for the readers

So, if you run this code locally and plot the original and predicted values on the scatter plot, you will notice that every predicted closing price is less than its corresponding opening price. I am not quite sure as to what is causing this issue. Maybe, I'll try tinkering around with the learning rate.

Let me know if you catch the issue 🙏.

Blog