Self-Service Machine Learning with Intelligent Databases
MindsDB Team
Posted on March 9, 2022
The currently available Automated Machine Learning (AutoML) tools promise to make ML easy and affordable. But despite the lucrative promises and bold statements about replacing data scientists with software, it’s not happening anytime soon.
This article will check a new data-centric construct called AI Tables towards its chance to make self-service ML easy for data engineers, developers, and business analysts.
Let’s get started.
How to Become an Insight-Driven Organization?
To become an insight-driven organization (IDO), first and foremost, you need data and the tools to manipulate and analyze it. Another essential component is the people, i.e., data analysts or data scientists with appropriate experience. And last but not least, you need to find a way to implement insight-driven decision-making processes across your company.
The technology that lets you make the most out of your data is Machine Learning. The ML flow starts by using your data to train the predictive model. And later, it answers your data-related questions. The most effective technology for Machine Learning is Artificial Neural Networks. Their design is influenced by our current understanding of how the human brain works. And given the great computing resources people have nowadays, it can lead to incredible models trained with a lot of data.
Nowadays, companies use various automation software and scripts to get different tasks done without human errors. Similarly, you can avoid human mistakes in your decision-making processes by basing decisions exclusively on your data.
Why are Companies Slow to Adopt AI?
The majority of businesses do not use AI or ML to handle their data. For example, the US Census Bureau shared that, as of 2020, less than 10% of US businesses had adopted Machine Learning. These include primarily large companies.
Let’s look into the biggest obstacles that stand on the way to adopting ML.
- Before AI can replace humans, there is a great deal of work to be performed. But the problem is the lack of skilled professionals. Also, many businesses cannot afford them. Data scientists are among the most highly demanded professionals. They are also the most expensive to hire.
- Another set of challenges is data-related: lack of available data, data security, and the time-consuming implementation of ML algorithms.
- Also, companies have trouble creating an environment to embrace data and all the benefits that it brings. This environment requires relevant tools, processes, and strategies.
Democratizing Machine Learning. AutoML tools alone aren’t sufficient.
Automated ML platforms make big promises but with almost no coverage. There are ongoing debates on whether AutoML can replace data scientists anytime soon.
Do you want to succeed in deploying automated machine learning at your company? AutoML tools are crucial but remember to focus on processes, methods, and strategies. AutoML platforms are just tools, and most ML experts agree that it’s not enough.
Breaking Down the Machine Learning Process
Any ML process starts with data. It’s commonly agreed that the data preparation step is the most significant roadblock of an ML process. Next, the modeling part is just a piece of the whole data science pipeline, and AutoML tools simplify it. But the complete workflow still requires much effort to transform data and supply it to the models. And this is not helped by the fact that data preparation and data transformation are the most time-consuming and the least enjoyable part of the job.
And, of course, the business data used to train ML models is updated regularly. Hence, it requires the companies to build complex ETL pipelines that utilize sophisticated tools and processes. So making the ML process continuous and real-time is a challenging task.
Integrating ML with Apps and Change Management
Assume that now we have our ML Model built, and we need to deploy it. The classical deployment approach treats it as an application layer component as per the diagram below.
Its input is data, and we get the predictions as an output. Our business applications consume these predictions from ML tools via APIs used by developers to integrate these apps. Sounds straightforward from the developers’ point of view, right?
As easy as it can be for developers, it is not as easy when considering processes. Any integration with the business-critical app in a reasonably-large organization is quite troublesome to maintain. Even if the company is tech-savvy, any code change request must go through specific reviews and testing workflows that involve multiple stakeholders. And that negatively impacts flexibility and adds complexity to the whole workflow.
It is much easier to experiment with ML-based decision-making when having enough flexibility in testing various concepts and ideas. So you would prefer something that would give you a self-service capability.
Self-Service Machine Learning or Intelligent Databases?
As we see above, data is the core of ML processes, with existing ML tools taking data and returning predictions, which are also data.
So now the questions arise:
- Why should we treat ML as a standalone app and implement all these complex integrations among ML Models, apps, and databases?
- Why not make ML a core capability of databases?
- Why not make ML Models available through standard database syntax, such as SQL?
Let’s analyze the abovementioned ML workflow challenges and find the solution by addressing these questions.
Current ML Challenges and How to Overcome Them
Challenge #1: Complex Data Integrations and ETL Pipelines
Maintaining complex data integrations and ETL pipelines between the ML Model and a database is one of the biggest challenges faced by ML processes.
SQL is the best tool for data manipulation, so we can solve this problem by bringing ML models inside the data layer, not the other way around. In other words, ML models would learn and return predictions inside the database.
Challenge #2: ML Models Integrations with Apps
Another challenge that generates an avalanche of issues is integrating models with the business applications via the APIs.
Business applications and BI tools are tightly-coupled with databases. So, if the AutoML tools become a part of the database, we can make predictions using standard SQL queries. What follows is that no API integrations between ML models and business apps are necessary anymore because models reside within the database.
Solution: Embedding AutoML within the Database
Embedding AutoML tools within the database brings many benefits, such as the following:
- Any person who works with data and knows SQL (for example, data analyst or data scientist) can leverage the power of Machine Learning.
- Software developers can embed ML into business tools and apps more efficiently.
- No complex integrations are required between data and the model and between model and business apps.
The relatively complex diagram presented in section Integrating ML with Apps and Change Management changes into the following:
It looks simpler and makes the ML processes smooth and efficient.
How to implement Self-Service ML: Models as virtual database tables
So now we know the solution to the main challenges, let’s implement it.
For that, we use a construct called AI Tables. It brings machine learning in the form of virtual tables into data platforms. Such an AI Table is created like any other database table and then exposed to applications, BI tools, and DB clients. We make predictions by simply querying the data.
AI Tables is a part of GitHub project by MindsDB and are available as an open-source service. They integrate with traditional SQL and NoSQL databases and data streams like Kafka & Redis.
Using AI Tables
The concept of AI Tables enables us to perform ML processes within the database. So that all the steps of an ML process (that is, data preparation, training the model, and making predictions) take place through the database.
Training AI Tables
The user specifies a source table or view from which an AI Table learns automatically. To create an AI Table, use a single SQL command, shown in the following section.
AI Table is a machine learning model consisting of features equivalent to columns of a source table. AutoML engine automates the remaining modeling tasks. Nevertheless, experienced ML engineers can specify model parameters through a declarative syntax called JSON-AI.
Making Predictions
Once you create an AI Table, it is ready to use. It doesn’t require any further deployment. To make a prediction is to run a standard SQL query on an AI Table and consider that data we ask for is already there – although it will be created on the fly as we ask for it.
You make predictions either one by one or in batches. AI Tables can handle many complex machine learning tasks like multivariate time-series, detecting anomalies, and more.
How AI Tables work – an example
Let’s look at the real-world example. We’ll predict a stock for a retailer to generate better incomes by having the right products at the right time.
One of the intricate tasks of being a retailer is to have all the products available in stock at the right time. When the demand grows, the supply must increase. Your data can take the load of handling this task. All you need is to keep track of the following information:
- when the products were sold (the date_of_sale column),
- who sold the products (the shop column),
- which products were sold (the product_code column),
- how many products were sold (the amount column).
Let’s visualize the above data in a table:
Based on these data, and using Machine Learning processes, we can predict how many items of a given product should be in stock at a given date.
Training AI Tables
To create AI Tables that utilize your data, you must first allow MindsDB to access your data. You can do it by connecting your database to your MindsDB. It is quite a straightforward task. The detailed instructions are available in the MindsDB documentation.
AI Tables are like ML models, so you must train them using past data.
Below is a simple SQL command that trains an AI Table:
CREATE PREDICTOR predictor
FROM source_database
(SELECT column_1, column_2 FROM historical_table)
PREDICT column_to_be_predicted as column_alias;
Let’s analyze this query:
- We use the CREATE PREDICTOR statement available in MindsDB.
- We define the source database from where the historical data comes.
- We train the AI Table based on the table that contains historical data (historical_table). And the selected columns (column_1 and column_2) are the features used to make predictions.
- AutoML automates the remaining modeling tasks.
- What happens when you execute the query?
- MindsDB determines data types for every column, normalizes and encodes it, assigns appropriate library, and builds and trains the ML model.
- It leaves some percentage of data inaccessible to the model and uses it to test its accuracy.
As a result, you can see the overall accuracy score and confidence for every prediction and estimate which columns are the most important for better results.
In databases, we often need to deal with tasks that involve multivariate time-series data with high cardinality. And if we use traditional approaches, it requires quite an effort to create such ML models. We need to group data and order it by a given time, date, or timestamp data field.
For example, we can predict the number of hammers sold by a hardware store. Here, data is grouped by the shop and product values, and a forecast is made for each distinct combination of shop and product values. But it is also crucial to know when a specific number of a given product will be sold. That brings us to the problem of creating time series models for each group.
It may sound like a lot of work, but MindsDB provides the means to create a single ML model to train the multivariate time series data at once using the GROUP BY statement. Let’s see how it is done using just a single SQL command.
Now, we’ll create a predictor for our sales data.
CREATE PREDICTOR stock_forecaster
FROM postgres_db
(SELECT shop, amount, date_of_sale
FROM sales_data)
PREDICT amount
ORDER BY date_of_sale
GROUP BY shop
WINDOW 10
HORIZON 7;
The stock_forecaster predictor uses the sales data to predict how many items will be sold by a specific shop in the future. The data is ordered by the date of sale and grouped by the shop. So we can predict the amount value for each shop value.
Let’s make some predictions using our stock_forecaster predictor.
Making Bulk Predictions and Building Forecasts in Analytical Tools
We can get bulk predictions for many records at once by joining our sales data table with the predictor using the below query.
SELECT sf.shop, sf.amount as predicted_amount
FROM postgres_db.sales_data AS sd
JOIN mindsdb.stock_forecaster AS sf
WHERE sd.shop="Easy Store"
AND sd.date_of_sale>LATEST
LIMIT 7;
The JOIN operation adds the predicted amount to the records. So you get predictions for many rows at once.
If you want to learn more about analyzing and visualizing predictions in BI tools, check out this article.
Let’s Be Practical
The traditional approach treats ML Models as standalone apps that require maintaining ETL pipelines to a database and API integrations to business applications. Even though the AutoML tools make the modeling part effortless and straightforward, profound specialists still need to manage the complete ML workflow.
But databases are already the best tool for data preparation. Thus, it makes more sense to bring ML to data and not the other way round.
The construct of AI Tables from MindsDB enables self-service AutoML for data practitioners and streamlines machine learning workflows. It is because the AutoML tool resides within the database.
Let’s summarize the benefits of it:
- No need for ETL pipelines between AutoML and databases.
- No need for API integrations between AutoML and business apps.
- No need for model deployment.
- Predictions made using standard SQL queries.
- Automated modeling with possible customization for advanced users.
- Performing ML using existing analytical tools.
- AI Tables can help companies make data-driven decisions using existing tools and personnel.
Want to try it out yourself?
- Bookmark MindsDB repository on GitHub utm_medium=referral&utm_source=dev.to&utm_campaign=self-service%20ml%20article%202022-03)
- Engage with the MindsDB community on Slack or GitHub to ask questions, share and express ideas and thoughts.
If this article was helpful, please give us a GitHub star here.
Posted on March 9, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.