Accessing JSON data at warp speed in SingleStoreDB
Akmal Chaudhri
Posted on January 28, 2023
Abstract
The Winter 2022 Release of SingleStoreDB added performance improvements for Transactions and Analytics. In particular, for working with JSON and string data. This article will present several examples to demonstrate the speedup, and we'll see the performance improvements first-hand.
Introduction
JSON is a very popular data format today and, in a previous article, we saw how SingleStoreDB supported this data format. The latest innovation for JSON support within SingleStoreDB is Seekable JSON. This enables JSON data to be stored in a compressed format and, when specific rows need to be retrieved, only the retrieved data needs to be decompressed. Figure 1 shows the previous approach for accessing JSON data (left-hand side) and the new approach (right-hand side). The new approach makes JSON suitable for OLTP applications in addition to OLAP.
The benefits for developers working extensively with JSON are that there are no changes in the data format or the JSON Functions, as the enhancements are under the covers. What developers will notice, however, are the performance gains. For management, the improved performance means that SingleStoreDB becomes a more compelling business solution, removing the need to integrate multiple products and providing better Total Cost of Ownership (TCO).
Seekable JSON YouTube Video
On SingleStore's YouTube Channel, there is a great 5-minute video that demos Faster Seeks for JSON Data on Universal Storage. The demo results are impressive. How can we try the demo ourselves? Fortunately, all the details are provided in the documentation, and we can find the demo code in the JSON Performance section.
One way that we can test the code locally is to install SingleStoreDB in a Virtual Machine environment. Details of how to do this were provided in a previous article. Another possibility is to use the SingleStoreDB Dev Image, which contains all the required software.
Install SingleStoreDB
In this article, we'll install SingleStoreDB in a Virtual Machine environment. It takes just a few minutes. A previous article described the steps. For this article, we'll only need two tarball files:
Assuming a two-node cluster was correctly deployed and using the same variable names from the previous article, we can connect to our cluster from a MySQL CLI Client as follows:
mysql --local-infile -u root -h ubuntu -P 3306 --default-auth=mysql_native_password -p
Once connected to our cluster, we need to execute the JSON Performance demo code. The demo code produces over 8 million rows of data.
Results
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = OFF;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.46785009 |
+------------+
+------------+
| avg_time |
+------------+
| 0.49248448 |
+------------+
+------------+
| avg_time |
+------------+
| 0.45003322 |
+------------+
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = ON;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.00581854 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00237616 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00180840 |
+------------+
The results are summarised in Figure 2 using a Log Scale.
Taking the average, these results show we achieved over 140x performance improvement with Seekable JSON with this small cluster set-up and environment.
Bonus: Using Los Angeles Police Department Crime Data
Let's use real-world JSON data to test the Seekable JSON capability. We can obtain crime data provided by the Los Angeles Police Department. We'll focus on Crime Data from 2020 to the Present. The data can be downloaded in a CSV format and are available under a Creative Commons CC0 1.0 Universal license. From the website:
This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database.
The downloaded CSV file is approximately 163 MB in size at the time of writing this article and was updated on 25 January 2023.
Convert CSV to JSON
We need to convert the CSV file to JSON. The following Python code achieves this, creating correct JSON records using double quotes, nulls, and one JSON record per line:
import pandas as pd
import json
import ndjson
input_file = "rows.csv"
output_file = "rows.json"
df = pd.read_csv(input_file)
df.rename(columns = lambda x: x.replace(" ", "_"), inplace = True)
json_data = json.loads(df.to_json(orient = "records"))
with open(output_file, "w") as f:
ndjson.dump(json_data, f)
The following shows an example JSON record with the keys displayed and the values removed.
{
"DR_NO":
"Date_Rptd":
"DATE_OCC":
"TIME_OCC":
"AREA":
"AREA_NAME":
"Rpt_Dist_No":
"Part_1-2":
"Crm_Cd":
"Crm_Cd_Desc":
"Mocodes":
"Vict_Age":
"Vict_Sex":
"Vict_Descent":
"Premis_Cd":
"Premis_Desc":
"Weapon_Used_Cd":
"Weapon_Desc":
"Status":
"Status_Desc":
"Crm_Cd_1":
"Crm_Cd_2":
"Crm_Cd_3":
"Crm_Cd_4":
"LOCATION":
"Cross_Street":
"LAT":
"LON":
}
The JSON file was approximately 432 MB in size.
Results
The complete code is listed below in the Appendix. It follows the structure of the earlier example. In the code in the Appendix, when loading the data into the database, we would replace /path/to/
with the actual path to where the JSON file was located. The JSON data load creates nearly 640,000 rows.
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = OFF;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.17954817 |
+------------+
+------------+
| avg_time |
+------------+
| 0.19996782 |
+------------+
+------------+
| avg_time |
+------------+
| 0.17837315 |
+------------+
Executing the code:
CALL get_avg_time(100);
three times with:
SET GLOBAL use_seekable_json = ON;
produced the following results:
+------------+
| avg_time |
+------------+
| 0.00427090 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00249579 |
+------------+
+------------+
| avg_time |
+------------+
| 0.00249940 |
+------------+
The results are summarised in Figure 3 using a Log Scale.
Taking the average, these results show we achieved over 60x performance improvement with Seekable JSON with this small cluster set-up, small dataset, and environment.
Summary
Our simple performance tests have shown that Seekable JSON can provide significant speedup. With mixed workloads consisting of both OLTP and OLAP, SingleStoreDB can now support fast access to JSON data in multiple environments, making it the product of choice for many applications.
Appendix
SET GLOBAL use_seekable_json = OFF;
CREATE DATABASE crimes;
USE crimes;
CREATE TABLE crimes (
id INT UNSIGNED PRIMARY KEY NOT NULL AUTO_INCREMENT,
crime_doc JSON NOT NULL,
SORT KEY(id)
);
LOAD DATA LOCAL INFILE '/path/to/rows.json'
INTO TABLE crimes
FORMAT JSON (
crime_doc <- %
);
DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
_id INT;
_od json;
_oid INT;
m INT;
st DATETIME(6);
et DATETIME(6);
BEGIN
SELECT MAX(id) INTO m FROM crimes;
st = now(6);
FOR i in 1..n LOOP
_oid = CEILING(m*RAND());
SELECT id, crime_doc
INTO _id, _od
FROM crimes
WHERE id = _oid;
END LOOP;
et = now(6);
echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n AS avg_time;
END
//
DELIMITER ;
OPTIMIZE TABLE crimes FULL;
CALL get_avg_time(100);
SET GLOBAL use_seekable_json = ON;
CREATE TABLE crimes2 LIKE crimes;
INSERT INTO crimes2 SELECT * FROM crimes;
OPTIMIZE TABLE crimes2 FULL;
DELIMITER //
CREATE OR REPLACE PROCEDURE get_avg_time(n INT) AS
DECLARE
_id INT;
_od json;
_oid INT;
m INT;
st DATETIME(6);
et DATETIME(6);
BEGIN
SELECT MAX(id) INTO m FROM crimes2;
st = now(6);
FOR i in 1..n LOOP
_oid = CEILING(m*RAND());
SELECT id, crime_doc
INTO _id, _od
FROM crimes2
WHERE id = _oid;
END LOOP;
et = now(6);
echo SELECT (timestampdiff(MICROSECOND, st, et)/1000000.0)/n AS avg_time;
END
//
DELIMITER ;
CALL get_avg_time(100);
Posted on January 28, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.