Quick tip: Using Pentaho Data Integration (PDI) with SingleStoreDB
Akmal Chaudhri
Posted on November 2, 2022
Abstract
Continuing our use and evaluation of open-source and commercial tools with SingleStoreDB, we'll focus on Pentaho Data Integration (PDI) in this short article. We'll use the Developer Edition of PDI, and work through the PDI Tutorial to demonstrate ETL with SingleStoreDB as our database system.
The project files used in this article are available on GitHub.
Introduction
This article used a Virtual Machine running Ubuntu 22.04.2 with OpenJDK 11.
Create a SingleStoreDB Cloud account
A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use PDI Demo Group as our Workspace Group Name and pdi-demo as our Workspace Name. We'll make a note of our password and host name. Finally, we'll create a new database using the SQL Editor:
CREATE DATABASE IF NOT EXISTS pdi_demo;
Download and unpack PDI
We'll download the Developer Edition edition of PDI. The version used when writing this article was pdi-ce-10.2.0.0-222
. The zip file was unpacked to a data-integration
folder.
From the data-integration
folder, we'll run:
./spoon.sh
Several warning and error messages may appear; we'll need to fix these before the software runs correctly.
1. no libwebkitgtk-1.0 detected
This can be fixed by editing the following file:
sudo vim /etc/apt/sources.list
and adding the following to the end of the file and saving the file:
deb http://cz.archive.ubuntu.com/ubuntu bionic main universe
Next, we'll run the following:
sudo apt update
If there is an error similar to this:
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 3B4FE6ACC0B21F32
We need to run the following:
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 3B4FE6ACC0B21F32
Then the following commands should be successful:
sudo apt update
sudo apt install libwebkitgtk-1.0-0
2. Failed to load module "canberra-gtk-module"
This can be fixed by installing the following:
sudo apt install libcanberra-gtk-module libcanberra-gtk3-module
3. Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
This can be fixed by editing the following file:
sudo vim /etc/java-11-openjdk/accessibility.properties
and commenting out the next line and saving the file:
assistive_technologies=org.GNOME.Accessibility.AtkWrapper
Download MySQL connector
We need to obtain mysql-connector-java-5.1.49
from the MySQL archives. This is a PDI requirement. Unpacking the downloaded archive file, we'll find two jar files:
mysql-connector-java-5.1.49-bin.jar
mysql-connector-java-5.1.49.jar
We'll copy these two files to data-integration/lib
.
Launch PDI
From the data-integration
folder, we'll again run:
./spoon.sh
This time, the software should start correctly.
PDI Tutorial
The PDI tutorial works through an example that reads data from CSV files, creates data pipelines, performs data cleanup, and eventually writes the data to a database system. The project files can be downloaded from GitHub and opened in PDI to save time.
We'll need to adjust our project files with the full paths to the CSV files:
data-integration/samples/transformations/files/sales_data.csv
data-integration/samples/transformations/files/Zipssortedbycitystate.csv
and configure the database settings, as follows:
- Connection name: SingleStoreDB
- Connection type: MySQL
- Access: Native (JDBC)
- Host Name: <host>
- Database Name: pdi_demo
- Port Number: 3306
- Username: admin
- Password: <password>
We'll replace the <host>
and <password>
with the values from our SingleStoreDB Cloud account.
We'll make any other file path adjustments, as required.
When ready, we'll run the Transformation pipeline using Action > Run. The result should be as shown in Figure 1.
Similarly, we'll run the Job pipeline using Action > Run. The result should be as shown in Figure 2.
We can check our SALES_DATA
table in SingleStoreDB. It is quite a wide table with 2823 rows.
Summary
In this short article, we have successfully configured and run PDI using SingleStoreDB Cloud as our backend database system. Our data pipelines have cleaned the CSV data and written the final output to SingleStoreDB, demonstrating a complete ETL example.
Posted on November 2, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.