How to stream PostgreSQL CDC to Kafka and use Propel to get an instant API
Tyler Wells
Posted on April 8, 2024
Learn how to stream PostgreSQL Change Data Capture (CDC) to a Kafka topic and serve the data via an API with Propel. This guide provides step-by-step instructions, from setting up your PostgreSQL database and Kafka broker to deploying the Debezium PostgreSQL connector and creating a Propel Kafka Datapool.
Change Data Capture (CDC) is the process of tracking and capturing changes to your data. In PostgreSQL, we can implement CDC efficiently using the transaction log so that, rather than running batch jobs to gather data, we can capture data changes as they occur continuously. This has numerous benefits, including allowing your applications to see and react to changes in near real-time with lower impact on your source systems. In this article, we'll walk you step-by-step through how to stream CDC from PostgreSQL to a Propel Kafka Data Pool to get an instant data-serving API.
What are we trying to do?
We want to power large-scale analytics applications with data coming from PostgreSQL. As we all know, PostgreSQL, as an OLTP database, is notoriously slow in handling analytical queries that process and aggregate large amounts of data.
- First, we want to capture all changes to our PostgreSQL database table(s) and send them as JSON messages (see below) to a Kafka topic.
- Second, we’ll need to configure Kafka Connect and the Debezium PostgreSQL connector to stream the CDC to a Kafka topic.
- Lastly, create a Propel Kafka Data Pool that ingests data from the Kafka topic and exposes it via a low latency API.
We will cover how to deploy how to create and deploy all services in a Kubernetes cluster.
How PostgreSQL CDC with Debezium works
In this section, we will provide a quick overview of how PostgreSQL change data capture works with Debezium.
Imagine you have a simple table:
CREATE TABLE "simple" (
id uuid NOT NULL PRIMARY KEY,
foo character varying,
);
You update a row with the following SQL statement:
UPDATE public.simple SET foo = 'baz' WHERE ID = 'b18df779-0cae-43bc-9f4b-8efe571abd8a';
The change to the column foo
will result in an “update” to the transaction log, which Debezium then transforms into the JSON message below and drops into a Kafka topic.
CDC JSON with before
and after
:
{
"before": {
"id": "b18df779-0cae-43bc-9f4b-8efe571abd8a",
"foo": "bar",
},
"after": {
"id": "b18df779-0cae-43bc-9f4b-8efe571abd8a",
"foo": "baz"
},
"source": {
"version": "2.5.0.Final",
"connector": "postgresql",
"name": "pg",
"ts_ms": 1707250399964,
"snapshot": "false",
"db": "production",
"sequence": "[\"573123170952\",\"573123068760\"]",
"schema": "public",
"table": "simple",
"txId": 55986537,
"lsn": 573123068760,
"xmin": null
},
"op": "u",
"ts_ms": 1707250400533,
"transaction": null
}
The Debezium PostgreSQL connector generates a data change event for each row-level INSERT
, UPDATE
, and DELETE
operation. The op
key describes the operation that caused the connector to generate the event. In this example, u indicates that the operation updated a row. Valid values are:
-
c
= create -
u
= update -
d
= delete -
r
= read (applies to only snapshots) -
t
= truncate -
m
= message
What is the Propel Kafka Datapool?
The Kafka Data Pool lets you ingest real-time streaming data into Propel. It provides an instant low-latency API on top of Kafka to power real-time dashboards, streaming analytics, and workflows.
Consider using Propel on top of Kafka when:
- You need an API on top of a Kafka topic.
- You need to power real-time analytics applications with streaming data from Kafka.
- You need to ingest Kafka messages into ClickHouse.
- You need to ingest from self-hosted Kafka, Confluent Cloud, AWS MSK, or Redpanda into ClickHouse.
- You need to transform or enrich your streaming data.
- You need to power real-time personalization and recommendations for use cases.
Once we have the PostgreSQL, Debezium, and Kafka set up, we’ll ingest the data into a Propel Kafka Data Pool to expose it via an API.
Setting up Minikube, PostgreSQL, Kafka, Kafka Connect, and Debezium
We will walk through setting this up in a Kubernetes cluster on a Mac. This can be done in other environments, but Kubernetes is a common platform for hosting and running data streaming services in the cloud. For this example, we will use Minikube to deploy our Kubernetes cluster, Redpanda to deploy our Kafka cluster, and Helm to manage Kubernetes applications like PostgreSQL. I’ll provide the Mac-specific commands but links to the installed services if you are running a different OS.
Setting up Minikube
Step 1: Install Docker Desktop
brew install --cask docker
Step 2: Install Kubectl
brew install kubectl
Step 3: Install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-darwin-amd64
sudo install minikube-darwin-amd64 /usr/local/bin/minikube
Step 4: Start minikube
minikube start --nodes 8 #docker-desktop will need to be running
Step 5: Interact with your cluster
minikube kubectl -- get pods -A
Step 6: Install helm
brew install helm
Setting up PostgreSQL
Step 1: Install the PostgreSQL Helm Chart with extended configuration
Extend the configuration to set the wal_level
cat > pg-values.yaml <<'EOF'
primary:
extendedConfiguration: |-
wal_level = logical
shared_buffers = 256MB
EOF
helm install blog oci://registry-1.docker.io/bitnamicharts/postgresql -f pg-values.yaml
Step 2: Connect to PostgreSQL using the PostgreSQL CLI Create the database
export POSTGRES_PASSWORD=$(kubectl get secret --namespace default blog-postgresql -o jsonpath="{.data.postgres-password}" | base64 -d)
kubectl run blog-postgresql-client --rm --tty -i --restart='Never' --namespace default --image docker.io/bitnami/postgresql:16.2.0-debian-12-r6 --env="PGPASSWORD=$POSTGRES_PASSWORD" \
--command -- psql --host blog-postgresql -U postgres -d postgres -p 5432
Step 3: Create the table
# after connecting to the PostgreSQL instance, create the table
CREATE TABLE simple (
id uuid NOT NULL PRIMARY KEY,
foo character varying);
ALTER TABLE public.simple REPLICA IDENTITY FULL;
Step 4: Create a PostgreSQL user
To create a PostgreSQL user with the necessary privileges for Debezium to stream changes from the PostgreSQL source tables, run this statement with superuser privileges:
CREATE USER blog WITH REPLICATION PASSWORD 'your_password' LOGIN;
Setting up Kafka (Redpanda)
Step 1: Add the helm repo and install dependencies
helm repo add jetstack https://charts.jetstack.io
helm repo add redpanda https://charts.redpanda.com
helm repo update
helm install cert-manager jetstack/cert-manager --set installCRDs=true --namespace cert-manager --create-namespace
Step 2: Configure and install Redpanda
helm install redpanda redpanda/redpanda \
--namespace redpanda \
--create-namespace \
--set statefulset.initContainers.setDataDirOwnership.enabled=true \
--set image.tag=v23.3.4-arm64 \
--set tls.enabled=false \
--set statefulset.replicas=1 \
--set auth.sasl.enabled=true \
--set "auth.sasl.users[0].name=superuser" \
--set "auth.sasl.users[0].password=changethispassword" \
--set external.domain=customredpandadomain.local
Step 3: Install the Redpanda client rpk
brew install redpanda-data/tap/redpanda
Step 4: Create an alias to simplify the rpk
commands
alias local-rpk="kubectl --namespace redpanda exec -i -t redpanda-0 -c redpanda -- rpk -X user=superuser -X pass=changethispassword -X sasl.mechanism=SCRAM-SHA-512"
Step 5: Describe the Redpanda Cluster
local-rpk cluster info
CLUSTER
=======
redpanda.69087c28-e59b-4b25-bee2-bd199e6a9ab2
BROKERS
=======
ID HOST PORT
0* redpanda-0.redpanda.redpanda.svc.cluster.local. 9093
TOPICS
======
NAME PARTITIONS REPLICAS
_schemas 1 1
Step 6: Create a user
local-rpk acl user create blog -p changethispassword --mechanism scram-sha-256
Step 7: Create an ACL for the user blog
local-rpk acl create --allow-principal User:blog \
--operation all --topic '*' --group '*'
Setting up Kafka Connect
Step 1: Create a namespace
kubectl create namespace kafka-connect
Step 2: Create the Service
Below is the manifest you’ll need to apply to set up the Kafka Connect service with your Redpanda broker (see above). Copy the manifest below and provide the needed values to configure the service for your environment. In the example below, we are connecting with SASL_PLAINTEXT
and SCRAM-SHA-256
, ensure that CONNECT_SASL_JAAS_CONFIG
is using org.apache.kafka.common.security.scram.ScramLoginModule
and that you have set the username and password for both the CONNECT_
and PRODUCER_
keys.
apiVersion: apps/v1
kind: Deployment
metadata:
name: kafka-connect
labels:
app: kafka-connect
spec:
replicas: 1
selector:
matchLabels:
app: kafka-connect
template:
metadata:
labels:
app: kafka-connect
spec:
containers:
- name: kafka-connect
image: debezium/connect:2.5.2.Final # Use the appropriate Debezium version
env:
- name: LOG_LEVEL
value: INFO
- name: BOOTSTRAP_SERVERS
value: "redpanda-0.redpanda.redpanda.svc.cluster.local.:9093" # Replace with your Kafka broker address
- name: GROUP_ID
value: "blog-connect-cluster" # Replace with your group ID
- name: CONFIG_STORAGE_TOPIC
value: "connect-configs"
- name: OFFSET_STORAGE_TOPIC
value: "connect-offsets"
- name: STATUS_STORAGE_TOPIC
value: "connect-status"
- name: CONNECT_SECURITY_PROTOCOL # additional values https://kafka.apache.org/24/javadoc/index.html?org/apache/kafka/common/security/auth/SecurityProtocol.html
value: "SASL_PLAINTEXT"
- name: CONNECT_SASL_MECHANISM # additional values https://docs.confluent.io/platform/current/kafka/authentication_sasl/auth-sasl-overview.html
value: "SCRAM-SHA-256"
- name: CONNECT_SASL_JAAS_CONFIG # additional values https://docs.confluent.io/platform/current/kafka/authentication_sasl/index.html
value: "org.apache.kafka.common.security.scram.ScramLoginModule required username='blog' password='changethispassword';"
- name: CONNECT_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM
value: "" # Set to an empty string to disable hostname verification
- name: PRODUCER_SECURITY_PROTOCOL
value: "SASL_PLAINTEXT"
- name: PRODUCER_SASL_MECHANISM
value: "SCRAM-SHA-256"
- name: PRODUCER_SASL_JAAS_CONFIG
value: "org.apache.kafka.common.security.scram.ScramLoginModule required username='blog' password='changethispassword';"
- name: PRODUCER_SSL_ENDPOINT_IDENTIFICATION_ALGORITHM
value: "" # Set to an empty string to disable hostname verification
ports:
- containerPort: 8083
Use kubectl
to install the manifest and start the Kafka Connect pod in your cluster.
kubectl apply -n kafka-connect -f kafka-connect-pod.yaml
After you apply the manifest, you should get the following output.
deployment.apps/kafka-connect configured
Check the pod log files to ensure the service can log in to the broker. You should see the following in the logs if everything has been configured correctly.
2024-03-03 23:40:40,484 INFO || Successfully logged in. [org.apache.kafka.common.security.authenticator.AbstractLogin]
2024-03-03 23:40:40,504 INFO || Kafka version: 3.6.1 [org.apache.kafka.common.utils.AppInfoParser]
2024-03-03 23:40:40,504 INFO || Kafka commitId: 5e3c2b738d253ff5 [org.apache.kafka.common.utils.AppInfoParser]
2024-03-03 23:40:40,504 INFO || Kafka startTimeMs: 1709509240504 [org.apache.kafka.common.utils.AppInfoParser]
2024-03-03 23:40:40,508 INFO || Kafka Connect worker initialization took 5851ms [org.apache.kafka.connect.cli.AbstractConnectCli]
2024-03-03 23:40:40,508 INFO || Kafka Connect starting [org.apache.kafka.connect.runtime.Connect]
2024-03-03 23:40:40,510 INFO || Initializing REST resources [org.apache.kafka.connect.runtime.rest.RestServer]
2024-03-03 23:40:40,510 INFO || [Worker clientId=connect-1, groupId=propel-connect-cluster] Herder starting [org.apache.kafka.connect.runtime.distributed.DistributedHerder]
2024-03-03 23:40:40,510 INFO || Worker starting [org.apache.kafka.connect.runtime.Worker]
Check that the connect-offsets
topic has been created on the Redpanda broker.
local-rpk topic list
NAME PARTITIONS REPLICAS
_schemas 1 1
connect-configs 1 1
connect-offsets 25 1
connect-status 5 1
Setting up the Debezium PostgreSQL Connector
We must create a topic to store the CDC messages before deploying the Debezium PostgreSQL connector to the Kafka Connect service. If you take a look at the config below we’ve set the property topic.prefix
to “blog” andpublication.autocreate.mode
to “filtered” and set the table.include.list
to “public.simple”. This configures the connector to write the CDC messages to a topic named blog.public.simple
, use the command below to create the topic, then deploy the connector using the cURL command. You’ll need to provide your environment-specific details for connecting to your PostgreSQL server and the username/password for the Redpanda service you used above.
Step 1: Create a topic in Redpanda
local-rpk topic create blog.public.simple
Install the Debezium PostgreSQL connector using cURL. You’ll need to fetch the password for the PostgreSQL user and fill it in below. We will also set up port forwarding from the Kafka Connect service to our local host.
Step 2: Get the PostgreSQL password
export POSTGRES_PASSWORD=$(kubectl get secret --namespace default blog-postgresql -o jsonpath="{.data.postgres-password}" | base64 -d)
echo $POSTGRES_PASSWORD
Step 3: Forward the Kafka-Connect service port (8083)
# Run this in a separate terminal to get the Pod name
kubectl get pod -n kafka-connect -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kafka-connect-5498764f4f-qw5qg 1/1 Running 1 (34m ago) 35m 10.244.0.253 minikube <none> <none>
# Run this command to forward the traffic to local host
kubectl port-forward -n kafka-connect pod/kafka-connect-5498764f4f-qw5qg 8083:8083
Forwarding from 127.0.0.1:8083 -> 8083
Step 4: Deploy the Debezium Postgres Connector with cURL
curl -X POST http://127.0.0.1:8083/connectors -H "Content-Type: application/json" --data '{
"name": "debezium-postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"after.state.only": "false",
"tasks.max": "1",
"plugin.name": "pgoutput",
"database.hostname": "blog-postgresql.default.svc.cluster.local",
"database.port": "5432",
"database.user": "postgres",
"database.password": "<replace>",
"database.dbname": "postgres",
"topic.prefix": "blog",
"publication.autocreate.mode": "filtered",
"table.include.list": "public.simple",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"producer.override.sasl.mechanism": "SCRAM-SHA-256",
"producer.override.security.protocol": "SASL_PLAINTEXT",
"producer.override.sasl.jaas.config": "org.apache.kafka.common.security.scram.ScramLoginModule required username='blog' password='changethispassword';",
"producer.override.ssl.endpoint.identification.algorithm": "",
"producer.override.tls.insecure.skip.verify": "true"
}
}'
Insert data and validate the setup
Let’s validate the CDC messages are landing in the topic we created; we’ll insert some data into our PostgreSQL table and then consume a message from the topic.
Step 1: Connect to the database
kubectl run blog-postgresql-client --rm --tty -i --restart='Never' --namespace default --image docker.io/bitnami/postgresql:16.2.0-debian-12-r6 --env="PGPASSWORD=$POSTGRES_PASSWORD" \
--command -- psql --host blog-postgresql -U postgres -d postgres -p 5432
Step 2: Insert data into the table
INSERT INTO public.simple VALUES (
'9cb52b2a-8ef2-4987-8856-c79a1b2c2f71',
'foo'
);
INSERT INTO public.simple values(
'9cb52b2a-8ef2-4987-8856-c79a1b2c2f73',
'bar'
);
Step 3: Consume the CDC message from the topic
local-rpk topic consume blog.public.simple --num 1
{
"topic": "blog.public.simple",
"key": "{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f71\"}",
"value": "{\"before\":null,\"after\":{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f71\",\"foo\":\"foo\"},\"source\":{\"version\":\"2.5.2.Final\",\"connector\":\"postgresql\",\"name\":\"blog\",\"ts_ms\":1710755794795,\"snapshot\":\"false\",\"db\":\"postgres\",\"sequence\":\"[null,\\\"22407328\\\"]\",\"schema\":\"public\",\"table\":\"simple\",\"txId\":747,\"lsn\":22407328,\"xmin\":null},\"op\":\"c\",\"ts_ms\":1710755795117,\"transaction\":null}",
"timestamp": 1710755795494,
"partition": 0,
"offset": 0
}
We’ve validated that the Debezium CDC messages with before
-and-after
values are landing in the topic. Don’t be concerned if before
is not filled out in this example; if you take a look, the “op” is c
, which means a new row was created in the table.
Expose Redpanda to the internet
Deploying all of the services locally using minikube
is a great way to do local development and get hands-on experience without paying a cloud provider. However, for us to use Propel and create a Kafka Data Pool, Propel has to connect to our Redpanda topic. There are a number of ways that this can be accomplished. For the sake of simplicity, we are going to install a reverse proxy and create a TCP Tunnel to expose the Redpanda port to the internet. The reverse proxy we are going to use is called LocalXpose.
💡 To use the LocalXpose TCP Tunnel with a dedicated port, I had to pay $6. Other free services also do this, but you can’t control which port the tunnel is on. Redpanda has a limitation when creating the external NodePort that restricts it to ports 30000-32767
.
Step 1: Install LocalXpose and kcat
brew install --cask localxpose
brew install kcat
Step 2: Start the TCP Tunnel
loclx tunnel tcp --port 31092 -t 127.0.0.1:9093
You can now see the running tunnels. Copy the domain name in From
, e.g., us.localx.io
. This will be used in step 3 and later when creating the Kafka Data Pool.
Step 3: Upgrade the Redpanda install and expose an external NodePort
Save the yaml below to external-dns.yaml
, and replace the <domain>
with the value from step 2.
external:
enabled: true
type: NodePort
addresses:
- <domain>
Upgrade Redpanda.
helm upgrade --install redpanda redpanda/redpanda --namespace redpanda --create-namespace --values external-dns.yaml --reuse-values --wait
Step 4: Forward the traffic to minikube
kubectl port-forward -n redpanda svc/redpanda-external 9093:9094
Step 5: Validate you can connect externally using kcat
kcat -L -b us.loclx.io:31092 -X sasl.username=blog -X sasl.password=changethispassword -X sasl.mechanism=SCRAM-SHA-256 -X security.protocol=SASL_PLAINTEXT
You can now tunnel traffic to Redpanda running in minikube
, allowing Propel’s backend to connect and consume the CDC messages from the topic.
Create a Propel Kafka Data Pool
Login to the Propel console or create a new account. Then choose your environment and create a new Kafka Datapool, for this example I’m using the development
environment.
Next, choose Add new credentials to configure the connection to our Redpanda cluster. (use the domain and port from the TCP Tunnel that you created above for the Bootstrap Servers)
Click Create and test Credentials
Ensure the topics are visible
Next, choose the topic blog.public.simple
to start pulling in the CDC data.
Click Next to choose the primary timestamp.
Give the Data Pool a unique name.
Click Next to start syncing the data from the topic into Propel.
Once you see two records in the Datapool, you can click Preview Data.
Query the data
To query the data we have in Propel, you’ll need to create an Application. Follow the guide in the docs to create an application. Once you’ve created an Application, you can get a token, as seen in the image below. We’ll use that token in the subsequent API calls below.
Use the Data Grid API to query the data in the newly created Kafka Data Pool. Make sure to copy the token you created above and replace <TOKEN>
with the value. You will also need to replace <DATAPOOL_ID>
with the ID of the newly created Kafka Data Pool. Copy the curl
command below and execute it from the command line.
curl 'https://api.us-east-2.propeldata.com/graphql' \
-H 'Authorization: Bearer <TOKEN>' \
-H 'Content-Type: application/json; charset=utf-8' \
-d '{
"query": "query DataGridQuery($input: DataGridInput!) {
dataGrid(input: $input) {
headers
rows
pageInfo {
hasNextPage
hasPreviousPage
endCursor
startCursor
}
}
}",
"variables": {
"input": {
"dataPool": {
"id": "<DATAPOOL_ID>"
},
"columns": [
"_key",
"_timestamp",
"_propel_payload"
],
"orderByColumn": 1,
"sort": "DESC",
"filters": [],
"first": 5
}
}
}'
This will return a Data Grid with headers and rows.
{
"dataGrid": {
"headers": [
"_key",
"_timestamp",
"_propel_payload"
],
"rows": [
[
"{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73\"}",
"2024-03-18T09:56:37Z",
"{\"before\":null,\"after\":{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73\",\"foo\":\"bar\"},\"source\":{\"version\":\"2.5.2.Final\",\"connector\":\"postgresql\",\"name\":\"blog\",\"ts_ms\":1710755796308,\"snapshot\":\"false\",\"db\":\"postgres\",\"sequence\":\"[\\\"22407624\\\",\\\"22407680\\\"]\",\"schema\":\"public\",\"table\":\"simple\",\"txId\":748,\"lsn\":22407680,\"xmin\":null},\"op\":\"c\",\"ts_ms\":1710755796782,\"transaction\":null}"
],
[
"{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f71\"}",
"2024-03-18T09:56:35Z",
"{\"before\":null,\"after\":{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f71\",\"foo\":\"foo\"},\"source\":{\"version\":\"2.5.2.Final\",\"connector\":\"postgresql\",\"name\":\"blog\",\"ts_ms\":1710755794795,\"snapshot\":\"false\",\"db\":\"postgres\",\"sequence\":\"[null,\\\"22407328\\\"]\",\"schema\":\"public\",\"table\":\"simple\",\"txId\":747,\"lsn\":22407328,\"xmin\":null},\"op\":\"c\",\"ts_ms\":1710755795117,\"transaction\":null}"
]
],
"pageInfo": {
"hasNextPage": false,
"hasPreviousPage": false,
"endCursor": "eyJvZmZzZXQiOjJ9",
"startCursor": "eyJvZmZzZXQiOjB9"
}
}
}
Update the source PostgreSQL database
If we update the PostgreSQL database with a new value for one of the rows, we’ll see that change reflected in the Propel Data Pool. Let’s execute the SQL statement below and re-run the GraphQL query we ran above.
# Connec to the database
kubectl run blog-postgresql-client --rm --tty -i --restart='Never' --namespace default --image docker.io/bitnami/postgresql:16.2.0-debian-12-r6 --env="PGPASSWORD=$POSTGRES_PASSWORD" \
--command -- psql --host blog-postgresql -U postgres -d postgres -p 5432
UPDATE public.simple SET foo = 'baz' WHERE id = '9cb52b2a-8ef2-4987-8856-c79a1b2c2f73';
When we re-run the GraphQL query, we get the following response.
{
"dataGrid": {
"headers": [
"_key",
"_timestamp",
"_propel_payload"
],
"rows": [
[
"{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73\"}",
"2024-03-18T09:56:37Z",
"{\"before\":null,\"after\":{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73\",\"foo\":\"bar\"},\"source\":{\"version\":\"2.5.2.Final\",\"connector\":\"postgresql\",\"name\":\"blog\",\"ts_ms\":1710755796308,\"snapshot\":\"false\",\"db\":\"postgres\",\"sequence\":\"[\\\"22407624\\\",\\\"22407680\\\"]\",\"schema\":\"public\",\"table\":\"simple\",\"txId\":748,\"lsn\":22407680,\"xmin\":null},\"op\":\"c\",\"ts_ms\":1710755796782,\"transaction\":null}"
],
[
"{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73\"}",
"2024-03-18T17:38:53Z",
"{\"before\":{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73\",\"foo\":\"bar\"},\"after\":{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73\",\"foo\":\"baz\"},\"source\":{\"version\":\"2.5.2.Final\",\"connector\":\"postgresql\",\"name\":\"blog\",\"ts_ms\":1710783532893,\"snapshot\":\"false\",\"db\":\"postgres\",\"sequence\":\"[\\\"22407880\\\",\\\"22408168\\\"]\",\"schema\":\"public\",\"table\":\"simple\",\"txId\":749,\"lsn\":22408168,\"xmin\":null},\"op\":\"u\",\"ts_ms\":1710783533138,\"transaction\":null}"
],
[
"{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f71\"}",
"2024-03-18T09:56:35Z",
"{\"before\":null,\"after\":{\"id\":\"9cb52b2a-8ef2-4987-8856-c79a1b2c2f71\",\"foo\":\"foo\"},\"source\":{\"version\":\"2.5.2.Final\",\"connector\":\"postgresql\",\"name\":\"blog\",\"ts_ms\":1710755794795,\"snapshot\":\"false\",\"db\":\"postgres\",\"sequence\":\"[null,\\\"22407328\\\"]\",\"schema\":\"public\",\"table\":\"simple\",\"txId\":747,\"lsn\":22407328,\"xmin\":null},\"op\":\"c\",\"ts_ms\":1710755795117,\"transaction\":null}"
]
],
"pageInfo": {
"hasNextPage": false,
"hasPreviousPage": false,
"endCursor": "eyJvZmZzZXQiOjJ9",
"startCursor": "eyJvZmZzZXQiOjB9"
}
}
}
This shows that we have three rows in our Data Pool; it reflects the raw CDC messages that have been sent to the Kafka Data Pool. In order to use the data for analytics, we need to transform it to correctly represent the current state in the source database.
Create a real-time transformation to unpack the CDC JSON message
Now that we’re getting the stream of CDC messages to Propel and landing them in a Kafka Data Pool, we need to write some SQL to transform the messages based on the “op” type. The columns we are interested in using are id
, _timestamp
, and foo
. Our SELECT
statement below will query the source table blog-post
and extract id
and foo
from _propel_payload.after
. This will handle both the create and update operations. In the case of an update operation a duplicate row will be inserted and then de-duped by Propel. This will ensure that when we set id
as the uniqueId
column in the MaterializedView only one version of the id
exists with the latest value from the source table.
SELECT
"blog-post"."_propel_payload.after.id" AS id,
"_timestamp" AS timestamp,
"blog-post"."_propel_payload.after.foo" AS foo
FROM
"blog-post" # the name of the Kafka Data Pool from above
WHERE
"blog-post"."_propel_payload.op" = 'c' OR
"blog-post"."_propel_payload.op" = 'u';
We will use the Propel API to create a Materialized View with the SQL above and set the uniqueId
to the column id
in the resulting destination table. Be sure to replace the <TOKEN>
with the Application token we created earlier before running the curl
command. Be sure to save the response from the curl
command, as we’ll use some of the returned ids in later API calls.
curl --location "https://api.us-east-2.propeldata.com/graphql" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <TOKEN>" \
-d '{
"query": "mutation MaterializedView($input: CreateMaterializedViewInput!) { createMaterializedView(input: $input) { materializedView { id sql destination { id uniqueName tableSettings { partitionBy primaryKey orderBy } } source { id uniqueName tableSettings { partitionBy primaryKey orderBy } } } } }",
"variables": {
"input": {
"uniqueName": "Blog Post MV",
"description": "Blog Post unpacked",
"sql": "SELECT \"blog-post\".\"_propel_payload.after.id\" as id, \"_timestamp\" as timestamp, \"blog-post\".\"_propel_payload.after.foo\" as foo FROM \"blog-post\" WHERE \"blog-post\".\"_propel_payload.op\"='\''c'\'' OR \"blog-post\".\"_propel_payload.op\" ='\''u'\''",
"destination": {
"newDataPool": {
"uniqueId": {
"columnName": "id"
}
}
},
"backfillOptions": {
"backfill": true
}
}
}
}'
You should get a response like the one below.
{
"data": {
"createMaterializedView": {
"materializedView": {
"id": "MAT01HSY2DHN9HWKD1QAG0B9C115J",
"sql": "SELECT \"blog-post\".\"_propel_payload.after.id\" as id, \"_timestamp\" as timestamp, \"blog-post\".\"_propel_payload.after.foo\" as foo FROM \"blog-post\" WHERE \"blog-post\".\"_propel_payload.op\"='c' OR \"blog-post\".\"_propel_payload.op\" ='u'",
"destination": {
"id": "DPO01HSY2DHTZZDRZP2BDZAMX6PKP",
"uniqueName": "Data Pool for Materialized View MAT01HSY2DHN9HWKD1QAG0B9C115J",
"tableSettings": {
"partitionBy": null,
"primaryKey": null,
"orderBy": [
"\"id\""
]
}
},
"source": {
"id": "DPO01HS98EGS0W40ARYHBR2SDVR5X",
"uniqueName": "blog-post",
"tableSettings": {
"partitionBy": null,
"primaryKey": null,
"orderBy": null
}
}
}
}
}
}
Query the Materialized View Data Pool
Now that we’ve created the MaterialzedView let’s query the Data Pool that was created along with it. Replace <token>
with the token you created earlier, and use the response from the above curl
command to get the destination data pool id. This is available at data.createMaterializedView.materializedView.destination.id
in the JSON. Replace <datapool_id>
with that value and run the curl
command from the command line.
curl 'https://api.us-east-2.propeldata.com/graphql' \
-H 'Authorization: Bearer <TOKEN>' \
-H 'Content-Type: application/json; charset=utf-8' \
-d '{
"query": "query DataGridQuery($input: DataGridInput!) { dataGrid(input: $input) { headers rows pageInfo { hasNextPage hasPreviousPage endCursor startCursor } } }",
"variables": {
"input": {
"dataPool": {
"id": "<DATAPOOL_ID>"
},
"columns": [
"id",
"foo"
],
"orderByColumn": 1,
"sort": "DESC",
"filters": [],
"first": 5
}
}
}'
The response you get should only have two rows and the updated value of baz
for 9cb52b2a-8ef2-4987-8856-c79a1b2c2f73
.
{
"dataGrid": {
"headers": [
"id",
"foo"
],
"rows": [
[
"9cb52b2a-8ef2-4987-8856-c79a1b2c2f73",
"baz"
],
[
"9cb52b2a-8ef2-4987-8856-c79a1b2c2f71",
"foo"
]
],
"pageInfo": {
"hasNextPage": false,
"hasPreviousPage": false,
"endCursor": "eyJvZmZzZXQiOjF9",
"startCursor": "eyJvZmZzZXQiOjB9"
}
}
}
Visualize the results
Let’s write a small node.js
application that combines all of the Propel concepts and renders a table with the results from the MaterializedView. Instead of providing a token, we’ll use the Propel API to mint a token using the OAuth2 client credentials from the Propel Application we created. These are found in the Propel console within the Applications section, as seen below.
Step 1: OAuth2 client credentials
Step 2: Set environment variables
Replace the tokens with the values from the console, then paste and execute the commands.
export CLIENT_ID=<CLIENT_ID> \
export CLIENT_SECRET=<CLIENT_SECRET> \
export TOKEN_HOST=https://auth.us-east-2.propeldata.com \
export TOKEN_PATH=/oauth2/token
Step 3: Install the dependencies
npm install simple-oauth2 cli-table
Step 4: Copy the code
Copy the code below and replace <DATAPOOL_ID>
with the data pool ID for the MaterializedView from above.
const { ClientCredentials } = require("simple-oauth2");
const Table = require("cli-table");
async function getToken() {
const config = {
client: {
id: process.env.CLIENT_ID,
secret: process.env.CLIENT_SECRET,
},
auth: {
tokenHost: process.env.TOKEN_HOST,
tokenPath: process.env.TOKEN_PATH,
},
};
const client = new ClientCredentials(config);
try {
const accessToken = await client.getToken();
return accessToken.token.access_token;
} catch (error) {
console.error("Access Token Error", error.message);
return null;
}
}
async function fetchDataGrid(accessToken) {
const url = "https://api.us-east-2.propeldata.com/graphql";
const query = `
query DataGridQuery($input: DataGridInput!) {
dataGrid(input: $input) {
headers
rows
}
}
`;
const variables = {
input: {
dataPool: { id: "<DATAPOOL_ID>" },
columns: ["id", "foo"],
sort: "ASC",
first: 5,
last: 5,
},
};
try {
const response = await fetch(url, {
method: "POST",
headers: {
Authorization: `Bearer ${accessToken}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
query,
variables,
}),
});
const data = await response.json();
return data.data.dataGrid;
} catch (error) {
console.error("Error fetching data grid:", error.message);
return null;
}
}
async function main() {
const accessToken = await getToken();
if (!accessToken) {
console.error("Failed to retrieve access token");
return;
}
const dataGrid = await fetchDataGrid(accessToken);
if (!dataGrid) {
console.error("Failed to fetch data grid");
return;
}
const { headers, rows } = dataGrid;
const table = new Table({
head: headers,
});
rows.forEach((row) => {
table.push(row);
});
console.log(table.toString());
}
main();
Step 5: Run the script
node script.js <aws:prod-propel>
┌──────────────────────────────────────┬─────┐
│ id │ foo │
├──────────────────────────────────────┼─────┤
│ 9cb52b2a-8ef2-4987-8856-c79a1b2c2f71 │ foo │
├──────────────────────────────────────┼─────┤
│ 9cb52b2a-8ef2-4987-8856-c79a1b2c2f73 │ baz │
└──────────────────────────────────────┴─────┘
Wrap up
In wrapping up this post, it's important to recap the key skills and knowledge you've acquired. You have learned how to effectively stream PostgreSQL Change Data Capture (CDC) to a Propel Kafka DataPool, a significant step in managing real-time data.
Further, you've gained the ability to query this data, which opens up a wealth of opportunities for analysis and insights. This guide also walked you through the process of updating the source PostgreSQL database, ensuring that you can maintain the accuracy and relevance of your data.
One of the highlights of this guide was teaching you how to create a real-time transformation to unpack the CDC JSON message. This is a fundamental skill that elevates your data management capabilities, enabling you to manipulate and interpret data for various applications.
Finally, you've learned how to visualize the results. Visualization plays a crucial role in data analysis, making complex data more understandable, revealing trends and outliers, and contributing to more effective communication.
We hope you found the content both informative and practical and that you enjoyed the learning process. Propel is dedicated to empowering you with the tools and knowledge to transform your data into valuable insights. We look forward to supporting you in your future endeavors with Propel. Keep exploring and transforming your data!
Posted on April 8, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.