How to Use Kafka Connect to Connect to Data Sources on Heroku

Introduction

With ever-increasing demands from other business units, IT departments have to be constantly looking for service improvements and cost-saving opportunities. This article showcases several concrete use-cases for companies that are investigating or already using Kafka, in particular, Kafka Connect.

Kafka Connect is an enterprise-grade solution for integrating a plethora of applications, ranging from traditional databases to business applications like Salesforce and SAP. Possible integration scenarios range from continuously streaming events and data between applications to large-scale, configurable batch jobs that can be used to replace manual data transfers.

Why Kafka and Heroku

Apache Kafka started as an in-house project at LinkedIn intended to replace the various queuing and message-passing systems that had evolved over the years. The goal was to create a horizontally scalable system, able to reliably store and handle many millions of messages per second. Today, the Kafka community provides a whole ecosystem of tools and features ranging from simple log-message passing to performing basic on-the-fly ETL jobs to complex data-pipeline processing. Its versatility and robustness in addition to its scalability makes Kafka a great choice for the enterprise.

The power of Kafka comes at a price: While it's easy to use Kafka from a client perspective, the setup and operation of Kafka is a difficult task. Setting up a reliable Kafka cluster is a challenge that requires experience. If your company does not have the experience, it can take weeks, even months to set up and fine-tune a production-ready cluster. That's where Kafka on Heroku shines: Setting up a private Kafka cluster literally takes minutes, which makes it feasible for IT departments to focus on the services they want to provide instead of operating a Kafka cluster.

Adding Kafka Connect is the logical next step, once you are using Kafka. But Kafka Connect on Heroku is a huge time-saver, even if you are not planning to use Kafka directly from your application. With little operational overhead, Kafka Connect provides an integration point for any number of applications. Instead of running and maintaining different solutions for recurring data transfer, on-demand data transfer, and continuous updates between systems, Kafka Connect provides a unified solution to those problems.

In the next section you will learn about some typical use cases, and also find out which applications Kafka Connect is able to integrate.

Typical Business Use Cases for Kafka Connect

IT departments constantly struggle with requirements from business units that must connect and integrate all kinds of custom and off-the-shelf applications. These integrations often use less-reliable mechanisms, like recurring batch jobs, instead of continuously "streaming" updates across systems. Kafka Connect provides a scalable, reliable solution that can often replace slow batch jobs by instead "streaming" updates. Common scenarios include:

Your sales department regularly needs to update an internal SQL database with data from Salesforce. Use Kafka Connect to continuously stream the changes from Salesforce to your on-premise database.
Your compliance department needs an audit trail for updates to certain databases or applications. Use Kafka Connect to stream the relevant changes to an on-premise file server, to S3 for offsite storage, or even to Elasticsearch for quick retrieval.
Replace your unreliable, slow-running ETL batch jobs with a robust solution. Use Kafka Connect to
- update data across multiple applications while making them faster and more reliable
- continuously stream changes from structured files (CSV, JSON, logs, etc.) into databases and applications
- reliably and continuously stream updates between heterogeneous applications and databases

IT departments could even go as far as creating new self-service opportunities, where manual or labor-intensive requests for data transfer or extraction are replaced by self-service processes that reduce the IT department's workload while improving response time for business units. Consider a system in which your company's marketing department creates a ticket in your service-desk software, which automatically creates and triggers a data-transfer that otherwise the IT service desk would have had to handle manually.

An (Incomplete) List of Systems That Kafka Connect Can Integrate

Kafka Connect can read from and write to an ever-increasing number of applications and services, including enterprise applications, like SAP and Salesforce:

any SQL database: Oracle, Microsoft SQL Server, IBM Db2, MySQL, PostgreSQL, MariaDB, and others
NoSQL databases like MongoDB, Redis, Cassandra, InfluxDB, and others
message queues like ActiveMQ, RabbitMQ, IBM MQ, Amazon SQS, Azure Event Hub, Google Pub/Sub, and others
IT service-desk applications like Jira, ServiceNow, and Zendesk
logging and monitoring services, applications, and protocols like SNMP, Syslog, Elasticsearch, Amazon CloudWatch, Appdynamics, Metrics, Splunk, and others
generic "low-level" data sources and protocols, including HTTP, SFTP, MQTT, JMS, the file system, and others
robust custom connectors can be easily written using Java, taking full advantage of the reliable Kafka Connect framework and the underlying infrastructure
since Kafka Connect uses Kafka as the underlying infrastructure, any application that uses Kafka as a message bus is automatically available for integration

Setting up a Kafka Connect Integration Is Fast and Easy

Let's take a concrete example. A common integration scenario is this: You have two SQL databases and you need to update one database with information from the other database. For example, when the status of an order changes in your fulfillment system, you might need to update the internal CRM database to reflect this change. Here is the concrete scenario:

whenever the STATUS table changes in the fulfillment database
write the changes to the ORDER_STATUS table in the CRM database

For simplicity, we'll assume both are MySQL databases. The steps for other connectors would be the same, only with different configurations.

It takes two steps to set up this integration, assuming you have a working Kafka Cluster with Kafka Connect:

create the "source-connection" to read data from the STATUS table of fulfillment database
create the "sink-connection" to write data to the ORDER_STATUS table of the CRM database

Here are the corresponding cURL calls, and an explanation of what each of them does. They use the Kafka Connect REST API to create the source and sink. Note that these calls are not specific to Heroku. They will work with any Kafka Connect installation:

1 - Creating the source-connection.

curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{    
    "name": "jdbc_source_mysql_01",     
    "config": {       
        "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",       
        "connection.url": "jdbc:mysql://fulfillmentdbhost:3306/fulfillmentdb",       
        "connection.user": "fullfilment_user",       
        "connection.password": "&lt;password>",       
        "topic.prefix": "order-status-update-",       
        "mode":"timestamp",       
        "table.whitelist" : "fulfullmentdb.status",       
        "timestamp.column.name": "LAST_UPDATED",       
        "validate.non.null": false     
    }   
}'

2 - Creating the sink-connection.

curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{    
    "name": "jdbc_sink_mysql_01",     
    "config": {       
        "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",       
        "connection.url": "jdbc:mysql://crmdbhost:3306/crmdb",       
        "connection.user": "crm_user",       
        "connection.password": "&lt;password>",       
        "topics": "order-status-update-status",       
        "table.name.format" : "crmdb.order_status"    
    }   
}'

And you're done! From this point on, any new entries in the STATUS table of the fulfillment database will automatically be written to the ORDER_STATUS table of the CRM database. Let's dive a little into what each call is doing.

cURL call no. 1: Connecting the source

The first cURL command tells Kafka Connect to use a specific type of source connector, namely JdbcSourceConnector, to connect to the MySQL database at fulfillmentdbhost:3306/fulfillmentdb using the supplied credentials. It also configures this connector to monitor the table called "status" in that database for new entries based on the timestamp in the column "LAST_UPDATED." As a result, whenever a new row is written to the table, Kafka Connect will automatically write a new message into the Kafka topic "order-status-update-status" for each new row.

Next, we need a sink connector that monitors this Kafka topic for changes. That's what the second cURL call does.

cURL call no. 2: Connecting the sink

The second cURL command creates a sink connector (JdbcSinkConnector) that connects to a different database, namely crmdbhost:3306/crmdb. This connector will monitor the topic "order-status-update-status" for new messages and will write any new messages to the table "order_status" in the crmdb-database.

This example uses the JdbcSourceConnector and JdbcSinkConnector to connect two relational databases. Other types of connectors can be installed and then used in the same way. Each connector comes with its own documentation and set of options. An unofficial list of available connectors with links to their documentation can be found on the Confluent website.

Taking It to the Next Level

You have seen how to set up a continuously streaming integration between two SQL databases with just two commands. In a similar fashion you could integrate any of the applications that Kafka Connect provides connectors for -- or even write your own. Just set up a source and a sink connector and you are ready to stream your updates.

But this is not all. Once you have set up a source connector, you can use the same source to update multiple destinations ("sinks"). In our example you could run a third cURL command to write all order-status updates to another database as well.

Conclusion

If your team faces any of the problems described in this article, you should give Kafka Connect a try. Kafka Connect adds a whole new set of capabilities to an existing Kafka cluster, that will make your teams life easier in the long run. And if you are currently not running a Kafka cluster, this article has shown you how to set it up within minutes. Once Kafka Connect is up and running, your team can focus on solving actual business problems.

Once your team has mastered the basic workflows described in this article, you might want also want to check out Kafka Connect's ability to transform data on the fly, which can be used to solve a whole new set of problems.

Blog