Using MySQL as a Cache Layer for BigQuery

keydunov

Artyom Keydunov

Posted on September 12, 2019

Using MySQL as a Cache Layer for BigQuery

BigQuery is great at handling large datasets, but will never give you a sub-second response, even on small datasets. It leads to a wait time on dashboards and charts, especially dynamic, where users can select different date ranges or change filters. It is almost always okay for internal BIs, but not for customer-facing analytics. We tolerate a lot of things such as poor UI and performance in internal tools, but not in those we ship to customers.

But we still can leverage BigQuery’s cheap data storage and the power to process large datasets, while not giving up on the performance. As BigQuery acts as a single source of truth and stores all the raw data, MySQL can act as cache layer on top of it and store only small, aggregated tables and provides us with a desired sub-second response.

You can check out the demo here and the source code on Github. Make sure to play with date range and switchers—dynamic dashboards benefit the most from the pre-aggregations.

cubejs-external-rollups.gif

Cube.js leverages the pre-aggregations layer as a part of its two-level caching system. We recently released support for external pre-aggregations to target use cases, where users can combine multiple databases and get the best out of the two worlds. The schema below shows the typical setup for Cube.js with BigQuery and MySQL.

Alt Text

To use the external rollup feature, we need to configure Cube.js to connect to both BigQuery and MySQL, as well as specify which pre-aggregation we want to build externally. If you are new to Cube.js, I recommend checking this 101-style tutorial first and then come back here. We are going to use the public Hacker News dataset from BigQuery for our sample application.

Let’s first install Cube.js CLI and create a new application.

$ npm install -g cubejs-cli
$ cubejs create external-rollups -d bigquery
Enter fullscreen mode Exit fullscreen mode

We set -d bigquery to make our main database be a BigQuery. Next, cd into the bigquery-mysql folder and configure .env with correct credentials.

CUBEJS_DB_TYPE=bigquery
CUBEJS_DB_BQ_PROJECT_ID=<BIGQUERY PROJECT ID>
CUBEJS_DB_BQ_CREDENTIALS=<BIGQUERY BASE64-ENCODED KEY>
CUBEJS_EXT_DB_NAME=preags
CUBEJS_EXT_DB_HOST=localhost
CUBEJS_EXT_DB_USER=root
CUBEJS_EXT_DB_PASS=12345

Enter fullscreen mode Exit fullscreen mode

Here we set credentials for both the main DB (BigQuery) and external DB for pre-aggregations (MySQL). You can learn more about obtaining BigQuery credentials at the Cube.js docs here. Also, in order to build pre-aggregations inside MySQL, Cube.js should have write access to the stb_pre_aggregations schema where pre-aggregation tables will be stored.

Now, let’s install the Cube.js MySQL driver.

$ npm install @cubejs-backend/mysql-driver --save
Enter fullscreen mode Exit fullscreen mode

Once we have it, the last configuration step is to provide MySQL related options via the externalDbType and externalDriverFactory properties. Replace the content of the index.js file with the following.

const CubejsServer = require("@cubejs-backend/server");
const MySQLDriver = require('@cubejs-backend/mysql-driver');

const server = new CubejsServer({
  externalDbType: 'mysql',
  externalDriverFactory: () => new MySQLDriver({
    host: process.env.CUBEJS_EXT_DB_HOST,
    database: process.env.CUBEJS_EXT_DB_NAME,
    user: process.env.CUBEJS_EXT_DB_USER,
    password: process.env.CUBEJS_EXT_DB_PASS.toString()
  })
});

server.listen().then(({ port }) => {
  console.log(`🚀 Cube.js server is listening on ${port}`);
});
Enter fullscreen mode Exit fullscreen mode

That is all we need to let Cube.js connect to both BigQuery and MySQL. Now, we can create our first Cube.js data schema file. Cube.js uses the data schema to generate an SQL code, which will be executed in your database.

Create the schema/Stories.js file with the following content.

cube(`Stories`, {
  sql: `select * from \`fh-bigquery.hackernews.full_partitioned\` WHERE type = 'story'`,

  measures: {
    count: {
      type: `count`,
    }
  },

  dimensions: {
    category: {
      type: `string`,
      case: {
        when: [
          { sql: `STARTS_WITH(title, "Show HN")`, label: `Show HN` },
          { sql: `STARTS_WITH(title, "Ask HN")`, label: `Ask HN` }
        ],
        else: { label: `Other` }
      }
    },

    time: {
      sql: `timestamp`,
      type: `time`
    }
  }
});
Enter fullscreen mode Exit fullscreen mode

Now start the Cube.js server by running node index.js and navigate to the development playground at http://localhost:4000.

You can select the Stories count measure and category dimension, alongside a time dimension to build a chart as shown below.

Alt Text

If we inspect a generated SQL by clicking a SQL button, we’ll see the following.

SELECT
  CASE
    WHEN STARTS_WITH(title, "Show HN") THEN 'Show HN'
    WHEN STARTS_WITH(title, "Ask HN") THEN 'Ask HN'
    ELSE 'Other'
  END `stories__category`,
  DATETIME_TRUNC(DATETIME(`stories`.timestamp, 'UTC'), MONTH) `stories__time_month`,
  count(*) `stories__count`
FROM
  (
    select
      *
    from
      `fh-bigquery.hackernews.full_partitioned`
    WHERE
      type = 'story'
  ) AS `stories`
GROUP BY
  1,
  2
ORDER BY
  2 ASC
LIMIT
  10000
Enter fullscreen mode Exit fullscreen mode

This SQL shows us that this query runs against the raw data inside BigQuery. Now, let’s make it run against the pre-aggregated table inside MySQL. To do that, we are going to define a pre-aggregation. Usually, it is done inside the same cube, but for the sake of the tutorial, so we can compare the performance with and without pre-aggregation, let’s create a new cube. We can do it inside the same file. Add the following code to the schema/Stories.js file.

cube(`StoriesPreAgg`, {
  extends: Stories,
  preAggregations: {
    main: {
      type: `rollup`,
      measureReferences: [count],
      dimensionReferences: [category],
      granularity: `month`,
      timeDimensionReference: time,
      external: true
    }
  }
});
Enter fullscreen mode Exit fullscreen mode

In the above code, we declare a pre-aggregation with a rollup type and specify which measures and dimensions to include in the aggregate table. Also note external: true; this line tells Cube.js to upload this pre-aggregation into MySQL.

Now, go to the development playground and select the same measures and dimensions as before: count, category, and time grouped by month, but this time select them from the Stories PreAgg cube. When requested the first time, Cube.js will generate an aggregate table and upload it to MySQL. All subsequent requests will go directly to the aggregate table inside MySQL. You can inspect the generated SQL and it should look like the following.

SELECT
  `stories_pre_agg__category` `stories_pre_agg__category`,
  `stories_pre_agg__time_month` `stories_pre_agg__time_month`,
  sum(`stories_pre_agg__count`) `stories_pre_agg__count`
FROM
  stb_pre_aggregations.stories_pre_agg_main
GROUP BY
  1,
  2
ORDER BY
  2 ASC
LIMIT
  10000
Enter fullscreen mode Exit fullscreen mode

As you can see, it now queries data from the stb_pre_aggregations.stories_pre_agg_main table inside MySQL. You can play around with filters to see the performance boost of the aggregated query compared to the raw one.

You can also check this demo dashboard with multiple charts and compare its performance with and without pre-aggregations. The source code of the example dashboard is available on Github.

💖 💪 🙅 🚩
keydunov
Artyom Keydunov

Posted on September 12, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Using MySQL as a Cache Layer for BigQuery
tutorial Using MySQL as a Cache Layer for BigQuery

September 12, 2019