Inserting and Selecting New Records - One Query

Just found something very interesting in PostgreSQL, thanks to Claude.

I have a really nasty query on some huge tables. Here's the basic layout that we're concerned with:

class RawRoyaltyRecord < ApplicationRecord
  belongs_to :royalty_input_batch_partial
  belongs_to :track
  has_many :raw_royalty_records_sales
end

class RoyaltyInputBatchPartial < ApplicationRecord
  belongs_to :pro
  has_many :raw_royalty_records
end

class RawRoyaltyRecordsSale < ApplicationRecord
  belongs_to :raw_royalty_record
  belongs_to :sale
end

Don't worry about "Pro" and "Sale" - we just need their IDs.

The concept is this. We get spreadsheets in every quarter or so with new raw_royalty_records. I have to match them to sales, which can be a bit tricky. Because of this trickiness, the simplest way to match them is to look at previously matched raw_royalty_records from the same Pro (through the royalty_input_batch_partials table) that have matching track_id and customer values.

Note that the customer value is stripped of extraneous leading/trailing spaces before being saved the in the database, and we have an index on lower(customer) to allow us to search quickly for them.

This thing was dog slow. It started out fine, but as we've added millions of records it got slower.

The way this worked was that we'd go get a list of prior matched raw_royalty_records and create a hash to look up the sale_ids based on the customer and track_id:

    @pro = Pro.find(params[:pro_id])

    @raw_royalty_records = @pro.raw_royalty_records
      .sans_sales
      .where("track_id is not null")
      .includes(:raw_royalty_records_sales, track: :library)

    @previously_matched_raw_royalty_records =
      @pro.raw_royalty_records.with_sales.where("track_id is not null")
        .select(:customer, :track_id, :id).distinct.includes(:sales).each_with_object({}) do |rrr, hash|
      hash[rrr.track_id] ||= {}
      hash[rrr.track_id][rrr.customer.strip.downcase] = rrr.sales
    end

    # Later on, looping through @raw_royalty_records
        @matches[track_id] ||= {}
        previously_matched = @previously_matched_raw_royalty_records.dig(track_id, customer_key)
        if previously_matched
          @matches[track_id][customer_key] = { match_type: 'previously_matched', sales: previously_matched }
          #break
        end

Not the most beautiful code I've written, but you get the idea.

That worked great when there were a couple hundred thousand records, but now we have a hundred and something thousand unmatched records, and many matched records. Looking in the database, for one particular PRO that I'm testing with has around 287,000 records total, with 259,000 not connected to a sale. So, our query there was pulling in 259,000 records total along with some associated records. That requires CPU and memory. Way too much.

But I realized quickly that I can match up the records in an SQL query. It's ugly, but using a common table expression takes a lot of the complexity out and works well because it needs to compute certain expressions only a single time.

WITH eligible_records AS (
    SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer
    FROM raw_royalty_records rrr
    INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
    INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
    WHERE ribp.pro_id = 960
      AND rrr.track_id IS NOT NULL
)
SELECT rr.*
FROM raw_royalty_records rr
INNER JOIN royalty_input_batch_partials ribp ON rr.royalty_input_batch_partial_id = ribp.id
LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rr.id
WHERE ribp.pro_id = 960
  AND rrs.id IS NULL
  AND rr.track_id IS NOT NULL
  AND EXISTS (
      SELECT 1
      FROM eligible_records er
      WHERE er.track_id = rr.track_id
        AND er.lower_customer = LOWER(rr.customer)
  );

That's awesome, but what I really want to do is just create the new records in raw_royalty_records_sales. But I also want to get a list of what was just created. Technically, I could do that by saving the highest id in raw_royalty_records_sales, then looking at the row count. This thing shouldn't run overlapping (famous last words), but I'm not going to rely on such silliness.

This is where common table expressions shine.

WITH eligible_records AS ( -- This gets a list of existing track_id, customer, and sale_ids
    SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer, rrrs.sale_id
    FROM raw_royalty_records rrr
    INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
    INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
    WHERE ribp.pro_id = 960
      AND rrr.track_id IS NOT NULL
),
inserted_records AS (
INSERT INTO raw_royalty_records_sales (raw_royalty_record_id, sale_id, created_at, updated_at)
SELECT DISTINCT rr.id, er.sale_id, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
FROM raw_royalty_records rr
INNER JOIN royalty_input_batch_partials ribp ON rr.royalty_input_batch_partial_id = ribp.id
LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rr.id
INNER JOIN eligible_records er ON er.track_id = rr.track_id and er.lower_customer = LOWER(rr.customer)
WHERE ribp.pro_id = 960
  AND rrs.id IS NULL
  AND rr.track_id IS NOT NULL
RETURNING *
)
SELECT ir.raw_royalty_record_id, ir.sale_id, rrr.track_id, rrr.customer
  FROM inserted_records ir INNER JOIN raw_royalty_records rrr
         ON rrr.id=ir.raw_royalty_record_id

That's a very similar query, but now it's using INSERT INTO...RETURNING * - in a CTE. So the actual query that is run selects from the items that were just inserted, joins them with raw_royalty_records record to get the track_id and customer, and returns the records in a normalized fashion.

I have 3448 raw_royalty_records records that can match up like this. Queries that took multiple seconds now take around a second, including the insert. This is a huge win.

Unfortunately, I'm not really using a lot of the cool Railsy ActiveRecord goodies to make this more readable, but that's what good documentation covers. I've sped this up by a couple of orders of magnitude and made it use 1% of the memory of the original. This is a huge win. I can handle an ugly query with those numbers.

Blog

Inserting and Selecting New Records - One Query

Michael Chaney

Join Our Newsletter. No Spam, Only the good stuff.

Related