Parsing huge XML in Ruby with Sax Parser. Still relevant

nikolaiprivalov

Nikolai Privalov

Posted on March 3, 2020

Parsing huge XML in Ruby with Sax Parser. Still relevant

This article is written by Nikolai Privalov, graduate of Rubizza Survival Camp during his work on the first commercial project.

The task in hand

Every web-developer once in a while faces the problem of parsing price lists and catalogs for another e-commerce site. And there are common strategies to deal with price updates, depending on the way you integrate with your supplier.

In my case I was working on a b2b solution, which by default has a low number of visitors. The catalog is an XML that contains all kinds of possible products, categories and vendors and it is updated once a month. When you read this file with the Nokogiri default (DOM) parser, it creates a tree structure with all branches and leaves. It allows you to easily navigate through it via css/xpath selectors.

The only problem is that if you read the whole file into memory, it takes a significant amount of RAM. It is really ineffective to pay for a server if you need this RAM once a month. Since I don't need to navigate through the tree structure, but just replicate all the needed data into the database, the best option is to use SAX parser.

SAX parser works a little bit differently. It uses an event-based approach. It doesn't parse or create a tree structure. Instead, it reads file line by line and notifies you about any occurrence of opening or closing tag.

This communication between the caller app and the parser is usually carried out by the use of the callback functions. In other words, SAX parser passes through the XML document sequentially, and notifies you on the encounters of the tracked events(document start, document end, element start(tag opening), element end(tag closing), etc...), that you can react on by executing some event-specific code. That approach allows interpreter not to store everything in RAM except the one and only line, where tag has been opened or closed.

Now, presumably, you are working on a ruby application, and use ActiveRecord to access the database, that has pre-created tables with associations, closely matching the nesting structure of your XML-document.

<CATALOG>
  <VENDOR MAN_ID="" MAN_NAME="">
    <UNITS UNT_ID="" UNT_NAME="" UNT_PACK_QTTY=""/>
    <UNITS UNT_ID="" UNT_NAME="" UNT_PACK_QTTY=""/>
    ...
    <PRODUCTS SER_ID="" SER_NAME="" STY_ID="">
      <PRODUCT_FAMILY PRF_CONVERT="" PRF_ID="">
        <ITEM PRD_ADDITION="" PRD_AXIS="" PRD_CHECKDIGIT="" PRD_COLOR="" PRD_CONVERT="" PRD_CYLINDER="" PRD_DESCRIPTION="" PRD_ID="" PRD_POWER="" PRD_UPC_CODE="" PRD_COLOR_ID=""/>
        <ITEM PRD_ADDITION="" PRD_AXIS="" PRD_CHECKDIGIT="" PRD_COLOR="" PRD_CONVERT="" PRD_CYLINDER="" PRD_DESCRIPTION="" PRD_ID="" PRD_POWER="" PRD_UPC_CODE="" PRD_COLOR_ID=""/>
        ...
        <UNITS_AVAILABILITY UNT_ID=""/>
      </PRODUCT_FAMILY>
      ...
    </PRODUCTS>
    ...
  </VENDOR>
  ...
</CATALOG>

Let's consider an XML document, that represents a catalog of items(<ITEM/>) structured by the product families(<PRODUCT_FAMILY/>), by products(<PRODUCTS/>), and by vendors(<VENDOR/>). The data for parsing can be found within the attributes of the corresponding elements, while the elements themselves are hierarchically nested in each other. In the catalog, there could be multiple vendors, in the vendor - multiple products, etc. There is also a <UNITS_AVAILABILITY/> element, that uniquely matches one of the <UNITS/> elements by the UNT_ID attribute, and characterizes available package type for the product family (In fact, not for the family, but for the products instead, due to UNT_ID attributes being identical within a particular products).

table structure

The tables of the following structure have been prepared in the database. The column names are mostly similar to the attribute names of the XML elements, and the nesting hierarchical structure is somewhat simplified. Here the vendors table matches the <VENDOR/> element, products - <PRODUCTS/>,<UNITS/>, items - <PRODUCT_FAMILY/>,<ITEM/>.

Parsing of an XML element

In this article, we are going to use Nokogiri::XML::SAX for parsing.
(https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX/)

According to the manual let's create a subclass of the Nokogiri::XML::SAX::Document, and implement the methods that correspond with the events we want to track(start_document, start_element, end_element, end_document). After that let's create an instance of the parser and feed it our XML file to parse.

# Create a subclass of Nokogiri::XML::SAX::Document and implement
# the events we care about:
class CatalogFilter < Nokogiri::XML::SAX::Document
  def start_document
  end

  def start_element(name, attrs = [])
  end

  def end_element(name)
  end

  def end_document
  end
end

# Create our parser
parser = Nokogiri::XML::SAX::Parser.new(CatalogFilter.new)

# Send some XML to the parser
parser.parse(File.open('./Catalog.xml)')

When the parser encounters an element, the start_element method is being called with two arguments: name - element's name, attrs - an array of the element's attributes. The addition of the case-when statement into the start_element method allows us to react to the start of a <VENDOR/> element.

def start_element(name, attrs = [])
  case name
  when 'VENDOR'
    puts 'we are in <VENDOR/> element'
  end
end

As a result, we get we are in <VENDOR/> element output on every <VENDOR/> element encounter.

In theory, when a <VENDOR/> element is encountered, we want to create a corresponding record in the vendors table via the ActiveRecord. To do that, we modify our case-when accordingly and also convert the attrs array into a hash, to access the attribute values by keys.

def start_element(name, attrs = [])
  attrs = attrs.to_h

  case name
  when 'VENDOR'
    Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
  end
end

Nested elements parsing

To create records in the products table on a <PRODUCTS/> element encounter, let's change our case-when statement in a similar way.

def start_element(name, attrs = [])
  attrs = attrs.to_h

  case name
  when 'VENDOR'
    Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
  when 'PRODUCTS'
    Product.create({abbr: attrs['SER_ID'], name: attrs['SER_NAME']})
  end
end

In the current state of our parser, the records of the vendors table on the encounter of the <VENDOR/> elements and the records of the product table on the encounter of the <PRODUCTS/> elements are already being created. However, these two tables are linked by the association of (1 - *), and the foreign key of vendor_id in the records of the products table is needed to be set. To do that, we assign a Vendor class object, that we create via ActiveRecord, to a CatalogFilter class instance variable, so we can add a vendor_id foreign key to the products table records in the following way:

def start_element(name, attrs = [])
  attrs = attrs.to_h

  case name
  when 'VENDOR'
    @vendor = Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
  when 'PRODUCTS'
    @vendor.products.create({
      abbr: attrs['SER_ID'],
      name: attrs['SER_NAME']
    })
  end
end

The @vendor variable gets reassigned every time the parser encounters a <VENDOR/> element, and it's value is relevant while we are inside that element. Since we also have a callback, when the parser finds a closing tag, we can explicitly reset the value of the @vendor variable in the end_element method.

def end_element(name)
  case name
  when 'VENDOR'
    @vendor = nil
  end
end

As a result, on the exit out of the <VENDOR/> element, the end_element function is called, the @vendor variable gets assigned a nil, and therefore the Vendor class object becomes non-accessible outside of the <VENDOR/> element. After that, we proceed to create records of the items table in the same manner, keeping in mind that we are going to need the <PRODUCT_FAMILY/> and <ITEM/> elements attributes.

def start_element(name, attrs = [])
  attrs = attrs.to_h

  case name
  when 'VENDOR'
    @vendor = Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
  when 'PRODUCTS'
    @product = @vendor.products.create({
      abbr: attrs['SER_ID'],
      name: attrs['SER_NAME']})
  end
  when 'PRODUCT_FAMILY'
    @prf = {
      basecurve: attrs['PRF_BASECURVE'],
      diameter: attrs['PRF_DIAMETER'],
      id: attrs['PRF_ID']
    }
  when 'ITEM'
    @product.items.create({
      prf_basecurve: @prf[:basecurve],
      prf_diameter: @prf[:diameter],
      prf_id: @prf[:id],
      prd_addition: attrs['PRD_ADDITION'],
      prd_axis: attrs['PRD_AXIS'],
      prd_color: attrs['PRD_COLOR'],
      prd_cylinder: attrs['PRD_CYLINDER'],
      prd_description: attrs['PRD_DESCRIPTION'],
      prd_id: attrs['PRD_ID'],
      prd_power: attrs['PRD_POWER']
    })
  end
end

Here, we use @prf variable to store the attributes of the <PRODUCT_FAMILY/> element as a hash, which is available during the parsing of all the inner <ITEM/> elements.

Let's edit our end_element method in regards to the newly assigned instance variables.

def end_element(name)
  case name
  when 'VENDOR'
    @vendor = nil
  when 'PRODUCTS'
    @product = nil
  when 'PRODUCT_FAMILY'
    @prf = nil
  end
end

Non-nested concurrent elements parsing

At this stage we already create the necessary database records during the sequential parsing of the XML document.

But looking at the document nesting structure we may also acknowledge the requirement of the <UNITS/> element attributes addition to each of the products records. However, it becomes known much later, the attributes of which exactly <UNITS/> element are needed to be used. We can pick the correct <UNITS/> element only during the parsing of the <UNITS_AVAILABILITY/> element as we acquire the UNT_ID attribute value.

To solve this problem let's make a use of the @units array into which, as we parse the <UNITS/> elements, we push the hashes containing these elements attribute values. Later on, we select the correct hash to use from the @units array by the UNT_ID attribute value.

Let's add additional case-when condition to the start_element method in order to parse <UNITS/> and <UNITS_AVAILABILITY/> elements.

def start_element(name, attrs = [])
...
  when 'UNITS'
    @units << {
      id: attrs['UNT_ID'],
      name: attrs['UNT_NAME'],
      lens_quantity: attrs['UNT_LENS_QTTY']
    }
  when 'UNITS_AVAILABILITY'
    unit = @units.find { |u| u[:id] == attrs['UNT_ID'] }
    @product.update({
      unit_id: unit[:id],
      unit_name: unit[:name],
      unit_lens_quantity: unit[:lens_quantity]
    })
...
end

Of course, the @units array has to be previously defined before we can even use it, and it also needs to be cleared on the end of the <VENDOR/> element, outside of which its content is irrelevant.

It would be valid to define the @units array on the encounter of the <VENDOR/> element in the start_element method, but, as an example, we are going to implement the start_document method and define the @units array in it at the very start of the document parsing. Also on the exit out of the <VENDOR/> element the @units array is being flushed.

def start_document
  @units = []
end

def end_element(name)
  ...
  when 'VENDOR'
    @vendor = nil
    @units.clear
  ...
end

After making the final changes our parser is officially ready and it is already capable of filling the database with the relevant records.

Insert optimization

This article is based on the production case, and the real catalog contained tens of vendors, hundreds of products, and millions of items table records. Currently our parser creates records one at a time, and while it does not lead to any kind of problems for the vendors and products, the creation of the couple of millions items records one by one will consume a significant amount of time, like hours or even days.

The simplest solution to that is the insertion of records in batches of some reasonable size. To accomplish that the insert_all method of the ActiveRecord can be used. It accepts an array of the attribute hashes and creates multiple records in the database via single INSERT SQL request, bypassing the ActiveRecord objects creation callbacks. Therefore instead of the Item.create on the encounter of the <ITEM/> element, we form attribute hashes and push them into the @items array, as it was previously done with the @units array.

def start_document
  @units = []
  @items = []
end

def start_element(name, attrs = [])
...
  when 'ITEM'
    @items << {
      prf_basecurve: @prf[:basecurve],
      prf_diameter: @prf[:diameter],
      prf_id: @prf[:id],
      prd_addition: attrs['PRD_ADDITION'],
      prd_axis: attrs['PRD_AXIS'],
      prd_color: attrs['PRD_COLOR'],
      prd_cylinder: attrs['PRD_CYLINDER'],
      prd_description: attrs['PRD_DESCRIPTION'],
      prd_id: attrs['PRD_ID'],
      prd_power: attrs['PRD_POWER'],
      product_id: @product.id # ← here we pass product_id
    }
...
end

Now, at the end of the <ITEM/> element we may check if the @items array size has reached some set BATCH_SIZE value, and if so, the batch insert of the records based on the accumulated attribute hashes takes place with the clearing of the @items array right after.

def end_element(name)
...
  when 'ITEM'
    if @items.size == BATCH_SIZE
      item_batch_insert
      @items.clear
    end
...
end

def item_batch_insert
  Item.insert_all(@items)
end

Bear in mind, that at the end of the document parsing there could be still some amount of elements left in the @items array, and if the array size is less than a BATCH_SIZE the item_batch_insert is not being called, and that last set of records is not being created in the database. To counteract that, we call item_batch_insert one more time at the end of the document parsing in the end_document method.

def end_document
  item_batch_insert
end

Results

Here is the final result:

class CatalogFilter < Nokogiri::XML::SAX::Document
  BATCH_SIZE = 5000

  def start_document
    @units = []
    @items = []
  end

  def start_element(name, attrs = [])
    attrs = attrs.to_h

    case name
    when 'VENDOR'
      @vendor = Vendor.create({
        abbr: attrs['MAN_ID'],
        name: attrs['MAN_NAME']
      })
    when 'UNITS'
      @units << {
        id: attrs['UNT_ID'],
        name: attrs['UNT_NAME'],
        lens_quantity: attrs['UNT_LENS_QTTY']
      }
    when 'PRODUCTS'
      @product = @vendor.products.create({
        abbr: attrs['SER_ID'],
        name: attrs['SER_NAME'],
        ser_day_per_lens: attrs['SER_DAY_PER_LENS']
      })
    when 'PRODUCT_FAMILY'
      @prf = {
        basecurve: attrs['PRF_BASECURVE'],
        diameter: attrs['PRF_DIAMETER'],
        id: attrs['PRF_ID']
      }
    when 'ITEM'
      @items << {
        prf_basecurve: @prf[:basecurve],
        prf_diameter: @prf[:diameter],
        prf_id: @prf[:id],
        prd_addition: attrs['PRD_ADDITION'],
        prd_axis: attrs['PRD_AXIS'],
        prd_color: attrs['PRD_COLOR'],
        prd_cylinder: attrs['PRD_CYLINDER'],
        prd_description: attrs['PRD_DESCRIPTION'],
        prd_id: attrs['PRD_ID'],
        prd_power: attrs['PRD_POWER'],
        product_id: @product.id
      }
    when 'UNITS_AVAILABILITY'
      unit = @units.find { |u| u[:id] == attrs['UNT_ID'] }
      @product.update({
        unit_id: unit[:id],
        unit_name: unit[:name],
        unit_lens_quantity: unit[:lens_quantity]
      })
    end
  end

  def end_element(name)
    case name
    when 'ITEM'
      if @items.size == BATCH_SIZE
        item_batch_insert
        @items.clear
      end
    when 'PRODUCT_FAMILY'
      @prf = nil
    when 'PRODUCTS'
      @product = nil
    when 'VENDOR'
      @vendor = nil
      @units.clear
    en
  end

  def end_document
    item_batch_insert
  end

  def item_batch_insert
    Item.insert_all(@items)
  end
end

parser = Nokogiri::XML::SAX::Parser.new(CatalogFilter.new)
parser.parse(File.open('./Catalog.xml)'))

P.S. Are we done? Almost. Instead of creating new objects from scratch all the time, you might want to check if they already exist in your database. In that case you can use first_or_create method of ActiveRecord. But you can not use that strategy for SQL bulk insertion. Instead, use upsert_all method for that. And since insert_all and upsert_all bypass all the callbacks it also skips setting timestamps to a record. Please check https://github.com/rails/rails/issues/35493 this thread to deal with possible problems.

💖 💪 🙅 🚩
nikolaiprivalov
Nikolai Privalov

Posted on March 3, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related