Nikolai Privalov
Posted on March 3, 2020
This article is written by Nikolai Privalov, graduate of Rubizza Survival Camp during his work on the first commercial project.
- The task in hand
- Parsing of an XML element
- Nested elements parsing
- Non-nested concurrent elements parsing
- Insert optimization
- Results
The task in hand
Every web-developer once in a while faces the problem of parsing price lists and catalogs for another e-commerce site. And there are common strategies to deal with price updates, depending on the way you integrate with your supplier.
In my case I was working on a b2b solution, which by default has a low number of visitors. The catalog is an XML that contains all kinds of possible products, categories and vendors and it is updated once a month. When you read this file with the Nokogiri default (DOM) parser, it creates a tree structure with all branches and leaves. It allows you to easily navigate through it via css/xpath selectors.
The only problem is that if you read the whole file into memory, it takes a significant amount of RAM. It is really ineffective to pay for a server if you need this RAM once a month. Since I don't need to navigate through the tree structure, but just replicate all the needed data into the database, the best option is to use SAX parser.
SAX parser works a little bit differently. It uses an event-based
approach. It doesn't parse or create a tree structure. Instead, it reads file line by line and notifies you about any occurrence of opening or closing tag.
This communication between the caller app and the parser is usually carried out by the use of the callback functions. In other words, SAX parser passes through the XML document sequentially, and notifies you on the encounters of the tracked events(document start, document end, element start(tag opening), element end(tag closing), etc...), that you can react on by executing some event-specific code. That approach allows interpreter not to store everything in RAM except the one and only line, where tag has been opened or closed.
Now, presumably, you are working on a ruby
application, and use ActiveRecord to access the database, that has pre-created tables with associations, closely matching the nesting structure of your XML-document.
<CATALOG>
<VENDOR MAN_ID="" MAN_NAME="">
<UNITS UNT_ID="" UNT_NAME="" UNT_PACK_QTTY=""/>
<UNITS UNT_ID="" UNT_NAME="" UNT_PACK_QTTY=""/>
...
<PRODUCTS SER_ID="" SER_NAME="" STY_ID="">
<PRODUCT_FAMILY PRF_CONVERT="" PRF_ID="">
<ITEM PRD_ADDITION="" PRD_AXIS="" PRD_CHECKDIGIT="" PRD_COLOR="" PRD_CONVERT="" PRD_CYLINDER="" PRD_DESCRIPTION="" PRD_ID="" PRD_POWER="" PRD_UPC_CODE="" PRD_COLOR_ID=""/>
<ITEM PRD_ADDITION="" PRD_AXIS="" PRD_CHECKDIGIT="" PRD_COLOR="" PRD_CONVERT="" PRD_CYLINDER="" PRD_DESCRIPTION="" PRD_ID="" PRD_POWER="" PRD_UPC_CODE="" PRD_COLOR_ID=""/>
...
<UNITS_AVAILABILITY UNT_ID=""/>
</PRODUCT_FAMILY>
...
</PRODUCTS>
...
</VENDOR>
...
</CATALOG>
Let's consider an XML document, that represents a catalog of items(<ITEM/>
) structured by the product families(<PRODUCT_FAMILY/>
), by products(<PRODUCTS/>
), and by vendors(<VENDOR/>
). The data for parsing can be found within the attributes of the corresponding elements, while the elements themselves are hierarchically nested in each other. In the catalog, there could be multiple vendors, in the vendor - multiple products, etc. There is also a <UNITS_AVAILABILITY/>
element, that uniquely matches one of the <UNITS/>
elements by the UNT_ID
attribute, and characterizes available package type for the product family (In fact, not for the family, but for the products instead, due to UNT_ID attributes being identical within a particular products).
The tables of the following structure have been prepared in the database. The column names are mostly similar to the attribute names of the XML elements, and the nesting hierarchical structure is somewhat simplified. Here the vendors
table matches the <VENDOR/>
element, products
- <PRODUCTS/>
,<UNITS/>
, items
- <PRODUCT_FAMILY/>
,<ITEM/>
.
Parsing of an XML element
In this article, we are going to use Nokogiri::XML::SAX for parsing.
(https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX/)
According to the manual let's create a subclass of the Nokogiri::XML::SAX::Document
, and implement the methods that correspond with the events we want to track(start_document, start_element, end_element, end_document). After that let's create an instance of the parser and feed it our XML file to parse.
# Create a subclass of Nokogiri::XML::SAX::Document and implement
# the events we care about:
class CatalogFilter < Nokogiri::XML::SAX::Document
def start_document
end
def start_element(name, attrs = [])
end
def end_element(name)
end
def end_document
end
end
# Create our parser
parser = Nokogiri::XML::SAX::Parser.new(CatalogFilter.new)
# Send some XML to the parser
parser.parse(File.open('./Catalog.xml)')
When the parser encounters an element, the start_element
method is being called with two arguments: name - element's name, attrs - an array of the element's attributes. The addition of the case-when
statement into the start_element
method allows us to react to the start of a <VENDOR/>
element.
def start_element(name, attrs = [])
case name
when 'VENDOR'
puts 'we are in <VENDOR/> element'
end
end
As a result, we get we are in <VENDOR/> element
output on every <VENDOR/>
element encounter.
In theory, when a <VENDOR/>
element is encountered, we want to create a corresponding record in the vendors
table via the ActiveRecord. To do that, we modify our case-when
accordingly and also convert the attrs
array into a hash, to access the attribute values by keys.
def start_element(name, attrs = [])
attrs = attrs.to_h
case name
when 'VENDOR'
Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
end
end
Nested elements parsing
To create records in the products
table on a <PRODUCTS/>
element encounter, let's change our case-when
statement in a similar way.
def start_element(name, attrs = [])
attrs = attrs.to_h
case name
when 'VENDOR'
Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
when 'PRODUCTS'
Product.create({abbr: attrs['SER_ID'], name: attrs['SER_NAME']})
end
end
In the current state of our parser, the records of the vendors
table on the encounter of the <VENDOR/>
elements and the records of the product
table on the encounter of the <PRODUCTS/>
elements are already being created. However, these two tables are linked by the association of (1 - *), and the foreign key of vendor_id
in the records of the products
table is needed to be set. To do that, we assign a Vendor class object, that we create via ActiveRecord, to a CatalogFilter class instance variable, so we can add a vendor_id
foreign key to the products
table records in the following way:
def start_element(name, attrs = [])
attrs = attrs.to_h
case name
when 'VENDOR'
@vendor = Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
when 'PRODUCTS'
@vendor.products.create({
abbr: attrs['SER_ID'],
name: attrs['SER_NAME']
})
end
end
The @vendor
variable gets reassigned every time the parser encounters a <VENDOR/>
element, and it's value is relevant while we are inside that element. Since we also have a callback, when the parser finds a closing tag, we can explicitly reset
the value of the @vendor
variable in the end_element
method.
def end_element(name)
case name
when 'VENDOR'
@vendor = nil
end
end
As a result, on the exit out of the <VENDOR/>
element, the end_element
function is called, the @vendor
variable gets assigned a nil
, and therefore the Vendor class object becomes non-accessible outside of the <VENDOR/>
element. After that, we proceed to create records of the items
table in the same manner, keeping in mind that we are going to need the <PRODUCT_FAMILY/>
and <ITEM/>
elements attributes.
def start_element(name, attrs = [])
attrs = attrs.to_h
case name
when 'VENDOR'
@vendor = Vendor.create({abbr: attrs['MAN_ID'], name: attrs['MAN_NAME']})
when 'PRODUCTS'
@product = @vendor.products.create({
abbr: attrs['SER_ID'],
name: attrs['SER_NAME']})
end
when 'PRODUCT_FAMILY'
@prf = {
basecurve: attrs['PRF_BASECURVE'],
diameter: attrs['PRF_DIAMETER'],
id: attrs['PRF_ID']
}
when 'ITEM'
@product.items.create({
prf_basecurve: @prf[:basecurve],
prf_diameter: @prf[:diameter],
prf_id: @prf[:id],
prd_addition: attrs['PRD_ADDITION'],
prd_axis: attrs['PRD_AXIS'],
prd_color: attrs['PRD_COLOR'],
prd_cylinder: attrs['PRD_CYLINDER'],
prd_description: attrs['PRD_DESCRIPTION'],
prd_id: attrs['PRD_ID'],
prd_power: attrs['PRD_POWER']
})
end
end
Here, we use @prf
variable to store the attributes of the <PRODUCT_FAMILY/>
element as a hash, which is available during the parsing of all the inner <ITEM/>
elements.
Let's edit our end_element
method in regards to the newly assigned instance variables.
def end_element(name)
case name
when 'VENDOR'
@vendor = nil
when 'PRODUCTS'
@product = nil
when 'PRODUCT_FAMILY'
@prf = nil
end
end
Non-nested concurrent elements parsing
At this stage we already create the necessary database records during the sequential parsing of the XML document.
But looking at the document nesting structure we may also acknowledge the requirement of the <UNITS/>
element attributes addition to each of the products
records. However, it becomes known much later, the attributes of which exactly <UNITS/>
element are needed to be used. We can pick the correct <UNITS/>
element only during the parsing of the <UNITS_AVAILABILITY/>
element as we acquire the UNT_ID
attribute value.
To solve this problem let's make a use of the @units array into which, as we parse the <UNITS/>
elements, we push the hashes containing these elements attribute values. Later on, we select the correct hash to use from the @units array by the UNT_ID attribute value.
Let's add additional case-when
condition to the start_element
method in order to parse <UNITS/>
and <UNITS_AVAILABILITY/>
elements.
def start_element(name, attrs = [])
...
when 'UNITS'
@units << {
id: attrs['UNT_ID'],
name: attrs['UNT_NAME'],
lens_quantity: attrs['UNT_LENS_QTTY']
}
when 'UNITS_AVAILABILITY'
unit = @units.find { |u| u[:id] == attrs['UNT_ID'] }
@product.update({
unit_id: unit[:id],
unit_name: unit[:name],
unit_lens_quantity: unit[:lens_quantity]
})
...
end
Of course, the @units
array has to be previously defined before we can even use it, and it also needs to be cleared on the end of the <VENDOR/>
element, outside of which its content is irrelevant.
It would be valid to define the @units
array on the encounter of the <VENDOR/>
element in the start_element
method, but, as an example, we are going to implement the start_document
method and define the @units
array in it at the very start of the document parsing. Also on the exit out of the <VENDOR/>
element the @units array is being flushed.
def start_document
@units = []
end
def end_element(name)
...
when 'VENDOR'
@vendor = nil
@units.clear
...
end
After making the final changes our parser is officially ready and it is already capable of filling the database with the relevant records.
Insert optimization
This article is based on the production case, and the real catalog contained tens of vendors
, hundreds of products
, and millions of items
table records. Currently our parser creates records one at a time, and while it does not lead to any kind of problems for the vendors
and products
, the creation of the couple of millions items
records one by one will consume a significant amount of time, like hours or even days.
The simplest solution to that is the insertion of records in batches of some reasonable size. To accomplish that the insert_all
method of the ActiveRecord can be used. It accepts an array of the attribute hashes and creates multiple records in the database via single INSERT SQL request, bypassing the ActiveRecord objects creation callbacks. Therefore instead of the Item.create on the encounter of the <ITEM/>
element, we form attribute hashes and push them into the @items
array, as it was previously done with the @units
array.
def start_document
@units = []
@items = []
end
def start_element(name, attrs = [])
...
when 'ITEM'
@items << {
prf_basecurve: @prf[:basecurve],
prf_diameter: @prf[:diameter],
prf_id: @prf[:id],
prd_addition: attrs['PRD_ADDITION'],
prd_axis: attrs['PRD_AXIS'],
prd_color: attrs['PRD_COLOR'],
prd_cylinder: attrs['PRD_CYLINDER'],
prd_description: attrs['PRD_DESCRIPTION'],
prd_id: attrs['PRD_ID'],
prd_power: attrs['PRD_POWER'],
product_id: @product.id # ← here we pass product_id
}
...
end
Now, at the end of the <ITEM/>
element we may check if the @items array size has reached some set BATCH_SIZE
value, and if so, the batch insert of the records based on the accumulated attribute hashes takes place with the clearing of the @items
array right after.
def end_element(name)
...
when 'ITEM'
if @items.size == BATCH_SIZE
item_batch_insert
@items.clear
end
...
end
def item_batch_insert
Item.insert_all(@items)
end
Bear in mind, that at the end of the document parsing there could be still some amount of elements left in the @items array, and if the array size is less than a BATCH_SIZE the item_batch_insert
is not being called, and that last set of records is not being created in the database. To counteract that, we call item_batch_insert
one more time at the end of the document parsing in the end_document
method.
def end_document
item_batch_insert
end
Results
Here is the final result:
class CatalogFilter < Nokogiri::XML::SAX::Document
BATCH_SIZE = 5000
def start_document
@units = []
@items = []
end
def start_element(name, attrs = [])
attrs = attrs.to_h
case name
when 'VENDOR'
@vendor = Vendor.create({
abbr: attrs['MAN_ID'],
name: attrs['MAN_NAME']
})
when 'UNITS'
@units << {
id: attrs['UNT_ID'],
name: attrs['UNT_NAME'],
lens_quantity: attrs['UNT_LENS_QTTY']
}
when 'PRODUCTS'
@product = @vendor.products.create({
abbr: attrs['SER_ID'],
name: attrs['SER_NAME'],
ser_day_per_lens: attrs['SER_DAY_PER_LENS']
})
when 'PRODUCT_FAMILY'
@prf = {
basecurve: attrs['PRF_BASECURVE'],
diameter: attrs['PRF_DIAMETER'],
id: attrs['PRF_ID']
}
when 'ITEM'
@items << {
prf_basecurve: @prf[:basecurve],
prf_diameter: @prf[:diameter],
prf_id: @prf[:id],
prd_addition: attrs['PRD_ADDITION'],
prd_axis: attrs['PRD_AXIS'],
prd_color: attrs['PRD_COLOR'],
prd_cylinder: attrs['PRD_CYLINDER'],
prd_description: attrs['PRD_DESCRIPTION'],
prd_id: attrs['PRD_ID'],
prd_power: attrs['PRD_POWER'],
product_id: @product.id
}
when 'UNITS_AVAILABILITY'
unit = @units.find { |u| u[:id] == attrs['UNT_ID'] }
@product.update({
unit_id: unit[:id],
unit_name: unit[:name],
unit_lens_quantity: unit[:lens_quantity]
})
end
end
def end_element(name)
case name
when 'ITEM'
if @items.size == BATCH_SIZE
item_batch_insert
@items.clear
end
when 'PRODUCT_FAMILY'
@prf = nil
when 'PRODUCTS'
@product = nil
when 'VENDOR'
@vendor = nil
@units.clear
en
end
def end_document
item_batch_insert
end
def item_batch_insert
Item.insert_all(@items)
end
end
parser = Nokogiri::XML::SAX::Parser.new(CatalogFilter.new)
parser.parse(File.open('./Catalog.xml)'))
P.S. Are we done? Almost. Instead of creating new objects from scratch all the time, you might want to check if they already exist in your database. In that case you can use first_or_create
method of ActiveRecord. But you can not use that strategy for SQL bulk insertion. Instead, use upsert_all
method for that. And since insert_all
and upsert_all
bypass all the callbacks it also skips setting timestamps to a record. Please check https://github.com/rails/rails/issues/35493 this thread to deal with possible problems.
Posted on March 3, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.