HowTo: Working with large files in Ruby efficiently.

tjay_dev

Thomas Jaskiewicz

Posted on September 3, 2019

HowTo: Working with large files in Ruby efficiently.

How can we read files in Ruby?

* Testing file generated by running a following command:

❯ openssl req -newkey rsa:2048 -new -nodes -x509 -days 3650 -keyout key.pem -out cert.pem

It has a clearly defined the beginning and the end of the file which fill be useful while reading the files.

1. File.read() which is actually IO.read():

> file = File.read("cert.pem")
=> "-----BEGIN CERTIFICATE-----\nMIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\nTDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\nzqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\nTgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\naigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\nraNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\nGNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\naeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\nlybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n-----END CERTIFICATE-----\n"
> file.bytesize
=> 956
> file.class
=> String

read method reads the entire file's content and assigns it to the variable as single String.

2. File.new() and its synonym File.open():

> file = File.new("cert.pem")
=> #<File:cert.pem>
> lines = file.readlines
=> ["-----BEGIN CERTIFICATE-----\n",
 "MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n",
 "TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n",
 "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n",
 "zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n",
 "1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n",
 "TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n",
 "7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n",
 "aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n",
 "4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n",
 "raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n",
 "9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n",
 "GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n",
 "aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n",
 "lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n",
 "-----END CERTIFICATE-----\n"]
> lines.class
=> Array

new or open methods returns an instance of the File class on which we can call readlines method which reads the entire file's content, splits it line by line and returns an Array of Strings where one element is one line from the file.

3. File.readlines() which is actually IO.readlines():

> lines = File.readlines("cert.pem")
=> ["-----BEGIN CERTIFICATE-----\n",
 "MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n",
 "TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n",
 "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n",
 "zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n",
 "1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n",
 "TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n",
 "7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n",
 "aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n",
 "4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n",
 "raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n",
 "9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n",
 "GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n",
 "aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n",
 "lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n",
 "-----END CERTIFICATE-----\n"]
> lines.class
=> Array

Here, we have the same output as in the previous example by calling just class method readlines on File class.

4. File.foreach() which is actually IO.foreach():

> file = File.foreach("./cert.pem")
=> #<Enumerator: ...>
> file.entries
=> ["-----BEGIN CERTIFICATE-----\n",
 "MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n",
 "TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n",
 "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n",
 "zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n",
 "1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n",
 "TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n",
 "7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n",
 "aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n",
 "4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n",
 "raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n",
 "9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n",
 "GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n",
 "aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n",
 "lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n",
 "-----END CERTIFICATE-----\n"]
> lines.class
=> Array

foreach method returns an Enumerator instance on which we call entries which returns an Array of String, again each element is a line from the file.

As we can see above there are many methods that allow us to read the file. However which one should we use and why? Let's create a large file and check those methods again!

Which methods should we use to read large files?

Generating our test file

At first, let's generate a large file with randomized data inside:

require 'securerandom'
one_megabyte = 1024 * 1024

name = "large_1G"
size = 1000

File.open("./#{name}.txt", 'wb') do |file|
  size.times do
    file.write(SecureRandom.random_bytes(one_megabyte))
  end
end
  • w - Write-only, truncates existing file to zero length or creates a new file for writing.
  • b - Binary file mode. Suppresses EOL <-> CRLF conversion on Windows. And sets external encoding to ASCII-BIT unless explicitly specified.

As the result we generated 1GB file:

ls -lah
...
-rw-r--r--   1 user  user   1.0G Aug 31 22:10 large_1G.txt

Defining our metrics and profilers

There are probably 2 the most important metrics that we would like to track in our experiment:

  • Time - How long does it take to open and read the file?
  • Memory - How much memory does it take to open and read the file?

Also there will be one additional metric describing how many objects were freed by Garbage Collector.

We can prepare simple profiling methods:

# ./helpers.rb
require 'benchmark'

def profile_memory
  memory_usage_before = `ps -o rss= -p #{Process.pid}`.to_i
  yield
  memory_usage_after = `ps -o rss= -p #{Process.pid}`.to_i

  used_memory = ((memory_usage_after - memory_usage_before) / 1024.0).round(2)
  puts "Memory usage: #{used_memory} MB"
end

def profile_time
  time_elapsed = Benchmark.realtime do
    yield
  end

  puts "Time: #{time_elapsed.round(2)} seconds"
end

def profile_gc
  GC.start
  before = GC.stat(:total_freed_objects)
  yield
  GC.start
  after = GC.stat(:total_freed_objects)

  puts "Objects Freed: #{after - before}"
end

def profile
  profile_memory do 
    profile_time do 
      profile_gc do
        yield
      end
    end 
  end 
end

Testing our methods for reading files

  • .read
file = nil
profile do
  file = File.read("large_1G.txt")
end

Objects Freed: 39
Time: 0.52 seconds
Memory usage: 1000.05 MB
  • .new + #readlines
file = nil
profile do
  file = File.new("large_1G.txt").readlines
end

Objects Freed: 39
Time: 4.19 seconds
Memory usage: 1298.4 MB
  • .readlines
file = nil
profile do
  file = File.readlines("large_1G.txt")
end

Objects Freed: 39
Time: 4.24 seconds
Memory usage: 1284.61 MB
  • .foreach
file = nil
profile do
  file = File.foreach("large_1G.txt").to_a
end

Objects Freed: 40
Time: 4.42 seconds
Memory usage: 1284.31 MB

The examples we can see above allowed us to read the whole file and store it in local memory as one String or as an Array of Strings (each line from the file as one element in the Array).

As we can see, it requires at least as much memory as the size of the file:

  • one String - 1GB file requires 1GB of memory.
  • an Array of Strings - 1GB memory for file's content + additional memory for an Array (+- 300MB here). This approach has one advantage, we can access whichever line of the file we want as long as we know which line is it.

At this point we can see that the methods that we tested are not really efficient. The bigger the file, the more memory we need. In longer term this approach might lead to some serious consequences, even killing the application.

Now, we need to us ourselves a question. Can we process our files line by line? If so, then we can read our files in a different way:

  • .new + #each
file = nil

profile do
  file = File.new("large_1G.txt")
  file.each { |line| line }
end

Objects Freed: 4100808
Time: 2.08 seconds
Memory usage: 57.68 MB
  • .new + #advise + #each
file = nil

profile do
  file = File.new("large_1G.txt")
  file.advise(:sequential)
  file.each { |line| line }
end

Objects Freed: 4100808
Time: 2.22 seconds
Memory usage: 55.71 MB

Calling #advise method announces an intention to access data from the current file in a specific pattern. No major improvement here with using #advise method.

  • .new + #read - reading chunk by chunk
file = nil
chunk_size = 4096
buf = ""

profile do
  file = File.new("large_1G.txt")
  while buf = file.read(chunk_size)
    buf.tap { |buf| buf }
  end
end

Objects Freed: 256037
Time: 1.27 seconds
Memory usage: 131.64 MB

We defined the chunk as 4096 bytes and we read our file chunk by chunk. Depending on the structure of your file this approach might be useful.

  • .foreach + #each_entry
file = nil

profile do
  file = File.foreach("large_1G.txt")
  file.each_entry { |line| line }
end

Objects Freed: 4100809
Time: 2.22 seconds
Memory usage: 53.02 MB

Creating an Enumerator instance as file and reading file line by line using each_entry method.

First thing we can notice is that memory usage is way lower. Main reason for that is that we read the file line by line and when the line is processed then it's garbage collected. We can see that by the size of the Objects Freed, it's quite high.

We also tried to use here an #advise method which we can tell how we want to process our file. More about IO#advise can be found in the documentation. Unfortunately, it didn't help us out here.

Except IO#each method we have also similar methods like IO#each_byte (reading byte by byte),IO#each_char (reading char by char) and IO#each_codepoint.

In the example with reading by chunks (IO#read) the memory usage will vary depending on the chunk size. If you find this way useful you can experiment with the chunk size.

When using IO.foreach we operate on Enumerator which gives us a few more methods like: IO#each_entry, IO#each_slice, IO#each_cons. There is also lazy method which returns a Enumerator::Lazy. Lazy Enumerator has a few additional methods which enumerate values only on an as-needed basis. If you don't need to read the entire file but, for example, looking for a particular line containing given expression then it might be worth to check it out.

I could finish the article at this point, but what if before we even start reading the file we need to decrypt it? Let's move further to the example.

Decrypting large file and processing it line by line

Prerequisites

Before we decrypt the file we need to encrypt our generated large file. We are going to use AES with 256 bits key length with Cipher Block Chaining (CBC) as mode.

cipher = OpenSSL::Cipher::AES256.new(:CBC)
cipher.encrypt
KEY = cipher.random_key
IV = cipher.random_iv

Now, let's encrypt out file:

cipher = OpenSSL::Cipher::AES256.new(:CBC)
cipher.encrypt
cipher.key = KEY
cipher.iv = IV

file = nil
enc_file = nil

profile do
  file = File.read("large_1G.txt")
  enc_file = File.open("large_1G.txt.enc", "wb")
  enc_file << cipher.update(file)
  enc_file << cipher.final
end

file.close
enc_file.close

Objects Freed: 12
Time: 3.6 seconds
Memory usage: 1000.02 MB

Seems like encrypting is also a quite memory consuming task. Let's adjust the algorithm a little bit:

cipher = OpenSSL::Cipher::AES256.new(:CBC)
cipher.encrypt
cipher.key = KEY
cipher.iv = IV

file = nil
enc_file = nil

profile do 
  buf = ""
  file = File.open("large_1G.txt", "rb")
  enc_file = File.open("large_1G.txt.enc", "wb")
  while buf = file.read(4096)
    enc_file << cipher.update(buf)
  end
  enc_file << cipher.final
end

file.close
enc_file.close

Objects Freed: 768048
Time: 5.05 seconds
Memory usage: 145.93 MB

By changing the algorithm to read and cipher the file by chunks made the task much less memory consuming.

Decrypt

All right, let's try to decrypt it now:

decipher = OpenSSL::Cipher::AES256.new(:CBC)
decipher.decrypt
decipher.key = KEY
decipher.iv = IV

dec_file = nil
enc_file = nil

profile do 
  buf = ""
  enc_file = File.open("large_1G.txt.enc", "rb")
  dec_file = File.open("large_1G.txt.dec", "wb")
  while buf = enc_file.read(4096)
    dec_file << decipher.update(buf)
  end
  dec_file << decipher.final
end

dec_file.close
enc_file.close

Objects Freed: 768050
Time: 3.5 seconds
Memory usage: 152.12 MB

Now, let's compare our files whether we properly encrypted and decrypted it:

❯ diff large_1G.txt large_1G.txt.dec

No differences were found. We are good here!

We managed to lower the memory usage quite significantly. That's great!

Treat this article as a toolset that you can use in your specific case.

This article was originally posted on my personal dev blog: https://tjay.dev/

Photo by Erwan Hesry on Unsplash

💖 💪 🙅 🚩
tjay_dev
Thomas Jaskiewicz

Posted on September 3, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related