Under the Hood: “Slurping” and Streaming Files in Ruby

jkreeftmeijer

Jeff Kreeftmeijer

Posted on July 10, 2018

Under the Hood: “Slurping” and Streaming Files in Ruby

In this edition of Ruby Magic, we'll learn about streaming files in Ruby, how the IO class handles reading files without completely loading them into memory, and how it reads files per line by buffering read bytes. Let's dive right in!

“Slurping” and Streaming Files

Ruby's File.read method reads a file and returns its full content.

irb(main):001:0> content = File.read("log/production.log")
=> "I, [2018-06-27T16:45:02.843719 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" for 127.0.0.1 at 2018-06-27 16:45:02 +0200\nI, [2018-06-27T16:45:02.846719 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Processing by ArticlesController#index as HTML\nI, [2018-06-27T16:45:02.848212 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Rendering articles/index.html.erb within layouts/application\nD, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Article Load (0.3ms)  SELECT \"articles\".* FROM \"articles\"\nI, [2018-06-27T16:45:02.850901 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Rendered articles/index.html.erb within layouts/application (1.7ms)\nI, [2018-06-27T16:45:02.851633 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Completed 200 OK in 5ms (Views: 3.4ms | ActiveRecord: 0.3ms)\n"
Enter fullscreen mode Exit fullscreen mode

Internally, this opens the file, reads its content, closes the file, and returns the content as a single string. By "slurping" the file's content at once, it's kept in memory until it’s cleaned up by Ruby’s garbage collector.

As an example, let's say we'd like to uppercase all characters in a file and write it to another file. Using File.read, we can get the content, call String#upcase on the resulting string, and pass the uppercased string to File.write.

irb> upcased = File.read("log/production.log").upcase
=> "I, [2018-06-27T16:45:02.843719 #9098]  INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] STARTED GET \"/ARTICLES\" FOR 127.0.0.1 AT 2018-06-27 16:45:02 +0200\nI, [2018-06-27T16:45:02.846719 #9098]  INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] PROCESSING BY ARTICLESCONTROLLER#INDEX AS HTML\nI, [2018-06-27T16:45:02.848212 #9098]  INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22]   RENDERING ARTICLES/INDEX.HTML.ERB WITHIN LAYOUTS/APPLICATION\nD, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22]   ARTICLE LOAD (0.3MS)  SELECT \"ARTICLES\".* FROM \"ARTICLES\"\nI, [2018-06-27T16:45:02.850901 #9098]  INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22]   RENDERED ARTICLES/INDEX.HTML.ERB WITHIN LAYOUTS/APPLICATION (1.7MS)\nI, [2018-06-27T16:45:02.851633 #9098]  INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] COMPLETED 200 OK IN 5MS (VIEWS: 3.4MS | ACTIVERECORD: 0.3MS)\n"
irb> File.write("log/upcased.log", upcased)
=> 896
Enter fullscreen mode Exit fullscreen mode

While that works for small files, reading the whole file into memory might be problematic when dealing with larger files. For instance, when parsing a 14-gigabyte log file, reading the whole file at once would be an expensive operation. The content of the file is kept in memory, so the app's memory footprint grows considerably. This can eventually lead to memory swapping and the OS killing the app's process.

Luckily, Ruby allows reading files line by line using File.foreach. Instead of reading the file's full content at once, it will execute a passed block for each line.

Its result is enumerable, therefore it either yields a block for each line, or returns an Enumerator object if no block is passed. This enables the reading of bigger files without having to load all their content into memory at once.

irb> File.foreach("log/production.log") { |line| p line }
"I, [2018-06-27T16:45:02.843719 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" for 127.0.0.1 at 2018-06-27 16:45:02 +0200\n"
"I, [2018-06-27T16:45:02.846719 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Processing by ArticlesController#index as HTML\n"
"I, [2018-06-27T16:45:02.848212 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Rendering articles/index.html.erb within layouts/application\n"
"D, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Article Load (0.3ms)  SELECT \"articles\".* FROM \"articles\"\n"
"I, [2018-06-27T16:45:02.850901 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Rendered articles/index.html.erb within layouts/application (1.7ms)\n"
"I, [2018-06-27T16:45:02.851633 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Completed 200 OK in 5ms (Views: 3.4ms | ActiveRecord: 0.3ms)\n"
Enter fullscreen mode Exit fullscreen mode

To uppercase a whole file, we read from the input file line by line, uppercase it, and append it to the output file.

irb> File.open("upcased.log", "a") do |output|
irb*   File.foreach("production.log") { |line| output.write(line.upcase) }
irb> end
=> nil
Enter fullscreen mode Exit fullscreen mode

So, how does reading a file line by line work without having to first read the whole file? To understand that, we’ll have to peel back some of the layers around reading files. Let's take a closer look at Ruby's IO class.

I/O and Ruby's IO Class

Even though File.read and File.foreach exist, the documentation for the File class doesn’t list them. In fact, you won’t find any of the file reading or writing methods in the File class documentation, because they are inherited from the parent IO class.

I/O

An I/O device is a device that transfers data to or from a computer, for example keyboards, displays and hard drives. It performs Input/Output, or I/O, by reading or producing streams of data.

Reading and writing files from the hard drive is the most common I/O you’ll encounter. Other types of I/O include socket communication, logging output to your terminal and input from your keyboard.

The IO class in Ruby handles all input and output like reading and writing to files. Because reading files isn't different than reading from any other I/O stream, the File class directly inherits methods like IO.read and IO.foreach.

irb> IO.foreach("log/production.log") { |line| p line }
"I, [2018-06-27T16:45:02.843719 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" for 127.0.0.1 at 2018-06-27 16:45:02 +0200\n"
"I, [2018-06-27T16:45:02.846719 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Processing by ArticlesController#index as HTML\n"
"I, [2018-06-27T16:45:02.848212 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Rendering articles/index.html.erb within layouts/application\n"
"D, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Article Load (0.3ms)  SELECT \"articles\".* FROM \"articles\"\n"
"I, [2018-06-27T16:45:02.850901 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22]   Rendered articles/index.html.erb within layouts/application (1.7ms)\n"
"I, [2018-06-27T16:45:02.851633 #9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Completed 200 OK in 5ms (Views: 3.4ms | ActiveRecord: 0.3ms)\n"
Enter fullscreen mode Exit fullscreen mode

File.foreach is equivalent to IO.foreach, so the IO class version can be used to get the same result we did previously.

Reading I/O Streams Via the Kernel

Internally, Ruby's IO class' reading and writing abilities are based on abstractions around kernel system calls. The operating system's kernel takes care of reading from and writing to I/O devices.

Opening Files

IO.sysopen opens a file by asking the kernel to put a reference to the file in the file table and creating a file descriptor in the process' file descriptor table.

File Descriptors and the File Table

Opening a file returns a file descriptor — an integer used to access the I/O resource.

Each process has its own file descriptor table to keep the file descriptors in memory, and each descriptor points to an entry in the system-wide file table.

To read from or write to an I/O resource, the process passes the file descriptor to the kernel through a system call. The kernel then accesses the file on behalf of the process, as processes don’t have access to the file table.

Opening files will not keep their content in memory, but the file descriptor table can get filled up, so it’s a good practice to always close files after opening them. Methods that wrap File.open like File.read do this automatically, as well as the ones taking a block.

In this example, we'll go one step further by calling the IO.sysopen method directly. By passing a filename, the method creates a file descriptor we can use to reference the open file later.

irb> IO.sysopen("log/production.log")
=> 9
Enter fullscreen mode Exit fullscreen mode

To create an IO instance for Ruby to read from and write to, we pass the file descriptor to IO.new

irb> file_descriptor = IO.sysopen("log/production.log")
=> 9
irb> io = IO.new(file_descriptor)
=> #<IO:fd 9>
Enter fullscreen mode Exit fullscreen mode

To close an I/O stream and remove the reference to the file from the files table, we call IO#close on the IO instance.

irb> io.close
=> nil
Enter fullscreen mode Exit fullscreen mode

Reading Bytes and Moving Cursors

IO#sysread reads a number of bytes from an IO object.

irb> io.sysread(64)
=> " [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" "
Enter fullscreen mode Exit fullscreen mode

This example uses the IO instance we created previously by passing the file descriptor integer to IO.new. It reads and returns the first 64 bytes from the file by calling IO#sysread with 64 as its argument.

irb> io.sysread(64)
=> "for 127.0.0.1 at 2018-06-27 16:45:02 +0200\nI, [2018-06-27T16:45:"
Enter fullscreen mode Exit fullscreen mode

The first time we requested bytes from the file, the cursor was moved automatically, so calling IO#sysread on the same instance again will produce the next 64 bytes of the file.

Moving the Cursor

IO.sysseek manually moves the cursor to a location in the file.

irb> io.sysseek(32)
=> 32
irb> io.sysread(64)
=> "9098]  INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started "
irb> io.sysseek(0)
=> 0
irb> io.sysread(64)
=> " [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" "
Enter fullscreen mode Exit fullscreen mode

In this example, we move to position 32, then read 64 bytes using IO#sysread. By calling IO.sysseek again with 0, we jump back to the beginning of the file, allowing us to read the first 64 bytes again.

Reading Files Line by Line

Now, we know how the IO class's convenience methods open IO streams, read bytes from them and how they move the cursor's position.

Methods like IO.foreach and IO#gets can request lines line by line instead of per number of bytes. There's no performant way of looking ahead to find the next newline and take all bytes until that position, so Ruby needs to take care of splitting the file's content.

class MyIO
  def initialize(filename)
    fd = IO.sysopen(filename)
    @io = IO.new(fd)
  end

  def each(&block)
    line = ""

    while (c = @io.sysread(1)) != $/
      line << c
    end

    block.call(line)
    each(&block)
  rescue EOFError
    @io.close
  end
end
Enter fullscreen mode Exit fullscreen mode

In this example implementation, the #each method takes bytes from the file using IO#sysread one at a time, until the byte is $/, indicating a newline. When it finds a newline, it stops taking bytes and calls the passed block with that line.

This solution works but is inefficient as it calls IO.sysread for every byte in the file.

Buffering File Content

Ruby is smarter about how it does this by keeping an internal buffer of the file's content. Instead of reading the file one byte at a time, it takes 512 bytes at once and checks if there are any newlines in the returned bytes. If there are, it returns the portion before the newline and keeps the rest in memory as a buffer. If the buffer doesn't include a newline, it fetches 512 bytes more until it finds one.

class MyIO
  def initialize(filename)
    fd = IO.sysopen(filename)
    @io = IO.new(fd)
    @buffer = ""
  end

  def each(&block)
    @buffer << @io.sysread(512) until @buffer.include?($/)

    line, @buffer = @buffer.split($/, 2)

    block.call(line)
    each(&block)
  rescue EOFError
    @io.close
  end
end
Enter fullscreen mode Exit fullscreen mode

In this example, the #each method adds bytes to an internal @buffer variable in chunks of 512 bytes until when the @buffer variable includes a newline. When that happens, it splits the buffer by the first newline. The first part is the line, and the second part is the new buffer.

The passed block is then called with the line and the remaining @buffer is kept for use in the next loop.

By buffering the file's content, the number of I/O calls is reduced while dividing the file in logical chunks.

Streaming Files

To summarize, streaming files works by asking the operating system's kernel to open a file, then read bytes from it bit by bit. When reading a file per line in Ruby, data is taken from the file 512 bytes at a time and split up in "lines" after that.

This concludes our overview of I/O and streaming files in Ruby. We'd love to know what you thought of this article, or if you have any questions. We're always on the lookout for topics to investigate and explain, so if there's anything magical in Ruby you'd like to read about, don't hesitate to leave a comment.

💖 💪 🙅 🚩
jkreeftmeijer
Jeff Kreeftmeijer

Posted on July 10, 2018

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related