Under the Hood: “Slurping” and Streaming Files in Ruby
Jeff Kreeftmeijer
Posted on July 10, 2018
In this edition of Ruby Magic, we'll learn about streaming files in Ruby, how the IO
class handles reading files without completely loading them into memory, and how it reads files per line by buffering read bytes. Let's dive right in!
“Slurping” and Streaming Files
Ruby's File.read
method reads a file and returns its full content.
irb(main):001:0> content = File.read("log/production.log")
=> "I, [2018-06-27T16:45:02.843719 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" for 127.0.0.1 at 2018-06-27 16:45:02 +0200\nI, [2018-06-27T16:45:02.846719 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Processing by ArticlesController#index as HTML\nI, [2018-06-27T16:45:02.848212 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Rendering articles/index.html.erb within layouts/application\nD, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Article Load (0.3ms) SELECT \"articles\".* FROM \"articles\"\nI, [2018-06-27T16:45:02.850901 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Rendered articles/index.html.erb within layouts/application (1.7ms)\nI, [2018-06-27T16:45:02.851633 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Completed 200 OK in 5ms (Views: 3.4ms | ActiveRecord: 0.3ms)\n"
Internally, this opens the file, reads its content, closes the file, and returns the content as a single string. By "slurping" the file's content at once, it's kept in memory until it’s cleaned up by Ruby’s garbage collector.
As an example, let's say we'd like to uppercase all characters in a file and write it to another file. Using File.read
, we can get the content, call String#upcase
on the resulting string, and pass the uppercased string to File.write
.
irb> upcased = File.read("log/production.log").upcase
=> "I, [2018-06-27T16:45:02.843719 #9098] INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] STARTED GET \"/ARTICLES\" FOR 127.0.0.1 AT 2018-06-27 16:45:02 +0200\nI, [2018-06-27T16:45:02.846719 #9098] INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] PROCESSING BY ARTICLESCONTROLLER#INDEX AS HTML\nI, [2018-06-27T16:45:02.848212 #9098] INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] RENDERING ARTICLES/INDEX.HTML.ERB WITHIN LAYOUTS/APPLICATION\nD, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] ARTICLE LOAD (0.3MS) SELECT \"ARTICLES\".* FROM \"ARTICLES\"\nI, [2018-06-27T16:45:02.850901 #9098] INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] RENDERED ARTICLES/INDEX.HTML.ERB WITHIN LAYOUTS/APPLICATION (1.7MS)\nI, [2018-06-27T16:45:02.851633 #9098] INFO -- : [86A5D18C-19DD-4CBF-9D7A-461C79E98C22] COMPLETED 200 OK IN 5MS (VIEWS: 3.4MS | ACTIVERECORD: 0.3MS)\n"
irb> File.write("log/upcased.log", upcased)
=> 896
While that works for small files, reading the whole file into memory might be problematic when dealing with larger files. For instance, when parsing a 14-gigabyte log file, reading the whole file at once would be an expensive operation. The content of the file is kept in memory, so the app's memory footprint grows considerably. This can eventually lead to memory swapping and the OS killing the app's process.
Luckily, Ruby allows reading files line by line using File.foreach
. Instead of reading the file's full content at once, it will execute a passed block for each line.
Its result is enumerable, therefore it either yields a block for each line, or returns an Enumerator object if no block is passed. This enables the reading of bigger files without having to load all their content into memory at once.
irb> File.foreach("log/production.log") { |line| p line }
"I, [2018-06-27T16:45:02.843719 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" for 127.0.0.1 at 2018-06-27 16:45:02 +0200\n"
"I, [2018-06-27T16:45:02.846719 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Processing by ArticlesController#index as HTML\n"
"I, [2018-06-27T16:45:02.848212 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Rendering articles/index.html.erb within layouts/application\n"
"D, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Article Load (0.3ms) SELECT \"articles\".* FROM \"articles\"\n"
"I, [2018-06-27T16:45:02.850901 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Rendered articles/index.html.erb within layouts/application (1.7ms)\n"
"I, [2018-06-27T16:45:02.851633 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Completed 200 OK in 5ms (Views: 3.4ms | ActiveRecord: 0.3ms)\n"
To uppercase a whole file, we read from the input file line by line, uppercase it, and append it to the output file.
irb> File.open("upcased.log", "a") do |output|
irb* File.foreach("production.log") { |line| output.write(line.upcase) }
irb> end
=> nil
So, how does reading a file line by line work without having to first read the whole file? To understand that, we’ll have to peel back some of the layers around reading files. Let's take a closer look at Ruby's IO
class.
I/O and Ruby's IO
Class
Even though File.read
and File.foreach
exist, the documentation for the File
class doesn’t list them. In fact, you won’t find any of the file reading or writing methods in the File
class documentation, because they are inherited from the parent IO
class.
I/O
An I/O device is a device that transfers data to or from a computer, for example keyboards, displays and hard drives. It performs Input/Output, or I/O, by reading or producing streams of data.
Reading and writing files from the hard drive is the most common I/O you’ll encounter. Other types of I/O include socket communication, logging output to your terminal and input from your keyboard.
The IO
class in Ruby handles all input and output like reading and writing to files. Because reading files isn't different than reading from any other I/O stream, the File
class directly inherits methods like IO.read
and IO.foreach
.
irb> IO.foreach("log/production.log") { |line| p line }
"I, [2018-06-27T16:45:02.843719 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" for 127.0.0.1 at 2018-06-27 16:45:02 +0200\n"
"I, [2018-06-27T16:45:02.846719 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Processing by ArticlesController#index as HTML\n"
"I, [2018-06-27T16:45:02.848212 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Rendering articles/index.html.erb within layouts/application\n"
"D, [2018-06-27T16:45:02.850020 #9098] DEBUG -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Article Load (0.3ms) SELECT \"articles\".* FROM \"articles\"\n"
"I, [2018-06-27T16:45:02.850901 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Rendered articles/index.html.erb within layouts/application (1.7ms)\n"
"I, [2018-06-27T16:45:02.851633 #9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Completed 200 OK in 5ms (Views: 3.4ms | ActiveRecord: 0.3ms)\n"
File.foreach
is equivalent to IO.foreach
, so the IO
class version can be used to get the same result we did previously.
Reading I/O Streams Via the Kernel
Internally, Ruby's IO
class' reading and writing abilities are based on abstractions around kernel system calls. The operating system's kernel takes care of reading from and writing to I/O devices.
Opening Files
IO.sysopen
opens a file by asking the kernel to put a reference to the file in the file table and creating a file descriptor in the process' file descriptor table.
File Descriptors and the File Table
Opening a file returns a file descriptor — an integer used to access the I/O resource.
Each process has its own file descriptor table to keep the file descriptors in memory, and each descriptor points to an entry in the system-wide file table.
To read from or write to an I/O resource, the process passes the file descriptor to the kernel through a system call. The kernel then accesses the file on behalf of the process, as processes don’t have access to the file table.
Opening files will not keep their content in memory, but the file descriptor table can get filled up, so it’s a good practice to always close files after opening them. Methods that wrap File.open
like File.read
do this automatically, as well as the ones taking a block.
In this example, we'll go one step further by calling the IO.sysopen
method directly. By passing a filename, the method creates a file descriptor we can use to reference the open file later.
irb> IO.sysopen("log/production.log")
=> 9
To create an IO
instance for Ruby to read from and write to, we pass the file descriptor to IO.new
irb> file_descriptor = IO.sysopen("log/production.log")
=> 9
irb> io = IO.new(file_descriptor)
=> #<IO:fd 9>
To close an I/O stream and remove the reference to the file from the files table, we call IO#close
on the IO
instance.
irb> io.close
=> nil
Reading Bytes and Moving Cursors
IO#sysread
reads a number of bytes from an IO
object.
irb> io.sysread(64)
=> " [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" "
This example uses the IO
instance we created previously by passing the file descriptor integer to IO.new
. It reads and returns the first 64 bytes from the file by calling IO#sysread
with 64 as its argument.
irb> io.sysread(64)
=> "for 127.0.0.1 at 2018-06-27 16:45:02 +0200\nI, [2018-06-27T16:45:"
The first time we requested bytes from the file, the cursor was moved automatically, so calling IO#sysread
on the same instance again will produce the next 64 bytes of the file.
Moving the Cursor
IO.sysseek
manually moves the cursor to a location in the file.
irb> io.sysseek(32)
=> 32
irb> io.sysread(64)
=> "9098] INFO -- : [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started "
irb> io.sysseek(0)
=> 0
irb> io.sysread(64)
=> " [86a5d18c-19dd-4cbf-9d7a-461c79e98c22] Started GET \"/articles\" "
In this example, we move to position 32, then read 64 bytes using IO#sysread
. By calling IO.sysseek
again with 0, we jump back to the beginning of the file, allowing us to read the first 64 bytes again.
Reading Files Line by Line
Now, we know how the IO
class's convenience methods open IO streams, read bytes from them and how they move the cursor's position.
Methods like IO.foreach
and IO#gets
can request lines line by line instead of per number of bytes. There's no performant way of looking ahead to find the next newline and take all bytes until that position, so Ruby needs to take care of splitting the file's content.
class MyIO
def initialize(filename)
fd = IO.sysopen(filename)
@io = IO.new(fd)
end
def each(&block)
line = ""
while (c = @io.sysread(1)) != $/
line << c
end
block.call(line)
each(&block)
rescue EOFError
@io.close
end
end
In this example implementation, the #each
method takes bytes from the file using IO#sysread
one at a time, until the byte is $/
, indicating a newline. When it finds a newline, it stops taking bytes and calls the passed block with that line.
This solution works but is inefficient as it calls IO.sysread
for every byte in the file.
Buffering File Content
Ruby is smarter about how it does this by keeping an internal buffer of the file's content. Instead of reading the file one byte at a time, it takes 512 bytes at once and checks if there are any newlines in the returned bytes. If there are, it returns the portion before the newline and keeps the rest in memory as a buffer. If the buffer doesn't include a newline, it fetches 512 bytes more until it finds one.
class MyIO
def initialize(filename)
fd = IO.sysopen(filename)
@io = IO.new(fd)
@buffer = ""
end
def each(&block)
@buffer << @io.sysread(512) until @buffer.include?($/)
line, @buffer = @buffer.split($/, 2)
block.call(line)
each(&block)
rescue EOFError
@io.close
end
end
In this example, the #each
method adds bytes to an internal @buffer
variable in chunks of 512 bytes until when the @buffer
variable includes a newline. When that happens, it splits the buffer by the first newline. The first part is the line
, and the second part is the new buffer.
The passed block is then called with the line and the remaining @buffer
is kept for use in the next loop.
By buffering the file's content, the number of I/O calls is reduced while dividing the file in logical chunks.
Streaming Files
To summarize, streaming files works by asking the operating system's kernel to open a file, then read bytes from it bit by bit. When reading a file per line in Ruby, data is taken from the file 512 bytes at a time and split up in "lines" after that.
This concludes our overview of I/O and streaming files in Ruby. We'd love to know what you thought of this article, or if you have any questions. We're always on the lookout for topics to investigate and explain, so if there's anything magical in Ruby you'd like to read about, don't hesitate to leave a comment.
Posted on July 10, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.