this is fine - home

Optimal way of processing large files in Ruby

I was asked what was the optimal way to process files in Ruby. I had some assumptions, but they turn out to be wrong šŸ˜, so Iā€™m writing this post for future reference (and for anyone out there interested on it).

Note: the ā€œbenchmarksā€ here are non-scientific, they are just a way to show in how many orders of magnitude the examples differ from each other

File.readlines/IO.readlines

This is by far the slowest. Thatā€™s because this method scans the whole file, returning an array with every line in the file, which is very convenient and you might not even see problems in small files.

For my test case, I created a file with 24MB and loading it with readlines takes almost 2 seconds:

ā†’ time ruby -e "File.readlines('large.txt')"

real  0m1.352s
user  0m1.044s
sys   0m0.210s

Also the memory consumption was quite high, on my machine it was reaching around 100MB!

File.read/IO.read

Faster than #readlines, however it returns a large string. This means that the whole file will still be loaded into memory, which is still not ideal.

ā†’ time ruby -e "File.read('large.txt')"

real  0m0.392s
user  0m0.212s
sys   0m0.098s

Memory consumption was around 31MB on my machine. Ruby runtime itself has 7MB, plus 24MB of loaded strings, matches the actual memory.

File#each/IO.foreach

This is where things get more interesting. File#each receives a block, passing each line as the argument of the block. This so far is the best method to process a file sequentially, because the lines are not all loaded into memory at the same time.

ā†’ time ruby -e "File.open('large.txt','r').each { |line| line }"

real  0m1.410s
user  0m1.231s
sys   0m0.089s

The total time is pretty similar to #readlines, though, looking at memory consumption at the end of the script, it was nearly the same as before loading the file, around 8MB.

Note that Iā€™m passing a dummy block { |line| line } on #each. Thatā€™s because calling #each without a block returns an enumerator. Which is a good thing!

Imagine that you want to find the first 10 lines that contains the string abcd. You could do that with:

IO.foreach('large.txt').grep(/abcd/).take(10)

That takes around 3 seconds on my machine. It could be better though, if we take advantage of Enumerable#lazy:

IO.foreach('large.txt').lazy.grep(/abcd/).take(10).to_a

That takes around 1 second, because lazy makes methods like grep, find, reject to be evaluated only when theyā€™re needed. Pretty powerful!

Could it be faster?

I found out that you can ā€œadviseā€ the system on the type of read youā€™re going to perform with IO#advise. For example, if youā€™re doing a sequential read, you can call file.advise(:sequential), however I didnā€™t see that many improvements on my tests.

Do you know a better way? Let me know in the comments!