'Why does ruby's IO readlines method behave differently when followed by a filter

I'm building a little Wordle inspired project for fun and am gathering the words from my local dictionary. Originally I was doing this:

word_list = File.readlines("/usr/share/dict/words", chomp: true)
word_list.filter { |word| word.length == 5 }.map(&:upcase)

The first line takes absolutely ages. However when doing this:

word_list = File.readlines("/usr/share/dict/words", chomp: true).filter { |word| word.length == 5 }.map(&:upcase)

it completes in a matter of seconds. I can't work out how the filter block is being applied to the lines being read before they're assigned memory (which I'm assuming is what is causing the slow read time), clearly each method isn't being fully applied before the next is called but that is how I thought method chaining works.



Solution 1:[1]

Let's create a file.

File.write('t', "dog horse\npig porcupine\nowl zebra\n") #=> 34

then

a = File.readlines("t", chomp:true)
  #=> ["dog horse", "pig porcupine", "owl zebra"]

so your block variable word holds a string of two words. That's obviously not what you want.

You could use IO::read to "gulp" the file into a string.

s = File.read("t")
  #=> "dog horse\npig porcupine\nowl zebra\n"

then

a = s.scan(/\w+/)
  #=> ["dog", "horse", "pig", "porcupine", "owl", "zebra"].
b = a.select { |word| word.size == 5 }
  #=> ["horse", "zebra"]
c = b.map(&:upcase)
  #=> ["HORSE", "ZEBRA"]

We could of course chain these operations:

File.read("t").scan(/\w+/).select { |word| word.size == 5 }.map(&:upcase)
  #=> ["HORSE", "ZEBRA"]

scan(/\w+/) matches each string of word characters (letters, digits and underscores). To match only letters change that to scan(/[a-zA-Z]+/).


You could use IO#readlines, which reads lines into an array, by extracting words for each line, filtering the resulting array to keep ones having 5 characters, and then adding those words, after upcasing, to a previously-defined empty array.

File.readlines('t')
    .each_with_object([]) { |line,arr| line.scan(/\w+/) }
    .select { |word| word.size == 5 }
    .map(&:upcase)
    .each { |word| arr << word } #=> ["HORSE", "ZEBRA"]

You could add the optional parameter chomp: true to readline's arguments, but there is no reason to do so.


Better would be to use IO#foreach which, without a block, returns an enumerator which can be chained, avoiding for the temporary array created by readlines.

File.foreach('t').with_object([]) do |line,arr|
  line.scan(/\w+/)
      .select { |word| word.size == 5 }
      .map(&:upcase)
      .each { |word| arr << word }
end
  #=> ["HORSE", "ZEBRA"]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1