Strings are cool little things. They let us do text in our programs, and what’s a program without text? What’re strings? Well, maybe they’re arrays of characters, or maybe they’re relatively lightweight objects. Or then again, maybe they contain Unicode characters and allow a ton of cool behavior. In Ruby, they tend towards being an array of characters and allowing a ton of cool behavior.

The trouble is that, most of the usual languages implement what we call ``lexicographic comparisons’’ (see the wikipedia article on lexicographic ordering) which, while usually good enough (and often faster), can be a problem when you have strings with numbers in them.

So let’s see how we can modify the Ruby String class to use natural ordering (rather than lexicographic ordering) in strings with numbers.

The sample we’ll be using is the string array:


 [ 'a2', 'a3', 'a10', 'a13', 'a13bb1' ]

That array’s in order for a human being, but not for a typical programming language, which will disagree. For example, in Ruby’s irb:


 >>  [ 'a2', 'a3', 'a10', 'a13', 'a13bb1' ].sort
 => ["a10", "a13", "a13bb1", "a2", "a3"]

Well that’s just nasty. What’s going on? Basically, lexicographic ordering compares each character, and, at the first unequal character, stops. So basically, as soon as we see that 1 < 2 (or, more specifically in this instance, that the ASCII representation of the character ‘1’, which is 49, is less than the representation of ‘2’, which is 50), we decide that ‘a10’ is less than ‘a2’. But even in a string, 2 < 10, right? Not when we calculate things this way. Oops.

While lexicographic ordering is faster, because it decides based on the first unequal character and therefore doesn’t have to run through the entire string, it’s still really annoying in this context. In Ruby, though, thanks to the open classes, we can modify the behavior of the <=> operator (the comparison operator) to behave appropriately. And then <, >, <=, and >= will all work correctly, since they’re based off of <=> thanks to the Comparable mixin.

In a nutshell, here’s the code:


 class String
   alias :old_cmp :'<=>'
   def <=>(other)
     return old_cmp(other) unless self =~ /\d/ 

     # Take non-numbers, get their byte values, then compare the numbers.
     val = self.to_comp_val
     other_val = other.to_comp_val

     val <=> other_val
   end

   protected
     # Sums up the byte values of the characters in the String.
     def to_comp_val
       sum = 0
       self.each_byte { |byte| sum += byte }
       sum
     end
 end

Here, we first turn the original <=> into old_cmp. We do this because we still call the old comparison if there are no numbers in the string (the first line of the new <=> method does that) [1]. Then, if that’s not the case, we sum the byte values for each String and stick them into two variables. Our return value is then a comparison of the full numeric representation of the string. This ensures that the numbers do what they should. My tests for this functionality are relatively simple and incomplete, but somewhat satisfactory:


 context 'string comparisons' do
   specify 'should compare with numbers appropriately' do
     ('a2' <=> 'a10').should.equal -1
     ('a2' < 'a10').should.be true
   end
   specify 'should order strings with numbers appropriately' do
     ['a2', 'a3', 'a10', 'a13', 'a13bb1'].sort { |first,second| first <=> second } .should.equal ['a2','a3','a10','a13','a13bb1']
   end
 end

Array#sort?

What in the world is this:


 ['a2', 'a3', 'a10', 'a13', 'a13bb1'].sort { |first,second| first <=> second } 

What’s with the block we passed it? Isn’t that what Array#sort is supposed to do anyway? The answer, it turns out, is yes… Except when we have Strings or Fixnums. Here’re a few lines of code from the Ruby interpreter (1595-1602 in array.c, in case you’re interested):


    if (FIXNUM_P(a) && FIXNUM_P(b)) {
    if ((long)a > (long)b) return 1;
    if ((long)a < (long)b) return -1;
    return 0;
    }
    if (TYPE(a) == T_STRING) {
    if (TYPE(b) == T_STRING) return rb_str_cmp(a, b);
    }

Aha! If we have fixnums, it does a comparison in C. If we have Strings, it uses a helper function used in the usual <=> operator, too. So it actually bypasses the <=> operator for the sake of speed. That kind of gets in the way when we do what we want to do. Unfortunately, there’s no good way of getting around this without passing a block, since the function that does that check is only in C, and is always invoked when sort is called without a block. The only way to shorten this is to define a separate Proc object and use it when you want to do the regular kind of comparison:


 comp = Proc.new { |o1, o2| o1 <=> o2 }

 # ...

 ary.sort(&comp)

Too bad, really, but at least we can make it work.

1 This may well slow things down unnecessarily, since it invokes the regular expressions engine. I’m not sure, to be honest, but at the end of the day it may be faster to just do the full comparison every time rather than invoke that overhead.

Sorry, comments are closed for this article.