Tuesday, November 6, 2012

Java Strings, toLowerCase(), and Locale

I fixed an interesting bug last night, and figured I'd mention it here for anybody who happens to be searching.

Basically, we got a customer service e-mail from a fellow who was having one of our products crash on his Galaxy S2.  I asked him to send over some logs, we went back and forth a few times, and eventually he discovered that if he set his phone to English, the product worked no problem.  That's strange, we don't do anything special for different languages, do we?  As it turns out, we do.

Replicating the bug, I discovered it was in the midst of Mesh.readFromText().  This is a function that reads the text version of our model format -- we have both a working binary and text-based path, since for some problems having a human-readable model format is extremely useful.  Almost everything we use is binary regardless, but in this case we're reading a text formatted model.  The format is simple, and the top of the file looks like so:

TC 4

0:    1.0    1.0
1:    0.0    1.0
2:    0.0    0.0
3:    1.0    0.0


0:    0    1    3
1:    1    2    3



0:    1.0    1.0    0.0
1:    -1.0    1.0    0.0
2:    -1.0    -1.0    0.0
3:    1.0    -1.0    0.0
Now, what's happening is when it gets to the first set of windings, it tells me I've overrun an array and throws an exception.  After a bunch of experimentation, it turns out that when I have the device set to Turkish, it never recognizes the "winding" key as being hit, so it's still trying to read in texture coordinates.  Except the array is already full of those.  That's weird, why is it missing that key?

The reason why is that I made the file parsing non-case-sensitive, and I did it by sending the current line though toLowerCase() before parsing it.  It's in no way obvious, but toLowerCase() makes decisions based on your current locale, so it's possible for your lower-case version of a string to have accents, hard-spaces instead of soft-spaces or what have you inserted.  I'm still not sure which character does it, but when it was forcing "WINDING" to lower-case, one of those characters ended up different from the ascii "winding".  Thus the key was never recognized.

This was a pretty harsh lesson to learn.  Luckily, it's easy to fix, as you can call toLowerCase with an argument specifying a locale, like so: myString.toLowerCase(Locale.ENGLISH).  That means you can at least count on it being consistent if you're doing something like parsing a file.  This last week or so has had more than one lesson about this.

So, here's the TL;DR take away: If you're using toLowerCase() or a similar operation on a user-visible string, let it default to the current locale.  If you're using it when parsing, tell it which one to use to ensure consistency.


  1. In turkish we have two letters 'ı' and 'i' and their uppercases are 'I' and 'İ', when you are converting "I"s in WINDING to lowercase it results "wındıng", what you need is "winding"

    PS: Sending a comment was really more difficult than expected!

  2. ...and when you want to compare strings case-independent, use String.equalsIgnoreCase(String). Just thought I should mention it for the record ;-).

    1. True. There's a couple places where I was just using a contains(), and until now I didn't realize there was a functional difference as far as the casing goes.