Return back to the blog

Hidden evils of Java’s String.split() and replace()

Posted by Prashant Deva on January 25, 2012

If you code in Java, you have inevitably used the String.split() and String.replace() (including replaceFirst() and replaceAll()) functions.

And why wouldn’t you? They are much more convenient than using the Java Regular Expressions API where you need to create a ‘Pattern‘ object, and possibly a ‘Matcher‘, and then call methods on those.

However, all convenience comes at a price!

The Evil Inside

In this case, the String.split() and String.replace*() methods (with the sole exception of String.replace(char, char) ) internally use the regular expression apis themselves, which can result in performance issues for your application.

Here is the String.split() method:


Notice that each call to String.split() creates and compiles a new Pattern object. The same is true for the String.replace() methods. This compiling of a pattern each time can cause performance issues in your program if you call the split() or replace() functions in a tight loop.

Benchmark

I tried a very simple test case to see how much the performance is affected.

 

The first case used String.split() a million times:


In the second case, I just changed the loop to use a precompiled Pattern object:


Benchmark Results

Here are the average results of 6 test runs:

Time taken with String.split() : 1600ms
Time taken with precompiled Pattern object: 1195 ms

Split-benchmark

Conclusion

Note that I used an extremely simple regular expression here which consists of just a single ‘space’ character and it resulted in > 25% decrease in performance.

A longer more complex expression would take longer to compile and thus make the loop containing the split() method even slower compared to its counterpart.

Lesson learned: It is good to know the internals of the APIs you use. Sometimes the convenience comes at the price of a hidden evil which may come to bite you when you are not looking.

 

Tagged 16 Comments |

16 thoughts on “Hidden evils of Java’s String.split() and replace()

  1. There is no Pattern in String.replace:public String replace(CharSequence sequence1, CharSequence sequence2) {
    if (sequence2 == null)

    throw new NullPointerException();
    int sequence1Length = sequence1.length();
    if (sequence1Length == 0) {

    StringBuilder result = new StringBuilder((count+1) * sequence2.length());

    result.append(sequence2);

    for (int i=0; i<count; i++) {

    result.append(charAt(i));

    result.append(sequence2);

    }

    return result.toString();
    }
    StringBuilder result = new StringBuilder();
    char first = sequence1.charAt(0);
    int start = 0, copyStart = 0, firstIndex;
    while (start < count) {

    if ((firstIndex = indexOf(first, start)) == -1)

    break;

    boolean found = true;

    if (sequence1.length() > 1) {

    if (firstIndex + sequence1Length > count)

    break;

    for (int i=1; i<sequence1Length; i++) {

    if (charAt(firstIndex + i) != sequence1.charAt(i)) {

    found = false;

    break;

    }

    }

    }

    if (found) {

    result.append(substring(copyStart, firstIndex));

    result.append(sequence2);

    copyStart = start = firstIndex + sequence1Length;

    } else {

    start = firstIndex + 1;

    }
    }
    if (result.length() == 0 && copyStart == 0)

    return this;
    result.append(substring(copyStart));
    return result.toString();}

  2. This is already mentioned in the first paragraph of the ‘The Evil Inside’ section:"with the sole exception of String.replace(char, char) "

  3. I think Jarek is correctpublic String replace(CharSequence, CharSequence)doesn’t use regexp patterns.neither does public String replace(Char, Char) which you DID mention in ‘The Evil Inside’ section, but which is not the same method.

  4. The split method in JDK7 has a fast path for the common case where the limit is a single character. This avoids need to compile or use regex. It would be nice to repeat your test with JDK7.

  5. "Hidden evil" because processing 1 million entries a simple way took an extra 1/2 second? If this were part of a nightly batch job I’d say simpler code is better.Be careful about optimizing code (which usually means making it more complex) if the gain is negligible. If that 1/2 second does make a difference in your application (e.g., real-time applications), then optimizing it makes sense.

  6. I have already run the above code under JavaSE 1.7 u2, and the results are exactly the opposite ones: split from String runs in halft the time pattern takes.

  7. Jarek, sorry for the confusion. I thought you were referring to the replace(char,char) method.This is what the replace(CharSequence, CharSequence) shows for me in the source of jdk1.6.0-29 and jrockit4.0.1-1.6.0:public String replace(CharSequence target, CharSequence replacement) { return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( this).replaceAll(Matcher.quoteReplacement(replacement.toString())); }

  8. Carlos, I only tested on jdk 6, since thats the jvm most people currently use. According to our anonymous commentor it seems they changed the implementation in jdk 7.

  9. This is the current implementation in JavaSE 1.7u2: public String[] split(String regex, int limit) { /* fastpath if the regex is a (1)one-char String and this character is not one of the RegEx’s meta characters ".$|()[{^?*+", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter. */ char ch = 0; if (((regex.count == 1 && ".$|()[{^?*+".indexOf(ch = regex.charAt(0)) == -1) || (regex.length() == 2 && regex.charAt(0) == '' && (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 && ((ch-'a')|('z'-ch)) < 0 && ((ch-'A')|('Z'-ch)) < 0)) && (ch < Character.MIN_HIGH_SURROGATE || ch > Character.MAX_LOW_SURROGATE)) { int off = 0; int next = 0; boolean limited = limit > 0; ArrayList<String> list = new ArrayList<>(); while ((next = indexOf(ch, off)) != -1) { if (!limited || list.size() < limit - 1) { list.add(substring(off, next)); off = next + 1; } else { // last one //assert (list.size() == limit - 1); list.add(substring(off, count)); off = count; break; } } // If no match was found, return this if (off == 0) return new String[] { this }; // Add remaining segment if (!limited || list.size() < limit) list.add(substring(off, count)); // Construct result int resultSize = list.size(); if (limit == 0) while (resultSize > 0 && list.get(resultSize-1).length() == 0) resultSize–; String[] result = new String[resultSize]; return list.subList(0, resultSize).toArray(result); } return Pattern.compile(regex).split(this, limit); }It is significantly different from previous JavaSE versions.

  10. Carlos, Yes looking at this it does seem like for simple patterns java 7 will indeed be faster with String.split() since it wont have any overhead of the regular expression engine.However, if you do pass a more complex expression, it reverts to compiling a new pattern and here the behavior should be the same as that on Java 6.That said, to java 7′s credit, I believe most of the code out there that uses String.split() will indeed go through the fast path.

  11. There are some benefits with JavaSe 1.7 implementation, as you have noted, and also some drawbacks.It leaves the developer the decision on when it would be useful to use String or Pattern split, based on a known pattern.The problem arises when the pattern is not known in advance. In this latter case, there is no way to know which method would perform better, at development time, other than doing some checks before, which may override any time gains on their own.Either case, unless there is a very tight constrain, under very specific conditions, I see no major difference between both approaches. Most use cases won’t notice the difference at all.

  12. Carlos, That’s why its a ‘hidden’ evil ;) It wont have an impact most of the time, but when it does it would be tough to know about it ;)

  13. OK, so using String.split() makes me lose 0.4s when executed a MILLION times ?This is definitely not going to make me use regexp instead of split() I can tell you !

  14. I very often use String.split(), particularly when parsing Strings during loading/saving data. I am very happy about this article, and will pay more attention to what the methods I use do internally.

  15. Its easy to dismiss this as too much trouble to worry about, but I have had significant gains refactoring my Android code to avoid using String.split inside of loops and instead creating a Pattern before hand. All of the overhead from object creation and garbage collection adds up, especially if there are multiple splits per loop, nested loops, etc. The same type of situation occurs with String concatenation, with each + compiling to its own StringBuilder instance. After the loop has been running for awhile the discarded objects will start to accumulate and the garbage collector will be running continuously. It seems that with Android development at least it pays to pay attention to stuff like this.

Comments are closed.