Java Standard Edition Platform - Regular Expressions
Motivation
The Java programming language provides great support for regular expressions. In general, the performance of regular expressions in Java is very good. The trade-off for that excellent performance is a (somewhat) protracted API. To see what I mean, have a a look at a canonical regular expression example from Sun's javadoc:
Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches();
Kind of verbose, considering all we've accomplished is checking a single CharacterSequence for a match against one regular expression! Such complaints about decreased simplicity are likely what motivated Sun to provide some regular expression "convenience" methods inside the API for the String class. Specifically, as of version 6, the String API provides five such convenience methods:
public boolean matches(String regex); public String replaceFirst(String regex, String replacement); public String replaceAll(String regex, String replacement); public String[] split(String regex); public String[] split(String regex, int limit);
The documentation clearly explains that that each of the above convenience methods constructs and delegates the method call to Pattern and Matcher objects. What's not immediately obvious from the documentation for those convenience methods is that they become extremely inefficient as you execute them many thousands of times. Consider the following bit of code that illuminates the inefficiency of the String.matches(regex) method:
public class RegExpExerciser
{
public static void main(String[] args)
{
int numTimes = 10000000;
String regexp = "a*b";
String toMatch = "aaaaab";
long begin1 = System.currentTimeMillis();
for (int i = 0; i < numTimes; i++)
{
toMatch.matches(regexp);
}
System.out.println("Milliseconds using convenience method from String: " + (System.currentTimeMillis() - begin1));
long begin2 = System.currentTimeMillis();
Pattern p = Pattern.compile(regexp);
for (int i = 0; i < numTimes; i++)
{
p.matcher(toMatch).matches();
}
System.out.println("Milliseconds when re-using compiled reg. exp. pattern: " + (System.currentTimeMillis() - begin2));
}
}
Executing the above code, my box reveals these times:
Milliseconds using convenience method from String:
5074
Milliseconds when re-using compiled reg. exp.
pattern: 1720
The Code
The RegExpExerciser code sample above clearly demonstrates that the time to compile a regular expression Pattern dominates the time required to actually check whether a given string matches the pattern. Utilizing that key insight, I offer a helper class that provides simple APIs for working with regular expressions without sacrificing performance.
public class RegExpHelper
{
// The compiledPatterns Map is synchronized because it will be accessed
// concurrently by multiple threads
private static final Map<String, Pattern> compiledPatterns = Collections.synchronizedMap(new LRUMap(1000));
private static Pattern getCompiledPatern(String regExpPattern)
{
Pattern pattern = compiledPatterns.get(regExpPattern);
// No need to protect this pseudo-critical section from concurrent
// execution since regular expression compilation is idempotent
if (pattern == null)
{
pattern = Pattern.compile(regExpPattern);
compiledPatterns.put(regExpPattern, pattern);
}
return pattern;
}
public static String replaceFirst(String sourceString, String regExpPattern, String replacementString)
{
return getCompiledPatern(regExpPattern).matcher(sourceString).replaceFirst(replacementString);
}
public static String replaceAll(String sourceString, String regExpPattern, String replacementString)
{
return getCompiledPatern(regExpPattern).matcher(sourceString).replaceAll(replacementString);
}
public static boolean matches(String sourceString, String regExpPattern)
{
return getCompiledPatern(regExpPattern).matcher(sourceString).matches();
}
public static String[] split(String sourceString, String regExpPattern)
{
return getCompiledPatern(regExpPattern).split(sourceString);
}
public static String[] split(String sourceString, String regExpPattern, int limit)
{
return getCompiledPatern(regExpPattern).split(sourceString, limit);
}
}
Performance Comparison
The key feature to note about the RegExpHelper class presented above is that it caches compiled regular expressions so that publicly exposed methods can be both very simple and efficient. In order to appreciate the performance improvement, consider the following revised RegExpExerciser code:
public class RegExpExerciser
{
public static void main(String[] args)
{
int numTimes = 10000000;
String regexp = "a*b";
String toMatch = "aaaaab";
long begin1 = System.currentTimeMillis();
for (int i = 0; i < numTimes; i++)
{
toMatch.matches(regexp);
}
System.out.println("Milliseconds using convenience method from String: " + (System.currentTimeMillis() - begin1));
long begin2 = System.currentTimeMillis();
for (int i = 0; i < numTimes; i++)
{
RegExpHelper.matches(toMatch, regexp);
}
System.out.println("Milliseconds using convenience method from RegExpHelper: " + (System.currentTimeMillis() - begin2));
}
}
Executing the above code, my box reveals these times:
Milliseconds using convenience method from String:
5102
Milliseconds using convenience method from
RegExpHelper: 2002
Note that the convenience method provided by the RegExpHelper class
is much more efficient than the similar convenience method provided
by the String class.