Thursday, October 22, 2009

Parsing files using Groovy regex

In my previous post I mentioned several ways of defining regular expressions in Groovy. Here I want to show how we can use Groovy regex to find/replace data in the files.

Parsing properties file (simplified)1

Data: each line in the file has the same structure; the entire line can be matched by single regex. Problem: transform each line to the object. Solution: construct regex with capturing parentheses, apply it to each line, extract captured data. Demonstrates: File.eachLine method, matrix syntax of Matcher object.

def properties = [:]
new File('path/to/some.properties').eachLine { line ->
if ((matcher = line =~ /^([^#=].*?)=(.+)$/)) {
properties[matcher[0][1]] = matcher[0][2]
}
}
println properties

Parsing csv files (simplified)2

Data: each line in the file has the same structure; the line consists of the blocks separated by some character sequence. Problem: transform each line to the list of objects. Solution: construct regex with capturing parentheses, parse each line with the regex in a loop extracting captured data. Demonstrates: ~// Pattern defenition, Matcher.group method, \G regex meta-sequence.

def regex = ~/\G(?:^|,)(?:"([^"]*+)"|([^",]*+))/
new File('path/to/file.csv').eachLine { line ->
def fields = []
def matcher = regex.matcher(line)
while (matcher.find()) {
fields << (matcher.group(1) ?: matcher.group(2))
}
println fields
}

Finding snapshot dependencies in the pom (simplified)3

Data: file contains blocks with known boundaries (possibly crossing multiple lines). Problem: extract the blocks satisfying some criteria. Solution: read the entire file into the string, construct regex with capturing parentheses, apply the regex to the string in a loop. Demonstrates: File.text property, list syntaxt of Matcher object, named capture, global \x regex modifier, local \s regex modifier.

def pom = new File('path/to/pom.xml').text
def matcher = pom =~ '''(?x)
<dependency> \\s*
<groupId>([^<]+)</groupId> \\s*
<artifactId>([^<]+)</artifactId> \\s*
<version>(.+?-SNAPSHOT)</version> (?s:.*?)
</dependency>
'''
matcher.each { matched, groupId, artifactId, version ->
println "$groupId:$artifactId:$version"
}

Finding stacktraces in the log

Data: file contains entries each of which starts with the same pattern and can span multiple lines. Typical example is log4j log files:

2009-10-16 15:32:12,157 DEBUG [com.ndpar.web.RequestProcessor] Loading user
2009-10-16 15:32:13,258 ERROR [com.ndpar.web.UserController] id to load is required for loading
java.lang.IllegalArgumentException: id to load is required for loading
at org.hibernate.event.LoadEvent.(LoadEvent.java:74)
at org.hibernate.event.LoadEvent.(LoadEvent.java:56)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:839)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:835)
at org.springframework.orm.hibernate3.HibernateTemplate$1.doInHibernate(HibernateTemplate.java:531)
at org.springframework.orm.hibernate3.HibernateTemplate.doExecute(HibernateTemplate.java:419)
at org.springframework.orm.hibernate3.HibernateTemplate.executeWithNativeSession(HibernateTemplate.java:374)
at org.springframework.orm.hibernate3.HibernateTemplate.get(HibernateTemplate.java:525)
at org.springframework.orm.hibernate3.HibernateTemplate.get(HibernateTemplate.java:519)
at com.ndpar.dao.UserManager.getUser(UserManager.java:90)
... 62 more
2009-10-16 15:32:14,659 DEBUG [com.ndpar.jms.MessageListener] Received message:
... multi-line message ...
2009-10-16 15:32:15,169 INFO [com.ndpar.dao.UserManager] User: ...

Problem: find entries satisfying some criteria. Solution: read the entire file into the string4, construct regex with capturing parentheses and lookahead, split the string into entries, loop through the result and apply criteria to each entry. Demonstrates: regex interpolation, combined global regex modifiers \s and \m.

def log = new File('path/to/your.log').text
def logLineStart = /^\d{4}-\d{2}-\d{2}/
def splitter = log =~ """(?xms)
( ${logLineStart} .*?)
(?= ${logLineStart} | \\Z)
"""
splitter.each { matched, entry ->
if (entry =~ /(?m)^(?:\t| {8})at/) println entry
}

Replacing text in the file

Use Groovy one-liner to perform the replacement. Here is the Tim's example in Groovy:

$ groovy -p -i -e '(line =~ /1\.6/).replaceAll("2.0-alpha-1-SNAPSHOT")' `find . -name pom.xml`


Resources

• Groovy regexes
• Groovy one-liners
• Using String.replaceAll method

Footnotes

1. This example is for demonstration purposes only. In real program you would just use Properties.load method.
2. The regex is simplified. If you want the real one, take a look at Jeffrey Friedl's example.
3. Again, in reality you would find snapshots using mvn dependency:resolve | grep SNAPSHOT command.
4. This approach won't work for big files. Take a look at this script for practical solution.

Wednesday, October 14, 2009

GParallelizer Performance

GParallelizer is a Groovy wrapper for new Java concurrency library. It allows you to perform list and map operations using parallel threads, which in theory leverages the full power of multi-processor computations. Here I want to check if it's true in reality. I run the following tests on my dual-core MacBook

import static org.gparallelizer.Parallelizer.*
import org.gparallelizer.ParallelEnhancer
import org.junit.Before
import org.junit.Test

class GParsTest {

def list = []

@Before void setUp() {
1000000.times {
list << (float) Math.random()
}
}

@Test void sequential() {
def start = System.currentTimeMillis()
list.findAll { it < 0.4 }
def duration = System.currentTimeMillis() - start

println "Sequential: ${duration}ms"
}

@Test void parallel_with_enhancer() {
ParallelEnhancer.enhanceInstance list

def start = System.currentTimeMillis()
list.findAllAsync { it < 0.4 }
def duration = System.currentTimeMillis() - start

println "Parallel with enhancer: ${duration}ms"
}

@Test void parallel_with_parallelizer_2() {
parallelWithParallelizer 2
}

@Test void parallel_with_parallelizer_3() {
parallelWithParallelizer 3
}

@Test void parallel_with_parallelizer_5() {
parallelWithParallelizer 5
}

@Test void parallel_with_parallelizer_10() {
parallelWithParallelizer 10
}

def parallelWithParallelizer(threads) {
def start = System.currentTimeMillis()
withParallelizer(threads) {
list.findAllAsync { it < 0.4 }
}
def duration = System.currentTimeMillis() - start

println "Parallel with parallelizer (${threads}): ${duration}ms"
}
}

And here is the output

Sequential: 774ms
Parallel with enhancer: 9311ms
Parallel with parallelizer (2): 1785ms
Parallel with parallelizer (3): 769ms
Parallel with parallelizer (5): 500ms
Parallel with parallelizer (10): 722ms

Something strange happened with mixed-in ParallelEnhancer, but with Parallelizer performance improved indeed. With optimal thread pool size parallel processing is 35% faster than sequential.

Conclusion: Use GPars methods if you need to process big amount of data. Try different config parameters to find the best solution for your particular problem.

Resources

• Brian Goetz on new concurrency library
• Vaclav Pech on GPars

Tuesday, October 06, 2009

Converting XML to POGO

Suppose we want to convert XML to Groovy bean:

class MyBean {
String strField
float floatField
int intField
boolean boolField
}

def message = "<xml stringAttr='String Value' boolAttr='true' />"

def xml = new XmlSlurper().parseText(message)

def bean = new MyBean(
strField: xml.@stringAttr,
boolField: xml.@boolAttr
)

Everything looks good, even assertions succeed

assert 'String Value' == bean.strField
assert bean.boolField

Now let's try false value:

message = "<xml stringAttr='String Value' boolAttr='false' />"
assert !bean.boolField

Oops, the assertion failed. Why? Because xml.@boolAttr cast to boolean always returns true. The correct implementation must be like this:

message = "<xml stringAttr='String Value' floatAttr='3.14' intAttr='9' boolAttr='false' />"

xml = new XmlSlurper().parseText(message)

bean = new MyBean(
strField: xml.@stringAttr.toString(),
floatField: xml.@floatAttr.toFloat(),
intField: xml.@intAttr.toInteger(),
boolField: xml.@boolAttr.toBoolean()
)
assert 'String Value' == bean.strField
assert 3.14F == bean.floatField
assert 9 == bean.intField
assert !bean.boolField

Now everything works properly. The moral of this blog post: Create more unit tests (assertions), especially when you work with dynamic language.

Resources

• Converting String to Boolean