Parsing files using Groovy regex

In my previous post I mentioned several ways of defining regular expressions in Groovy. Here I want to show how we can use Groovy regex to find/replace data in the files.

Parsing properties file (simplified)1

Data: each line in the file has the same structure; the entire line can be matched by single regex. Problem: transform each line to the object. Solution: construct regex with capturing parentheses, apply it to each line, extract captured data. Demonstrates: File.eachLine method, matrix syntax of Matcher object.

def properties = [:]
new File('path/to/').eachLine { line ->
if ((matcher = line =~ /^([^#=].*?)=(.+)$/)) {
properties[matcher[0][1]] = matcher[0][2]
println properties

Parsing csv files (simplified)2

Data: each line in the file has the same structure; the line consists of the blocks separated by some character sequence. Problem: transform each line to the list of objects. Solution: construct regex with capturing parentheses, parse each line with the regex in a loop extracting captured data. Demonstrates: ~// Pattern defenition, method, \G regex meta-sequence.

def regex = ~/\G(?:^|,)(?:"([^"]*+)"|([^",]*+))/
new File('path/to/file.csv').eachLine { line ->
def fields = []
def matcher = regex.matcher(line)
while (matcher.find()) {
fields << ( ?:
println fields

Finding snapshot dependencies in the pom (simplified)3

Data: file contains blocks with known boundaries (possibly crossing multiple lines). Problem: extract the blocks satisfying some criteria. Solution: read the entire file into the string, construct regex with capturing parentheses, apply the regex to the string in a loop. Demonstrates: File.text property, list syntaxt of Matcher object, named capture, global \x regex modifier, local \s regex modifier.

def pom = new File('path/to/pom.xml').text
def matcher = pom =~ '''(?x)
<dependency> \\s*
<groupId>([^<]+)</groupId> \\s*
<artifactId>([^<]+)</artifactId> \\s*
<version>(.+?-SNAPSHOT)</version> (?s:.*?)
matcher.each { matched, groupId, artifactId, version ->
println "$groupId:$artifactId:$version"

Finding stacktraces in the log

Data: file contains entries each of which starts with the same pattern and can span multiple lines. Typical example is log4j log files:

2009-10-16 15:32:12,157 DEBUG [com.ndpar.web.RequestProcessor] Loading user
2009-10-16 15:32:13,258 ERROR [com.ndpar.web.UserController] id to load is required for loading
java.lang.IllegalArgumentException: id to load is required for loading
at org.hibernate.event.LoadEvent.(
at org.hibernate.event.LoadEvent.(
at org.hibernate.impl.SessionImpl.get(
at org.hibernate.impl.SessionImpl.get(
at org.springframework.orm.hibernate3.HibernateTemplate$1.doInHibernate(
at org.springframework.orm.hibernate3.HibernateTemplate.doExecute(
at org.springframework.orm.hibernate3.HibernateTemplate.executeWithNativeSession(
at org.springframework.orm.hibernate3.HibernateTemplate.get(
at org.springframework.orm.hibernate3.HibernateTemplate.get(
at com.ndpar.dao.UserManager.getUser(
... 62 more
2009-10-16 15:32:14,659 DEBUG [com.ndpar.jms.MessageListener] Received message:
... multi-line message ...
2009-10-16 15:32:15,169 INFO [com.ndpar.dao.UserManager] User: ...

Problem: find entries satisfying some criteria. Solution: read the entire file into the string4, construct regex with capturing parentheses and lookahead, split the string into entries, loop through the result and apply criteria to each entry. Demonstrates: regex interpolation, combined global regex modifiers \s and \m.

def log = new File('path/to/your.log').text
def logLineStart = /^\d{4}-\d{2}-\d{2}/
def splitter = log =~ """(?xms)
( ${logLineStart} .*?)
(?= ${logLineStart} | \\Z)
splitter.each { matched, entry ->
if (entry =~ /(?m)^(?:\t| {8})at/) println entry

Replacing text in the file

Use Groovy one-liner to perform the replacement. Here is the Tim's example in Groovy:

$ groovy -p -i -e '(line =~ /1\.6/).replaceAll("2.0-alpha-1-SNAPSHOT")' `find . -name pom.xml`


1. This example is for demonstration purposes only. In real program you would just use Properties.load method.
2. The regex is simplified. If you want the real one, take a look at Jeffrey Friedl's example.
3. Again, in reality you would find snapshots using mvn dependency:resolve | grep SNAPSHOT command.
4. This approach won't work for big files. Take a look at this script for practical solution.

