def subDomain = '(?i:[a-z0-9]|[a-z0-9][-a-z0-9]*[a-z0-9])' // simple regex in single quotes
def topDomains = """
(?x-i : com \\b # you can put whitespaces and comments
| edu \\b # inside regex in eXtended mode
| biz \\b
| in(?:t|fo) \\b # but you have to escape
| mil \\b # backslashes in multiline strings
| net \\b
| org \\b
| [a-z][a-z] \\b
)"""
def hostname = /(?:${subDomain}\.)${topDomains}/ // variable substitution in slashy strings
def NOT_IN = /;\"'<>()\[\]{}\s\x7F-\xFF/ // backslash is not escaped in slashy strings
def NOT_END = /!.,?/
def ANYWHERE = /[^${NOT_IN}${NOT_END}]/
def EMBEDDED = /[$NOT_END]/ // you can ommit {} around var name
def urlPath = "/$ANYWHERE*($EMBEDDED+$ANYWHERE+)*"
def url =
"""(?x:
\\b
# match the hostname part
(
(?: ftp | http s? ): // [-\\w]+(\\.\\w[-\\w]*)+
|
$hostname
)
# allow optional port
(?: :\\d+ )?
# rest of url is optional, and begins with /
(?: $urlPath )?
)"""
assert 'http://www.google.com/search?rls=en&q=regex&ie=UTF-8&oe=UTF-8' ==~ url
As you can see, there are several options, and for every subexpression you can choose the one that's more expressive.
Resources
• Martin Fowler on composed regexes
• Pragmatic Dave on regexes in Ruby
• Feature request to make regexes even groovier
• Mastering Regular Expressions — best regex book
• Groovy Pattern and Matcher classes
No comments:
Post a Comment