Writing the same regular expression logic in multiple JVM languages

Posted on May 14, 2023 by Jeanne Boyarsky

I tried writing three regular expressions in a the most common JVM languages.

Find first match
Find all matches
Replace first match

My experience in these languages range from use it many times a week (Groovy) to this is the first thing I’ve written in it (Clojure).

I’m going to be using these in a presentation. So if you see anything in here that is a bad idiom in the language, do let me know!

Kotlin

The WordPress syntax highlighter doesn’t have Kotlin as a choice

val text = "Mary had a little lamb"
val regex = Regex("\\b\\w{3,4} ")
print(regex.find(text)?.value)
-----------------------------------------
val text = "Mary had a little lamb"
val regex = "\\b\\w{3,4} ".toRegex()
regex.findAll(text)
  .map { it.groupValues[0] }
  .forEach { print(it) }
-----------------------------------------
val text = "Mary had a little lamb."
val wordBoundary = "\\b"
val threeOrFourChars = "\\w{3,4}"
val space = " "
val regex = Regex(wordBoundary +
  threeOrFourChars + space)
     
println(regex.replaceFirst(text, "_"))

Scala

Thanks to dhinojosa for the code review and feedback that smart quotes don’t require backslashes inside!

val text = "Mary had a little lamb"
val regex = """\b\w{3,4} """.r
val optional = regex findFirstIn text
      
println(optional.getOrElse("No Match"))
-----------------------------------------
val text = "Mary had a little lamb."
val regex = """\b\w{3,4} """.r
val it = regex findAllIn text
      
it foreach print
-----------------------------------------
val text = "Mary had a little lamb."
val wordBoundary = """\b"""
val threeOrFourChars = """\w{3,4}"""
val space = " "
val regex = new Regex(wordBoundary + threeOrFourChars + space)
     
println(regex replaceFirstIn(text, "_"))

Closure

(println(
  re-find #”\b\w{3,4} ", 
          "Mary had a little lamb"))
-----------------------------------------
(println(
  re-seq #”\b\w{3,4} ", 
          "Mary had a little lamb"))
-----------------------------------------
(ns clojure.examples.example
   (:gen-class))
(defn Replacer []
   (def text "Mary had a little lamb.")
   (def wordBoundary "\\b")
   (def threeOrFourChars "\\w{3,4}")
   (def space " ")
   (def regex (str wordBoundary 
        threeOrFourChars space))
   (def pat (re-pattern regex))
   (println(clojure.string/replace-first 
       text pat "_")))
(Replacer)

Groovy

def text = 'Mary had a little lamb'
def regex = /\b\w{3,4} /

def matcher = text =~ regex
print matcher[0]
-----------------------------------------
def text = 'Mary had a little lamb'
def regex = /\b\w{3,4} /

def matcher = text =~ regex
print matcher.findAll().join(' ')
-----------------------------------------
def text = 'Mary had a little lamb'
def regex = /\b\w{3,4} /

def matcher = text =~ regex
print matcher.findAll().join(' ')

[2019 oracle code one] mastering regular expressions

Posted on September 19, 2019 by Jeanne Boyarsky

Mastering Regular Expressions (^.*$)(?#everything) You Should Know

Speakers: Fernando Babadopulos @babadopulos

For more blog posts, see The Oracle Code One table of contents

History

Invented in 1951
Popularized in 70’s with vi/lex/sed/awk/expr
In 80’s Perl included advanced regex
Java 1.4 – added regex

General

Used for a spam filter before AI started detecting spam
Not just for developers. Can do lots of things in text editor

Engine

Eager – starts with leftmost match and first match is good enough.

Regex Examples

/Java/ – match first string “Java”
/database/g/ – match all instances of string “database” (different syntax than for Java. Typing without the / from now one)
.* – match everything (or nothing) – as much as can because eager
[dl]ate – character class matching d or l (followed by ate)
a[^p] – negated character class. No “p” after “a”
database[0-9]\. – range of database0. to database9.
\d\d.\d\d – Two digits, any character and two more digits. Probably not what you want. Works if have valid data. But will also match 15624. Use what want ex: [:h] instead of dot. (Or escape the dot if want a period)
^(.*),(.*) – Want first two fields of a CSV. [unless have commas inside field with quotes]. Backtracks a lot though because consumes entire string and backtracks one character at a time until gets to a comma. Then backtracks more. Better to write [^,] than dot so express what actually want.
^([^,]*),[^,]*)$ – match exactly two field [Unless have commas within quotes for an element]
^[a-z]*$ – empty file or only lower case letters
get|set – match either string
\bget|set\b – matches words that end with get or start with set – not what intended
\b(get|set)\b – only match words get or set since checking for boundary (space, tab, etc)
Jan(uary)? – matches Jan or January
a{0,10} – 0-10 a’s
a{10} – exactly 10 a’s
a{0,} – zero or more a’s
a{1,} – at least one a’s
Java(?=Script) – positive lookahead – Java followed by Script
Java(?!Script) – negative lookahead – Java not followed by Script

Escaping

\( -escape paren
\d – shorthand for digit
\s – shorthand for whitespace
\w – shorthand for word
\\ – slash

Tips

Avoid . – use characters want or negative character class.
Use anchors (^ and $) wherever possible.

Regex101.com

Type regex
Gives explanation of regex typed
Can set flags ex: global
Area for test string to see what matches
Has a code generator so can get regex with proper Java escaping
Has debugger – can step through the parts of the reg ex and see what matches at each step. It also shows backtracking. This seems like a good way to see the efficiency of a regex as well. [Cool!]

Other URLs:

https://www.regular-expressions.info – tutorial

https://regexper.com – create graphs

My take

I enjoyed the debate (and then vote) on how to pronounce regex before the talk started! Half of the audience raised their hands for liking regular expressions. Biased crowd of course. The room was awkward and the lecturn hid part of the screen. I like that he showed a lot of examples and the execution graph. I really like the debugger on regex101. Learning that was worth attending the talk on its own! As was the regexper graph site

JavaOne – Simplified and Fast Fraud Detection

Posted on October 2, 2017 by Jeanne Boyarsky

Simplified and Fast Fraud Detection”

Speaker: Keith Laker

For more blog posts from JavaOne, see the table of contents

Live SQL

free online Oracle 12C database
Can save scripts
Google searchable
Each OTN (oracle tech network) users sees own copy of data. Sandboxed
Can download data as CSV

https://livesql.oracle.com/apex/livesql/file/index.html

And for this session the live sql URL

Pattern Matching

types – regex, sed/awk
in SQL – row level regex
new: pattern recognition in a stream or rows – aka can match across rows and columns
new SQL construct MATCH_RECOGNIZE – ANSII standard; not Oracle specific

Steps

Bucket and order the data
- This makes the patterns “visible”.
- Used order by or partition by/order by so queries are deterministic (this does not require the paid Oracle partitioning feature)
Define the pattern
- Regular expression like pattern
- Ex: PATTERN (X+ Y+ Z+) where X/Y/Z is a boolean expression. Ex: bal < PREV(bal)
- Common qualifiers: * + ? {n} {n,} {n,m}
- Also have extra ? for reluctant qualifiers – helps deal with what to do with overlapping matches
Define measures
- Define columns in output table
- pattern navigation options; PREV, NEXT, FIRST, LAST
- column
- optional aggregates (COUNT, SUM, AVG, MAX, MIN)
- special measures: CLASSIFIER() – which component of the pattern applied to this row and MATCH_NUMBER() – how many matches within each partition – both are good for debugging
- Ex: MEASURES FIRST(x.tstamp) as first_x
Controlling output
- by default get a column per measure along with the partitioning column (when using one row per match). Get more columns with all rows per match)
- how many rows back: ONE ROW PER MATCH (default) ALL ROWS PER MATCH or ALL ROWS PER MATCH WITH UNMATCHED ROWS (good for debugging)
- where to start next search: AFTER MATCH SKIP PAST LAST ROW (default), also options for next row and relating to variables

Demo

Find 3 or more small (<2K) money transfers within 30 days. Then find large transfer (?=1M) within 10 days of last small transfer
Can do in SQL without pattern matching, but a lot of code.
Can do in Java, but. [copying the database…]
Showed how to create a table for JSON data – reads into a CLOB and Oracle checks it is valid JSON. Loaded with insert statements because live sql is web based and can’t access underlying file system.
Can use dot notation to access SQL fields

Sample pattern matching statement:


SELECT *
FROM transfers_view
MATCH_RECOGNIZE(
 ORDER BY time_id
 MEASURES
 user_id AS user_id,
 amount AS amount
 PATTERN (X{3,} Y)
 DEFINE
 X AS (amount < 2000) AND 
 LAST(time_id) - FIRST(time_id) < 30,
 Y AS (amount >= 1000000) AND 
 time_id - LAST(x.time_id)< 10);

My take: This was a two hour “tutorial” which differs from a hands on lab. We were still able to follow along with a laptop or “large tablet.” I followed along with the demos on my Mac. Which also let me play a bit. It was fun. I’ve always liked SQL :). I like that he uses QR codes for the links/blogs he wants people to go to. They are also linked in the PowerPoint when it becomes available.

It was also interesting blogging on my laptop. On my tablet, I blog in HTML because it is a pain to u se the visual editor on the tablet. A laptop has no such problem. But a laptop battery doesn’t last all day so…

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Tag Archives: regex

Writing the same regular expression logic in multiple JVM languages

Kotlin

Scala

Closure

Groovy

[2019 oracle code one] mastering regular expressions

JavaOne – Simplified and Fast Fraud Detection

Kotlin

Scala

Closure

Groovy

Share this:

Share this:

Share this: