[2019 oracle code one] mastering regular expressions

Mastering Regular Expressions (^.*$)(?#everything) You Should Know

Speakers: Fernando Babadopulos @babadopulos

For more blog posts, see The Oracle Code One table of contents


History

  • Invented in 1951
  • Popularized in 70’s with vi/lex/sed/awk/expr
  • In 80’s Perl included advanced regex
  • Java 1.4 – added regex

General

  • Used for a spam filter before AI started detecting spam
  • Not just for developers. Can do lots of things in text editor

Engine

  • Eager – starts with leftmost match and first match is good enough.

Regex Examples

  • /Java/ – match first string “Java”
  • /database/g/ – match all instances of string “database” (different syntax than for Java. Typing without the / from now one)
  • .* – match everything (or nothing) – as much as can because eager
  • [dl]ate – character class matching d or l (followed by ate)
  • a[^p] – negated character class. No “p” after “a”
  • database[0-9]\. – range of database0. to database9.
  • \d\d.\d\d – Two digits, any character and two more digits. Probably not what you want. Works if have valid data. But will also match 15624. Use what want ex: [:h] instead of dot. (Or escape the dot if want a period)
  • ^(.*),(.*) – Want first two fields of a CSV. [unless have commas inside field with quotes]. Backtracks a lot though because consumes entire string and backtracks one character at a time until gets to a comma. Then backtracks more. Better to write [^,] than dot so express what actually want.
  • ^([^,]*),[^,]*)$ – match exactly two field [Unless have commas within quotes for an element]
  • ^[a-z]*$ – empty file or only lower case letters
  • get|set – match either string
  • \bget|set\b – matches words that end with get or start with set – not what intended
  • \b(get|set)\b – only match words get or set since checking for boundary (space, tab, etc)
  • Jan(uary)? – matches Jan or January
  • a{0,10} – 0-10 a’s
  • a{10} – exactly 10 a’s
  • a{0,} – zero or more a’s
  • a{1,} – at least one a’s
  • Java(?=Script) – positive lookahead – Java followed by Script
  • Java(?!Script) – negative lookahead – Java not followed by Script

Escaping

  • \( -escape paren
  • \d – shorthand for digit
  • \s – shorthand for whitespace
  • \w – shorthand for word
  • \\ – slash

Tips

  • Avoid . – use characters want or negative character class.
  • Use anchors (^ and $) wherever possible.

Regex101.com

  • Type regex
  • Gives explanation of regex typed
  • Can set flags ex: global
  • Area for test string to see what matches
  • Has a code generator so can get regex with proper Java escaping
  • Has debugger – can step through the parts of the reg ex and see what matches at each step. It also shows backtracking. This seems like a good way to see the efficiency of a regex as well. [Cool!]

Other URLs:

https://www.regular-expressions.info – tutorial

https://regexper.com – create graphs

My take

I enjoyed the debate (and then vote) on how to pronounce regex before the talk started! Half of the audience raised their hands for liking regular expressions. Biased crowd of course. The room was awkward and the lecturn hid part of the screen. I like that he showed a lot of examples and the execution graph. I really like the debugger on regex101. Learning that was worth attending the talk on its own! As was the regexper graph site

Leave a Reply

Your email address will not be published. Required fields are marked *