How to master regular expressions to speed up your tasks
✏️

How to master regular expressions to speed up your tasks

ExerciseFiles
emerson_self-reliance.txt
Tags
Regular Expression
regex
Published
May 20, 2022

Getting started with Regular Expressions

What are regular expressions?

  • A text pattern
  • It interpreted by a regex processor
  • Used for matching, searching, and replacing text

Usage examples

  • Test a credit card number
notion image
const regex = /\d\d\d-\d\d\d-\d\d\d\d/gm; // Alternative syntax using RegExp constructor // const regex = new RegExp('\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d', 'gm') const str = `555-961-2425`; let m; while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } // The result can be accessed through the `m`-variable. m.forEach((match, groupIndex) => { console.log(`Found match, group ${groupIndex}: ${match}`); }); }
  • Test if an email is valid or not
  • Search a document for either color or colour
  • Replace all occurrences of “gmail.com” with “gmail.com.vn”
  • Count how many times “training” is preceded by “blockchain”, “java”, “python”

Regular expression engines

Programming languages
  • javascript
  • java
  • python
  • perl
  • php
  • .net
  • mysql
  • unix
  • apache
  • c/c++
  • ruby
  • posgresql
 
Text Editor
  • Atom
  • Sublime
  • Notepad++
Online Text Editor

Notation convention

  • Regular expression: /[a-zA-Z]+/
  • Text string: “hello word”

Regular expression modes

  • Standard: /[a-zA-Z]+/
  • Global: /[a-zA-Z]+/g
  • Case insensitive: /[a-zA-Z]+/i
  • Multiline: /[a-zA-Z]+/m

Characters

Literal character

  • /car/ matches “car”
  • /car/ matches the first 3 letters of “carnival”
  • This is the simplest match

Meta-characters

  • Characters with special meaning
  • Transform literal character into powerful expressions
  • Only a few to learn
    • \ . * + - { } [ ] ^ $ | ? ( ) : ! =
  • Can have more than one meaning
The wildcard meta-character
  1. Dot meta-character “.”
      • mean any character except a new line character
      • /h.t/ matches “hat,” “hot,”, and “hit,” but not “heat”
      • Most common metacharacter
      • Most common mistake
      • Ex: /9.00/ matches “9.00”, “9500”, and “9-00”
      A good regular expression should match the text you want to target and only that text, nothing more
  1. Escape meta-character “\”
      • Allow using meta-character as literal character
      • Match a period with /\./
      • /9\.00/ matches “9.00”, but not “9500” or “9-00”
      • Match a backslash by escaping a backslash /\\/
Escaping meta-character
  • Only for meta-character
  • Literal characters should never be escaped
  • Quotation marks are not meta-characters
    • You do not need to escape it
Other special characters
  • Space \s
  • Tab \t
  • Line returns \r, \n, \r\n

Examples

1. How many times does the word "self" appear? /self/g found 61 matches 2. Count himself, herself, itself, myself, yourself, thyself - /himself/g found 20 - /herself/g found 0 - /itself/g found 12 - /myself/g found 6 - /yourself/g found 7 - /thyself/g found 1 - /(him|her|it|my|your|thy)self/g found 46 3. Using three literal characters and three wildcard characters, match: please, palace, parade - /p[la][elr]a[scd]e/g found 5 - /p..a.e/g found 11 - /p[la].a.e/g 4. What matches /t..ch/ besides "teach" found "ttach", "touch"

Character Sets

  • [ - begin a character set
  • ] - end a character set

Define a character set

  • A list of characters
  • But only one character in that set
  • The order of characters in the set does not matter
  • /[aeiou]/ matches any one vowel
  • /gr[ea]y/ matches “grey” and “gray”

Character ranges

  • - range of characters
  • [0-9]: all digit characters from 0, 1,…,9
  • [A-Za-z]: all characters from a,b,…,z and A,B,…,Z
  • [50-99] is the same with [0-9] which includes 5, 0-9, and 9

Negative character sets

  • ^ - meaning not any one of several characters
  • Add ^ as the first character inside a character set
  • /[^aeiou]/ matches any one consonant (non-vowel)
  • /see[^mn]/ matches “seek” and “sees” but not “seem” or “seen”

Metacharacter inside character sets

  • Most meta-characters inside character sets are already escaped
  • Do not need to escape them again
  • /h[a.]t/ matches “hat” and “h.t” but not “hot”
  • Exception: ] - ^ \
  • /var[ [ ( ] [0-9] [ \] ) ]/ matches “var[1]” or “var(1)”
  • /file[0\-\\_]1/ matches “file_01”, “file-1”, and “file_1”

Shorthand character sets

Shorthand
Meaning
Equivalent
\d
Digit
[0-9]
\w
Word character
[a-zA-Z0-9_]
\s
Whitespace
[\t\r\n]
\D
Not digit
[^0-9]
\W
Not word character
[^a-zA-Z0-9_]
\S
Not whitespace
[^\t\r\n]
  • /\d\d\d\d/ matches “1984”, but not text
  • /\w\w\w/ matches “ABC”, “123”, and “1_A”
  • /\w\s\w\w/ matches “I am”, but not “Am I”
  • /[\w\-]/ matches any word character or hyphen
  • /[^\d]/ is the same as both /\D/ and /[^0-9]/
  • /[^\d\s]/ is not same as [\D\S]
  • /[^\d\s]/ = NOT digit OR space character
  • /[\D\S]/ = EITHER NOT digit OR NOT space character

Examples

1. Match both "lives" and "lived" /live[sd]/ found 7 matches 2. Match "virtue" but not "virtues" /virtue\b/ found 14 matches 3. Match the numbers and periods on all numbered paragraphs /\d\./ found 4 matches 4. Find the 16-character word that starts with "c" /c\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w/ found 1 match /c\w{15}/ found 1

Repetition

Meta-character
Meaning
*
Preceding item, zero or more time
+
Preceding item, one or more times
?
Preceding item, zero or one time

Repetition meta-characters

  • /.+/ matches any string of characters except a line return
  • /Good .+\./ matches “Good morning.”, “Good day.”, “Good evening.”, and “Good night.”
  • /\d+/ matches “90210”
  • /\s[a-z]+ed\s/ matches lowercase words ending in “ed”
  • /apple*/ matches “apple”, “apples”, and “applessssssssss”
  • /apples+/ matches “apples” and “applessssss” but not “apple”
  • /apple?/ matches “apple” and “apples” but not “applesssssssssssss”
  • /\d\d\d\d*/ matches number with three digits or more
  • /\d\d\d+/ matches numbers with three digits or more
  • /colou?r/ matches “color” and “colour”

Quantified repetition

Meta-character
Meaning
{
start quantified repetition of the preceding item
}
end quantified repetition of the preceding item
  • {min, max}
    • min and max are positive numbers
    • min must always be included and can be zero
    • max is optional
  • \d{4,8} matches numbers with four to eight digits
  • \d{4} matches numbers with exactly four digits
  • \d{4,} matches numbers with four or more digits
  • \d{0,} is the same as \d*
  • \d{1,} is the same as \d+
  • /\d{3}-\d{3}-\d{4}/ matches most US phone numbers
  • /A{1,2} bonds/ matches “A bonds” and “AA bonds” not “AAA bonds”

Greedy expression

  • Standard repetition quantifiers are greedy
  • Expression tries to match the longest possible string
  • Defers to achieving an overall match
  • /.+\.jpg/ matches “filename.jpg”
  • Gives back as little as possible
  • /.*[0-9]+/ matches “Page 266”
  • .* portion matches “Page 26”
  • [0-9]+ portion matches only “6”
  • Match as much as possible before giving control to the next expression part
Regular expression engines are eager
Regular expression engines are greedy

Lazy expression

Meta-character
Meaning
?
Make preceding quantifier lazy
  • *?, +?, {min, max}?, ??
  • Instructs quantifier to use a “lazy strategy” for making choices
  • Match as little as possible before giving control to the next expression part
  • Still defers to the overall match
  • Not necessarily faster or slower
/\d+\w+\d+/ 01_FY_07_report_99.xls /\d+\w+?\d+/ 01_FY_07_report_99.xls

Examples

1. Match self, himself, herself, itself, myself, yourself, thyself /\w*self\b/g found 60 matches 2. Match both "virtue" and "virtues" /virtues?/ found 16 matches 3. Use quantified repetition to find the word that starts with T and has 12 letters /T\w{11}/g found 1 match 4. Match all text inside quotation marks, but nothing that is not inside them /".+?"/g found 19 matches

Grouping and Alternation

Meta-character
Meaning
(
Start grouped expression
)
End grouped expression

Grouping meta-characters

  • Group portions of the expression
  • Apply repetition operations to a group
  • Create a group of alternation expressions
  • Capture group for use in matching and replacing
  • /(abc)+/ matches “abc” and “abcabcabc”
  • /(in)?dependent/ matches “independent” and “dependent”
  • /run(s)?/ is the same as /runs?/

Alternation meta-characters

Meta-character
Meaning
|
Match previous or next expression

Efficiency when using alternation

  • | is an OR operator
  • Either match the expression on the left or match the expression on the right
  • Ordered, the leftmost expression gets precedence
  • Group alternation expressions to keep them distinct
  • /apple|orange/ matches “apple” and “orange”
  • /abc|def|ghi|jkl/ matches “abc”, “def”, “ghi”, and “jkl”
  • /apple(juice|sauce)/ is not the same as /applejuice|sauce/
  • /w(ei|ie)rd/ matches “weird” and “wierd”
  • /(AA|BB|CC){4}/ matches “AABBAACC” and “CCCCBBBB”
  • Put simplest (most efficient) expression first
  • /\w+_\d{2,4}|\d{4}_export|export_\d{2}/
  • It should be
    • /export_\d{2}|\d{4}_export|\w+_\d{2,4}/

Examples

1. Match "myself", "yourself", "thyself", but not "himself", "herself", "itself" /(my|your|thy)self/g found 14 matches 2. Match "good", "goodness", and "goods" without typing "good" more than once /good(s|ness)?/g found 23 matches 3. Match "do" or "does" followed by "no", "not" or "nothing", even when it occurs at the start of a sentence /[dD]o(es)? (nothing|not|no){1}/g found 35 matches

Anchors

Meta-character
Meaning
^
Start of string/line
$
End of string/line
\A
Start of string, never end of line
\Z
End of string, never end of line

Start and End anchors

  • Reference a position, not an actual character
  • Zero-width
  • /^apple/ or /\Aapple/
  • /apple$/ or /apple\Z/
  • /^apple$/ or /\Aapple\Z/

Line breaks and multiline mode

  • ^ and $ do not match at line breaks in single mode
  • ^ and $ will match at start and end of lines in multiline mode
  • \A and \Z do not match at line breaks in both single mode and multi-line mode

Word boundaries

Meta-character
Meaning
\b
Word boundary (start/end of word)
\B
not a word boundary
  • Reference a position, not an actual character
  • Before the first word character in the string
  • After the last word character in the string
  • Between a word character and a non-word character
  • Word character: [A-Za-z0-9_]
  • /\b\w+\b/ find four matches in “This is a test.”
  • /\b\w+\b/ matches all of “abc_123” but only part of “top-notch”
  • /\bNew\bYork\b/ matches “New York”
  • /\B\w+\B/ find two matches in “This is a test.” (”hi” and “es”)

Examples

1. How many paragraphs start with "I" as in "I read"? /^I.+/g found 9 matches 2. How many paragraphs end with a question mark? /.*\?$/g found 1 match 3. Match all words with exactly 15 letters, including hyphenated words /\b[a-zA-Z0-9_-]{15}\b/ found 3 matches /\b(\w|-){15}\b/g found 3 matches

References