Getting started with Regular ExpressionsWhat are regular expressions?Usage examplesRegular expression enginesNotation conventionRegular expression modesCharactersLiteral characterMeta-charactersExamplesCharacter SetsDefine a character setCharacter rangesNegative character setsMetacharacter inside character setsShorthand character setsExamplesRepetitionRepetition meta-charactersQuantified repetitionGreedy expressionLazy expressionExamplesGrouping and AlternationGrouping meta-charactersAlternation meta-charactersEfficiency when using alternationExamplesAnchorsStart and End anchorsLine breaks and multiline modeWord boundariesExamplesReferences
Getting started with Regular Expressions
What are regular expressions?
- A text pattern
- It interpreted by a regex processor
- Used for matching, searching, and replacing text
Usage examples
- Test a credit card number
const regex = /\d\d\d-\d\d\d-\d\d\d\d/gm; // Alternative syntax using RegExp constructor // const regex = new RegExp('\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d', 'gm') const str = `555-961-2425`; let m; while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } // The result can be accessed through the `m`-variable. m.forEach((match, groupIndex) => { console.log(`Found match, group ${groupIndex}: ${match}`); }); }
- Test if an email is valid or not
- Search a document for either color or colour
- Replace all occurrences of “gmail.com” with “gmail.com.vn”
- Count how many times “training” is preceded by “blockchain”, “java”, “python”
Regular expression engines
Programming languages
- javascript
- java
- python
- perl
- php
- .net
- mysql
- unix
- apache
- c/c++
- ruby
- posgresql
Text Editor
- Atom
- Sublime
- Notepad++
Online Text Editor
Notation convention
- Regular expression: /[a-zA-Z]+/
- Text string: “hello word”
Regular expression modes
- Standard: /[a-zA-Z]+/
- Global: /[a-zA-Z]+/g
- Case insensitive: /[a-zA-Z]+/i
- Multiline: /[a-zA-Z]+/m
Characters
Literal character
- /car/ matches “car”
- /car/ matches the first 3 letters of “carnival”
- This is the simplest match
Meta-characters
- Characters with special meaning
- Transform literal character into powerful expressions
- Only a few to learn
- \ . * + - { } [ ] ^ $ | ? ( ) : ! =
- Can have more than one meaning
The wildcard meta-character
- Dot meta-character “.”
- mean any character except a new line character
- /h.t/ matches “hat,” “hot,”, and “hit,” but not “heat”
- Most common metacharacter
- Most common mistake
- Ex: /9.00/ matches “9.00”, “9500”, and “9-00”
A good regular expression should match the text you want to target and only that text, nothing more
- Escape meta-character “\”
- Allow using meta-character as literal character
- Match a period with /\./
- /9\.00/ matches “9.00”, but not “9500” or “9-00”
- Match a backslash by escaping a backslash /\\/
Escaping meta-character
- Only for meta-character
- Literal characters should never be escaped
- Quotation marks are not meta-characters
- You do not need to escape it
Other special characters
- Space \s
- Tab \t
- Line returns \r, \n, \r\n
Examples
1. How many times does the word "self" appear? /self/g found 61 matches 2. Count himself, herself, itself, myself, yourself, thyself - /himself/g found 20 - /herself/g found 0 - /itself/g found 12 - /myself/g found 6 - /yourself/g found 7 - /thyself/g found 1 - /(him|her|it|my|your|thy)self/g found 46 3. Using three literal characters and three wildcard characters, match: please, palace, parade - /p[la][elr]a[scd]e/g found 5 - /p..a.e/g found 11 - /p[la].a.e/g 4. What matches /t..ch/ besides "teach" found "ttach", "touch"
Character Sets
- [ - begin a character set
- ] - end a character set
Define a character set
- A list of characters
- But only one character in that set
- The order of characters in the set does not matter
- /[aeiou]/ matches any one vowel
- /gr[ea]y/ matches “grey” and “gray”
Character ranges
- - range of characters
- [0-9]: all digit characters from 0, 1,…,9
- [A-Za-z]: all characters from a,b,…,z and A,B,…,Z
- [50-99] is the same with [0-9] which includes 5, 0-9, and 9
Negative character sets
- ^ - meaning not any one of several characters
- Add ^ as the first character inside a character set
- /[^aeiou]/ matches any one consonant (non-vowel)
- /see[^mn]/ matches “seek” and “sees” but not “seem” or “seen”
Metacharacter inside character sets
- Most meta-characters inside character sets are already escaped
- Do not need to escape them again
- /h[a.]t/ matches “hat” and “h.t” but not “hot”
- Exception: ] - ^ \
- /var[ [ ( ] [0-9] [ \] ) ]/ matches “var[1]” or “var(1)”
- /file[0\-\\_]1/ matches “file_01”, “file-1”, and “file_1”
Shorthand character sets
Shorthand | Meaning | Equivalent |
\d | Digit | [0-9] |
\w | Word character | [a-zA-Z0-9_] |
\s | Whitespace | [\t\r\n] |
\D | Not digit | [^0-9] |
\W | Not word character | [^a-zA-Z0-9_] |
\S | Not whitespace | [^\t\r\n] |
- /\d\d\d\d/ matches “1984”, but not text
- /\w\w\w/ matches “ABC”, “123”, and “1_A”
- /\w\s\w\w/ matches “I am”, but not “Am I”
- /[\w\-]/ matches any word character or hyphen
- /[^\d]/ is the same as both /\D/ and /[^0-9]/
- /[^\d\s]/ is not same as [\D\S]
- /[^\d\s]/ = NOT digit OR space character
- /[\D\S]/ = EITHER NOT digit OR NOT space character
Examples
1. Match both "lives" and "lived" /live[sd]/ found 7 matches 2. Match "virtue" but not "virtues" /virtue\b/ found 14 matches 3. Match the numbers and periods on all numbered paragraphs /\d\./ found 4 matches 4. Find the 16-character word that starts with "c" /c\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w/ found 1 match /c\w{15}/ found 1
Repetition
Meta-character | Meaning |
* | Preceding item, zero or more time |
+ | Preceding item, one or more times |
? | Preceding item, zero or one time |
Repetition meta-characters
- /.+/ matches any string of characters except a line return
- /Good .+\./ matches “Good morning.”, “Good day.”, “Good evening.”, and “Good night.”
- /\d+/ matches “90210”
- /\s[a-z]+ed\s/ matches lowercase words ending in “ed”
- /apple*/ matches “apple”, “apples”, and “applessssssssss”
- /apples+/ matches “apples” and “applessssss” but not “apple”
- /apple?/ matches “apple” and “apples” but not “applesssssssssssss”
- /\d\d\d\d*/ matches number with three digits or more
- /\d\d\d+/ matches numbers with three digits or more
- /colou?r/ matches “color” and “colour”
Quantified repetition
Meta-character | Meaning |
{ | start quantified repetition of the preceding item |
} | end quantified repetition of the preceding item |
- {min, max}
- min and max are positive numbers
- min must always be included and can be zero
- max is optional
- \d{4,8} matches numbers with four to eight digits
- \d{4} matches numbers with exactly four digits
- \d{4,} matches numbers with four or more digits
- \d{0,} is the same as \d*
- \d{1,} is the same as \d+
- /\d{3}-\d{3}-\d{4}/ matches most US phone numbers
- /A{1,2} bonds/ matches “A bonds” and “AA bonds” not “AAA bonds”
Greedy expression
- Standard repetition quantifiers are greedy
- Expression tries to match the longest possible string
- Defers to achieving an overall match
- /.+\.jpg/ matches “filename.jpg”
- Gives back as little as possible
- /.*[0-9]+/ matches “Page 266”
- .* portion matches “Page 26”
- [0-9]+ portion matches only “6”
- Match as much as possible before giving control to the next expression part
Regular expression engines are eager
Regular expression engines are greedy
Lazy expression
Meta-character | Meaning |
? | Make preceding quantifier lazy |
- *?, +?, {min, max}?, ??
- Instructs quantifier to use a “lazy strategy” for making choices
- Match as little as possible before giving control to the next expression part
- Still defers to the overall match
- Not necessarily faster or slower
/\d+\w+\d+/ 01_FY_07_report_99.xls /\d+\w+?\d+/ 01_FY_07_report_99.xls
Examples
1. Match self, himself, herself, itself, myself, yourself, thyself /\w*self\b/g found 60 matches 2. Match both "virtue" and "virtues" /virtues?/ found 16 matches 3. Use quantified repetition to find the word that starts with T and has 12 letters /T\w{11}/g found 1 match 4. Match all text inside quotation marks, but nothing that is not inside them /".+?"/g found 19 matches
Grouping and Alternation
Meta-character | Meaning |
( | Start grouped expression |
) | End grouped expression |
Grouping meta-characters
- Group portions of the expression
- Apply repetition operations to a group
- Create a group of alternation expressions
- Capture group for use in matching and replacing
- /(abc)+/ matches “abc” and “abcabcabc”
- /(in)?dependent/ matches “independent” and “dependent”
- /run(s)?/ is the same as /runs?/
Alternation meta-characters
Meta-character | Meaning |
| | Match previous or next expression |
Efficiency when using alternation
- | is an OR operator
- Either match the expression on the left or match the expression on the right
- Ordered, the leftmost expression gets precedence
- Group alternation expressions to keep them distinct
- /apple|orange/ matches “apple” and “orange”
- /abc|def|ghi|jkl/ matches “abc”, “def”, “ghi”, and “jkl”
- /apple(juice|sauce)/ is not the same as /applejuice|sauce/
- /w(ei|ie)rd/ matches “weird” and “wierd”
- /(AA|BB|CC){4}/ matches “AABBAACC” and “CCCCBBBB”
- Put simplest (most efficient) expression first
- /\w+_\d{2,4}|\d{4}_export|export_\d{2}/
- It should be
- /export_\d{2}|\d{4}_export|\w+_\d{2,4}/
Examples
1. Match "myself", "yourself", "thyself", but not "himself", "herself", "itself" /(my|your|thy)self/g found 14 matches 2. Match "good", "goodness", and "goods" without typing "good" more than once /good(s|ness)?/g found 23 matches 3. Match "do" or "does" followed by "no", "not" or "nothing", even when it occurs at the start of a sentence /[dD]o(es)? (nothing|not|no){1}/g found 35 matches
Anchors
Meta-character | Meaning |
^ | Start of string/line |
$ | End of string/line |
\A | Start of string, never end of line |
\Z | End of string, never end of line |
Start and End anchors
- Reference a position, not an actual character
- Zero-width
- /^apple/ or /\Aapple/
- /apple$/ or /apple\Z/
- /^apple$/ or /\Aapple\Z/
Line breaks and multiline mode
- ^ and $ do not match at line breaks in single mode
- ^ and $ will match at start and end of lines in multiline mode
- \A and \Z do not match at line breaks in both single mode and multi-line mode
Word boundaries
Meta-character | Meaning |
\b | Word boundary (start/end of word) |
\B | not a word boundary |
- Reference a position, not an actual character
- Before the first word character in the string
- After the last word character in the string
- Between a word character and a non-word character
- Word character: [A-Za-z0-9_]
- /\b\w+\b/ find four matches in “This is a test.”
- /\b\w+\b/ matches all of “abc_123” but only part of “top-notch”
- /\bNew\bYork\b/ matches “New York”
- /\B\w+\B/ find two matches in “This is a test.” (”hi” and “es”)
Examples
1. How many paragraphs start with "I" as in "I read"? /^I.+/g found 9 matches 2. How many paragraphs end with a question mark? /.*\?$/g found 1 match 3. Match all words with exactly 15 letters, including hyphenated words /\b[a-zA-Z0-9_-]{15}\b/ found 3 matches /\b(\w|-){15}\b/g found 3 matches