Regular Expressions 101

Learn the basics of regular expressions, when to use them and how to use them.

Emil Kirilov - Jan 31, 2022
  • regex
  • regular expressions

Introduction

Regular expressions seem complicated, confusing, even ciphered. They easily become hard to read and can lead to performance issues. Their reputation precedes them as a quirky mix between chaos and maths.

I ask you to put your prejudices aside and let me introduce you to the friendly and powerful regex I know. Let's start with the basics.

What is a regex?

A regex is a string that matches a pattern of text. In most programming languages you surround the string with forward slashes so it is recognized and treated as a regex. /foo/, /[0-9]+\.[0-9]+/ and /(0[1-9]|1[0-2])/(0[1-9]|[12][0-9]|3[01])/\d{4}/ are valid regular expressions. The first matches the sequence of characters 'foo', the second matches decimal numbers, and the third - dates.

When to use a regex?

Regular expressions are used when you need to find, replace, split or validate strings. These tasks are so essential that all the major programming languages have a built-in regex engine.

How to use a regex?

The capabilities of regular expressions are vast and we only go through the basics in this article. You can find more sophisticated use-cases and examples {{here}}.

Most characters, like letters and numbers, match themselves. Creating a regex composed of only alphanumerical characters will be no different than looking for a substring - a function present in the standard library of modern languages. In order to unlock the full potential of regular expressions, we add a little bit of special syntax in the form of quantifiers, tokens, anchors, and groups.

Quantifiers

Quantifiers are symbols that indicate how many times a given character should be matched.

Examples: ?, *, +, ...

Cheatsheet

a? - matches one or zero of a a* - matches zero or more of a a+ - matches one or more of a a{1} - matches exactly one of a a{1,3} - matches one, two, or three of a a{1,} - matches at least one of a

Examples

Lets's have a test sentence: Son, it's too soon for you to drink alcohol,and the goal to match both son and soon.

We can write the following regular expressions that will do the trick

  • /so{1,2}n/ - there are either 1 or 2 "o"s - /soo?n/ - the second "o" is optional - /soo{0,1}n/ - the second "o" is an optional alternative

Tokens

Tokens generally start with a backslash \ and represent a single or a group of characters. They also provide a way for matching the special characters that are part of the regex syntax.

Examples: ?, +, \s, \d, \D, \n, \t, ...

Cheatsheet

? - matches the character ? + - matches the character + \s - matches the whitespace character \d - matches any digit \D - matches  any non-digit; the opposite of \d \n - matches the newline character \t - matches the tab character . - matches any character [abc] - matches a single character: a, b or c [a-z] - matches a single character in the range from a to z [0-9] - matches any digit

Examples

  • /\?{3}/ - matches 3 consecutive question marks   - /\s+/ - matches one or many spaces ⠀◦ useful for splitting user input - like names - /\d{4}/ - matches 4 consecutive digits ⠀◦ think of years, ZIP codes - /fi.e/ - matches fire, file, five, because the 3rd character can be anything - /[A-Z][a-z]+/ - matches words that begin with an uppercase letter and have at least 2 letters  - /[0-5][0-9]/ - matches all numbers that represent seconds ⠀◦ Imagine a digital clock that goes from 00 to 59

Anchors

Anchors indicate the start and the end of strings and boundaries.

Examples: ^, $, \b, \B

Cheatsheet

^ - start of string & - end of string \b - word boundary; a position between alphanumerical characters and non-alphanumerical characters \B - non-word boundary; a position between two alphanumerical characters or two non-alphanumerical characters 

Examples

  • /^Regex/ - matches strings that begin with Regex- /regex.&/ - matches strings that end with *regex. *- /^Some\b/ - matches strings that begin with the word Some

Groups

Groups, or capturing groups, are a way to treat multiple characters as the same unit. This way you can isolate regex logic inside a group, or name a group so you can inspect what text it has matched. 

Cheatsheet

({regex}) - a group is indicated with brackets around your custom {regex}  (?<{name}>{regex}) - you name a group by writing a question mark right after the opening bracket and surrounding the {name} with less than and more than signs

Examples

  • /(\d+)/ - captures all numbers  - /(?<long_spaces>\s{2,})/ - captures occurrences of two, or more, consecutive spaces

Extras

There are logical operators in regex too! You can negate and use or, just like in your favorite programming language.

Cheatsheet

^ - negation | - or

Examples

  • /[^a-z]/ - matches any character which is not a lowercase letter from the English alphabet - /al|eal/ - matches either the 2 characters al, or the word *eel *- /(al|ea)l/ - matches either the word all, or the word *eel *⠀◦ note that using a group isolates the or logic inside of it

Bringing the pieces together

Regular expressions need some time to sink in. Piling up more syntax will only confuse you so let us practice instead.

Email validation

Basic

/[^@]+@[^.]+\.[a-z]{2,}/

This should be one of the simplest regular expressions that validate emails. It can be broken down into 5 parts:

  • [^@]+ - matches one or more characters that are not the character @ - @ - matches the character @ - [^\.]+ - matches one or more characters that are not the character . ⠀◦ we escape the dot character as it can also mean any character! - \. - matches the character . - [a-z]{2,} - matches two or more letters

With groups

/(?<username>[^@]+)@(?<mail_server>[^\.]+)\.(?<domain>[a-z]{2,})/

Adding groups allows us to match test string and see what we capture in each of them.

john.doe@test.com

Groups: - username: john.doe - mail_server: test - domain: com

emil.kirilov@lexis.solutions

Groups: - username: emil.kirilov - mail_server: lexis - domain: solutions

Date validation

We will use the dd.mm.yyyy format in this example.

Naive with groups

/(?<day>\d{2})\.(?<month>\d{2})\.(?<year>\d{4})/

27.12.2021

Groups: - Day: 12 - Month: 12 - Year: 2021

33.13.2021

Groups: - Day: 33 - Month: 13 - Year: 2021

This regex is naive because it matches too optimistically. We could provide laughable input and it would find it acceptable. Let's fix it!

A perfect enough with groups

/(?<day>0[1-9]|[12][0-9]|30|31)\.(?<month>0[1-9]|11|12)\.(?<year>[12][0-9]{3})/

Breakdown

Group day: - the minimum date is 01, so dates starting with 0, can't end with 0 - all 10s and 20s are OK - 30 and 31 are both possible

Group month: - same logic as with day

Group year: - only accepts 1xxx and 2xxx years

Examples

27.12.2021

Groups: - Day: 12 - Month: 12 - Year: 2021

33.13.2021 - no match

Considerations

I labeled this regex as 'perfect enough' because not all months have 31 days, or 30 days, or 29 days for that matter. We could, of course, write a monster regex and account for each month's possible day count, but we won't. It will become unreadable and still for short on leap years.

Be careful not to misuse regular expressions. Our date regex will be exceptional if you need to extract dates in a long text. Perhaps your task is to find the min/max date? Do that with a library or a built-in date parser. Don't compare the year named groups, the month named, and the day named groups of each month.

End thoughts

Regular expressions are indispensable. They are a specific tool that you won't need every day but can save you days when you correctly recognize a use case. 

I hope that by the end of this article you have a basic understanding and appreciation for the regex power and would, perhaps, revisit it whenever you find yourself in need of a cheat sheet in building regular expressions.

Lexis Solutions is a software agency in Sofia, Bulgaria. We are a team of young professionals improving the digital world, one project at a time.

Contact

  • Deyan Denchev
  • CEO and Co-founder
© 2022 Lexis Solutions. All rights reserved.