Regular Expression: Case Studies For Full-Stack Developers

Regular expressions' case studies for full-stack developers

Developers rarely use regular expressions on a daily basis. However, these might be a powerful tool for fixing file syntax, web scraping, validating dates, e-mails or other personal data.

In this article, I will present real-life case studies of regular expressions usage from my career. Whenever you have problems like those listed above, my examples will be able to help you solve them.

Regular expression (aka regexp, regex) is the text pattern, that developers mostly use in string-search algorithms, for finding and replacing searched substrings in text fragments, text files or even whole directories of them.

The strongest advantage of regex is its portability. Being standardized by POSIX, regex is used commonly in UNIX-based operating systems’ terminal commands such as “grep”. However, most programming languages use the Perl-based regex library written in C called PCRE (Perl-Compatible Regular Expressions). This regex engine became the most popular one because of its groundbreaking features: minimal (“lazy”) matching, multiline matching and look-ahead/look-behind assertions.

Regex 101

For a better understanding of the success stories below, I will start with the basics of regular expressions. Regex contain multiple special characters used to capture specific patterns in text.

  • \d – a single digit
  • \D – contradiction of \d
  • \w – a letter from any alphabet
  • \W – contradiction of \w
  • \s – any whitespace character (including line breaks)
  • \S – contradiction of \s
  • . – any character (including white spaces etc.)
  • a – specifically letter “a
  • A-Z – any character represented in ASCII table between A and Z including them
  • a|b – letter “a” or letter “b”

We could also use “\” as an escape character for any special character used in regex, and combine those characters with the parentheses and brackets:

  • [….] – capturing a character which is represented by tags inside the square brackets,
  • (….) – the same as above, but it creates a separate capturing group which can be used in the replacement phase.

To capture multiple characters or groups, we can use these tags as suffixes:

  • + – one or many regular expressions before +
  • * – none or many characters defined before *
  • {1,3} – 1, 2 or 3 regular expressions before these brackets
  • ? – lazy search, finds the shortest string which complies with the regex before ?

For example, [a-zA-ZżółćęąźńŻŁĆŹ]+ searches for any Polish word in text.

The capturing group will be the core of the case studies below. They are identified as indexes counting from 1. For example, $1, $2. $& represents the whole searched regex expression. We can define a non-capturing group using (?: … ) brackets. I will leverage this knowledge in case 2.

Given the string “Nice car!” we can capture the whole string and divide it by two capturing groups with the regex expression:

([A-Z][a-z]+) ([A-Z][a-z]+?)!

Then we can replace or extend the groups, creating a different string. For replacement regex “$1 bike, who needs a $2?” the result is obvious.

regular expression

After gaining the basic knowledge of regex, we can move on to the actual, much more sophisticated case studies. All examples you can see at regexr.com.

The most common usage of regular expressions in a frontend development is date, e-mail and telephone number pattern matching. This might also be useful for web scraping data from webpages by script or bot.

Background of problem

In this case study, we have a custom browser with the ability to represent data in tabular form and filter the table’s columns by date. We want to validate string-represented date input if it’s compliant with DMY (DD-MM-YYYY hh:mi:ss) date and time format on the frontend side before putting it into the HTTP request. There is no point of pushing the obligation of date format validation for obviously incorrect given date to the server side.

Solution

The date regex might be written in unlimited ways, including parsing the Gregorian calendar’s leap year. But in this article, I will simplify it to just parsing numbers delimited by dashes and time delimited by colons.

In that case, the regex will be:

 [0-9]{1,2}\-[0-9]{1,2}-[0-9]{1-4} [0-9]{1,2}\:[0-9]{1,2}\:[0-9]{1,2}

The problem might seem trivial. However, it gets more complicated when we want to validate it in real time. For example, every time the user inputs the next character in the browser via keyboard. We have to capture an incompletely written date and raise an error on the frontend side if any illegal character or three-digit number appears.

It creates the challenge of matching not only the full DMY date and time pattern, but also that pattern in every step of typing it. In other words, for an exemplary date such as November 11th 2023 at noon we want to capture: “11”, “11-”, “11-11-”, “11-11-2023”, “11-11-2023 12:” and so on.

In one of my projects I encountered the same problem and solved it by using the regex end of string representation ($). The whole regex is presented below.

(((\d{1,2}\-){2}\d{1,4}) (\d{1,2}\:){2}\d{1,2}$)|((\d{1,2}\-){2}\d{1,4} (\d{1,2}\:){1,2}\d{0,2}$)|((\d{1,2}\-){2}\d{1,4} (\d{1,2}){1,2}$)||((\d{1,2}\-){2}\d{1,4} {0,1}$)|((\d{1,2}\-){1,2}\d{0,2}$)|(\d{1,2}$)

Regexr.com also allows creating unit tests for regular expressions. See the results of these tests below.

regular expression test

As you can see, all the combinations are covered, so no false-positive validation error will be raised. The result is here.

Let’s take a look at a backend side of the problem I encountered a couple of years ago in one of my previous companies related to the public transit system. Long story short, I had a batch program which downloaded a ZIP archive containing XSD and XML files from clients’ server and translated it to a CSV. The file was used by another batch program to produce the resulting SQLite database. Then it was being distributed to multiple cash machines in the country. However, the client and producer of XML files with all timetables was constantly sending erroneous files, causing XML unmarshalling to fail. Because the importer batch was being executed repeatedly every single day, one cannot be fixed manually each time forever.

Solution

The source XML timetable file is below. I anonymized and simplified the file for a better understanding of the problem.

<timetable>

<bus>

<destination>Wil

anów

</destination>

<stations>5</stations>

</bus>

<bus>

<destination>Urs

ynów

</destination>

<stations>6</stations>

</bus>

<bus>

<destination>Och

ota

</destination>

<stations>7</stations>

</bus>

</timetable>

There was also an XSD schema included and it looked slightly different from the given XML. Instead of “stations” the tag name was “stationsNum” and the “destination” tag’s content had to be in a single line.

Knowing XML structure, we can replace faulty fragments in the file using substitution. Search regex:

(\<stations)(\>)(\d+)(\<\/stations)(\>)

Replacement regex: 

$1Num$2$3$4Num$5

The result is here.

regular expression result

Search regex:

(\<destination\>\s*.+?)(?:\s*)(.*?)(?:\s*)(.*?\<\/destination\>)

Replacement regex:

$1$2$3

See the result here.

regular expression result

The result after these two regex substitutions is below.

<timetable>

<bus>

<destination>Wilanów</destination>

<stationsNum>5</stationsNum>

</bus>

<bus>

<destination>Ursynów</destination>

<stationsNum>6</stationsNum>

</bus>

<bus>

<destination>Ochota</destination>

<stationsNum>7</stationsNum>

</bus>

</timetable>

The solution was ergonomic and effective. I performed regular expression substitution directly on the ZIP file input stream, so there was no need for extracting. That small change did not prolong the timetable extraction process and was virtually invisible for any third party. To my knowledge, this solution still works today without any changes.



Looking for more programming tips?

Check out our technology bites

Do you have any questions?