A Programmer’s Guide to Regex or Regular Expressions | Hacker Noon

June 8th 2020

Author profile picture

Everybody talks about regular expression, but everyone hates regular expression yet ends up using regular expression!

So what is regular expression? umm, we need to go deeper? So yeah, Let’s dive into building blocks of regex with a short intro..

regular expression:

Regular expression or rational expression itself is an object and describes a pattern of characters. It allows us to search for specific patterns of text it also help match, locate, and manage text. Though they look pretty complicated yet they are very powerful, you can absolutely create a regex for almost any pattern of text you think.

Building block of regular expression:

Metacharacters are the building blocks of regular expressions. Characters in regex are understood to be either a metacharacter with a special meaning or a regular character with a literal meaning.

Reserved meta-characters:

Meta characters that are reserved and need to be escaped:

.[{()^$|?*+

we gonna see example of escaping later.

Other common meta characters are:

Caret (^):

(^)

Matches the start of the string, and in multiline mode also matches immediately after each newline.

example:

^d{3} will match with patterns like "456" in "456-112-112".

Dollar ($):

($)

Matches the end of the string or just before the newline at the end of the string, and in multiline mode also matches before a newline.

example:

d{3}$  will match with patterns like "112" in "456-112-112".

d:

d

matches whole number or digit

(0–9)

. Here number of

d

determines the umber of digits our regex will match for. i,e:

d

means single digit

dd = double digits

and so on.

example:

ddd

will match

327 , 123, 787 but not 1223

as there are 4 digits in “1223” and our regex is a match for 3 digits.

ddd ≠ 473847

as it returns 3 digits but

473847

contains 6 digits.

ddd ≠ cat

because it will match only digits but cat contains letter.

D:

Reverse of

d

. Matches anything except digits.

example:

DD ≠ 12

as it won’t match numeric character.

w:

Matches any alpha-numeric(word) characters.

example:

www = 467
wwww = Crow
www ≠ python
www

doesn’t return python because python contains 6 characters.

W:

Similar to D W is reverse of w i,e: Matches anything but alpha-numeric characters

example:

WW = ,, or !! or @#
WWWW != Titanic2

as every character is alpha-numeric.

/s :

Matches any white-space characters such as space and tab.

For example from upper example_text the regex

s

will match only the space between two words and ignore everything else.

/S:

Matches any non-whitespace characters unlike

s
Repeaters (

*, + and { }

):

*, + and { }

are called repeaters as they denote that the preceding character is to be used for more than one time.

Asterisk symbol ( * ):

Asterisk matches when the character preceding

*

matches 0 or more times. i.e: It tells the computer to match the preceding character (or set of characters) for 0 or more times (upto infinite).

example:

Gre*n = Green(e is found 2 times), Grn(e is found 0 time), Greeeeen (e is found 5 times) 

and so on ..

tre* != trees

as there is “s” followes by “ee”.

Plus symbol ( + ):

(+)

sign matches when the character preceding

‘+’

matches atleast one or more times (upto infinite).

example:

Gre+n = Green, Greeeen, Gren

and so on..

Gre+n != Grn

as “e” is absent here.

Dot(.):

The period matches any alphanumeric character or symbol. Interestingly it can take place of any other symbol and for that reason it is being called Wildcard.

example:

Gre. = Gree, Gren, Gre1

and so on

Gre. != Green

as . by itself will only match for a single character, here, in the 4th position of the term. n is the 5th character and is not accounted for in the RegEx.

but

Gre.*

will match Green as it tells to match any character
used any number of times.

Alternation (|):

Allows for alternate matches. | works like the Boolean OR.

example:

A|B

creates a regular expression that will match either

A or B
H(i!|ey!) 

will match either

Hi! or Hey!
M(s|r|rs).?s[A-Z]w+

will match any name started with

Ms, Mr or Mrs

.

Question mark (?):

Matches when the character preceding ? occurs 0 or 1 time only, making the character match optional.

example:

Favou?rite = Favourite

(u is found 1 time)

Favou?rite = Favorite

(u is found 0 time)

Character set ([]):

  1. []

    is used to indicate a set of characters. In a set:

  2. Characters can be listed individually, e.g.
     [cat]

    will match

    'c', 'a', or 't'

    . Ranges of characters can be indicated by giving two characters and separating them by a

    '-'

    ,

  3. example:

  • [A-Z]

    will match any uppercase ASCII letter,

  • [0–9]

    will match any digit from

    0 to 9

    .

  • [0-3][0-3]

    will match all the two-digits numbers from

    00 to 33
  • [0-9A-Fa-f]

    will match any hexadecimal digit.

  • If – is escaped (e.g.
     [A-Z])

    or if it’s placed as the first or last character (e.g.

    [A-])

    , it will match a literal ‘-‘.

  • The order of the characters does not matter.
  • Special characters lose their special meaning inside sets.For example,
    [(+*)]

    will match any of the literal characters

    '(', '+', '*', or ')'
  • To match a literal ‘{‘ inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both
    [()[]{}]

    and

    []()[{}]

    will both match a parenthesis.

  • Character group ():

    A character group is indicated by () matches the characters in exact order.

    example:

    (abc) = abc

    not

    acb
    (123) = 123

    not

    321
    https?://(www.)?(w+)(.w+)

    will match any url. There are 3 groups here.

    1st group: the optional

    www.
    2nd group: the domain name

    google, facebook

    etc

    3rd group: top level domain 

    .com, .net, .org
    There is another implicit group group 0 group 0 is everything that we captured in our case the entire

    url

    .

    Quantifiers:

    regex use quantifiers to indicate the scope of a search string. We can use multiple quantifiers in our search string. quantifiers are:

    {n}:

    Matches when the preceding character, or character group, occurs n times exactly.

    example:

    pand[ora]{2} = pandar, pandoo
    pand[ora]{2} ≠ pandora

    as the quantifier

    {2}

    only allows for 2 letters from the character set

     [ora]

    .

    {n,m}:

    Matches when the preceding character, or character group, occurs at least n times, and at most m times.

    example:

    d{2,6} = 430973, 4303, 38238
    d{2, 6} ≠ 3

    3 does not match because it is 1 digit, so outside of the character range.

    Escaping Metacharacters:

    To search for a character that is a reserved metacharacter (any of .[{()^$|?*+), we can use the backslash to escape the character so it can be recognized.

    Example:

    Below regex will match any valid mail id. Here we’ve used to escape reserved character.

    ^([a-zA-Z0–9_-.]+)@([a-zA-Z0–9_-.]+).([a-zA-Z]{2,5})$

    Congratulations! Now you know the very basic of regex and it’s already too much for a day!

    In my upcoming article we will practice regex with python.

    Tags

    The Noonification banner

    Subscribe to get your daily round-up of top tech stories!

    read original article here