Perl Regular Expressions - Extracting substrings between two characters

July 28, 2011 - 2 minute read -
perl regex regular expressions characters substrings strings

Suppose we need to extract a program name from a URL.
Ie. '//www.domain.com/program_name'

Simple enough with,

my $url = '//www.domain.com/program_name?extra=junk'
my ($domain, $program) = $url =~ m/(.*/)(.*)/;

(.*/) tells perl to grab everything up until the furthest right '/'.
(.*) grabs everything else after.

The '/' on either side of the regular expression define the delimiter.

Perl puts everything inside the () into variables for you. $1 for the first set of (), $2 is for the second, and so on. In the example above we are pre-defining our own variables for perl to put the results in.

Taking It Further

Now what if we want to cater for parameters or other variables in our url? We want to extract everything from the furthest right '/' to the end of the string or first '?' encountered.
Ie. '//www.domain.com/program_name' or '//www.domain.com/program_name?extra=junk'

This does the trick,

my $url = '//www.domain.com/program_name?extra=junk'
my ($domain, $program) = $url =~ m{(.*/)([^?]*)};

([^?]*) tells perl to grab everything that is not a '?'.
The [] represents a character class, where everything inside represents one character.
The ^ is a negation (everything that isn't a '?').
As the '?' is inside a character class, it does not need to be escaped.

Pleasing The Critics

Perl Critic complains about the above regular expression, for a few reasons.

Probibited Escaped Characters - It doesn't like how we've used '/', because our delimited is a '/'. The fix is to use a different delimiter like '{}'.

 m{(.*/)([^?]*)}; 

Missing /x (Extended Format) Flag - Adding this flag allows for comments and extra whitespace in the regular expression, to make it easier to read.

 m{(.*/)([^?]*)}x; 

Missing /m (Line Boundary Matching) Flag - Adding this flag makes boundary matching work as most would expect.

 m{(.*/)([^?]*)}m; 

Missing /s (Dot Anything Matching) Flag - Adding this flag makes '.' match anything, instead of anything but an 'n'.

 m{(.*/)([^?]*)}s; 

Perl Regular Expression Resources

perlre
perlrequick
perlretut