Posts tagged: substrings

  • Using Perl Regular Expressions to Replace substr Calls

    Suppose we want to format digits with commas after the 2nd and 5th digits.
    Ie. Convert 12345678 to 12,345,678

    Using substr, perl's substring method, this is achieved with:

    $old = ‘12345678’
    $new = substr( $old, 0, 2 ) . ',' . substr( $old, 2, 3 ) . ',' . substr( $old, 5, 3 );
    # result: 123,456,78

    It works and it’s fine, but perl critic will complain about the use of the numbers 0,2,3 and 5. You could go ahead define them as constants, but that’s cumbersome for trivial substr parameters.

    Using a regular expression provides another option.

    $old = ‘12345678’
    $old =~ m/([d]{2})([d]{3})([d]{3})/;
    $new .= $1 . ',' . $2 . ',' . $3;
    #result: 123,456,78

    It’s easily readable to anyone looking at your code and best of all perl critic won’t complain about it.

    August 10, 2011 - 1 minute read - perl regex regular expressions characters substrings strings
  • Perl Regular Expressions - Extracting substrings between two characters

    Suppose we need to extract a program name from a URL.
    Ie. '//www.domain.com/program_name'

    Simple enough with,

    my $url = '//www.domain.com/program_name?extra=junk'
    my ($domain, $program) = $url =~ m/(.*/)(.*)/;

    (.*/) tells perl to grab everything up until the furthest right '/'.
    (.*) grabs everything else after.

    The '/' on either side of the regular expression define the delimiter.

    Perl puts everything inside the () into variables for you. $1 for the first set of (), $2 is for the second, and so on. In the example above we are pre-defining our own variables for perl to put the results in.

    Taking It Further

    Now what if we want to cater for parameters or other variables in our url? We want to extract everything from the furthest right '/' to the end of the string or first '?' encountered.
    Ie. '//www.domain.com/program_name' or '//www.domain.com/program_name?extra=junk'

    This does the trick,

    my $url = '//www.domain.com/program_name?extra=junk'
    my ($domain, $program) = $url =~ m{(.*/)([^?]*)};

    ([^?]*) tells perl to grab everything that is not a '?'.
    The [] represents a character class, where everything inside represents one character.
    The ^ is a negation (everything that isn't a '?').
    As the '?' is inside a character class, it does not need to be escaped.

    Pleasing The Critics

    Perl Critic complains about the above regular expression, for a few reasons.

    Probibited Escaped Characters - It doesn't like how we've used '/', because our delimited is a '/'. The fix is to use a different delimiter like '{}'.

     m{(.*/)([^?]*)}; 

    Missing /x (Extended Format) Flag - Adding this flag allows for comments and extra whitespace in the regular expression, to make it easier to read.

     m{(.*/)([^?]*)}x; 

    Missing /m (Line Boundary Matching) Flag - Adding this flag makes boundary matching work as most would expect.

     m{(.*/)([^?]*)}m; 

    Missing /s (Dot Anything Matching) Flag - Adding this flag makes '.' match anything, instead of anything but an 'n'.

     m{(.*/)([^?]*)}s; 

    Perl Regular Expression Resources

    perlre
    perlrequick
    perlretut

    July 28, 2011 - 2 minute read - perl regex regular expressions characters substrings strings