Core Java

Java 9 Regular Expressions Example

1. Introduction

In this example we will explore the java.util.regex package and the abstractions contained within it that facilitate the usage of regular expressions in Java 9. Even though the common String class exports an assortment of “regex” convenience methods (eg: matches(...) & contains(...)) they are implemented in terms of the java.util.regex package and thus the focus will be on said package and the abstractions it provides.

Using Java 9 will bring about some initial setup to allow us to compile and run the code and this will be covered or at least linked to in this article.

You are welcome to use the Java 9 REPL shell to execute code snippets (copy and paste), however, the sample code will be in the form of a standalone console application, that can be driven from the command line in jarform.

If you feel a bit rusty on regular expression syntax, at least the kind of dialect used in Java, you can head over to the API (Summary of regular-expression constructs) and take a quick refresher. I found it very useful, even if I could only remember a handful of them.

Don’t feel too intimidated. This will not be an exhaustive look at using all the regular expression syntax features and java.util.regex API features, but rather a simple set of examples demonstrating most of the java.util.regex API features in Java 9.

2. Technologies used

The example code in this article was built and run using:

  • Java 9 (jdk-9+180)
  • Maven 3.3.9 (3.3.x will do fine)
  • Eclipse Oxygen (4.7.0)
  • Ubuntu 16.04 (Windows, Mac or Linux will do fine)

3. Setup

In order to run and compile the sample code, Java 9, Eclipse Oxygen (4.7.0 with Java 9 support) and Maven 3.3.9 needs to be installed. The process has been outlined in a previous article (3. Setup) and can be followed from there.

If you just want to view the code in a text editor and compile and run the program from command line, then Java 9 and Maven 3.3.9 is all that is required.

4. Api

The java.util.regex package is exported by the module java.base making it a default export bundled with jdk 9, meaning you do not need to explicitly include it as a dependent module.

The Java regular expression syntax is similar to that of Perl and the package java.util.regex defines classes and interfaces integral to this task.

These include:

5. Meta Characters

Before diving into the sample code, a small primer to cover some of the regular expression syntax and meta or special characters will be used.

Meta or special characters have special meaning in regular expressions and thus have an impact on how matches are made. eg: ^()[].+? These can be specialized into the following:

5.1 Character Classes

A composite of characters and symbols that form a logical unit and have special meaning in a regular expression. eg: [abc]

  • OR: A collection of characters in square brackets which are logically joined by way of an “or” conjunction. eg: [abc] reads a or b or c
  • NEGATION: A collection of characters that indicate that the desired match is the negation of what is shown. eg: [^abc] reads not a,b or c
  • RANGES: An inclusive range of characters / numbers starting at the left item and spanning to the right item inclusive. eg: [a-z] reads from a to z inclusive
  • UNIONS: A combination of character classes or number sets, the superset of all combined. eg: [a-d[u-z]] reads a through to d inclusive and u through to z inclusive
  • INTERSECTION: Represents the intersection / overlap of two or more character classes or number sets. eg: [a-z&&[bc]] reads only b and c because they are the only two common to both
  • SUBTRACTION: Exactly the same as intersection, just inverted, ie whats not common. eg: [a-z&&[^bc]] reads a, d through to z inclusive

5.2 Predefined Character Classes

This list is quite long but can be found here. These include “Predefined character classes”, “POSIX character classes (US-ASCII only)”, “java.lang.Character classes (simple java character type)”, “Classes for Unicode scripts, blocks, categories and binary properties”.

5.3 Quantifiers

Quantifiers specify how many occurrences of a character, group or character class must be matched in the given text input.

By default quantifiers are greedy in behavior and attempt to match as much of the input as possible. Quantifiers can be specialized into 3 types based on their behavior when matching text input.These are greedy, reluctant and possessive.

  • Greedy: Greedy quantifiers start by consuming the entire text input and then attempting to match the input based on the pattern. If it matches: great, we stop right there. If not, we remove one character at a time from the right hand side of the text input and attempt to match the removed character(s) with the pattern. Once we get a match, we stop. eg: a? a* a+ a{n} a{n,} a{n,m}
  • Reluctant: Reluctant quantifiers start by consuming one character at a time (reluctantly from the left), if we match, the consumed characters will form the match and then the Matcher will continue from the next index. eg: a?? a*? a+? a{n}? a{n,}? a{n,m}?
  • Possessive: Possessive quantifiers behave like greedy quantifiers except they do not back off (remove characters from the right to attempt to make a match) eg: a?+ a*+ a++ a{n}+ a{n,}+ a{n,m}+

5.4 Groups

Groups represent multiple characters in a regular expression as a single unit, similar to character classes but with additional benefits by being able to reference groups by name and index. We can also back reference a captured group later on in the regular expression itself.

eg: (\\d{2})\\s(\\d{2}) reads first group is the first 2 digit sequence followed by space then next group is the next 2 digits

eg: ^.*drink\\s(?<beverage>\\w+).*eat\\s(?<food>\\w+).*$ reads the first group occurs after ‘drink ‘ and we name it ‘beverage’ and the second group occurs after ‘eat ‘ and we name it ‘food’

eg: ^.*of\\s(\\w+).*(\\1).*$ reads the first group occurs after ‘of  ‘ and is a whole word and then sometime later we reference that value from the first group in the next group ‘(\\1)’ where ‘1’ represents the group number’s value we are referencing

5.5 Boundary matches

Represents a method to restrict matches to be more specific. For example, instead of capturing a match anywhere in a line, sometimes mid word, we can say we want to have word boundary matches.
eg: \\bdog\\b reads we want the word dog but not *dog* (ie ‘hotdog’ is ignored)

6. Example Code

The sample code is built with Maven by issuing the following command mvn clean install package. This will build a file called  regex-0.0.1-SNAPSHOT.jar in the target folder of the project root.

Running the program is as simple as navigating to the target folder and issuing the following command java -jar regex-0.0.1-SNAPSHOT.jar.

Snippet of program output

Misc - no class

        Input is At 10:00am I have Computer science class and at 11:00am I have a hall pass and at 12:00pm I have no class and at 4:00pm we leave in mass
        Regex is [^c][^l]ass

        Start index of match 69
        End index of match 74
        Value is  pass

        Start index of match 131
        End index of match 136
        Value is  mass
-----

Misc - in range

        Input is bow flow row now snow crow mow vow
        Regex is (\b[bn]ow)\b

        Start index of match 0
        End index of match 3
        Value is bow

        Start index of match 13
        End index of match 16
        Value is now

-----

Below are following some code snippets from some of the classes in the example code available for download in this article.

Snippets of basic Regular Expression usage

       // Simple
        final String input = "oxoxox";
        // Simple pattern match on literal String value
        final Pattern pattern = Pattern.compile("x");
        // Should match 3 'x' values at progressively increasing indexes.
 ...       
       // Character classes
        final String input = "At 10:00am I have Computer science class and at 11:00am I have a hall pass and at 12:00pm I have no class";
        // Ensure we capture any word(s) ending is 'ass' but starting with a 'c' followed by a 'l'
        final Pattern pattern = Pattern.compile("[l]ass");
...
       // Negation in character classes
        final String input = "At 10:00am I have Computer science class and at 11:00am I have a hall pass and at 12:00pm I have no class and at 4:00pm we leave in mass";
        // Here we negate the first two characters of the pattern we want matched by saying all word(s) that end with 'ass'
        // but that don't start with the following characters 'cl'
        final Pattern pattern = Pattern.compile("[^c][^l]ass");
...
       // Union
        final String input = "abcdefghijklmnopqrstuvwxyz";
        // We are interested in the letters 'a' through to 'd' and 'u' through to 'z' all inclusive.
        final Pattern pattern = Pattern.compile("[a-d[u-z]]");
...
       // Intersection
        final String input = "abcdefghijklmnopqrstuvwxyz";
        // We are interested in the overlap / intersection of the character classes 'a' through 'd' and the letters 'b',c',c','y'
        // meaning we will only get 'b' and 'c'
        final Pattern pattern = Pattern.compile("[a-d&&[bcxyz]]");
...
       // In range
        final String input = "bow flow row now snow crow mow vow";
        // Here we are interested in whole words '\\b' that end in 'ow' but that start with either 'b' or 'n'.
        // ie: 'bow' or 'now'
        final Pattern pattern = Pattern.compile("(\\b[bn]ow)\\b");

The above code snippet displays basic usage of the Pattern object to compile a regular expression in Java. Each code snippet comes with the input that will be matched against via a Matcher object. The code snippet demonstrates literal, Character class, negation, union, intersection and range pattern compilation examples.

Grouping Regular Expression usage

     private static void groupByIndex() {
        System.out.println("Grouping - simple\n");

        // Interested in 3 groups, groups 1 & 2 must be 2 digits long and separated by a
        // space.
        // Group 3 occurs after a space after group 2 and can be 1 or more digits long
        final Pattern PATTERN = Pattern.compile("^.*(\\d{2}) (\\d{2}) (\\d+)$");
        final Matcher matcher = PATTERN.matcher("+27 99 12345");

        System.out.printf("\tThe number of groups are %d\n\n", matcher.groupCount());

        // Define indexes for the groups in the pattern, first group is always entire
        // input.
        final int countryCodeIdx = 1;
        final int mnoIdx = 2;
        final int numberIdx = 3;

        if (matcher.find()) {

            // Retrieve the group values by the index
            System.out.printf("\tCountry code is %s\n", matcher.group(countryCodeIdx));
            System.out.printf("\tMobile network operator code is %s\n", matcher.group(mnoIdx));
            System.out.printf("\tNumber is %s\n", matcher.group(numberIdx));
        }

        System.out.println("-----\n");
    }

    private static void namedGroups() {
        System.out.println("Grouping - named groups\n");

        // Label the group with a name. Here we are interested in the beverage name that
        // occurs after 'drink ' and the food being eaten after 'eat '.
        final Pattern pattern = Pattern.compile("^.*drink\\s(?<beverage>\\w+).*eat\\s(?<food>\\w+).*$");
        final Matcher matcher = pattern.matcher("i drink soda, play football, run marathon, eat chips and watch TV");

        // There should be two groups
        System.out.printf("\tThe number of groups are %d\n\n", matcher.groupCount());

        if (matcher.find()) {
            // Reference the group by the label we used.
            System.out.printf("\tThe beverage start index is %d\n", matcher.start("beverage"));
            System.out.printf("\tI drink %s\n", matcher.group("beverage"));
            System.out.printf("\tThe beverage end index is %d\n\n", matcher.end("beverage"));

            // Reference the group by the label we used.
            System.out.printf("\tThe food start index is %d\n", matcher.start("food"));
            System.out.printf("\tI eat %s\n", matcher.group("food"));
            System.out.printf("\tThe food start index is %d\n", matcher.end("food"));
        }

        System.out.println("-----\n");
    }

    private static void backReference() {
        System.out.println("Grouping - back reference\n");

        // We use a back reference by referring to the previous group captured inline in
        // the expression.
        // Group one captures the word after 'of ' and then refers to it in group 2
        // '(\\1)'
        final Pattern pattern = Pattern.compile("^.*of\\s(\\w+).*(\\1).*$");
        final Matcher matcher = pattern.matcher("99 bottles of beer on the wall, if one bottle should fall, 98 bottles of beer on the wall");

        // There should be 2 groups
        System.out.printf("\tThe number of groups are %d\n\n", matcher.groupCount());

        if (matcher.find()) {

            // Select the captured values by group index
            System.out.printf("\tThe are 99 bottles of %s\n", matcher.group(1));
            System.out.printf("\tAfter one fell there are 98 bottles of %s\n", matcher.group(2));
        }

        System.out.println("-----\n");
    }

The above code snippet demonstrates usage of grouping in Pattern compilation. The input is also provided in all cases. Example usage of back references, named grouping and index grouping are shown.

Quantifiers Regular Expression usage

...
     static void run() {
        // Consume entire text input and attempt match. If match found, we stop
        // and entire text is returned,
        // if not, we remove one character(s) from right hand side and attempt
        // match with removed character(s),
        // once found, entire text returned, if not found, nothing returned.
        runInternal("ssxx", "Quantifiers greedy - %s\n", "x*");
        runInternal("ssxx", "Quantifiers greedy - %s\n", "x?");
        runInternal("ssxx", "Quantifiers greedy - %s\n", "x+");

        // Consume text one character at a time from left hand side reluctantly
        // and attempt match, if found,
        // return the portion of text from the start of the left hand side up
        // until index of where match finally
        // occurred and then continue until end of input is reached.
        runInternal("xxbx", "Quantifiers reluctant - %s\n", "x*?b");
        runInternal("xxbx", "Quantifiers reluctant - %s\n", "x??b");
        runInternal("xxbx", "Quantifiers reluctant - %s\n", "x+?b");

        // Behaves the same as greedy quantifiers without back off behavior.
        runInternal("xxbx", "Quantifiers possesive - %s\n", "x*+b");
        runInternal("xxbx", "Quantifiers possesive - %s\n", "x?+b");
        runInternal("xxbx", "Quantifiers possesive - %s\n", "x++b");
    }

    private static void runInternal(final String input, final String message, final String regex) {
        System.out.printf(message, input);

        final Pattern pattern = Pattern.compile(regex);
        final Matcher matcher = pattern.matcher(input);

        printResults(matcher, input, pattern.pattern());
        System.out.println("-----\n");
    }
...

The above code snippet displays usage of greedy, reluctant and possessive quantifier compilation. In all cases input is provided.

7. Summary

In this tutorial we briefly covered the main components that make up the java.util.regex package which encompasses the core of the regular expression functionality in Java 9. We demonstrated the usage of said API with a set of examples and also we explained some of the strange regular expression syntax used in the sample code.

8. Download the Source Code

This was a Java 9 Regular Expressions Example.

Download
You can download the full source code of this example here: Java 9 Regular Expressions Example

JJ

Jean-Jay Vester graduated from the Cape Peninsula University of Technology, Cape Town, in 2001 and has spent most of his career developing Java backend systems for small to large sized companies both sides of the equator. He has an abundance of experience and knowledge in many varied Java frameworks and has also acquired some systems knowledge along the way. Recently he has started developing his JavaScript skill set specifically targeting Angularjs and also bridged that skill to the backend with Nodejs.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button