Home » Core Java » util » regex » Matcher » Extract HTML Links with Java Regular Expression example

About Nikos Maravitsas

Nikos Maravitsas
Nikos has graduated from the Department of Informatics and Telecommunications of The National and Kapodistrian University of Athens. During his studies he discovered his interests about software development and he has successfully completed numerous assignments in a variety of fields. Currently, his main interests are system’s security, parallel systems, artificial intelligence, operating systems, system programming, telecommunications, web applications, human – machine interaction and mobile development.

Extract HTML Links with Java Regular Expression example

With this example we shall show you how to extract and process HTML links with Java Regular expression. You can follow the basic techniques in this article and learn how to process many other HTML elements and thus create a very basic HTML parser that you can easily embed in your application. So the things we want to do is:

  • Extract the a from the HTML document
  • Extract the value of the href attribute
  • Extract the text of the a HTML link element.

We are going to work with groups. In our regular expression we are going to have a group that describes the values between ' ' in the href=' ' attribute. Then we are going to see which part of the link element matches that group, and thus get the value of the href attribute.  We will apply tha same  strategy in order to get the thext of the link element.

So here are the two regular expressions we are going to use :

  • To get the anchor element:
  • To get the href attribute:

You should take a look at the Pattern class documentation to learn how to construct your own regular expressions according to your policy.

1. HTML Link Extractor classes


package com.javacodegeeks.java.core;

public class HTMLLinkElement {

	String linkElement;
	String linkAddress;

	public String getLinkAddress() {
		return linkAddress;

	public void setLinkAddress(String linkElement) {
		this.linkAddress = replaceInvalidChar(linkElement);

	public String getLinkElement() {
		return linkElement;

	public void setLinkElement(String linkAddress) {
		this.linkElement = linkAddress;

	private String replaceInvalidChar(String linkElement) {
		linkElement = linkElement.replaceAll("'", "");
		linkElement = linkElement.replaceAll("\"", "");
		return linkElement;

	public String toString() {

		return "Link Address : " + this.linkAddress + ". Link Element : "
				+ this.linkElement;



package com.javacodegeeks.java.core;

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HtmlLinkExtraction {

	private Matcher mTag, mLink;
	private Pattern pTag, pLink;

	private static final String HTML_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";
	private static final String HTML_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";

	public HtmlLinkExtraction() {
		pTag = Pattern.compile(HTML_TAG_PATTERN);
		pLink = Pattern.compile(HTML_HREF_TAG_PATTERN);

	public ArrayList<HTMLLinkElement> extractHTMLLinks(final String sourceHtml) {

		ArrayList<HTMLLinkElement> elements = new ArrayList<HTMLLinkElement>();

		mTag = pTag.matcher(sourceHtml);

		while (mTag.find()) {

			String href = mTag.group(1);     // get the values of href
			String linkElem = mTag.group(2); // get the text of link Html Element

			mLink = pLink.matcher(href);

			while (mLink.find()) {

				String link = mLink.group(1);
				HTMLLinkElement htmlLinkElement = new HTMLLinkElement();





		return elements;


2. Unit Testing our HtmlLinkExtraction class

For unit testing we are going to use JUnit. Unit testing is very important in these situations because they provide good feedback about the correctness of our regular expressions. You can test your program and reassure that your regular expression meets the rules on your HTML Link elements.

This is a basic test class:


package com.javacodegeeks.java.core;

import static org.junit.Assert.*;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;

import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import org.junit.runners.Parameterized.Parameters;

public class HtmlLinkExtractionTest {

	private String HTML_DOCUMENT;
	private static HtmlLinkExtraction htmlTagExtraction;
	private String expectedValidation;

	private static final String HTML = "http://www.javacodegeeks.com/";

	public HtmlLinkExtractionTest(String str, String expectedValidation) {
		this.HTML_DOCUMENT = str;
		this.expectedValidation = expectedValidation;

	public static void initialize() {
		htmlTagExtraction = new HtmlLinkExtraction();

	public static Collection<Object[]> data() {
		Object[][] data = new Object[][] {

				{ "Blah blah blah <a href='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML },                         
				{ "Blah blah blah <a HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML },
				{ "Blah blah blah <a target='_blank' HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML } };

		return Arrays.asList(data);

	public void test() {

		ArrayList<HTMLLinkElement> linkElements = htmlTagExtraction.extractHTMLLinks(this.HTML_DOCUMENT);
		for (int i = 0; i < linkElements.size(); i++) {
			HTMLLinkElement linkElem = linkElements.get(i);
			assertEquals("Result", this.expectedValidation, linkElem.getLinkAddress());	



Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

This was an example on how to extract HTML Links with Java Regular Expression.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....


Want to take your Java Skills to the next level?
Grab our programming books for FREE!
  • Save time by leveraging our field-tested solutions to common problems.
  • The books cover a wide range of topics, from JPA and JUnit, to JMeter and Android.
  • Each book comes as a standalone guide (with source code provided), so that you use it as reference.
Last Step ...

Where should we send the free eBooks?

Good Work!
To download the books, please verify your email address by following the instructions found on the email we just sent you.