Nikos Maravitsas

About Nikos Maravitsas

Nikos has graduated from the Department of Informatics and Telecommunications of The National and Kapodistrian University of Athens. Currently, his main interests are system’s security, parallel systems, artificial intelligence, operating systems, system programming, telecommunications, web applications, human – machine interaction and mobile development.

Extract HTML Links with Java Regular Expression example

With this example we shall show you how to extract and process HTML links with Java Regular expression. You can follow the basic techniques in this article and learn how to process many other HTML elements and thus create a very basic HTML parser that you can easily embed in your application. So the things we want to do is:

  • Extract the a from the HTML document
  • Extract the value of the href attribute
  • Extract the text of the a HTML link element.

We are going to work with groups. In our regular expression we are going to have a group that describes the values between ' ' in the href=' ' attribute. Then we are going to see which part of the link element matches that group, and thus get the value of the href attribute.  We will apply tha same  strategy in order to get the thext of the link element.

So here are the two regular expressions we are going to use :

  • To get the anchor element:
  • To get the href attribute:

You should take a look at the Pattern class documentation to learn how to construct your own regular expressions according to your policy.

1. HTML Link Extractor classes



public class HTMLLinkElement {

	String linkElement;
	String linkAddress;

	public String getLinkAddress() {
		return linkAddress;

	public void setLinkAddress(String linkElement) {
		this.linkAddress = replaceInvalidChar(linkElement);

	public String getLinkElement() {
		return linkElement;

	public void setLinkElement(String linkAddress) {
		this.linkElement = linkAddress;

	private String replaceInvalidChar(String linkElement) {
		linkElement = linkElement.replaceAll("'", "");
		linkElement = linkElement.replaceAll("\"", "");
		return linkElement;

	public String toString() {

		return "Link Address : " + this.linkAddress + ". Link Element : "
				+ this.linkElement;



import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HtmlLinkExtraction {

	private Matcher mTag, mLink;
	private Pattern pTag, pLink;

	private static final String HTML_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";
	private static final String HTML_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";

	public HtmlLinkExtraction() {
		pTag = Pattern.compile(HTML_TAG_PATTERN);
		pLink = Pattern.compile(HTML_HREF_TAG_PATTERN);

	public ArrayList<HTMLLinkElement> extractHTMLLinks(final String sourceHtml) {

		ArrayList<HTMLLinkElement> elements = new ArrayList<HTMLLinkElement>();

		mTag = pTag.matcher(sourceHtml);

		while (mTag.find()) {

			String href =;     // get the values of href
			String linkElem =; // get the text of link Html Element

			mLink = pLink.matcher(href);

			while (mLink.find()) {

				String link =;
				HTMLLinkElement htmlLinkElement = new HTMLLinkElement();





		return elements;


2. Unit Testing our HtmlLinkExtraction class

For unit testing we are going to use JUnit. Unit testing is very important in these situations because they provide good feedback about the correctness of our regular expressions. You can test your program and reassure that your regular expression meets the rules on your HTML Link elements.

This is a basic test class:


import static org.junit.Assert.*;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;

import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import org.junit.runners.Parameterized.Parameters;

public class HtmlLinkExtractionTest {

	private String HTML_DOCUMENT;
	private static HtmlLinkExtraction htmlTagExtraction;
	private String expectedValidation;

	private static final String HTML = "";

	public HtmlLinkExtractionTest(String str, String expectedValidation) {
		this.HTML_DOCUMENT = str;
		this.expectedValidation = expectedValidation;

	public static void initialize() {
		htmlTagExtraction = new HtmlLinkExtraction();

	public static Collection<Object[]> data() {
		Object[][] data = new Object[][] {

				{ "Blah blah blah <a href=''>JavaCodeGeeks</a> blah blah blah blah", HTML },                         
				{ "Blah blah blah <a HREF=''>JavaCodeGeeks</a> blah blah blah blah", HTML },
				{ "Blah blah blah <a target='_blank' HREF=''>JavaCodeGeeks</a> blah blah blah blah", HTML } };

		return Arrays.asList(data);

	public void test() {

		ArrayList<HTMLLinkElement> linkElements = htmlTagExtraction.extractHTMLLinks(this.HTML_DOCUMENT);
		for (int i = 0; i < linkElements.size(); i++) {
			HTMLLinkElement linkElem = linkElements.get(i);
			assertEquals("Result", this.expectedValidation, linkElem.getLinkAddress());	



Link Address : Link Element : JavaCodeGeeks

Link Address : Link Element : JavaCodeGeeks

Link Address : Link Element : JavaCodeGeeks

This was an example on how to extract HTML Links with Java Regular Expression.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

Examples Java Code Geeks and all content copyright © 2010-2015, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Examples Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Examples Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

Get ready to Rock!
To download the books, please verify your email address by following the instructions found on the email we just sent you.