Nikos Maravitsas

About Nikos Maravitsas

Nikos has graduated from the Department of Informatics and Telecommunications of The National and Kapodistrian University of Athens. Currently, his main interests are system’s security, parallel systems, artificial intelligence, operating systems, system programming, telecommunications, web applications, human – machine interaction and mobile development.

Extract HTML Links with Java Regular Expression example

With this example we shall show you how to extract and process HTML links with Java Regular expression. You can follow the basic techniques in this article and learn how to process many other HTML elements and thus create a very basic HTML parser that you can easily embed in your application. So the things we want to do is:

  • Extract the a from the HTML document
  • Extract the value of the href attribute
  • Extract the text of the a HTML link element.

We are going to work with groups. In our regular expression we are going to have a group that describes the values between ' ' in the href=' ' attribute. Then we are going to see which part of the link element matches that group, and thus get the value of the href attribute.  We will apply tha same  strategy in order to get the thext of the link element.

So here are the two regular expressions we are going to use :

  • To get the anchor element:
    (?i)<a([^>]+)>(.+?)</a>
  • To get the href attribute:
    \\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))

You should take a look at the Pattern class documentation to learn how to construct your own regular expressions according to your policy.

1. HTML Link Extractor classes

HTMLLinkElement:

package com.javacodegeeks.java.core;

public class HTMLLinkElement {

	String linkElement;
	String linkAddress;

	public String getLinkAddress() {
		return linkAddress;
	}

	public void setLinkAddress(String linkElement) {
		this.linkAddress = replaceInvalidChar(linkElement);
	}

	public String getLinkElement() {
		return linkElement;
	}

	public void setLinkElement(String linkAddress) {
		this.linkElement = linkAddress;
	}

	private String replaceInvalidChar(String linkElement) {
		linkElement = linkElement.replaceAll("'", "");
		linkElement = linkElement.replaceAll("\"", "");
		return linkElement;
	}

	@Override
	public String toString() {

		return "Link Address : " + this.linkAddress + ". Link Element : "
				+ this.linkElement;

	}
}

HtmlLinkExtraction.java:

package com.javacodegeeks.java.core;

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HtmlLinkExtraction {

	private Matcher mTag, mLink;
	private Pattern pTag, pLink;

	private static final String HTML_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";
	private static final String HTML_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";

	public HtmlLinkExtraction() {
		pTag = Pattern.compile(HTML_TAG_PATTERN);
		pLink = Pattern.compile(HTML_HREF_TAG_PATTERN);
	}

	public ArrayList<HTMLLinkElement> extractHTMLLinks(final String sourceHtml) {

		ArrayList<HTMLLinkElement> elements = new ArrayList<HTMLLinkElement>();

		mTag = pTag.matcher(sourceHtml);

		while (mTag.find()) {

			String href = mTag.group(1);     // get the values of href
			String linkElem = mTag.group(2); // get the text of link Html Element

			mLink = pLink.matcher(href);

			while (mLink.find()) {

				String link = mLink.group(1);
				HTMLLinkElement htmlLinkElement = new HTMLLinkElement();
				htmlLinkElement.setLinkAddress(link);
				htmlLinkElement.setLinkElement(linkElem);

				elements.add(htmlLinkElement);

				System.out.println(htmlLinkElement);

			}

		}

		return elements;

	}
}

2. Unit Testing our HtmlLinkExtraction class

For unit testing we are going to use JUnit. Unit testing is very important in these situations because they provide good feedback about the correctness of our regular expressions. You can test your program and reassure that your regular expression meets the rules on your HTML Link elements.

This is a basic test class:

HtmlLinkExtractionTest.java:

package com.javacodegeeks.java.core;

import static org.junit.Assert.*;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;

import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import org.junit.runners.Parameterized.Parameters;

@RunWith(Parameterized.class)
public class HtmlLinkExtractionTest {

	private String HTML_DOCUMENT;
	private static HtmlLinkExtraction htmlTagExtraction;
	private String expectedValidation;

	private static final String HTML = "http://www.javacodegeeks.com/";

	public HtmlLinkExtractionTest(String str, String expectedValidation) {
		this.HTML_DOCUMENT = str;
		this.expectedValidation = expectedValidation;
	}

	@BeforeClass
	public static void initialize() {
		htmlTagExtraction = new HtmlLinkExtraction();
	}

	@Parameters
	public static Collection<Object[]> data() {
		Object[][] data = new Object[][] {

				{ "Blah blah blah <a href='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML },                         
				{ "Blah blah blah <a HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML },
				{ "Blah blah blah <a target='_blank' HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML } };

		return Arrays.asList(data);
	}

	@Test
	public void test() {

		ArrayList<HTMLLinkElement> linkElements = htmlTagExtraction.extractHTMLLinks(this.HTML_DOCUMENT);
		for (int i = 0; i < linkElements.size(); i++) {
			HTMLLinkElement linkElem = linkElements.get(i);
			System.out.println();
			assertEquals("Result", this.expectedValidation, linkElem.getLinkAddress());	
		}

	}
}

Output:

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

 
This was an example on how to extract HTML Links with Java Regular Expression.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.
Examples Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Examples Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Examples Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close