Nikos Maravitsas

About Nikos Maravitsas

Nikos has graduated from the Department of Informatics and Telecommunications of The National and Kapodistrian University of Athens. Currently, his main interests are system’s security, parallel systems, artificial intelligence, operating systems, system programming, telecommunications, web applications, human – machine interaction and mobile development.

Extract HTML Links with Java Regular Expression example

With this example we shall show you how to extract and process HTML links with Java Regular expression. You can follow the basic techniques in this article and learn how to process many other HTML elements and thus create a very basic HTML parser that you can easily embed in your application. So the things we want to do is:

  • Extract the a from the HTML document
  • Extract the value of the href attribute
  • Extract the text of the a HTML link element.

We are going to work with groups. In our regular expression we are going to have a group that describes the values between ' ' in the href=' ' attribute. Then we are going to see which part of the link element matches that group, and thus get the value of the href attribute.  We will apply tha same  strategy in order to get the thext of the link element.

So here are the two regular expressions we are going to use :

  • To get the anchor element:
    (?i)<a([^>]+)>(.+?)</a>
  • To get the href attribute:
    \\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))

You should take a look at the Pattern class documentation to learn how to construct your own regular expressions according to your policy.

1. HTML Link Extractor classes

HTMLLinkElement:

package com.javacodegeeks.java.core;

public class HTMLLinkElement {

	String linkElement;
	String linkAddress;

	public String getLinkAddress() {
		return linkAddress;
	}

	public void setLinkAddress(String linkElement) {
		this.linkAddress = replaceInvalidChar(linkElement);
	}

	public String getLinkElement() {
		return linkElement;
	}

	public void setLinkElement(String linkAddress) {
		this.linkElement = linkAddress;
	}

	private String replaceInvalidChar(String linkElement) {
		linkElement = linkElement.replaceAll("'", "");
		linkElement = linkElement.replaceAll("\"", "");
		return linkElement;
	}

	@Override
	public String toString() {

		return "Link Address : " + this.linkAddress + ". Link Element : "
				+ this.linkElement;

	}
}

HtmlLinkExtraction.java:

package com.javacodegeeks.java.core;

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HtmlLinkExtraction {

	private Matcher mTag, mLink;
	private Pattern pTag, pLink;

	private static final String HTML_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>";
	private static final String HTML_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";

	public HtmlLinkExtraction() {
		pTag = Pattern.compile(HTML_TAG_PATTERN);
		pLink = Pattern.compile(HTML_HREF_TAG_PATTERN);
	}

	public ArrayList<HTMLLinkElement> extractHTMLLinks(final String sourceHtml) {

		ArrayList<HTMLLinkElement> elements = new ArrayList<HTMLLinkElement>();

		mTag = pTag.matcher(sourceHtml);

		while (mTag.find()) {

			String href = mTag.group(1);     // get the values of href
			String linkElem = mTag.group(2); // get the text of link Html Element

			mLink = pLink.matcher(href);

			while (mLink.find()) {

				String link = mLink.group(1);
				HTMLLinkElement htmlLinkElement = new HTMLLinkElement();
				htmlLinkElement.setLinkAddress(link);
				htmlLinkElement.setLinkElement(linkElem);

				elements.add(htmlLinkElement);

				System.out.println(htmlLinkElement);

			}

		}

		return elements;

	}
}

2. Unit Testing our HtmlLinkExtraction class

For unit testing we are going to use JUnit. Unit testing is very important in these situations because they provide good feedback about the correctness of our regular expressions. You can test your program and reassure that your regular expression meets the rules on your HTML Link elements.

This is a basic test class:

HtmlLinkExtractionTest.java:

package com.javacodegeeks.java.core;

import static org.junit.Assert.*;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;

import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import org.junit.runners.Parameterized.Parameters;

@RunWith(Parameterized.class)
public class HtmlLinkExtractionTest {

	private String HTML_DOCUMENT;
	private static HtmlLinkExtraction htmlTagExtraction;
	private String expectedValidation;

	private static final String HTML = "http://www.javacodegeeks.com/";

	public HtmlLinkExtractionTest(String str, String expectedValidation) {
		this.HTML_DOCUMENT = str;
		this.expectedValidation = expectedValidation;
	}

	@BeforeClass
	public static void initialize() {
		htmlTagExtraction = new HtmlLinkExtraction();
	}

	@Parameters
	public static Collection<Object[]> data() {
		Object[][] data = new Object[][] {

				{ "Blah blah blah <a href='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML },                         
				{ "Blah blah blah <a HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML },
				{ "Blah blah blah <a target='_blank' HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML } };

		return Arrays.asList(data);
	}

	@Test
	public void test() {

		ArrayList<HTMLLinkElement> linkElements = htmlTagExtraction.extractHTMLLinks(this.HTML_DOCUMENT);
		for (int i = 0; i < linkElements.size(); i++) {
			HTMLLinkElement linkElem = linkElements.get(i);
			System.out.println();
			assertEquals("Result", this.expectedValidation, linkElem.getLinkAddress());	
		}

	}
}

Output:

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks

 
This was an example on how to extract HTML Links with Java Regular Expression.

Related Whitepaper:

Java Essential Training

Author David Gassner explores Java SE (Standard Edition), the language used to build mobile apps for Android devices, enterprise server applications, and more!

The course demonstrates how to install both Java and the Eclipse IDE and dives into the particulars of programming. The course also explains the fundamentals of Java, from creating simple variables, assigning values, and declaring methods to working with strings, arrays, and subclasses; reading and writing to text files; and implementing object oriented programming concepts. Exercise files are included with the course.

Get it Now!  

Examples Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Examples Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Examples Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books