With this example we shall show you how to extract and process HTML links with Java Regular expression. You can follow the basic techniques in this article and learn how to process many other HTML elements and thus create a very basic HTML parser that you can easily embed in your application. So the things we want to do is:
- Extract the
a
from the HTML document - Extract the value of the
href
attribute - Extract the text of the
a
HTML link element.
We are going to work with groups. In our regular expression we are going to have a group that describes the values between ' '
in the href=' '
attribute. Then we are going to see which part of the link element matches that group, and thus get the value of the href
attribute. We will apply tha same strategy in order to get the thext of the link element.
So here are the two regular expressions we are going to use :
- To get the anchor element:
(?i)<a([^>]+)>(.+?)</a>
- To get the href attribute:
\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))
You should take a look at the Pattern
class documentation to learn how to construct your own regular expressions according to your policy.
1. HTML Link Extractor classes
HTMLLinkElement:
package com.javacodegeeks.java.core; public class HTMLLinkElement { String linkElement; String linkAddress; public String getLinkAddress() { return linkAddress; } public void setLinkAddress(String linkElement) { this.linkAddress = replaceInvalidChar(linkElement); } public String getLinkElement() { return linkElement; } public void setLinkElement(String linkAddress) { this.linkElement = linkAddress; } private String replaceInvalidChar(String linkElement) { linkElement = linkElement.replaceAll("'", ""); linkElement = linkElement.replaceAll("\"", ""); return linkElement; } @Override public String toString() { return "Link Address : " + this.linkAddress + ". Link Element : " + this.linkElement; } }
HtmlLinkExtraction.java:
package com.javacodegeeks.java.core; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; public class HtmlLinkExtraction { private Matcher mTag, mLink; private Pattern pTag, pLink; private static final String HTML_TAG_PATTERN = "(?i)<a([^>]+)>(.+?)</a>"; private static final String HTML_HREF_TAG_PATTERN = "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))"; public HtmlLinkExtraction() { pTag = Pattern.compile(HTML_TAG_PATTERN); pLink = Pattern.compile(HTML_HREF_TAG_PATTERN); } public ArrayList<HTMLLinkElement> extractHTMLLinks(final String sourceHtml) { ArrayList<HTMLLinkElement> elements = new ArrayList<HTMLLinkElement>(); mTag = pTag.matcher(sourceHtml); while (mTag.find()) { String href = mTag.group(1); // get the values of href String linkElem = mTag.group(2); // get the text of link Html Element mLink = pLink.matcher(href); while (mLink.find()) { String link = mLink.group(1); HTMLLinkElement htmlLinkElement = new HTMLLinkElement(); htmlLinkElement.setLinkAddress(link); htmlLinkElement.setLinkElement(linkElem); elements.add(htmlLinkElement); System.out.println(htmlLinkElement); } } return elements; } }
2. Unit Testing our HtmlLinkExtraction class
For unit testing we are going to use JUnit
. Unit testing is very important in these situations because they provide good feedback about the correctness of our regular expressions. You can test your program and reassure that your regular expression meets the rules on your HTML Link elements.
This is a basic test class:
HtmlLinkExtractionTest.java:
package com.javacodegeeks.java.core; import static org.junit.Assert.*; import java.util.ArrayList; import java.util.Arrays; import java.util.Collection; import org.junit.BeforeClass; import org.junit.Test; import org.junit.runner.RunWith; import org.junit.runners.Parameterized; import org.junit.runners.Parameterized.Parameters; @RunWith(Parameterized.class) public class HtmlLinkExtractionTest { private String HTML_DOCUMENT; private static HtmlLinkExtraction htmlTagExtraction; private String expectedValidation; private static final String HTML = "http://www.javacodegeeks.com/"; public HtmlLinkExtractionTest(String str, String expectedValidation) { this.HTML_DOCUMENT = str; this.expectedValidation = expectedValidation; } @BeforeClass public static void initialize() { htmlTagExtraction = new HtmlLinkExtraction(); } @Parameters public static Collection<Object[]> data() { Object[][] data = new Object[][] { { "Blah blah blah <a href='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML }, { "Blah blah blah <a HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML }, { "Blah blah blah <a target='_blank' HREF='http://www.javacodegeeks.com/'>JavaCodeGeeks</a> blah blah blah blah", HTML } }; return Arrays.asList(data); } @Test public void test() { ArrayList<HTMLLinkElement> linkElements = htmlTagExtraction.extractHTMLLinks(this.HTML_DOCUMENT); for (int i = 0; i < linkElements.size(); i++) { HTMLLinkElement linkElem = linkElements.get(i); System.out.println(); assertEquals("Result", this.expectedValidation, linkElem.getLinkAddress()); } } }
Output:
Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks
Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks
Link Address : http://www.javacodegeeks.com/. Link Element : JavaCodeGeeks
This was an example on how to extract HTML Links with Java Regular Expression.