Core Java

Convert Unicode Encoding String to Letters

Dealing with strings encoded in Unicode is a common task in Java programming, especially in multilingual applications where text comes in various scripts and languages. Java provides mechanisms to handle Unicode-encoded strings efficiently. Sometimes, there might be scenarios where we might need to convert these Unicode-encoded strings into a more human-readable format, such as a string of letters. This article will explore how to achieve this conversion in Java.

1. Understanding Unicode Encoding

Unicode is a standard for encoding characters used in text processing across different platforms and languages. Each character in the Unicode standard is assigned a unique code point, typically represented in hexadecimal format. When dealing with strings in Java, they are inherently represented using Unicode encoding, which ensures compatibility and support for a wide range of characters.

2. Converting Unicode Encoded Strings to Letters

To convert a Unicode-encoded string to a string of letters in Java, we can utilize various methods and classes provided by the Java standard library. There are two main approaches to converting a Unicode string containing letters into a string of just letters in Java:

2.1 Using Regular Expressions

One approach is to utilize regular expressions to match and extract letters from the Unicode encoded string. Java’s Pattern and Matcher classes enable us to define a regular expression pattern that matches letters and extract them from the input string. Here’s an example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class StringToLetters {

        public static String unicodeToString(String unicodeString) {

        Pattern pattern = Pattern.compile("\\\\u[0-9a-fA-F]{4}");
        Matcher matcher = pattern.matcher(unicodeString);
        StringBuilder builder = new StringBuilder();
        while (matcher.find()) {
            String unicodeSequence = matcher.group();
            char unicode = (char) Integer.parseInt(unicodeSequence.substring(2), 16);
            matcher.appendReplacement(builder, Character.toString(unicode));
        }
        matcher.appendTail(builder);
        return builder.toString();
    }

    public static void main(String[] args) {

        String unicodeString = "\u0048\u0065\u006C\u006C\u006F \u0057\u006F\u0072\u006C\u0064"; // Unicode encoded string: "Hello World"
        String letters = unicodeToString(unicodeString);
        System.out.println(letters); // Output: "Hello World"
    }

}

2.2 Using Java’s Character Class

Java’s Character class offers methods to work with individual characters, including those encoded in Unicode. Another way to convert a Unicode-encoded string to a string of letters is by iterating through each character in the string and checking if it represents a letter using the Character.isLetter() method.

public class StringToLetters {

        public static String unicodeToStrings(String unicodeString) {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < unicodeString.length(); i++) {
            char c = unicodeString.charAt(i);
            if (Character.isLetter(c)) {
                builder.append(c);
            } else if (Character.isWhitespace(c)) {
                builder.append(' ');
            }
        }
        return builder.toString();
    }

    public static void main(String[] args) {
        
        String unicodeString = "\u0048\u0065\u006C\u006C\u006F \u0057\u006F\u0072\u006C\u0064"; // Unicode encoded string: "Hello World"
        String letters = unicodeToString(unicodeString);
        System.out.println(letters); // Output: "Hello World"
    }

}

3. Using Apache Commons Text

The Apache Commons Text library provides a convenient utility class, StringEscapeUtils.unescapeJava() that can be used to convert the escaped Unicode characters to their corresponding characters, and then process the resulting string to extract letters. Here’s an example:

pom.xml

    <dependencies>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-text</artifactId>
            <version>1.10.0</version> 
        </dependency>
    </dependencies>

Java code:

import org.apache.commons.text.StringEscapeUtils;

public class UnicodeToStringConverter {

    public static String unicodeToString(String unicodeString) {
        return StringEscapeUtils.unescapeJava(unicodeString);
    }

    public static void main(String[] args) {
        String unicodeString = "\u0048\u0065\u006C\u006C\u006F \u0057\u006F\u0072\u006C\u0064";
        String letters = unicodeToString(unicodeString);
        System.out.println(letters); // Output: "Hello World"
    }

}

The output is:

Fig 1: output from converting unicode encoded string to letter in Java
Fig 1: Output from converting Unicode encoded string to letter in Java

4. Conclusion

Converting Unicode encoded strings to strings of letters in Java is a task that can be accomplished using various techniques, such as iterating through characters or employing regular expressions. By understanding these methods, we can effectively manipulate and process Unicode-encoded text in our Java applications.

5. Download the Source Code

This was an article on using Java to convert Unicode encoding string to letters

Download
You can download the full source code of this example here: Java convert string unicode encoding

Omozegie Aziegbe

Omos holds a Master degree in Information Engineering with Network Management from the Robert Gordon University, Aberdeen. Omos is currently a freelance web/application developer who is currently focused on developing Java enterprise applications with the Jakarta EE framework.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button