Home » Core Java » Java 9 Compact Strings Example

About Yatin

Yatin
The author is graduated in Electronics & Telecommunication. During his studies, he has been involved with a significant number of projects ranging from programming and software engineering to telecommunications analysis. He works as a technical lead in the information technology sector where he is primarily involved with projects based on Java/J2EE technologies platform and novel UI technologies.

Java 9 Compact Strings Example

Hello, in this tutorial we will learn about the evolution of Strings in Java to Compact Strings. In any of the Java applications Strings are used extensively. I can’t remember a single application where I have not used Strings. So any optimization on String class would affect almost each and every application.

1. Introduction

Since Java 9 is coming with 2 major changes so it would be important to know what Java 9 is bringing in with String optimizations. Java 9 is coming with a feature JEP 254 (Compact Strings) to lower the memory usage and improve performance.

2. Java 9 Compact Strings Example

2.1 History

Java was originally developed to support UCS-2, also referred to as Unicode at the time i.e. using 16 bits per character allowing for 65,536 characters. It’s only in 2004 with Java 5 that UTF-16 support was introduced by adding a method to extract 32 bits’ code point from chars.

2.2 Compact Strings

Each String in Java is internally represented by two objects. First object is the String object itself and the second one is the char array that handles the data contained by the String. The char type occupies 16 bits or two bytes. If the data is a String in the English language for instance, often the leading 8 bits will be all zeroes as the character can be represented only by using one byte.

Strings occupy a major portion of heap space of JVM in any application. Since strings are immutable and reside in the string literal pool, developers can imagine how much memory could be used up by them till the garbage collection occurs. It thus makes sense to make the strings shorter or compact by discarding some data that do not have the added value.

A JDK Enhancement Proposal (JEP 254) was created to address the issue explained above. Note that this is just a change at the internal implementation level and no changes are proposed for existing public interfaces. A study on thread dumps of various Java applications revealed that most of the Strings in the applications were LATIN-1 characters, that can be represented by just using 8 bits. There were other special characters that needed all 16 bits but their frequency of occurrence was far less compared to LATIN-1 characters.

To understand the proposed changes in a better fashion, let us consider a String in Java containing the letters Hello. The below diagram shows how the data are saved internally,

Fig.1: Java 9 Compact Strings

Fig.1: Java 9 Compact Strings

Under each byte, we have written the hexadecimal representation according to UTF-16. This is how a String object is internally represented using char array till Java 8. Note that the bytes in light gray are not really needed to represent the characters. The actual data that matters in each 16 bits representing the English alphabets are the trailing 8 bits. Thus, by omitting these bytes, it is possible to save extra space.

2.3 String Class Enhancements for Compact Strings

In the enhanced String class of Java 9, the string is compressed during construction where, there is an attempt to optimistically compress the string into 1 byte per character (simple ASCII, also known as an ISO-8859-1 representation for LATIN-1 character). If any character in given string is not representable only using 8 bits, copy all characters using two bytes (UTF-16 representation).

Certain changes are made to the internal implementation of String class in order to distinguish between UTF-16 and LATIN-1 Strings. A final field named coder has been introduced which demanded incorporation of one crucial change to the API i.e. how shall the length of the string be calculated for each encoding? This is a very important because the most widely used method in String class is charAt(index i) which goes to i-th position and returns the character there. Unless the length is determined properly, methods like this can be error prone.

In Java, the length of the String is calculated internally as follows:

Test.java

public int length() {
  return value.length >> coder;
}

If the String contains LATIN-1 only, the coder is going to be zero, so the length of String will be the length of char array. If the String contains UTF-16 characters, the coder will be set. The above method will perform a right shift which means the actual string length will be half of the size of the byte array that holds the UTF-16 encoded data.

2.3.1 Java 9 String Implementation

In Java 8 and previous – except for UseCompressedStrings – a String is basically:

private final char value[];

Each method will access that char array. But, in Java 9 we now have:

private final byte[] value;
private final byte coder;

where coder can be:

static final byte LATIN1 = 0;
static final byte UTF16 = 1;

Most of the methods now will check the coder and dispatch to the specific implementation:

Test.java

public int indexOf(int ch, int fromIndex) {
  return isLatin1() ? StringLatin1.indexOf(value, ch, fromIndex) : StringUTF16.indexOf(value, ch, fromIndex);
}
    
private boolean isLatin1() {
  return COMPACT_STRINGS && coder == LATIN1;
}

To mitigate the cost of the coder and the unpacking of bytes to chars, some methods have been intrinsified and the ASM generated by the JIT compiler has been improved. This came with some counter-intuitive results where indexOf(char) in LATIN-1 is more expensive than indexOf(String). This is due to the fact that in LATIN-1 indexOf(String) calls an intrinsic method and indexOf(char) does not. In UTF-16 they are both intrinsic.

Because it only affects LATIN-1 String, it is probably not wise to optimize for that. It is also a known issue that is targeted to be fixed in Java 10.

2.4 Kill-Switch for Compact String Feature

Compact String feature is enabled by default in Java 9. If we are sure that at runtime, your application will generate Strings that are mostly representable only using UTF-16, we may want to disable this compact string feature so that the overlay incurred during optimistic conversion to 1 byte (LATIN-1). Representation and failure to do so can be avoided during String construction.

To disable the feature, we can use the following switch:

+XX:-CompactStrings

2.5 Impact of Compact String During Runtime

The developers of this feature from Oracle found out during performance testing that Compact Strings showed a significant reduction in memory footprint and a performance gain when Strings of LATIN-1 only characters were processed. There was a notable improvement in the performance of Garbage Collector as well.

A feature named Compressed String was introduced in Java 6 which had the same motive but was not effective. Compressed Strings were not enabled by default in JDK 6 and had to be explicitly set using:

XX:+UseCompressedStrings

Compressed String maintained a completely distinct String implementation that was under alt-rt.jar and was focused on converting ASCII codebase string to byte array. A major problem faced during that time was that the String constructor used to take char array. Also, many operations depended on char array representation and not byte array because of which a lot of unpacking was needed which resulted in performance problems. This feature was eventually removed in JDK 7 and JDK 8.

Unlike compressed Strings, Compact Strings don’t require unpacking or repacking and hence gives better performance at runtime. Hence, in order to gauge the runtime performance, I ran the below code:

Test.java

long launchTime = System.currentTimeMillis();
List strings = IntStream.rangeClosed(1, 10_000_000).mapToObj(Integer::toString).collect(toList());
long runTime = System.currentTimeMillis() - launchTime;
System.out.println("Generated " + strings.size() + " strings in " + runTime + " ms.");

launchTime = System.currentTimeMillis();
String appended = strings.stream().limit(100_000).reduce("", (left, right) -> left + right);
runTime = System.currentTimeMillis() - launchTime;
System.out.println("Created string of length " + appended.length() + " in " + runTime + " ms.");

Here in this code, first it creates a list of ten million strings, then it concatenates the first 100,000 of them in a spectacularly naive way. And indeed running the code either with compact strings (which is the default on Java 9) or without (with -XX:-CompactStrings) I observed a considerable difference:

Console Output

# with compact strings
Generated 10000000 strings in 1048 ms.
Created string of length 488899 in 3244 ms.

# without compact strings
Generated 10000000 strings in 1077 ms.
Created string of length 488899 in 7005 ms.

But you don’t have to trust me. In the talk linked above Aleksey Shipilev shows his measurements, starting at 36:30, citing 1.36x is a better throughput and 45% less garbage.

3. Java 9 Compact Strings Highlights

If you want a 5-minute overview of this knowledge article, here is the summary:

  1. String density (JEP 254 Compact Strings) is a feature of JDK 9.
  2. Aims were to reduce memory footprint without affecting any performance – latency or throughput as well maintaining full backward compatibility.
  3. JDK 6 introduced compressed strings but this was never brought forward into later JVMs. This is a complete rewrite.
  4. To work out how much memory could be saved 960 disparate Java application heap dumps were analyzed.
  5. Live data size of the heap dumps was between 300MB and 2.5GB.
  6. char[] consumed between 10% and 45% of the live data and the vast majority of chars were only one byte in size (i.e. ASCII).
  7. 75% of the char arrays were 35 chars or smaller.
  8. On average, reduction in application size would be 5-15% (reduction in char[] size about 35-45% because of header size).
  9. The way it will be implemented is that if all chars in the String use only 1 byte (the higher byte is 0) then a byte[] will be used rather than char[] (IS0-8859-1/Latin1 encoding). There will be a leading byte to indicate which encoding was used.
  10. UTF8 is not used because it supports the variable length chars and is therefore not performant for random access.
  11. private final byte coder on the String indicates the encoding.
  12. For all 64 bit JVMs, no extra memory was needed for the extra field because of the ‘dead’ space needed for 8-byte object alignment.
  13. Latency is also improved.
  14. A feature can be enabled and disabled with -XX:+CompactStrings but will be enabled by default.

4. Conclusion

The main goal of this article is to discuss the optimize operation of String in the JVM. Compact Strings is going to be a very helpful feature for applications extensively using Strings. This may lead to a much less memory requirement. We are looking forward to this feature.

5. Download the Eclipse Project

This was an example of Java 9 Compact Strings

Download
You can download the full source code of this example here: Java9 Compact Strings
(No Ratings Yet)
Start the discussion Views Tweet it!

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

 

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

 

and many more ....

 

Receive Java & Developer job alerts in your Area

 

Leave a Reply

avatar
  Subscribe  
Notify of