Learn Unicode Character Types and Literals in Modern C++

C++11 brings a lot of improvements and I think one of the most important features were the Unicode Character Types and Literals that allow more support for strings in different languages globally. C++11 introduced a new character type to manipulate Unicode character strings. This can be used in C++11, C++14, C++17, and above. This feature improved interactions in next generation C++ applications, like chat, social media applications, and so on by allowing a more diverse set of language characters and symbols to be displayed as well as emoticons. In this post, we explain what are Unicode character types and literals in Modern C++.

What are Unicode character types and literals in Modern C++ 11?

Unicode character types and literals allow more support for different languages, characters, and symbols in strings. C++11 introduces new character types to manipulate Unicode character strings. These can be used in C++11, C++14, C++17, and above. This feature improved language support in editor and design applications (i.e. RAD Studio uses Unicode Strings). It also vastly improved interactions in the next generation C++ applications like chat and social media. This is why we can display smiley faces ????, Vulcan hand signals ???? and love hearts ????.

C++ Builder implements new character types and character literals for Unicode. These types are among the C++11 features added to bcc32, bcc32c, and bcc64 compilers.

1. New character types

C++11 introduces new character types to manipulate Unicode character strings. For more information on this feature, see Unicode Character Types and Literals (C++11).

2. Unicode string literals

C++11 introduces new character types to manipulate Unicode string literals. For more information on this feature, see Unicode Character Types and Literals (C++11).

3. Raw string literals

4. Universal character names in literals

In order to make the C++ code less platform-dependent, C++11 lifts the prohibitions regarding control and basic source universal character names within character and string literals. Prohibitions against surrogate values in all universal character names are added. For more information on this feature, see Universal character names in literals Proposal document.

5. User-defined literals

C++11 introduces new forms of literals using modified syntax and semantics in order to provide user-defined literals. Using user-defined literals, user-defined classes can provide new literal syntax. For more information on this feature, see User-defined literals Proposal document.

What are the Unicode character types char16_t and char32_t in Modern C++?

With the C++11 standards, two new types were introduced to represent Unicode characters:

  • char16_t is a 16-bit character type. char16_t is a C++ keyword. This type can be used for UTF-16 characters.
  • char32_t is a 32-bit character type. char32_t is a C++ keyword. This type can be used for UTF-32 characters.

The existing wchar_t type is a type for a wide character in the execution wide-character set. A wchar_t wide-character literal begins with an uppercase L (such as L'c').

We have a very good post that explains how you can use character literals in modern C++.

What are the character literals u'character' and U'character' in Modern C++?

There are two new ways to create character literals of the new types:

  • u'character' is a literal for a single char16_t character, such as u'g'. A multicharacter literal such as u'kh' is badly formed. The value of a char16_t literal is equal to its ISO 10646 code point value, provided that the code point is representable as a 16-bit value. Only characters in the basic multilingual plane (BMP) can be represented.
  • U'character' is a literal for a single char32_t character, such as U't'. A multicharacter literal such as U'de' is ill-formed. The value of a char32_t literal is equal to its ISO 10646 code point value.

Multibyte character literals were previously only of the form L'characters', representing one or more characters of the type wchar_t. The value of a single character wide-character literal is that character’s encoding in the execution wide-character set.

For more information on this feature, see Unicode Character Types and Literals (C++11).

String Literals u"UTF-16_string" and U"UTF-32_string" in Modern C++

There are two new forms to create string literals of the new types:

  • u"UTF-16_string" is a string literal containing characters of the char16_t type, for example u"string_containing_UTF-16_encoding_characters".
  • U"UTF-32_string" is a string literal containing characters of the char32_t type, for example U"string_containing_UTF-32_encoding_characters".

If you want more examples about string literals and u16string, u32string, basic_string examples, here are the posts that we released before.

Learn Unicode Character Types and Literals in Modern C++ - the C++ Builder Logo

C++ Builder is the easiest and fastest C and C++ IDE for building simple or professional applications on the Windows, MacOS, iOS & Android operating systems. It is also easy for beginners to learn with its wide range of samples, tutorials, help files, and LSP support for code. RAD Studio’s C++ Builder version comes with the award-winning VCL framework for high-performance native Windows apps and the powerful FireMonkey (FMX) framework for cross-platform UIs.

There is a free C++ Builder Community Edition for students, beginners, and startups; it can be downloaded from here. For professional developers, there are Professional, Architect, or Enterprise versions of C++ Builder and there is a trial version you can download from here.