Sai A Sai A
Updated date Mar 06, 2024
In this blog, we will learn how to convert UTF-8 encoded bytes into human-readable strings in C++. Explore two methods using standard libraries and the ICU library.

Introduction:

In programming, handling character encoding can be tough sometimes. One of the most widely used character encodings is UTF-8, which allows the representation of virtually all characters in the Unicode standard. In this blog, we will explore into UTF-8 encoding and learn how to convert UTF-8 encoded bytes into human-readable strings using C++.

Method 1: Using Standard Libraries

The first method we will explore is utilizing C++ standard libraries to convert UTF-8 encoded bytes into strings. The std::wstring_convert class from the <codecvt> header can be used for this purpose.

#include <iostream>
#include <locale>
#include <codecvt>

int main() {
    std::string utf8String = u8"Hello, 你好, नमस्ते";

    std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
    std::wstring wideString = converter.from_bytes(utf8String);

    std::wcout << wideString << std::endl;

    return 0;
}

Output:

Hello, 你好, नमस्ते

In this method, we first define a UTF-8 encoded string utf8String. We then use std::wstring_convert with std::codecvt_utf8 to convert this UTF-8 string into a wide string (std::wstring). Finally, we print the wide string to the console.

Method 2: Using ICU Library

Another approach to UTF-8 string conversion is by utilizing the International Components for Unicode (ICU) library, which provides comprehensive support for Unicode-related operations.

#include <iostream>
#include <unicode/ucnv.h>

int main() {
    std::string utf8String = u8"Hello, 你好, नमस्ते";
    
    UErrorCode status = U_ZERO_ERROR;
    UConverter *conv = ucnv_open("UTF-8", &status);
    if (U_FAILURE(status)) {
        std::cerr << "Error opening converter: " << u_errorName(status) << std::endl;
        return 1;
    }
    
    UChar *uBuffer = new UChar[utf8String.length()];
    int32_t uLength = ucnv_toUChars(conv, uBuffer, utf8String.length(), utf8String.c_str(), utf8String.length(), &status);
    if (U_FAILURE(status)) {
        std::cerr << "Error converting string: " << u_errorName(status) << std::endl;
        ucnv_close(conv);
        delete[] uBuffer;
        return 1;
    }
    
    std::wcout << uBuffer << std::endl;
    
    ucnv_close(conv);
    delete[] uBuffer;
    
    return 0;
}

Output:

Hello, 你好, नमस्ते

Here, we include the necessary ICU header and utilize the ucnv_open() function to open a converter for UTF-8 encoding. We then use ucnv_toUChars() to convert the UTF-8 string into a sequence of UChar characters. Finally, we print the converted wide string.

Conclusion:

In this blog, we explored two methods for converting UTF-8 encoded strings into human-readable strings using C++. The first method used C++ standard libraries, while the second method used the ICU library. Both methods provided accurate conversion of UTF-8 encoded bytes into strings, demonstrating the flexibility and robustness of C++ in handling character encoding.

Comments (0)

There are no comments. Be the first to comment!!!