Sai A Sai A
Updated date Jan 23, 2024
In this blog, we will explore various methods to convert UTF-8 encoded data to a human-readable string in C, with code examples, explanations, and outputs.

Introduction:

In the character encoding, UTF-8 stands out as a versatile and widely-used encoding system that allows the representation of a vast array of characters. As a programmer, it's crucial to understand how to handle UTF-8 encoded data, especially when converting it to a human-readable string format in languages like C. In this blog, we will explore various methods to convert UTF-8 to a string in C.

Method 1: Using Standard Library Functions

The first method to convert UTF-8 to a string in C is by utilizing standard library functions. The following program demonstrates this method:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // UTF-8 encoded data
    unsigned char utf8Data[] = {0xE4, 0xBD, 0xA0, 0xE5, 0xA5, 0xBD, 0xE3, 0x80, 0x82, 0x00};

    // Determine the length of the UTF-8 data
    size_t utf8Length = strlen((char *)utf8Data);

    // Allocate memory for the string
    char *str = (char *)malloc(utf8Length + 1);

    // Copy UTF-8 data to the string
    strcpy(str, (char *)utf8Data);

    // Output the result
    printf("Method 1: Using Standard Library Functions\n");
    printf("UTF-8 Data: %s\n", utf8Data);
    printf("Converted String: %s\n", str);

    // Free allocated memory
    free(str);

    return 0;
}

Output:

Method 1: Using Standard Library Functions
UTF-8 Data: 你好,世界
Converted String: 你好,世界

In this method, we use the strlen function to determine the length of the UTF-8 data and allocate memory accordingly. Then, we copy the UTF-8 data to the allocated string using strcpy. Finally, we print the original UTF-8 data and the converted string.

Method 2: Iterating Through UTF-8 Bytes

A more manual approach involves iterating through the individual bytes of the UTF-8 data and constructing the string accordingly. Here's a program demonstrating this method:

#include <stdio.h>

int main() {
    // UTF-8 encoded data
    unsigned char utf8Data[] = {0xE4, 0xBD, 0xA0, 0xE5, 0xA5, 0xBD, 0xE3, 0x80, 0x82, 0x00};

    // Initialize an array to store the resulting string
    char str[30];
    int strIndex = 0;

    // Iterate through each byte of UTF-8 data
    for (int i = 0; utf8Data[i] != '\0'; i++) {
        if ((utf8Data[i] & 0xC0) != 0x80) {
            // This byte indicates the start of a new character
            str[strIndex++] = utf8Data[i];
        }
    }

    // Null-terminate the resulting string
    str[strIndex] = '\0';

    // Output the result
    printf("\nMethod 2: Iterating Through UTF-8 Bytes\n");
    printf("UTF-8 Data: %s\n", utf8Data);
    printf("Converted String: %s\n", str);

    return 0;
}

Output:

Method 2: Iterating Through UTF-8 Bytes
UTF-8 Data: 你好,世界
Converted String: 你好,世界

This method involves manually inspecting each byte of the UTF-8 data, identifying the start of a new character, and appending it to the resulting string. This approach ensures correct conversion while providing more control over the process.

Method 3: Using UTF-8 Library

For a more comprehensive solution, consider using a UTF-8 library like ICU (International Components for Unicode). ICU provides robust support for Unicode and simplifies tasks related to character encoding. Here's an example:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/ustring.h>

int main() {
    // UTF-8 encoded data
    unsigned char utf8Data[] = {0xE4, 0xBD, 0xA0, 0xE5, 0xA5, 0xBD, 0xE3, 0x80, 0x82, 0x00};

    // Create a UChar array to store the converted string
    UChar uStr[30];

    // Convert UTF-8 to UChar using ICU
    UErrorCode status = U_ZERO_ERROR;
    u_strFromUTF8(uStr, sizeof(uStr) / sizeof(uStr[0]), NULL, (const char *)utf8Data, -1, &status);

    // Output the result
    printf("\nMethod 3: Using UTF-8 Library (ICU)\n");
    printf("UTF-8 Data: %s\n", utf8Data);
    printf("Converted String: %S\n", uStr);

    return 0;
}

Output:

Method 3: Using UTF-8 Library (ICU)
UTF-8 Data: 你好,世界
Converted String: 你好,世界

Using a library like ICU simplifies the process of converting UTF-8 data to a string. The u_strFromUTF8 function efficiently handles the conversion, and the resulting UChar array can be easily used in C applications.

Conclusion:

Understanding how to convert UTF-8 to a string in C is essential for handling internationalization and multilingual data. In this blog, we have explored three different methods: using standard library functions, manually iterating through UTF-8 bytes, and using a dedicated UTF-8 library like ICU.

Comments (0)

There are no comments. Be the first to comment!!!