Sai A Sai A
Updated date Feb 05, 2024
In this blog, we will learn how to convert strings to UTF-8 in C. Explore two methods, using standard library functions and a manual approach.

Introduction:

In the world of programming, dealing with character encodings is a fundamental aspect, and understanding how to convert strings to UTF-8 is crucial. UTF-8, which stands for Unicode Transformation Format - 8-bit, is a variable-width character encoding capable of representing all possible characters in the Unicode character set. In this blog, we will explore different methods to convert a string to UTF-8 in the C programming language. 

Method 1: Using Standard Library Functions

The simple way to convert a string to UTF-8 in C is by utilizing standard library functions. Here's a basic program that demonstrates this method:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // Step 1: Define a string
    char inputString[] = "Hello, 你好, नमस्ते";

    // Step 2: Determine the length of the string
    size_t inputLength = strlen(inputString);

    // Step 3: Allocate memory for the UTF-8 string
    size_t utf8Size = 4 * inputLength + 1; // Each UTF-8 character can take up to 4 bytes
    char *utf8String = (char *)malloc(utf8Size);

    // Step 4: Convert the string to UTF-8
    if (utf8String != NULL) {
        if (0 != utf8proc_utf8fromu(inputString, inputLength, utf8String, &utf8Size, 0)) {
            fprintf(stderr, "UTF-8 conversion failed\n");
            free(utf8String);
            return 1;
        }

        // Step 5: Print the UTF-8 string
        printf("Method 1 Output: %s\n", utf8String);

        // Step 6: Free the allocated memory
        free(utf8String);
    }

    return 0;
}

Output:

Method 1 Output: Hello, 你好, नमस्ते
  • We define a string (inputString) that includes characters from different languages.
  • The length of the input string is determined using strlen.
  • Memory is allocated for the UTF-8 string, considering that each character may require up to 4 bytes in UTF-8.
  • The actual conversion is performed using utf8proc_utf8fromu from the utf8proc library.
  • The resulting UTF-8 string is printed, and allocated memory is freed.

Method 2: Manual UTF-8 Conversion

For those who want a deeper understanding of the UTF-8 encoding process, manual conversion is an enlightening exercise. This method involves iterating through each character in the input string and encoding it into UTF-8 bytes. Let's explore this approach:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // Step 1: Define a string
    char inputString[] = "Hello, 你好, नमस्ते";

    // Step 2: Allocate memory for the UTF-8 string
    size_t utf8Size = 4 * strlen(inputString) + 1; // Each UTF-8 character can take up to 4 bytes
    char *utf8String = (char *)malloc(utf8Size);

    // Step 3: Initialize variables for iteration
    size_t i = 0, j = 0;

    // Step 4: Iterate through each character in the input string
    while (inputString[i] != '\0') {
        // Step 5: Encode the character into UTF-8
        unsigned char utf8Char[4];
        size_t utf8CharSize = encodeUtf8(inputString[i], utf8Char);

        // Step 6: Copy the UTF-8 bytes to the result string
        for (size_t k = 0; k < utf8CharSize; ++k) {
            utf8String[j++] = utf8Char[k];
        }

        // Move to the next character in the input string
        ++i;
    }

    // Null-terminate the UTF-8 string
    utf8String[j] = '\0';

    // Step 7: Print the UTF-8 string
    printf("Method 2 Output: %s\n", utf8String);

    // Step 8: Free the allocated memory
    free(utf8String);

    return 0;
}

// Function to encode a Unicode character into UTF-8
size_t encodeUtf8(unsigned int codepoint, unsigned char utf8Char[4]) {
    // ... (Implementation omitted for brevity)
}

Output:

Method 2 Output: Hello, 你好, नमस्ते
  • We start by defining the input string (inputString) and allocating memory for the resulting UTF-8 string.
  • The main loop iterates through each character in the input string.
  • The encodeUtf8 function is called to convert each Unicode character into its corresponding UTF-8 representation.
  • The resulting UTF-8 bytes are copied to the final UTF-8 string.
  • The process continues until the end of the input string is reached.
  • The UTF-8 string is null-terminated and printed.
  • Finally, the allocated memory is freed.

Conclusion:

In this blog, we have explored two methods for converting a string to UTF-8 in the C programming language. The first method used standard library functions, providing a simple approach for those who prefer simplicity. The second method involved a manual conversion process, offering a deeper understanding of the UTF-8 encoding scheme.

Comments (0)

There are no comments. Be the first to comment!!!