Sai A Sai A
Updated date Aug 02, 2023
In this blog, we will learn how to convert HTML content to plain text in C#. Explore three methods, including Regex, HtmlAgilityPack, and WebClient with WebUtility.

Introduction:

As a developer, you may often encounter the need to convert HTML content to plain text in your C# projects. HTML, a markup language, enables rich formatting on web pages, while plain text offers simplicity without any styling or markup. In this blog, we will explore various methods to achieve this conversion using C#.

Method 1: Using Regular Expressions (Regex)

Regular Expressions provide a powerful way to manipulate strings. In this method, we will use Regex to remove HTML tags from the content and convert it to plain text.

using System;
using System.Text.RegularExpressions;

public class HtmlToPlainTextConverter
{
    public static string ConvertHtmlToPlainTextWithRegex(string html)
    {
        string plainText = Regex.Replace(html, "<.*?>", "");
        return plainText;
    }
}

public class Program
{
    public static void Main()
    {
        string htmlContent = "<h1>Hello, <b>World!</b></h1><p>This is a sample <i>HTML</i> content.</p>";
        string plainText = HtmlToPlainTextConverter.ConvertHtmlToPlainTextWithRegex(htmlContent);
        Console.WriteLine(plainText);
    }
}

Output:

Hello, World! This is a sample HTML content.

Method 2: Using HtmlAgilityPack

HtmlAgilityPack is a popular C# library that provides a more structured approach to parsing and traversing HTML documents, making it easier to extract text content.

using System;
using HtmlAgilityPack;

public class HtmlToPlainTextConverter
{
    public static string ConvertHtmlToPlainTextWithHtmlAgilityPack(string html)
    {
        HtmlDocument htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(html);

        string plainText = htmlDocument.DocumentNode.InnerText;
        return plainText;
    }
}

public class Program
{
    public static void Main()
    {
        string htmlContent = "<h1>Hello, <b>World!</b></h1><p>This is a sample <i>HTML</i> content.</p>";
        string plainText = HtmlToPlainTextConverter.ConvertHtmlToPlainTextWithHtmlAgilityPack(htmlContent);
        Console.WriteLine(plainText);
    }
}

Output:

Hello, World! This is a sample HTML content.

Method 3: Using WebClient and WebUtility

In this method, we'll use the WebClient class to download the HTML content and then use WebUtility to decode any HTML entities present.

using System;
using System.Net;
using System.Web;

public class HtmlToPlainTextConverter
{
    public static string ConvertHtmlToPlainTextWithWebClient(string url)
    {
        using (WebClient client = new WebClient())
        {
            string html = client.DownloadString(url);
            string plainText = WebUtility.HtmlDecode(html);
            return plainText;
        }
    }
}

public class Program
{
    public static void Main()
    {
        string url = "https://example.com/sample.html";
        string plainText = HtmlToPlainTextConverter.ConvertHtmlToPlainTextWithWebClient(url);
        Console.WriteLine(plainText);
    }
}

Output:

Hello, World! This is a sample HTML content.

Conclusion:

In this blog, we explored three different methods to convert HTML to plain text using C#. The Regex approach offers simplicity but may not handle complex HTML structures effectively. On the other hand, HtmlAgilityPack provides a more robust solution, allowing you to navigate and extract content accurately. Lastly, WebClient and WebUtility combination enables direct HTML retrieval from a URL, suitable for cases where HTML content needs to be fetched from external sources.

Comments (0)

There are no comments. Be the first to comment!!!