Just Dial is an online search engine that hosts thousands of services, sellers, dealers, shops, and owners who publish their information to connect better with customers and other businesses. This website uses a lot of features that do not allow us to scrap the website easily.
They use image-based phone numbers, as a result, the information is not stored inside a webpage element. We need to use OCR technology to read the numbers from the image. This tutorial will focus on overcoming these issues and scraping the available information into the desired format.
This is how a basic just dial page looks like. The dealer/manufacturer name, address, and even the images can be scraped very easily. However, the phone numbers are displayed in an image format, and hence Get Text does not work.
In the above image, “07947161351” has 11 images stitched together to form the final image. Therefore, each digit is associated with a different image. To solve this problem, we will use Get OCR Text, which will use Tesseract OCR technology to read the information from the website.
It is important to note that, Get OCR Text only works when the image is visible on the user’s screen. If the element is hidden, it requires to be scrolled up. OCR fails to recognize the information within the element when it is not visible.