Scraping List of Websites With RPA Using UiPath

In this tutorial, we will read Company name from an Excel file to search google using UiPath, and scrape the website of those companies and save it back to the excel file in a new column.

The image highlights the sample data we are using for this tutorial. It consists of two columns Company and Domain.

Step 1:

  • First, we will use the Read Range activity under System, to read the table from the excel file.
  • Inside the activity, we have inputted the location of the file containing the name of the US banks. "Sheet1" is selected by default and it is exactly where the table is located and hence no change is necessary.
  • Finally, in the Range box, we have kept it blank so that UiPath reads whatever is present Sheet1 and stores it to the Data Table variable. We can also state the range for the table but it is better to keep it blank as the total number and type of data change frequently.
  • On the right-hand side, we can see the Properties Panel under which we have declared a new variable and save the Data Table in DTinput.

Step 2:

  • Now we need to use an Open Browser activity that opens a new browser window onto the users' screen.

  • We have inserted the URL "google.com" into the activity and then select Chrome from the "Browser Type" option from the properties panel.
  • We also a choice to start a new session and hence the NewSession property is checked as True.

Step 3.

  • Now, let's use the For Each Row activity to create a loop for each row of data that is present inside the DTinput variable.
  • In other words, using this activity allows us to create a looping sequence that will continue to execute for each row that is present in the excel file.
  • Therefore, in this case, the For Each Row activity will run 10 times as there are two 10 Bank names in the file.
  • The user needs to input the Data Table variable in the DataTable box from the properties panel.
  • Now, to search for each company’s website, we need to open the Chrome browser (or the browser that was selected in Open Browser activity) and navigate to google.com.

Step 4.

  • We will use the TypeInto activity to type inside Google’s search box.

· Clicking on the "Indicate element inside the browser" will open a pointer with which the user needs to select the chrome search bar.

· Once the chrome search bar is successfully selected by the pointer, the TypeInto activity grabs the necessary pointer to perform its action.

  • Inside the text box, we will pass the command row.item(“Company”).tostring command which will take the value of the column Company and convert it to a string variable. 
  • Inside the Properties Panel, checking the ClickBeforeTyping as well as the EmptyField option is crucial for TypeInto activity to perform accurately.
  • Our first search item is Wells Fargo, and we will navigate to that search result to help UiPath collect information about the element storing the website link.

  • We will only consider the first result as the valid data for the website address for simplification.

Step 5.

  • Now we need to extract the URL from the first search result and for that, we will use the GetAttribute activity.
  • It is important to point out that the GetText activity is also capable of scraping information but it only works for text and hence we are using GetAt9tribute activity.

  • By clicking on the Indicate element inside the browser, UiPath redirects to the pointer to select the element containing the required information.

  • Here we have selected the title of the search result using the pointer. UiPath will automatically identify the element and create a selector for the association.

  • As we want to extract the website link from the search result title, we will pass the value of “href”, which stands for Hypertext REFerence derived from HTML language.
  • The data extracted will be saved inside the string variable Website and is passed down into the Output Result box of the properties panel.

Step 6.

  • Now we need to write this information into an excel file and for that, we need to create a data table and store this information in that new table.

  • Placing this activity inside the looping activity will result in creating the same Data Table multiple times and hence we are placing it outside the loop, inside the Global Sequence.

  • Clicking on DataTable brings up a Wizard that allows a user to customize the type of Data Table they are trying to create.
  • Two columns are created by default, we will change the name of columns to fit our desired requirement. Also, we need the change the Data Type of Column2 from Int32 to String.

  • Clicking on the edit option brings up the Edit Column dialog box which allows us to change the column name as well as the Data Type.

  • This is the final customized Data table as per our requirements.

Step 7.

  • Now we need to save the created Data Table inside a variable and for that, we have set DToutput inside the Output Datatable box under the Properties Panel.
  • Let’s scroll down inside the looping activity and create a new activity called Add Data Row.

  • This activity will add a new Data Row into our newly build Data Table that is DToutput which we have passed into the DataTable box in the properties panel.
  • To select the data to be inserted, click on ArrayRow which will bring up the Expression Editor dialog box as shown in the image.
  • We are inputting an array because we have more than one data to be inserted.
  • To insert the array, we need to use second brackets “{}” and place the variables inside it separating each other by a comma “,”.
  • The two required variables are row.item(“Company”).ToString and Website. The variable written first will be the first column, while the second variable mentioned will be in the second column.

Step 8.

  • Using the Append Range activity we will write the extracted websites in the Sheet2 of the same file which contains the names of the banks.

Step 9.

  • The last activity to finalize the script is the Clear Data Table, with the DToutput variable passed into it. This is because for each row, the Website URL that is extracted will be added to the Data Table one by one 10 times.
  • However, for each new row, the information from the previous row is already saved in the Data Table, and the Add Data Row activity will add a new row instead of replacing the existing one. This will cause the Append Range activity to write the same Company and website multiple times.
  • Clear Data Table resets the Data Table and solves that problem.