Data plays a critical role in investigations, doesn't it? It can lead to a new way of looking at things and develop other insights. The most unfortunate thing is that the data you are looking for is not usually readily available. You can find it on the Internet, but it may not be in a format that is downloadable. In such a case, you can use the web scraping technique to program and gather the data you need.
There are several scraping approaches and programming languages that can be of help through this process. This article will guide you on how to use the python language to scrap a site. You will gain a lot of insights on the operation of web pages. You will also get to understand how developers structure data on any website.
The best starting point is to download and install the Anaconda Python Distribution on your computing machine. You can also take some tutorials on the basics of this programming language. The best place to set off could be Codecademy especially if you have no idea in this field.
This guide will make use of the Polk Country current listing site for inmates. We will guide you on how to use a Python script to extract a list of inmates and get some data like the city of residence and race for each inmate. The whole script that we will be taking you through is stored and open at GitHub. This is one of the popular online platforms that allow sharing of computer codes. The codes have a long list of commentary that can be of great help to you.
When scraping any site, the first tool to look for is a web browser. Most of the browsers will give users HTML inspection tools that assist in lifting engine-bay hatch and getting to understand the page structure. The way you access each tool varies from one browser to another. However, the mainstay is the 'view page source, and you can get it by right-clicking on the page directly.
As you view the HTML source of the page, it is advisable to neatly list the details of the links to the inmate in table rows. The next step is to write a script that we are going to use to extract this information. The two Python packages that we are going to use in the heavy lifting process are the Beautiful Soup and Requests. Make sure you install them before you begin to run the code.
The web scraping script will do three things. These include loading the listing pages and extraction of links to the details pages, loading each detail page and extracting data, and printing the extracted data depending on how it is filtered like the city of residence and race. Once you understand this, the next step is to begin the coding process by using the Beautiful Soup and Requests.
Firstly, logically load the inmate listing page using the requests.get URL and then use the beautiful soup to purse it. After that, we extract the link to the details pages by looping through each row. After parsing the inmate details, the next step is to extract the sex, age, race, booking time, and name values to the dictionary. Each inmate will get his dictionary, and all the dictionaries will get appended to the inmate's list. Finally, loop over the race and city values before you finally print out your list.