Login or register
Back to the blog

How to web-scrape and make the most of data?

Olya Pyrozhenko How-To Articles November 1, 2018
With the dawn of the Internet, all “business doings” are impossible without the World Wide Web. It’s of paramount importance to get actual data about market tendencies to make much headway in these. That’s why IT developers often do some research on sites to extract relevant information every now and then. Here is when web scraping with PHP comes into play.

Parsing, harvesting, and screen scraping are about the same things ‒ exploring the content of a page and converting it to different forms. They stand for the techniques applied to get data from a website that is then saved to a local file or a database.

PHP web scraping is used for several reasons. Analyzing a competitor’s site to see what strategies you can adopt in your products is one of them. The general concept of screen scraping can be explained with the help of an automated code, which:
  • makes GET requests to a target site;
  • receives an answer and parses an HTML or XML document;
  • searches for data and converts it to a designated format (a video, product catalog, image, text, or others).
Note that you may face strict copyright policies while harvesting content as you aren’t allowed to process everything you see on the Internet freely. However, if you use a PHP website scraper, you can retrieve any information for analysis purposes with ease. Check out our in-depth data parsing guide to make this no sweat.

3-Step web scraping tutorial: Crawl a page like an expert

Two PHP developers are working on the code structure

First and foremost, you have to get a handle on the structure of the website you want to parse. Sift through it as any other ordinary user would. Your web page scraping starts right here.
  • Step 1. Open an HTML form to type your URL
A URL usually involves a lot of information. As you browse a site, it changes, too. Create a new index.php file and enter the needed URL there. It’s up to you what part of it to go for: the base or query parameters. The former deals with the site’s main name, whereas the latter is all about additional values that can be presented on pages (strings, numbers, and others).
  • Step 2. PHP-scrape a web page
Now that you know what you have to do with, make time for a scrape.php file. Create a PHP function for extracting data and accessing the PHP web scraper library. Plus, it will assist you in data sharing with a variety of servers and protocols.

function scrapeSiteData($website_url){

if (!function_exists('curl_init')) {

die('cURL is not installed. Please install and try again.');

}

$curl = curl_init();

curl_setopt($curl, CURLOPT_URL, $website_url);

curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$output = curl_exec($curl);

curl_close($curl);

return $output;

}


The above code shows whether PHP scraping cURL is set correctly. Here, you can see its 3 functions:
  • curl_init() starts a session
  • curl_exec() carries it out
  • curl_close() closes it
The CURLOPT_URL variable is applied to set the URL you need to extract data from. CURLOPT_RETURNTRANSFER will come in handy for keeping the parsed page in the variable form, allowing you to avoid making it into the default one.
  • Step 3. Scrape particular sets of data
Let’s derive all the functional parameters of your PHP file and harvest a certain part. You can modify CURLOPT_RETURNTRANSFER variables, indicating what piece you actually need.

if(isset($_POST['submit'])){

$html = scrapeWebsiteData($_POST['website_url']);

$start_point = strpos($html, 'Latest Posts');

$end_point = strpos($html,'', $start_point);

$length = $end_point-$start_point;

$html = substr($html, $start_point, $length);

echo $html;

}


Hopefully, you’ve figured out how to web-scrape and will fine-tune your PHP parsing experience with one of the suggested regex code snippets.
GET EXPERT SEO ADVICE FOR FREE
We know how to kickstart your SEO campaign and double your organic traffic.
Get SEO Advice
578 Views 0 Comments
0 Comments
© 2013 - 2020, Semalt.com. All rights reserved
Close
Andrew Timchenko
Head of Customer Success Department
*
*
*
✓ By entering your data you agree to Semalt`s Terms of Service and Privacy Policy