Scraping with Goutte (crawler). Parsing sites using Goutte

Scraping with Goutte (crawler). Parsing sites using Goutte

Ilya Lyashchuk

Web Developer, Prog-Time

In the new entry, I will show you a PHP library for parsing (scraping) sites. With this library, you can take any information from a third-party site, follow links, and automatically submit forms.

Connecting the Goutte library and creating a request to the site

I will use my website as an example. At the very beginning, you need to make a request to the main page, then we will take elements from it, so the code below will be used in each request, I just won’t duplicate it.

/* подключаем файлы полученные через Composer */
require __DIR__ . "/vendor/autoload.php";
use GoutteClient;
use SymfonyComponentHttpClientHttpClient;

/* создаём объект и делаем запрос на сайт Prog-Time */
$client = new Client();
$crawler = $client->request('GET', '');

Getting text information using Goutte

Using the filter method, you can specify a selector for selecting elements. Since this page uses several elements with the home_heading_post class, we will use the each method.

$crawler->filter('.bottom_list_last_posts .home_link_post .home_heading_post')->each(function ($node) {

Getting the href attribute of a link

$crawler->filter('.bottom_list_last_posts .home_link_post')->each(function ($node) {var_dump($node->attr("href"));});

Getting the src attribute of an image

$crawler->filter('.bottom_list_last_posts .home_link_post img')->each(function ($node) {var_dump($node->attr("src"));});

Filtering the selection (selecting elements through one)

Use the reduce method to specify a function to filter the selection. In my example, a function is specified that sets the order “through 1” and “every tenth element”.

$newListLinks = $crawler->filter('.home_link_post .home_heading_post') ->reduce(function ($node, $i) {return ($i % 2) == 0;
// return ($i % 10) == 0;
})->each(function ($node) {

Getting an element of the specified order

Using the eq method, you can specify the element number. The numbering starts from 0, so in my example we will get 4 elements with the class “home_heading_post”.

$itemPost = $crawler->filter('.home_link_post .home_heading_post')->eq(3);

Getting the first and last element

first() — return the first element.
last() — returns the last element.

$firstItem = $crawler->filter('.home_link_post .home_heading_post')->first();
$lastItem = $crawler->filter('.home_link_post .home_heading_post')->last();

Getting a neighboring element at a level in the DOM tree

siblings() — returns neighboring elements in the DOM tree.


Getting a link by text and clicking on the link

Using the selectLink() method, we get the link, as a parameter we will pass the text inside the link.

Using the link() method, click on the link and get a new page.

Using the getUri() method, we get the URI of the link.

$linkPost = $crawler->selectLink('Парсинг на PHP с формированием данных в Excel');
$link = $linkPost->link();

Getting an image object

$imagesPost = $crawler->selectImage('Парсинг на PHP с формированием данных в Excel');
$image = $imagesPost->image();

Getting Child elements

$childrenItems = $crawler->filter('.header_post_list')->children();

Submitting a form using Goutte

/* получаем страницу с формой */
$crawler = $client->request('GET', '');

/* находим кнопку для отправки формы */
$form = $crawler->selectButton('Отправить')->form();

/* передаём параметры формы и отправляем запрос */
form, ['name' => 'Илья','phone' => '+7(999)999-99-99',]);

Tech Outsourcing | Dedicated Software Team

Ready to see us in action:

More To Explore
Enable registration in settings - general
Have any project in mind?

Contact us: