Using the web using R

Buying a monitor: how to save electricity - and money

A frame from the cartoon “One pea, two peas”, 1981, Soyuzmultfilm

The collection of raw data is found in many tasks related to analytics. The Web is also often a source. The probability of getting to a fully prepared and combed source is almost close to zero. You always have to do something to get this data and put it in order. It is encouraging that if the necessary information is visible in the browser, then in one way or another it can be scratched out from there. In the worst case, take a photo.

Below are three non-fictional stories, united by one goal — to get information from an open source. The whole code is written “on a napkin”, has a purely illustrative and entertaining character.

It is a continuation of a series of previous publications.

 

It is necessary to scroll the information on the subsidies paid. Here is a simple and easy website. A cursory study shows that the developers tried, but forgot one important thing — the “upload to excel” button. We consciously believe that they forgot or did not have time. We look further. This is JS, server-side logic, html arrives at the client with a table fragment. 1366 pages.

What was there in the course? Load the page by URL, position by tag and parse the table? It won’t work… We need emulation of events, we need a robot.

Rewind the time forward and move on to the answer.

 

Preparing the environment

 

  • Download Selenium 3.x. (4-ka does not start yet). We take selenium-server-standalone-x.x.x from the selenium-release.storage assembly archive site.
  • Download RSelenium from CRAN.
  • Download WebDriver for installed browser versions, put it in the PATH (easier and better next to the server, since the drivers depend on browser versions).
  • Run Selenium Server from cmd with the command java -jar selenium-server-standalone-3.141.59.jar.

 

Looking for a point to strike

Open the developer tools in chrome.

We make trial requests, look at the answers, mark them with a marker.

page_url <- "https://subsidies.qoldau.kz/ru/subsidies/recipients?Year=2020"
rvest::read_html(page_url) %>%
  html_nodes(xpath = "//*[@class="sw-result-table-container"]") %>%
  html_table()

rvest::read_html(page_url) %>%
  html_nodes(xpath = "//*[@class="page-link" and @aria-label]")

 

Lowering the “mechanical dog”

code

library(tidyverse)
library(RSelenium)
library(rvest)
library(iterators)
library(foreach)

# стартанули страницу
remDrv$navigate("https://subsidies.qoldau.kz/ru/subsidies/recipients?Year=2020")

lst <- foreach(it = iter(1:1366, .combine = NULL)) %do% {
  # локализуем элемент с таблицей
  tab_elem <- remDrv$findElement(using = "xpath", value = "//*[@class="sw-table-content-wrapper"]")
  df <- read_html(tab_elem$getElementAttribute('innerHTML')[[1]]) %>% 
    html_table() %>%
    # забираем тело таблицы
    .[[1]]

  # локализуем элемент "дальше"
  # тут выборка по русскому значению тега срабатывает, этим и воспользуемся
  next_elem <- remDrv$findElements(using = "xpath", value = "//*[@class="page-link" and @aria-label="Следующая страница"]")[[1]]

  remDrv$mouseMoveToLocation(webElement = next_elem)
  next_elem$click()

  df
}

We get data.frame, the rest is a matter of technique.

 

We tear off the champagne

It is necessary to solve sociological issues. German elections, 2017. After much agony, a great website was found, a wonderful js interactive, there is all the information, fireworks with firecrackers. Super! Now we’ll do everything quickly.

And then a fly in the ointment creeps up. Interactive leaflet with details on more than 5000 objects. The hand with the leash from the robot quietly hides in his pocket. I want to go outside and look at the students coming from lessons. Have a cup of coffee. And not to see this site, which a minute ago seemed like a wonderful find.

Careful, the doors are closing. Did we get on the train or not? In what reality do we exist next? Where is the site thrown out and went to look further? Or where we got everything we wanted?

We’ll leave the first branch to the screenwriters. Maybe everything ended very well there and this failure led later, far later, to a huge success. Let’s go along the second branch.

 

We are looking at the bottom

Open the developer tools. We are watching the information exchange. Click on the city — yes, json arrives here. With what? Yes, with the election results. That’s it! The catch is in addressing these results, how would you understand what’s what?

 

We are looking at the top

We make a second pass from above. Yeah, a tile map… and some json are arriving. What’s there? Yes, this is a list of all the points for which there is detailed information … So, let’s check the numbers. That’s it, the bundle is found, there are just the very numbers for which json with details are requested.

 

We carry out the operation

We take out a hammer and a soldering iron, 15 minutes of coding, 1 minute of work. The result is on the table.

Collecting election results

library(tidyverse)
library(glue)

# путем просмотра https://interaktiv.morgenpost.de/gemeindekarte-bundestagswahl-2017/
# наблюдаем список подгружаемых тайлов, формируем его руками
tiles_df <- tidyr::expand_grid(i = 32:34, j = 20:22) %>%
  mutate(url = glue("https://interaktiv.morgenpost.de/gewinner_btw2017/grid/6-{i}-{j}.json?v=3.0.0"))

# Шаг 1. собираем список всех городов
loadTile <- function(url){
  resp <- httr::GET(url)
  bind_rows(httr::content(resp)[["data"]])
}

job_df <- tiles_df$url %>%
  purrr::map_dfr(loadTile)

# Шаг 2. Собираем данные по каждому городу
loadTown <- function(id){
  resp <-  glue("https://interaktiv.morgenpost.de/",
                "gemeindekarte-bundestagswahl-2017/data/",
                "jsons/{id}.json") %>%
    httr::GET()
  bind_rows(httr::content(resp)) %>%
    mutate(AGS = id)
}

town_df <- job_df %>%
  slice(1:10) %>%
  pull(AGS) %>%
  purrr::map_dfr(loadTown)

An unexpected third scenario. Read the content of the site in German to the end. At the bottom you can see links to information sources used when creating the site. Follow the links and pick up the plates. Find out a little later that the plates do not take into account all the updates in the division.

 

National Electronic Library. Book monuments. The book “Interpretation on the Apocalypse”, 1625. Naturally, you can only look at a digital copy, you can’t find it in bookstores, even in second-hand books. A unique opportunity!

It is necessary for work. The only thing is that it becomes very painful to use only viewing from the screen after a while. And don’t put any bookmarks. It is impossible to print anything normally, the entire page is compressed into a vertical strip. Take pictures of the screen, print and glue with tape? And so all the necessary pages? After a series of experiments, it becomes clear that it is possible to rewrite by hand from the screen. About as productive. The trouble is that the 34″ large screen, rotated vertically, also does not help much — it is almost impossible to view the entire page from a close distance.

It is also impossible to save the image, the preview is saved in low resolution, which is critical for understanding the text. At a cursory glance, it becomes clear that this is a tile set and the page itself does not seem to exist. There are just various fragments that are put together by the browser.

Have you arrived?

We open the developer tools in chrome, we begin to study network exchange. Studying several pages gives an approximate understanding of the internal mechanics and ways of assembling tiles into a page. Different zoom, different grid.

You can try to pick up tools. One of the tricks is the installation of tiles into a single picture using ImageMagick. A minute of work and we have the right pages in maximum resolution in the form of a single graphic file.

scrap_one_page.R

library(tidyverse)
library(magrittr)
library(httr)
library(rvest)
library(stringi)
library(glue)
library(jsonlite)
library(furrr)
library(magick)

n_cores <- parallel::detectCores() - 1

# директории разного типа содержат тайлы 256x256 разных увеличения
# 10 -> 11 -> 12 (max)
# базовый url страницы
base_url <- "https://kp.rusneb.ru/tiles/5fd08afffc8ed229eaf02309_files"

# 1. генерируем фиктивную сетку
grid_df <- expand_grid(y = 0:12, x = 0:12) %>%
  mutate(img_name = glue("{x}_{y}.jpeg"),
         url = glue("{base_url}/11/{img_name}"),
         fname = here::here("page", img_name)
  )

# 2. загружаем в многопотоке
# plan(multisession, workers = n_cores)
plan(sequential)
processTile <- function(url){
  purrr::possibly(image_read, otherwise = NULL)(url)
}

img_lst <- grid_df %$%
  future_map(url, processTile)
plan(sequential)

tiles_df <- grid_df %>%
  mutate(img = !!img_lst) %>%
  drop_na(img)

# 3. склеиваем тайлы (монтаж)
image_obj <- purrr::lift_dl(c)(tiles_df$img)

tile_str <- tiles_df %$%
  glue("{n_distinct(x)}x{n_distinct(y)}")

res <- image_montage(image_obj, geometry = "256x256+0+0", tile = tile_str, 
                     bg = 'black', gravity = "North")

# 4. сохраняем страницу
image_write(res, path = here::here("page", "page.jpg"), format = "jpg")

The previous publication is “Refactoring Shiny Applications”.

Unity 3D Development Outsourcing | IT Outsource Support

Ready to see us in action:

More To Explore

IWanta.tech
Logo
Enable registration in settings - general
Have any project in mind?

Contact us:

small_c_popup.png