Sreality.cz web scraping

Sreality.cz web scraping - r

I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")

I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}

Related

rvest scraper working, but not returning newest data from website + not returning links

I'm using rvest to scrape the title, date and nested link for Danish parliamentary committee agendas. In general it works fine and I get the data I want, but I have two issues that I hope you can help with. As an example I'm scraping this committee website for the information in the table and the nested links. https://www.ft.dk/da/udvalg/udvalgene/liu/dokumenter/udvalgsdagsordner?committeeAbbreviation=LIU&session=20211
First problem - Missing newest data:
The scraper does not get the newest data although it is available on the website. For example on the particular page in the link there are two entries from June that is not "detected". This problem is consistent with the other committee pages, where it also does not pick up the newest data entries.
Q: Does anybody know why the data is not showing up in R even though it is present on the website and have a solution for getting the data?
Second problem - Missing links:
For the particular committee (LIU) linked to above, I'm not able to get the full nested links to the agendas, even though it works for all the other committees. Instead it just returns www.ft.dk as the nested link. Up until now I have solved it by manually adding every nested link to the dataset, but it is rather time consuming. Does anybody know why this is not working and can help solve it?
Q: How do I get the nested link for the individual committee agenda?
I'm using loops to go through all the different committee pages, but here's the basic code:
library(tidyverse)
library(rvest)
library(httr)
library(dplyr)
library(purrr)
library(stringr)
# base url of Folketinget for committee agendas
base.url <- "https://www.ft.dk/da/udvalg/udvalgene/"
#List of all committees
committee <- c("§71","BEU", "BUU", "UPN", "EPI", "ERU", "EUU", "FIU", "FOU", "FÆU", "GRA", "GRU", "BOU", "IFU", "KIU", "KEF", "KUU", "LIU", "MOF", "REU", "SAU", "SOU", "SUU", "TRU", "UFU", "URU", "UUI", "UFO", "ULØ", "UFS", "UPV", "UER", "UET", "UUF")
## Set up search archives
if (!dir.exists("./DO2011-2022/")) {
dir.create("./DO2011-2022/")
}
search.archive <- "./DO2011-2022/dagsorden_search/"
if (!dir.exists(search.archive)) {
dir.create(search.archive)
}
# empty data set
cols <- c("date", "title", "cmte", "link")
df <- cols %>% t %>% as_tibble(.name_repair = "unique") %>% `[`(0, ) %>% rename_all(~cols)
## Set up main date parameters
first.yr <- 2011
last.yr <- 2022
session <- 1:2
# main loop over committees
for (i in committee) {
for(current.yr in first.yr:last.yr) {
for(j in session) {
print(paste("Working on committee:", i, "Year", current.yr, "session", j))
result.page <- 1
## INTERIOR LOOP OVER SEARCH PAGES
repeat {
# build archive file name
file.name <- paste0(search.archive, i,
current.yr, "session", j,
"-page-",
result.page,
".html")
# construct url to pull
final.url <- paste0(base.url,i, "/dokumenter/udvalgsdagsordner?committeeAbbreviation=", i,
"&session=", current.yr, j, "&pageSize=200&pageNumber=", result.page)
# check archive / pull in page
#Fix problem with missing data from 2021 page - its because newly downloaded data is not on previous downloaded pages.
if(!current.yr == 2021){
if (file.exists(file.name)) {
page <- read_html(x = file.name)
} else {
page <- read_html(final.url)
tmp <- page %>% as.character
#Sys.sleep(3 + rpois(lambda = 2, n = 1))
write(x = tmp, file = file.name)
}
}
else{
page <- read_html(final.url)
tmp <- page %>% as.character
Sys.sleep(5)
write(x = tmp, file = file.name)
}
# only grab length of results once
if (result.page == 1) {
# get total # search results
total.results <- page %>%
html_nodes('.pagination-text-container-top .results') %>%
html_text(trim = T) %>%
str_extract("[[:digit:]]*") %>%
as.numeric
# break out of loop if no results on page (typical for session=2)
if (length(total.results) == 0) break
# count search pages to visit (NB: 200 = number of results per page)
count.pages <- ceiling(total.results / 200)
# print total results to console
print(paste("Total of", total.results, "for committee", i))
}
if(i == "FOU"|i == "GRU"){
titles <- page %>% html_nodes('.column-documents:nth-child(1) .column-documents__icon-text') %>% html_text(trim = T)
}
else{
titles <- page %>% html_nodes('.highlighted+ .column-documents .column-documents__icon-text') %>% html_text(trim = T) }
dates <- page %>% html_nodes('.highlighted .column-documents__icon-text') %>% html_text(trim = T)
# Solution to problem with links for LIU
if(i == "LIU"){
links <- page %>% html_nodes(".column-documents__link") %>% html_attr('href') %>% unique()
}
else{
links <- page %>% html_nodes(xpath = "//td[#data-title = 'Titel']/a[#class = 'column-documents__link']") %>% html_attr('href')
}
links <- paste0("https://www.ft.dk", links)
# build data frame from data
df <- df %>% add_row(
date = dates,
title = titles,
cmte = i,
link = links)
## BREAK LOOP when result.page == length of search result pages by year
if (result.page == count.pages) break
## iterate search page by ONE
result.page <- result.page + 1
} #END PAGE LOOP
} #END SESSION LOOP
} #END YEAR LOOP
} #END COMMITTEE LOOP
end <- Sys.time()
#Scraping time
end - start
If I alternatively use selectorgadget instead of xpath to get the links, I get the following error:
Error in tokenize(css) : Unclosed string at 42
links <- page %>% html_nodes(".highlighted .column-documents__icon-text']") %>% html_attr('href')
Thanks in advance.

Rvest: Building a queue for several html-links

I'm currently webscraping a news magazine, but unfortunately, I don't have a clue on how to establish a working queue. I can only scrape the content of all articles on one page, but I want a queue, that automatically does the same thing for the rest of articles.
library(rvest)
library(tidyverse)
library(data.table)
library(plyr)
library(writexl)
map_dfc(.x = c("em.entrylist__title", "time.entrylist__time"),
.f = function(x) {read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes(x) %>%
html_text()}) %>%
bind_cols(url = read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes("a.entrylist__link") %>%
html_attr("href")) %>%
setNames(nm = c("title", "time", "url")) -> temp
map_df(.x = temp$url[1:50],
.f = function(x){tibble(url = x,
text = read_html(x) %>%
html_nodes("#article-app-container > article > div.css-isuemq.e1lg1pmy0 > p:nth-child(n)") %>%
html_text() %>%
list
)}) %>%
unnest(text) -> foo
foo
X2 <- ddply(foo, .(url), summarize,
Xc=paste(text,collapse=","))
final <- merge(temp, X2, by="url")
In this case, I got 30 pages filled with articles, but my scripts only supports the scraping of one page.
The only thing that changes in between the pages is the page number (https://www.sueddeutsche.de/news/**page/1**?search=...)
If you could give me a hint on how to include all pages into the queue at once, I would be more than grateful. Thanks a lot :)

How would a queue in dataframe form work for you?
The following suggestion is held a little more generic, so it'll work beyond the specific use-case. You would be able to add more URLs to scrape as you go, but only new ones will be kept, due to dplyr::distinct.
(I've initiated the queue to hold the first 5 pages you want to scrape, you can add more right away or dynamically if you find links on the DOM...)
library(dplyr)
library(lubridate)
queue <- tibble(
url = paste0("https://www.sueddeutsche.de/news/page/", 1:5, "?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020"),
scraped_time = lubridate::NA_POSIXct_
)
results <- list()
while(length(open_rows <- which(is.na(queue$scraped_time))) > 0) {
i <- open_rows[1]
url <- queue$url[i]
[...]
results[[url]] <- <YOUR SCRAPING RESULT>
queue$scraped_time[i] <- lubridate::now()
if (<MORE PAGES TO QUEUE>) {
queue <- queue %>%
tibble::add_row(url = c('www.spiegel.de', 'www.faz.de')) %>%
arrange(desc(scraped_time)) %>%
distinct(url, .keep_all = T)
}
}

How to download multiple files with the same name from html page?

I want to download all the files named "listings.csv.gz" which refer to US cities from http://insideairbnb.com/get-the-data.html, I can do it by writing each link but is it possible to do in a loop?
In the end I'll keep only a few columns from each file and merge them into one file.
Since the problem was solved thanks to #CodeNoob I'd like to share how it all worked out:
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz, USA cities, data for March 2019
wanted <- grep('listings.csv.gz', links)
USA <- grep('united-states', links)
wanted.USA = wanted[wanted %in% USA]
wanted.links <- links[wanted.USA]
wanted.links = grep('2019-03', wanted.links, value = TRUE)
wanted.cols = c("host_is_superhost", "summary", "host_identity_verified", "street",
"city", "property_type", "room_type", "bathrooms",
"bedrooms", "beds", "price", "security_deposit", "cleaning_fee",
"guests_included", "number_of_reviews", "instant_bookable",
"host_response_rate", "host_neighbourhood",
"review_scores_rating", "review_scores_accuracy","review_scores_cleanliness",
"review_scores_checkin" ,"review_scores_communication",
"review_scores_location", "review_scores_value", "space",
"description", "host_id", "state", "latitude", "longitude")
read.gz.url <- function(link) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(wanted.cols) %>%
mutate(source.url = link)
df
}
all.df = list()
for (i in seq_along(wanted.links)) {
all.df[[i]] = read.gz.url(wanted.links[i])
}
all.df = map(all.df, as_tibble)

You can actually extract all links, filter for the ones containing listings.csv.gz and then download these in a loop:
library(rvest)
library(dplyr)
# Get all download links
page <- read_html("http://insideairbnb.com/get-the-data.html")
# Get all hrefs (i.e. all links present on the website)
links <- page %>%
html_nodes("a") %>%
html_attr("href")
# Filter for listings.csv.gz
wanted <- grep('listings.csv.gz', links)
wanted.links <- links[wanted]
for (link in wanted.links) {
con <- gzcon(url(link))
txt <- readLines(con)
df <- read.csv(textConnection(txt))
# Do what you want
}
Example: Download and combine the files
To get the result you want I would suggest to write a download function that filters for the columns you want and then combines these in a single dataframe, for example something like this:
read.gz.url <- function(url) {
con <- gzcon(url(link))
df <- read.csv(textConnection(readLines(con)))
close(con)
df <- df %>% select(c('calculated_host_listings_count_shared_rooms', 'cancellation_policy' )) %>% # random columns I chose
mutate(source.url = url) # You may need to remember the origin of each row
df
}
all.df <- do.call('rbind', lapply(head(wanted.links,2), read.gz.url))
Note I only tested this on the first two files since they are pretty large

need help in extracting the first google search result using html_node in R

I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!

This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)

R Selenium (or rvest): How to scrape tables in sub(sub)pages listed in a main page

RSelenium
I need quite often to scrape and analyze public data of health-care contracts and partially automated it in VBA.
I deserve a couple of minuses although I spent the last night trying to set up RSelenium, succeeded in firing up server and running some examples copying single tables to dataframes. I am a beginner in web-scraping.
I am working with a dynamically generated site.
https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false
I deal withthree levels of pages:
Level 1
My top pages have the following structure (column A contains links, at the bottom there are pages):
========
A, B, C
link_A,15,10
link_B,23,12
link_c,21,12
link_D,32,12
========
1,2,3,4,5,6,7,8,9,...
======================
I have just learned the Selector Gadget that indicates:
Table
.table-striped
1.2.3.4.5.6.7
.pagination-container
Level 2 Under each link (link_A, link_B) in the table there is a subpage which contains a table. Example: https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009
============
F, G, H
link_agreements,34,23
link_agreements,23,23
link_agreements,24,24
============
Selector gadget indicates
.table-striped
Level 3 Again, under each link (link_agreements) there is another, subsubpage with the data that I want to collect
https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009&AgreementTechnicalCode=761176
============
X,Y,Z
orthopedics, 231,323
traumatology, 323,248
hematology, 323,122
Again, Selector Gadget indicates
.table-striped
I would like to iteratively collect all the subpages to the data frame that would look like:
Info from top page; info from sub-subpages
link_A (from top page);15 (Value from A column), ortopedics, 231,323
link_A (from top page);15 (Value from A column), traumatology,323,248
link_A (from top page);15 (Value from A column), traumatology,323,122
Is there a cookbook, some good examples for R selenium or rvest to show, how to iterate through links in the tables and get data in the sub(sub)-pages into a dataframe?
I would appreciate any info, an example, any hints a book indicating how to do it with RSelenium or any other scraping package.
P.S. Warning: I am also encountering SSL invalid cretificate issues with this page, I am working with Firefox selenium driver. So each time I manually need to skip the warning - for another topic.
P.S. The code I tried so far and found to be a dead end.
install.packages("RSelenium")
install.packages("wdman")
library(RSelenium)
library(wdman)
library(XML)
Next I started selenium, I immediately had issues with "java 8 present, java 7 needed issues solved by removing all java?.exe files wrom Windows/System32 or SysWOW64
library(wdman)
library(XML)
selServ <- selenium(verbose = TRUE) #installs selenium
selServ$process
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4567
, browserName = "firefox")
remDr$open(silent = F)
remDr$navigate("https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=17480&OW=13&OrthopedicSupply=False&Code=130000111&AgreementTechnicalCode=773979")
webElem <- remDr$findElement(using = "class name", value = "table-striped")
webElemtxt <- webElem$getElementAttribute("outerHTML")[[1]]
table <- readHTMLTable(webElemtxt, header=FALSE, as.data.frame=TRUE,)[[1]]
webElem$clickElement()
webElem$sendKeysToElement(list(key="tab",key="enter"))
Here my struggle with RSelenium ended. I could not send keys to Chrome, I could not work with Firefox because it demanded correct SSL certificates and I could not effectively bypass it.

table<-0
library(rvest)
# PRIMARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=",i)
page<-html_session(url)
table[i]<-html_table(page)
}
library(data.table)
primary_table<-rbindlist(table,fill=TRUE)
# DATA CLEANING REQUIRED IN PRIMARY TABLE to clean the the variable
# `Kod Sortuj według kodu świadczeniodawcy`
# Clean and store it in the primary_Table_column only then secondary table extraction will work
#SECONDARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&OW=15&OrthopedicSupply=False&Code=",primary_table[i,2])
page<-html_session(url)
table[i]<-html_table(page)
# This is the key where you can identify the whose secondary table is this.
table[i][[1]][1,1]<-primary_table[i,2]
}
secondary_table<-rbindlist(table,fill=TRUE)

Here is the answer I developed based on hbmstr aid: rvest: extract tables with url's instead of text
Practically tribute goes to him. I modified his code to deal with subpages. I am also grateful to Bharath. My code works but it may be very untidy. Hope it will be adaptable for others. Feel free to simplify code, propose changes.
library(rvest)
library(tidyverse)
library(stringr)
# error: Peer certificate cannot be authenticated with given CA certificates
# https://stackoverflow.com/questions/40397932/r-peer-certificate-cannot-be-authenticated-with-given-ca-certificates-windows
library(httr)
set_config(config(ssl_verifypeer = 0L))
# Helpers
# First based on https://stackoverflow.com/questions/35947123/r-stringr-extract-number-after-specific-string
# str_extract(myStr, "(?i)(?<=ProviderID\\D)\\d+")
get_id <-
function (x, myString) {
require(stringr)
str_extract(x, paste0("(?i)(?<=", myString, "\\D)\\d+"))
}
rm_extra <- function(x) { gsub("\r.*$", "", x) }
mk_gd_col_names <- function(x) {
tolower(x) %>%
gsub("\ +", "_", .)
}
URL <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=%d"
get_table <- function(page_num = 1) {
pg <- read_html(httr::GET(sprintf(URL, page_num)))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,11)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID")) %>%
as_tibble()
}
pb <- progress_estimated(10)
map_df(1:10, function(i) {
pb$tick()$print()
get_table(page_num = i)
}) -> full_df
#===========level 2===============
# %26 escapes "&"
URL2a <- "https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId="
URL2b <- "&OW=15&OrthopedicSupply=False&Code="
paste0(URL2a,full_df[1,11],URL2b,full_df[1,1])
get_table2 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(URL2a,full_df[page_num,11],URL2b,full_df[page_num,1])))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID")) %>%
mutate(technical_code=get_id(link,"AgreementTechnicalCode")) %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df))
map_df(1:nrow(full_df), function(i) {
pb$tick()$print()
get_table2(page_num = i)
}) -> full_df2
#===========level 3===============
URL3a <- "https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId="
URL3b <- "&OW=15&OrthopedicSupply=False&Code=150000001&AgreementTechnicalCode="
paste0(URL3a,full_df2[1,8],URL3b,full_df2[1,9])
get_table3 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(paste0(URL3a,full_df2[page_num,8],URL3b,full_df2[page_num,9]))))
tab <- html_nodes(pg, "table")
provider <- as.numeric(full_df2[page_num,8])
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(provider_id=provider) %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df2)+1)
map_df(1:nrow(full_df2), function(i) {
pb$tick()$print()
get_table3(page_num = i)
} ) -> full_df3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sreality.cz web scraping - r

Related

rvest scraper working, but not returning newest data from website + not returning links

Rvest: Building a queue for several html-links

How to download multiple files with the same name from html page?

need help in extracting the first google search result using html_node in R

R Selenium (or rvest): How to scrape tables in sub(sub)pages listed in a main page

Categories

Resources