How to use rvest to web crawling correctly?

How to use rvest to web crawling correctly? - r

I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code
url<- read_html("http://www.funda.nl/en/koop/leiden/")
url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data-
pagination-page") %>% as.numeric()
However, what I got is numeric(0). If I remove as.numeric(), I get character(0).
How is this done ?

I believe that both your identification of the html and your parsing of the html are wrong. To easily find the name of a CSS id, you can use a chrome extension called Selector Gadget. In your case, it also requires some parsing, accomplished in the str_extract_all() function.
This will work:
url <- read_html("http://www.funda.nl/en/koop/leiden/")
pagination.last <- url %>%
html_node(".pagination-last") %>%
html_text() %>%
stringr::str_extract_all("[:number:]{1,2}", simplify = TRUE) %>%
as.numeric()
> pagination.last
[1] 29
You might find this other question helpful as well: R: Rvest - got hidden text i don't want

I've been dealing with the same issue and this worked for me:
> url = "http://www.funda.nl/en/koop/leiden/"
> last_page <-
+ last(read_html(url) %>%
+ html_nodes(css = ".pagination-pages") %>%
+ html_children()) %>%
+ html_text(trim = T) %>%
+ str_extract("[0-9]+") %>%
+ as.numeric()
> last_page
[1] 23

Related

How to scrape a table created using datawrapper using rvest?

I am trying to scrape Table 1 from the following website using rvest:
https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/
Following is the code i have written:
link <- "https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/"
page <- read_html(link)
page %>% html_nodes("iframe") %>% html_attr("src") %>% .[11] %>% read_html() %>%
html_nodes("table.medium datawrapper-g2oKP-6idse1 svelte-1vspmnh resortable")
But, i get {xml_nodeset (0)} as the result. I am struggling to figure out the correct tag to select in html_nodes() from the datawrapper page to extract Table 1.
I will be really grateful if someone can point out the mistake i am making, or suggest a solution to scrape this table.
Many thanks.

The data is present in the iframe but needs a little manipulation. It is easier, for me at least, to construct the csv download url from the iframe page then request that csv
library(rvest)
library(magrittr)
library(vroom)
library(stringr)
page <- read_html('https://www.kff.org/coronavirus-covid-19/issue-brief/u-s-international-covid-19-vaccine-donations-tracker/')
iframe <- page %>% html_element('iframe[title^="Table 1"]') %>% html_attr('src')
id <- read_html(iframe) %>% html_element('meta') %>% html_attr('content') %>% str_match('/(\\d+)/') %>% .[, 2]
csv_url <- paste(iframe,id, 'dataset.csv', sep = '/' )
data <- vroom(csv_url, show_col_types = FALSE)

Scraping from website using getURL() returns string of urls, not website content. How do I get the contents of the site? (R studio, windows 10)

I am completely new to scraping, using Windows 10 PC. I am trying to run this code from class to scrape the content of the party platforms form the URLs below:
years=c(1968, 1972, 1976)
urlsR=paste("https://maineanencyclopedia.com/republican-party-platform-",
years,"/",sep='')
urlsD=paste("https://maineanencyclopedia.com/democratic-party-platform-",
years,"/",sep='')
urls=c(urlsR,urlsD)
scraped_platforms <- getURL(urls)
When I run "scraped_platforms" the result is what is shown below rather than the content of the party platforms from the website.
https://maineanencyclopedia.com/republican-party-platform-1968/
""
https://maineanencyclopedia.com/republican-party-platform-1972/
""
https://maineanencyclopedia.com/republican-party-platform-1976/
""
https://maineanencyclopedia.com/democratic-party-platform-1968/
""
https://maineanencyclopedia.com/democratic-party-platform-1972/
""
https://maineanencyclopedia.com/democratic-party-platform-1976/
""
I've seen Windows 10 might be incompatible with getURL (re: How to get getURL to work on R on Windows 10? [tlsv1 alert protocol version]). Even after looking online though, I'm still unclear on how to fix my specific code?
List of links used here:
https://maineanencyclopedia.com/republican-party-platform-1968/
https://maineanencyclopedia.com/republican-party-platform-1972/
https://maineanencyclopedia.com/republican-party-platform-1976/
https://maineanencyclopedia.com/democratic-party-platform-1968/
https://maineanencyclopedia.com/democratic-party-platform-1972/
https://maineanencyclopedia.com/democratic-party-platform-1976/

I don't know getURL() function, but in R, there is one very handy package for scraping: rvest
You can just use your urls object which has all URLs and loop over:
library(rvest)
library(dplyr)
df <- tibble(Title= NULL,
Text= NULL)
for (url in urls){
t <- read_html(url) %>% html_nodes(".entry-title") %>% html_text2()
p <- read_html(url) %>% html_nodes("p") %>% html_text2()
tp <- tibble(Title = t,
Text = p)
df <- rbind(df, tp)
}
df
This is a bit unorganized output, but you can adjust for loop so you can get it a bit nicer.
Here is also a bit nicer presentation of data:
df2 <- df %>% group_by(Title) %>%
slice(-1) %>%
mutate(Text_all = paste0(Text, collapse = "\n")) %>%
dplyr::select(-Text) %>%
distinct()
df2

R scraping when SelectorGadget cannot find valid paths

I tried to scrape the data for each country from interactive pie charts here: https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types
But Selector Gadget does not allow me to select the data points on pie charts. How do I resolve this?
library(rvest)
library(dplyr)
link = "https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types"
page = read_html(link)
percentage = page %>% html_nodes("#content_types div") %>% html_text()
"#content_types div" returns void.

If you inspect the page and look on the "Network" tab, you can see the api call being made to get the data.
The end number is the last millisecond of today.
There is some junk at the beginning, but the rest of the response is JSON.
You'll have to figure out what the category numbers mean.
Maybe there is documentation of the api somewhere.
library(magrittr)
link <- "https://transparencyreport.google.com/transparencyreport/api/v3/"
parms <- paste0("europeanprivacy/siteinfo/urlsbycontenttype?start=1453420800000&end=",
1000 * ((Sys.Date() + 1) %>% as.POSIXct() %>% as.numeric()) - 1)
page <- httr::GET(paste0(link, parms))
data <- page %>% httr::content(as = "text") %>%
substr(., regexpr("\\[\\[.*\\]\\]", .), .Machine$integer.max) %>%
jsonlite::fromJSON() %>% .[[1]] %>% .[[2]] %>% as.data.frame()

html_attr "href" does not extract link

I want to download the file that is in the tab "Dossier" with the text "Modul 4" here:
https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier
First I want to get the link.
My code for that is the following:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(".gba-download__text") %>%
.[[4]] %>%
html_attr("href")
(I know the piece .[[4]] is not really good, this is not my full code.)
This leads to NA and I don't understand why.
Similar questions couldn't help here.

Allan already left a concise answer. But let me leave another way. If you check the page source, you can see that the target is in .gba-download-list. (There are actually two of them.) So get that part and walk down to href part. Once you get urls, you can use grep() to identify a link containing Modul4. I used unique() in the end to remove a dupe.
read_html("https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier") %>%
html_nodes(".gba-download-list") %>%
html_nodes("a") %>%
html_attr("href") %>%
grep(pattern = "Modul4", value = TRUE) %>%
unique()
[1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"

It's easier to get to a specific node if you use xpath :
library(rvest)
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes(xpath = "//span[contains(text(),'Modul 4')]/..") %>%
.[[1]] %>%
html_attr("href")
#> [1] "/downloads/92-975-67/2011-12-05_Modul4A_Apixaban.pdf"

I have another solution now and want to share it:
"https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier" %>%
read_html %>%
html_nodes("a.download-helper") %>%
html_attr("href") %>%
.[str_detect(., "Modul4")] %>%
unique

It is faster to use a css selector with contains operator to target the href by substring. In addition, only a single node match needs to be returned
library(rvest)
url <- "https://www.g-ba.de/bewertungsverfahren/nutzenbewertung/5/#dossier"
link <- read_html(url) %>%
html_node("[href*='Modul4']") %>%
html_attr("href") %>% url_absolute(url)

Trying to extract the links of r packages using rvest

I have been trying to use this question and this tutorial to get the table and links for the list of available rpackages in cran
Getting the html table
I got that right doing this:
library(rvest)
page <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_table(fill = TRUE, header = FALSE)
trying to get the links
When I try to get the links is where I get in trouble, I tried using the selector gadget for the first column of the table (Packages links) and I got the node td a, so I tried this:
test2 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("td a") %>% html_attr("href")
But I only get the first link, then I thought I could get all the href from the tables and tried the following:
test3 <- read_html("http://cran.r-project.org/web/packages/available_packages_by_name.html") %>% html_node("table") %>% html_attr("href")
but got nothing, what am I doing wrong?

Essentially, an "s" is missing: html_nodes() is used instead of html_node:
x <-
read_html(paste0(
"http://cran.r-project.org/web/",
"packages/available_packages_by_name.html"))
html_nodes(x, "td a") %>%
sapply(html_attr, "href")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to use rvest to web crawling correctly? - r

Related

How to scrape a table created using datawrapper using rvest?

Scraping from website using getURL() returns string of urls, not website content. How do I get the contents of the site? (R studio, windows 10)

R scraping when SelectorGadget cannot find valid paths

html_attr "href" does not extract link

Trying to extract the links of r packages using rvest

Categories

Resources