R software - rvest package, error in "download number" - r

I want download Amazon books review counts but I have one problem
I tried the following:
library(rvest)
url<-paste0("http://www.amazon.com/s/ref=lp_4_nr_p_72_3?",
"fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2C",
"n%3A4%2Cp_72%3A1250224011&bbn=4&ie=UTF8&qid",
"=1440446201&rnid=1250219011")
html<-html(url)
Reviews <- try({html_nodes(html, "#s-results-list-atf .a-text-normal:nth-child(2)") %>%
html_text()}, silent = TRUE)
But I only have 4 review counts in my R console and not 12 (Using selector gadget). What did I do wrong?
When I tried to download the books' names I didn't have the same problem... only in review counts.
Book <- try({ html_nodes(html, ".s-access-title") %>%
html_text()}, silent = TRUE)
page link Amazon Page

This is probably not the canonical approach, but here's what I did that works:
#via Inspect element in Chrome, the relevant info is
# in an <a> tag with class 'a-size-small a-link-normal a-text-normal'
# but this does not uniquely identify the review counts
# (e.g., the $12.00 Buy used & new... bit is also there)
# so we take a step up and find that both the rating
# and the review count are stored in a <div> tag
# with class 'a-row a-spacing-mini'
x<-html(url) %>% html_nodes("div.a-row.a-spacing-mini") %>%
html_nodes("a.a-size-small.a-link-normal.a-text-normal") %>%
html_text
#upon inspection of x, we can see that the relevant numbers
# always appear by themselves, thus:
> x[!is.na(as.integer(gsub(",","",x)))]
[1] "168" "232" "1,607" "2,226" "1,060" "25" "731" "2,374" "345" "7,205"
[11] "1,134" "1,137"

Related

Web Scraping with R: problem with "data.frame" function and number of rows

Briefly, I want to scrape information from this site about movies. I was using Selector Gadget to scrape it and I wrote down this code:
library(dplyr)
library(tidyverse)
library(rvest)
library(readr)
library(purrr)
link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=adventure&sort=user_rating,desc"
page = read_html(link)
film_name = page %>% html_nodes(".lister-item-header a") %>% html_text()
year = page %>% html_nodes(".text-muted.unbold") %>% html_text()
rating = page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
gross_income %>% html_nodes(".ghost~ .text-muted+ span") %>% html_text()
duration = page%>% html_nodes(".runtime") %>% html_text()
IMDB_Adventure_Movies_Rank = data.frame(film_name, year, rating, duration, gross_income, stringsAsFactors = FALSE)
R console gives the following error:
Error in data.frame(film_name, year, rating, duration, gross_income, stringsAsFactors = FALSE) :
gli argomenti implicano un numero differente di righe: 50, 44
The error is due to the fact that, in the website, 6 films out of 50 have not the income reported.
I have tried this solution, but the values do not get arranged in the correct order, since R assigns the wrong incomes to each film
length(gross_income) = length(film_name)
My question is: how can I create a table where, in case a film hasn't the income reported, R returns something as NA or null, instead of giving me error?
I saw that a guy had the same problem and the solution was to use the purrr package and the possibly() function. However, I am new to R and I can't understand the answer and how to use possibly().
We can get the income of the movies by,
link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=adventure&sort=user_rating,desc"
df = read_html(link) %>% html_nodes('#main div div.lister.list.detail.sub-list div div.lister-item-content p.sort-num_votes-visible') %>% html_text()
[1] "\n Votes:\n 1,766,474\n | Gross:\n $377.85M\n \n "
[2] "\n Votes:\n 1,788,217\n | Gross:\n $315.54M\n \n "
[3] "\n Votes:\n 2,253,349\n | Gross:\n $292.58M\n \n "
[4] "\n Votes:\n 1,595,898\n | Gross:\n $342.55M\n \n "
We now get votes and income for each movie. We shall filter income using regex.
library(stringi)
stri_extract_first_regex(df, "(?<=\\$).*")
[1] "377.85M" "315.54M" "292.58M" "342.55M" "6.10M" "188.02M" "290.48M" "10.06M" "210.61M" "322.74M" "678.82M" NA "187.71M" "422.78M" "190.24M"
[16] "858.37M" "209.73M" "223.81M" "2.38M" "85.16M" "248.16M" "47.70M" "293.00M" "415.00M" "120.54M" "191.80M" "197.17M" "309.13M" NA "56.95M"
[31] "44.82M" "13.28M" NA NA "1.43M" "356.46M" "381.01M" "4.71M" "380.84M" "402.45M" "1.23M" "12.10M" "44.91M" NA "5.01M"
[46] "1.03M" "5.45M" "8.18M" NA "59.10M"
I would suggest that you reflect on using imdbapi. imdbapi is a package that facilitates access to IMDB Api. You will need to acquire an API key but the cost of that is fairly insignificant.
library("imdbapi")
res_film <-
find_by_title("Top Gun: Maverick", api_key = <Your API KEY>)
When working against established data sources such as Eurostat, World Bank of IMDB for that matter is advisable to rely on maintained packages and available APIs. By scraping data from the site using rvest you will have to accomplish a lot of unnecessary work and solve problems that were already solved by the API and package creators.
There is an alternative Open Movie Database that gives you some free queries with a fairly high limit, and offers a dedicated R package. Likely you should be able to acquire the information that you need like that with no cost.

Extracting a specific link form an element based on a previous text-element

I want to extract all available links and dates of the available documents ("Referentenentwurf", "Kabinett", "Bundesrat" and "Inkrafttreten") for each legislative process (each of the gray boxes) from the page. My data set should have the following structure:
Each legislative process is represented by one row and the information about the related documents are in the rows
Here is the HTML structure of the seventh legislative process:
This is one example of the HTML-structure of the elements including the legislative processes.
Extracting the dates of each document per legislative process is not a problem (simply done by the investigation whether the "text()"-element includes e.g. "Kabinett").
But extracting the right URL is much more difficult because the "text()"-elements (indicating the document type) are not directly linked with the ""-elements (including the URL).
I'm trying to find a solution for the seventh legislative process ("Zwanzigste Verordnung zur Änderung von Anlagen des Betäubungsmittelgesetzes") in order to apply this solution to every legislative process.
This is my current work status:
if(!require("rvest")) install.packages("rvest")
library(rvest) #for html_attr & read_html
if(!require("dplyr")) install.packages("dplyr")
library(dplyr) # for %>%
if(!require("stringr")) install.packages("stringr")
library(stringr) # for str_detect()
if(!require("magrittr")) install.packages("magrittr")
library(magrittr) # for extract() [within pipes]
page <- read_html("https://www.bundesgesundheitsministerium.de/service/gesetze-und-verordnungen.html")
#Gesetz.Link -> here "Inkrafttreten"
#Gesetz.Link <- lapply(1:72, function(x){
x <- 7 # for demonstration reasons
node.with.data <- html_nodes(page, css = paste0("#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div:nth-child(",x*2,") > div > div > div.panel-body > p")) %>%
extract(
str_detect(html_text(html_nodes(page, css = paste0("#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div:nth-child(",x*2,") > div > div > div.panel-body > p"))),
"Inkrafttreten")
)
link <- node.with.data %>%
html_children() %>%
extract(
str_detect(html_text(html_nodes(node.with.data, xpath = paste0("text()"))),
"Inkrafttreten")
) %>%
html_attr("href")
ifelse(length(node.with.data)==0, NA, link) # set link to "NA" if there is no Link to "Referentenentwurf"
#}) %>%
# unlist()
(I have commented out the application for the entire website so that the solution can be related to the seventh element.)
The problem is, that there can be several URLs linked to each document (here "Download" & "Stellungnahmen" are linked to "Referentenentwurf"). This lead to an error of my syntax.
Is there any way to extract the nth-element within after another element? So there could be a check if the "text()"-element is "Referentenentwurf" and then extract the first element behind it
-> "<a href="/fileadmin/Dateien/3_Downloads/Gesetze_und_Verordnungen/GuV/B/2020-03-04_RefE_20-BtMAEndV.pdf" ...>".
I would be very grateful for tips on how to solve this problem!
Beyond that, I took the freedom to change a few things in your code and try to get you where you want:
My stab at this is to go into the list of Verordnungen/Gesetze/etc., find the div.panel-body > p as you do and within that the first link that refers to a downloadable document, by searching for a href containing "/fileadmin/Dateien" using xpath.
Looks like this:
library(purrr)
library(xml2)
html_nodes(page, css = '#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div') %>%
map(~{
.x %>%
xml_find_first('./div/div/div[contains(#class,"panel-body")]/p//a[contains(#href,"/fileadmin/Dateien")]') %>%
xml_attr('href')
})
//update:
If the above assumption doesn't work for you and you really just want to check for "first a-tag after 'Referentenentwurf' in the p-element", the following does get you that. However, I couldn't make it as "elegant" and just used a regex :)
html_nodes(page, css = '#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div') %>%
map(~{
.x %>%
xml_find_first('./div/div/div[contains(#class,"panel-body")]/p') %>%
as.character() %>%
str_extract_all('(?<=Referentenentwurf.{0,10000})(?<=<a href=")[^"]*(?=")') %>%
unlist() %>%
first()
})

webscraping loop over url and html node in R rvest

I have a dataframe pubs with two columns: url, html.node. I want to write a loop that reads each url retrieves the html contents, and extract the information indicated by html.node column, and accumulates it in a data frame, or list.
All URL's are different, all html nodes are different. My code so far is:
score <- vector()
k <- 1
for (r in 1:nrow(pubs)){
art.url <- pubs[r, 1] # column 1 contains URL
art.node <- pubs[r, 2] # column 2 contains html nodes as charcters
art.contents <- read_html(art.url)
score <- art.contents %>% html_nodes(art.node) %>% html_text()
k<-k+1
print(score)
}
I appreciate your help.
First of all, be sure that each site you're going to scrape, allows you to scrape data, you can incurr in legal issue if you break some rules.
(Note, I've used only http://toscrape.com/ , a sandbox site to scraping, due you did not provide your data)
After that, you can proceed with this, hope it helps:
# first, your data I think they're similar to this
pubs <- data.frame(site = c("http://quotes.toscrape.com/",
"http://quotes.toscrape.com/"),
html.node = c(".text",".author"), stringsAsFactors = F)
Then the loop you required:
library(rvest)
# an empty list, to fill with the scraped data
empty_list <- list()
# here you are going to fill the list with the scraped data
for (i in 1:nrow(pubs)){
art.url <- pubs[i, 1] # choose the site as you did
art.node <- pubs[i, 2] # choose the node as you did
# scrape it!
empty_list[[i]] <- read_html(art.url) %>% html_nodes(art.node) %>% html_text()
}
Now the result is a list, but, with:
names(empty_list) <- pubs$site
You are going to add to each element of the list the name of the site, with the result:
$`http://quotes.toscrape.com/`
[1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"
[2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"
[3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
[4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"
[5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"
[6] "“Try not to become a man of success. Rather become a man of value.”"
[7] "“It is better to be hated for what you are than to be loved for what you are not.”"
[8] "“I have not failed. I've just found 10,000 ways that won't work.”"
[9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"
[10] "“A day without sunshine is like, you know, night.”"
$`http://quotes.toscrape.com/`
[1] "Albert Einstein" "J.K. Rowling" "Albert Einstein" "Jane Austen" "Marilyn Monroe" "Albert Einstein" "André Gide"
[8] "Thomas A. Edison" "Eleanor Roosevelt" "Steve Martin"
Clearly it should work with different sites, and different nodes.
You could also use map from the purrr package instead of a loop:
expand.grid(c("http://quotes.toscrape.com/", "http://quotes.toscrape.com/tag/inspirational/"), # vector of urls
c(".text",".author"), # vector of nodes
stringsAsFactors = FALSE) %>% # assuming that the same nodes are relevant for all urls, otherwise you would have to do something like join
as_tibble() %>%
set_names(c("url", "node")) %>%
mutate(out = map2(url, node, ~ read_html(.x) %>% html_nodes(.y) %>% html_text())) %>%
unnest()

Scraping from one URL to another URL in R

My question is in regards to R being able to read a URL link. The example that I use is solely for illustration purposes. Say that I have the following webpage that I want to read (chosen at random);
https://www.mcdb.ucla.edu/faculty
It has a list of professor names with a URL link, I am trying to build a script which can read a webpage similar to this for instance and access each URL link and make a search for certain keywords regarding their publications.
I currently have my script to scan an individual website for certain keywords which I post below.
library(rvest)
library(dplyr)
library(tidyverse)
library(stringr)
prof <- readLines("https://www.mcdb.ucla.edu/faculty/jsadams")
library(dplyr)
text_df <- data_frame(text = prof)
text_df <- as.data.frame.table(text_df)
keywords <- c("nonskeletal", "antimicrobial response")
text_df %>%
filter(str_detect(text, keywords[1]) | str_detect(text, keywords[2]))
This should return publications 1, 2 and 4 under the section "Selected Publications" on the professors webpage.
Now I am trying to get R to read each professors page from the faculty link (https://www.mcdb.ucla.edu/faculty) and see if each professor has publications with the keywords listed above.
Read: https://www.mcdb.ucla.edu/faculty
Access each link and read each faculty member page:
Return if value "keywords" = TRUE:
List professors publications or text that has the "keywords" in:
I have already been able to do this for each individual page but I would perhaps prefer a loop or function so I do not have to copy and paste each professors page URL each time.
Just a slight disclaimer - I have no connection with the UCLA or the professor on that website, the professor URL I chose just so happened to be the first professor listed on the faculty of professors webpage.
I'd approach this as follows. This is "quick and dirty" code, but hopefully provides a basis for something better.
First, you need the correct selectors to get the faculty names and the links to their pages. Create a data frame with that information:
library(dplyr)
library(rvest)
library(tidytext)
page <- read_html("https://www.mcdb.ucla.edu/faculty")
table1 <- page %>%
html_nodes(xpath = "///table[1]/tr/td/a")
names <- table1 %>%
html_text() %>%
unlist(use.names = FALSE)
links <- table1 %>%
html_attrs() %>%
unlist(use.names = FALSE)
data1 <- data.frame(name = names, href = links)
head(data1)
name href
1 John Adams /faculty/jsadams
2 Utpal Banerjee /faculty/banerjee
3 Siobhan Braybrook /faculty/siobhanb
4 Jau-Nian Chen /faculty/chenjn
5 Amander Clark /faculty/clarka
6 Daniel Cohn /faculty/dcohn
Next, you need a function that takes the values in the href column, fetches the staff page and looks for keywords. I took a different approach to you, using tidytext to break all of the publications down into individual words, then counting rows where any of the keywords occur. This means that "antimicrobial response" has to be two separate words, so you may want to do that differently.
The function returns a count which is > 0 if any of the keywords were present.
get_pubs <- function(href) {
page <- read_html(paste0("https://www.mcdb.ucla.edu", href))
pubs <- data.frame(text = page %>%
html_nodes("div.mcdb-faculty-pubs p") %>%
html_text(),
stringsAsFactors = FALSE)
pubs <- pubs %>%
unnest_tokens(word, text)
pubs %>%
filter(word %in% c("nonskeletal", "antimicrobial", "response")) %>%
nrow()
}
Now you can apply the function to each href:
data1 <- data1 %>%
mutate(count = sapply(href, function(x) get_pubs(x)))
Which faculty had at least one keyword in their publications?
data1 %>%
filter(count > 0)
name href count
1 John Adams /faculty/jsadams 9
2 Arjun Deb /faculty/adeb 1
3 Tracy Johnson /faculty/tljohnson 1
4 Chentao Lin /faculty/clin 1
5 Jeffrey Long /faculty/jeffalong 1
6 Matteo Pellegrini /faculty/matteop 1

Cleaning HTML code in R: how to clean this list?

I know that this question has been asked here tons of times but after reading a bunch of topics I'm still stucked on this :( . I've a list of scraped html nodes like this
http://bit.d o/bnRinN9
and I just want to clean all code part. Unfortunately I'm a newbie and the only thing it comes to my mind is the Cthulhu way (regex, argh!). Which way I can do this?
*I put a space between "d" and "o" in domain name because SO doesn't allow to post that link
This uses the data linked in Why R can't scrape these links? which was downloaded.
library(rvest)
library(stringr)
# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")
# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")
# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]
# the real urls are in the html text, prefixed with http
span_text <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]
# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
library rvest includes many simple functions for scraping and processing html. It depends on package xml2. Generally you can scrape and filter in one step.
It's not clear if you want to extract the href value or the html text, which are the same in your example. This code extracts the href value by finding the a nodes and then the html attribute href. alternatively you can use html_text to get the link display text.
library(rvest)
links <- list('
http://anydomain.com/bnRinN9
<a href="domain.com/page">
')
# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs
## [1] "http://anydomain.com/bnRinN9" "domain.com/page"

Resources