Here's the code I'm running
library(rvest)
rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
html(l)
})
Up until this point it seems to work fine, but when I try to extract the text:
html_text(messages)
I get:
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
Trying to extract a specific element:
html_text(messages[1])
Can't do that either...
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
So I try a different way:
html_text(messages[[1]])
This seems to at least get at the data, but is still not succesful:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"
How can I extract the text material from each of the elements of my list?
There are two problems with your code. Look here for examples on how to use the package.
1. You cannot just use every function with everything.
html() is for download of content
html_node() is for selecting node(s) from the downloaded content of a page
html_text() is for extracting text from a previously selected node
Therefore, to download one of your pages and extract the text of the html-node, use this:
library(rvest)
old-school style:
url <- "https://github.com/rails/rails/pull/100"
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text
... or this ...
hard to read old-school style:
url_mainnode_text <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text
... or this ...
magritr-piping style
url_mainnode_text <-
html("https://github.com/rails/rails/pull/100") %>%
html_node("*") %>%
html_text()
url_mainnode_text
2. When using lists you have to apply functions to the list with e.g. lapply()
If you want to kind of batch-process several URLs you can try something like this:
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
get_html_text <- function(url, css_or_xpath="*"){
html_text(
html_node(
html("https://github.com/rails/rails/pull/100"), css_or_xpath
)
)
}
lapply(url_list, get_html_text, css_or_xpath="a[class=message]")
You need to use html_nodes() and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200
rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()
[1] "jaw6" "jaw6" "josevalim"
Related
I'm fairly new to webscraping and having issues troubleshooting my code. At the moment I'm having different errors every time and don't really know where to continue. Currently looking into utilizing RSelenium but would greatly appreciate some advise and feedback on the code below.
Based my initial code on the following: R: How to web scrape a table across multiple pages with the same URL
library(xml2)
library(RCurl)
library(dplyr)
library(rvest)
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
page <- read_html(data)
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
table=bind_rows(table, table1)
print(i)}
table$`ÅR`=as.Date(table$`ÅR`,format ="%Y")
Bellow are the errors i am recieving at the moment. I know its a lot, but i assume some of them are a result of previous errors. Any help would be greatly appreciated!
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
Error: unexpected ',' in:
"for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","
page <- read_html(data)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "function"
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "function"
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
Error in is.factor(x) : object 'table1' not found
table=bind_rows(table, table1)
Error in list2(...) : object 'table1' not found
print(i)}
Error: unexpected '}' in " print(i)}"
table$ÃR=as.Date(table$ÃR,format ="%Y")
The following code produces a dataframe containing all the data you are seeking. Rather than using RSelenium, the below code fetches the data directly from the same API from which the site populates the table and so you do not need to combine multiple pages:
library(tidyverse)
library(rvest)
library(jsonlite)
####GET NUMBER OF ITEMS#####
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
data <- jsonlite::fromJSON(url, flatten = TRUE)
totalItems <- data$TotalNumberOfItems
####GET ALL OF THE ITEMS#####
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
I have a vector with two character list elements that I would like to apply through a loop to complete the necessary information on a URL from which I would like to web-scrape a specific node and obtain two lists each with three sub-lists.
For a single URL I will do the following obtaining the desired output structure
library(rvest)
H1<- "https://www.tripadvisor.com/Hotel_Review-g1063979-d1756170-Reviews-La_Trobada_Hotel_Boutique-Ripoll_Province_of_Girona_Catalonia.html"
page0_url<-read_html (H1)
groupNodes <- html_nodes(page0_url,"._1nAmDotd") # get all nodes under the same heading
outputlist <-lapply ( groupNodes , function(node) {
results <- node %>% html_nodes("div") %>% html_text()
Hotel_Amenities <- outputlist[[1]][-9];
Room_Features <- outputlist [[2]] [-9];
Room_Types <- outputlist[[3]]
Hotel_Amenities <- as.matrix(Hotel_Amenities)
Room_Features <- as.matrix(Room_Features)
})
print (Hotel_Amenities)
print (Room_Features)
print (Room_Types)
Attempting to run the same function for two URLs sequentially
library(rvest)
country <- data.frame ("d1447619-Reviews-Solana_del_Ter-
Ripoll","d2219428-Reviews-La_Sequia_Molinar-Campdevanol")
for (i in country)
fun <- function (node) {
html_nodes ( read_html ( paste
("https://www.tripadvisor.co.uk/Hotel_Review-g1063979-
",i,"_Province_of_Girona_Catalonia.html", sep="" )) ,"._1nAmDotd")%>%
html_nodes("div") %>% html_text()
}
lapply(country, fun)
# Produces two lists however the lists are the same list twice
I have tried more combinations. All either yield and error or at most obtain the result of only one of the two URLs and in a text string format. Any help will be welcomed.
I am using json to scrape the content of multiple (1000) links. However, some of the links do not work in json format so there is not content to be scraped. Due to this, my code stop working when finding one of those links.
I have tried to use TryCatch to avoid the error but it seems not to be working
Here is the code I am using
library(jsonlite)
library(rvest)
lapply(links_jason[1:6], function(x) {
tryCatch(
{
json_data <- read_html(x) %>% html_text()%>%
jsonlite::fromJSON(.)%>%
select(1)
},
error = function(cond) return(NULL),
finally = print(x)
)
})
This is the issue I am getting
Debug location is approximate beacuse the source is not available
Here are some examples of the links I am trying to scrape
Links number 1, 2 and 6 works fine. 3, 4 and 5 needs to be avoid
> head(links_jason)
[1] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
[2] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
[3] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
[4] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
[5] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
[6] "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json"
I have also tried to use if statements with no results. Could anyone help? Thanks!
Read direct with jsonlite and test length of return
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)>0){
print(x)
}
}
Or something like:
library(jsonlite)
library(rvest)
library(magrittr)
links_jason <- c("https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/57833&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56774&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56748&_format=hal_json"
, "https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/56782&_format=hal_json"
,"https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/64341&_format=hal_json")
lapply(links_jason[1:6], function(x) {
json_data <- jsonlite::read_json(x)
if(length(json_data)==0){
json_data <- NA}
else{
print('doing something with json_data')
}
})
I'm a R beginner and I'm trying to write a function to scrape all the song lyrics from a certain singer from a website, returning a tibble with the lyrics and the name of the song. I already managed to get all the songs links, but I'm stuck trying to write a function to actually get the lyrics.
website in question is: https://www.letras.mus.br/belchior/44457/
selector for the song title: #js-lyric-cnt > article > div.cnt-head.cnt-head--l > div.cnt-head_title > h1
selector for the song lyrics: #js-lyric-cnt > article > div.cnt-letra-trad.g-pr.g-sp > div.cnt-letra.p402_premium
I wrote this function:
get_lyrics <- function(url){
url %>% read_html() %>%
um <- html_nodes('#js-lyric-cnt > article > div.cnt-letra-trad.g-pr.g-sp > div.cnt-letra.p402_premium')
um %>%
lyrics <- html_text()
url %>% read_html() %>%
dois <- html_nodes('#js-lyric-cnt > article > div.cnt-head.cnt-head--l > div.cnt-head_title > h1')
dois %>%
title <- html_text()
data_frame(title, lyrics)
}
But when I try to run I get:
get_lyrics('https://www.letras.mus.br/belchior/1391391/')
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
I'm not sure what I can do to fix it so I appreciate the help.
You could shorten your selectors (generally faster and more stable). read_html only once then work with the retrieved content. I assume (eek) - you want a dataframe with 1 entry for the title and 1 corresponding entry for the lyrics. The lyrics are within p tags within parent element with class cnt-letra; furthermore, individual lyric lines are br tag separated. In order to preserve a sense of the original lyrics line spacing when parsing to a single string I add '\n' to account for these breaks.
I got the functions necessary to work around the lack of br handling in rvest from #rentrop here - though as that issue is quite old perhaps I have missed the addition of this feature?
Be careful about the sequencing you use when chaining methods to ensure the flow is as intended.
library(rvest)
library(magrittr)
html_text_collapse <- function(x, trim = FALSE, collapse = "\n"){
UseMethod("html_text_collapse")
}
html_text_collapse.xml_nodeset <- function(x, trim = FALSE, collapse = "\n"){
vapply(x, html_text_collapse.xml_node, character(1), trim = trim, collapse = collapse)
}
html_text_collapse.xml_node <- function(x, trim = FALSE, collapse = "\n"){
paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
get_lyrics <- function(url){
page <- read_html(url)
lyrics <- toString(page %>% html_nodes('.cnt-letra p') %>% html_text_collapse)
title <- page %>% html_node('.cnt-head_title') %>% html_text()
return(data.frame(title, lyrics))
}
get_lyrics('https://www.letras.mus.br/belchior/44457/')
If the goal is to just get the lyrics you can use the genius package.
genius::genius_lyrics("Belchior", "Na Hora do Almoco") will fetch the lyrics.
I've got a simple Map function that scrapes text files from a blog site. It's pretty easy to get a scraper that gets all of the text files and downloads them to my working directory. My goal: use an ifelse() or a plain if statement to only scrape a file based on a certain date.
Eg, if four files were posted on 1/31/19, and I pointed my ifelse at that date, the function would return those four files. Code:
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Getting date elements
dates <- page %>%
html_nodes("node.dates") %>%
html_text()
dates <- parse_date_time(dates, "%m/%d/%Y", tz = "EST",
locale = Sys.getlocale("LC_TIME"))
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
ifelse(dates == '2019-01-31', write, "He's dead, Jim")
}, links)
I've tried various ways to get that if statement in there, and also moving the writeBin around. (Usually the writeBin would not be vectorized - I did it for easy viewing in my ifelse). Error:
Error in ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok] :
replacement has length zero
If I leave out the if code, everything works great, it just returns many text files, when I only want the ones from the specified date.
Based on the description, it seems like check the corresponding 'dates' for each 'links' and then apply the if/else. If that is the case, then we can have two arguments in Map
Map(function(ln, y) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
if(y == '2019-01-31') {
write
} else "He's dead, Jim"
},
links, dates)