Web scraping with rvest. Returning as NA - web-scraping

I am quite new to web scraping and I am trying to scrape the 5-yr market value from a five thirty eight site linked here (https://projects.fivethirtyeight.com/carmelo/kyrie-irving/). This is the code I am running from the rvest package to do so.
kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
kyrie_irving %>%
html_node(".market-value") %>%
html_text() %>%
as.numeric()
However the output looks like this:
> kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
> kyrie_irving %>%
+ html_node(".market-value") %>%
+ html_text() %>%
+ as.numeric()
[1] NA
I'm just wondering where I am going wrong with this?
EDIT: I have tried using RSelenium to do this and still get no value returned. I am really lost as to what the problem is. Here is the code:
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
elem <- remDr$findElement(using="css selector", value=".market-value")
elemtxt <- elem$getElementAttribute("div")

Rselenium works, you just need to change the last line code and you can get the result.
elem$getElementText()
[[1]]
[1] "$136.5m"
By the way, the result is a string, so you need to remove $ and m, then you can parse it into a number.

Related

R webscraping doesn't work when url is in variable

I'm having trouble scraping in R. I want to scrape genre information for several titles on goodreads.
If I do this, it works completely fine and gives me what I need:
library(polite)
library(rvest)
library(dplyr)
session <- bow("https://www.goodreads.com/book/show/29991718-royally-matched",
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
However, since I'd like loop over several pages, I need this to work, but it always returns character(0).
host <- "https://www.goodreads.com/book/show/29991718-royally-matched"
session <- bow(host,
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
Something like this would also be fine for me, but it doesn't work either:
link = "29991718-royally-matched"
session <- bow(paste0("https://www.goodreads.com/book/show/29991718-royally-matched", link),
delay = 5)
genres <- scrape(session) %>%
html_elements(".bookPageGenreLink") %>%
html_text()
genres
If I open the website and disable javascript, it still works completely fine, so I don't think Selenium is necessary and I really can't figure out why this doesn't work, which drives me crazy.
Thank you so much for your support!
Solution (kind of)
So I noticed that the success of my scrapings was kind of dependent on the random moods of the scraping gods.
So I did the following:
links <- c("31752345-black-mad-wheel", "00045101-The-Mad-Ship", "2767052-the-hunger-games", "18619684-the-time-traveler-s-wife", "29991718-royally-matched")
data <- data.frame(links)
for (link in links) {
print(link)
genres <- character(0)
url <- paste0('https://www.goodreads.com/book/show/',link)
#I don't know why, but resaving it kinda helped
host <- url
#I had the theory that repeating the scraping would eventually lead to a result. For me that didn't work though
try <- 0
while (identical(genres, character(0)) & (try < 10)) {
try <- try+1
print(paste0(try, ": ", link))
session <- bow(host,
delay = 5)
scraping <- scrape(session)
genres <- scraping %>% html_elements(".bookPageGenreLink") %>%
html_text()
}
if(identical(genres, character(0))){
print("Scraping unsuccessfull.. :( ")
}
else{
print("scraping success!!")
genres.df <- data.frame(genres)
data <- left_join(data,
genres.df, by = c("link"))
}
}
## then I created a list of the missing titles
missing_titles <- data %>%
filter(is.na(genre_1))
missing_links <- unique(missing_titles$link)
So the next step(s) were closing R (while saving the workspace of course), restarting it and refeeding the loop with missing_titles instead of links. It took me like 7 iterations of that to get everything I needed, while on the last run I had to insert the last remaining link directly into example 1, since it did not work inside the loop. whyever.
I hope the code kind of works, since I wanted to spare you pages of wild data formatting.
If someone has an explanation, why I needed to go through this hustle, I would still very much appreciate it.
You can consider to use the R package RSelenium as follows :
library(RSelenium)
library(rvest)
url <- "https://www.goodreads.com/book/show/29991718-royally-matched"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
page_Content <- remDr$getPageSource()[[1]]
read_html(page_Content) %>% html_elements(".bookPageGenreLink") %>% html_text()
Afterwards, you can loop over the url you want.

R How to web scrape data from StockTwits with RSelenium?

I want to get some information from tweets posted on the platform StockTwits.
Here you can see an example tweet: https://stocktwits.com/Kndihopefull/message/433815546
I would like to read the following information: Number of replies, number of reshares, number of likes:
I think this is possible with the RSelenium-package. However, I am not really getting anywhere with my approach.
Can someone help me?
library(RSelenium)
url<- "https://stocktwits.com/Kndihopefull/message/433815546"
# RSelenium with Firefox
rD <- RSelenium::remoteDriver(browser="firefox", port=4546L)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
I would like to have a list (or a data set) as a result, which looks like this:
$Reply
[1] 1
$Reshare
[1] 1
$Like
[1] 7
To get required info we can do,
library(rvest)
library(dplyr)
library(RSelenium)
#launch browser
driver = rsDriver(browser = c("firefox"))
url = "https://stocktwits.com/ArcherUS/message/434172145"
remDr <- driver$client
remDr$navigate(url)
#First we shall get the tags
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
[1] "Reply" "Reshare" "Like" "Share" "Search"
#then the number associated with it
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()
[1] "" "" "2" "" ""
The last two items Share and Search will be empty.
The faster approach would be by using rvest.
library(rvest)
url = "https://stocktwits.com/ArcherUS/message/434172145"
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()

Google Play web scraping: how can you get the number of votes for each review in R?

I'm doing web scraping in R of the reviews of a Google Play app, but I can't get the number of votes. I indicate the code: likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label") and I get no value. How can it be done?
Get scrape votes
FULL CODE
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
url <- 'https://play.google.com/store/apps/details?id=com.gospace.parenteral&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label")
What returns me
NA NA NA
What I want to be returned
3 3 2
Maybe you are using the selector gadget to get the css selector. As you, I tried to do that, but the css the selector gadget return is not the correct one.
Inspecting the html code of the page, I realized that the correct element is contain in the tag with class = "jUL89d y92BAb" as you can see in this image.
This way, the code you should use is this one.
html_obj %>% html_nodes('.jUL89d') %>% html_text()
My personal recommendation for you is to always check the source code to confirm the output of the selector gadget.

rvest handling hidden text

I don't see the data/text I am looking for when scraping a web page
I tried googling the issue without having any luck. I also tried using the xpath but i get {xml_nodeset (0)}
require(rvest)
url <- "https://www.nasdaq.com/market-activity/ipos"
IPOS <- read_html(url)
IPOS %>% xml_nodes("tbody") %>% xml_text()
Output:
[1] "\n \n \n \n \n \n "
I do not see any of the IPO data. Expected output should contain the table for the "Priced" IPOs: Symbol, Company Name, etc...
No need for the expensive RSelenium. There is an API call you can find in the network tab returning everything as json.
For example,
library(jsonlite)
data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09')
View(data$data$priced$rows)
It seems that the table data are loaded by scripts. You can use RSelenium package to get them.
library(rvest)
library(RSelenium)
rD <- rsDriver(port = 1210L, browser = "firefox", check = FALSE)
remDr <- rD$client
url <- "https://www.nasdaq.com/market-activity/ipos"
remDr$navigate(url)
IPOS <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table(fill = TRUE)
str(IPOS)
PRICED <- IPOS[[3]]

Continue the loop search when Rselenium returns error "NoSuchElement"

I use Rselenium to scrapt the "rent" information in advertisement from the website. However, it seems like not every advertisement contains the rent information. Therefore, when my loop runs to those don't have the rent information, it faced the error i.e. 'NoSuchElement' and the loop stops. I want to:
1/ fill "NA" values to those cases which dont have rent information; and
2/ continue the loop to scrapt rent information.
I already tried "tryCatch" function, however, it seems doesnt work.R still throws me an error i.e. "Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
Further Details: run errorDetails method".
My code is in the below. I appreciate your time and help.
#add url
url <- "https://www.toimitilat.fi/toimitilahaku/?size_min=&size_max=&deal_type%5B%5D=1&language=fin&result_type=list&advanced=0&gbl=1&ref=main#searchresult"
rD <- rsDriver()
remDr <- rD$client
remDr$navigate(url)
< for(i in 8:13){
Sys.sleep(0.86)
rent <- remDr$findElement(using = "css selector", paste("#objectList > div:nth-child(", i, ") > div.infoCont > div.priceCont", sep = ""))$getElementText()
#checking if there is a rent or not
if(!is.null(rent)){
tryCatch({
rent <- unlist(strsplit(rent[[1]][1], "\n"))
rent_df <- rbind(rent_df, rent)
}, error = function(e){
return("NoSuchElement")
i = i + 1
})
}
}
>
You can do this much more easily with rvest rather than using the sledgehammer of RSelenium. It also copes much better with missing information.
To get a dataframe with the addresses and rents, you can use html_nodes to create a list of the boxes containing the information, and then html_node to find the relevant information in each one. There will be one entry for each box, and any missing data will just appear as NA.
library(dplyr) #only needed for the pipe operator %>%
library(rvest)
url <- "https://www.toimitilat.fi/toimitilahaku/?size_min=&size_max=&deal_type%5B%5D=1&language=fin&result_type=list&advanced=0&gbl=1&ref=main#searchresult"
boxes <- read_html(url) %>% #read the page
html_nodes(".infoCont") #find the info boxes
address <- boxes %>%
html_node("h4 > a") %>% #find the address info in each box
html_text()
rent <- boxes %>%
html_node(".priceCont") %>% #find the rent info in each box
html_text() %>% #extract the text
trimws() #trim whitespace
#put together in a dataframe
rent_df <- data.frame(address = address,
rent = rent,
stringsAsFactors = FALSE)
head(rent_df)
address rent
1 Akaa, Airolantie 5 Myyntihinta: \nMyydään tarjousten perusteella...
2 Akaa, Hämeentie 18 <NA>
3 Akaa, Hämeentie 69, Akaa
4 Akaa, Keskuskatu 42 Vuokrahinta: \n300 e/kk + alv
5 Akaa, Kirkkotori 10, Toijala Vuokrahinta: \n450
6 Akaa, Palomäentie 6 Toijala Vuokrahinta: \n3€/m2+alv
You can then easily extract the information you need.
Solution with rvest should be easier, but if you want or need to use RSelenium, this should work:
# Preparation
library(dplyr) # required for bind_rows
# add url
url <- "https://www.toimitilat.fi/toimitilahaku/?size_min=&size_max=&deal_type%5B%5D=1&language=fin&result_type=list&advanced=0&gbl=1&ref=main#searchresult"
rD <- rsDriver()
remDr <- rD$client
remDr$navigate(url)
# Checking that rD and remDr objects exist and work
## If youg get an error here, that means that selenium objects doesn´t work - usually because ports are busy, selenium server or client have not been closed properly or browser drivers are out of date (or something else)
class(rD)
class(remDr)
# making separate function retrieving the rent and handling exceptions
giveRent <- function(i) {
Sys.sleep(0.86)
tryCatch( {
rent <- remDr$findElement(using = "css selector", paste("#objectList > div:nth-child(", i, ") > div.infoCont > div.priceCont", sep = ""))$getElementText()
rent <- unlist(strsplit(rent[[1]][1], "\n"))
rent <- rent[2]}
, warning = function(e){rent <<- NA}
, error = function(e){rent <<- NA})
return(rent)}
# adding rent to the dataframe in for-loop
rent_df <- c()
for(i in 1:33){rent_df <- bind_rows(rent_df, (data.frame(giveRent(i))))}
print(rent_df)

Resources