rvest handling hidden text

rvest handling hidden text - r

I don't see the data/text I am looking for when scraping a web page
I tried googling the issue without having any luck. I also tried using the xpath but i get {xml_nodeset (0)}
require(rvest)
url <- "https://www.nasdaq.com/market-activity/ipos"
IPOS <- read_html(url)
IPOS %>% xml_nodes("tbody") %>% xml_text()
Output:
[1] "\n \n \n \n \n \n "
I do not see any of the IPO data. Expected output should contain the table for the "Priced" IPOs: Symbol, Company Name, etc...

No need for the expensive RSelenium. There is an API call you can find in the network tab returning everything as json.
For example,
library(jsonlite)
data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09')
View(data$data$priced$rows)

It seems that the table data are loaded by scripts. You can use RSelenium package to get them.
library(rvest)
library(RSelenium)
rD <- rsDriver(port = 1210L, browser = "firefox", check = FALSE)
remDr <- rD$client
url <- "https://www.nasdaq.com/market-activity/ipos"
remDr$navigate(url)
IPOS <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table(fill = TRUE)
str(IPOS)
PRICED <- IPOS[[3]]

Related

R How to web scrape data from StockTwits with RSelenium?

I want to get some information from tweets posted on the platform StockTwits.
Here you can see an example tweet: https://stocktwits.com/Kndihopefull/message/433815546
I would like to read the following information: Number of replies, number of reshares, number of likes:
I think this is possible with the RSelenium-package. However, I am not really getting anywhere with my approach.
Can someone help me?
library(RSelenium)
url<- "https://stocktwits.com/Kndihopefull/message/433815546"
# RSelenium with Firefox
rD <- RSelenium::remoteDriver(browser="firefox", port=4546L)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
I would like to have a list (or a data set) as a result, which looks like this:
$Reply
[1] 1
$Reshare
[1] 1
$Like
[1] 7

To get required info we can do,
library(rvest)
library(dplyr)
library(RSelenium)
#launch browser
driver = rsDriver(browser = c("firefox"))
url = "https://stocktwits.com/ArcherUS/message/434172145"
remDr <- driver$client
remDr$navigate(url)
#First we shall get the tags
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
[1] "Reply" "Reshare" "Like" "Share" "Search"
#then the number associated with it
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()
[1] "" "" "2" "" ""
The last two items Share and Search will be empty.
The faster approach would be by using rvest.
library(rvest)
url = "https://stocktwits.com/ArcherUS/message/434172145"
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()

Scraping soundcloud.com with rvest package in R

I am attempting to scrape the this URL to get the names of the top 50 soundcloud artists in Canada.
Using SelectorGadget, I selected the artists names and it told me the path is '.sc-link-light'.
My first attempt was as follows:
library(rvest)
library(stringr)
library(reshape2)
soundcloud <- read_html("https://soundcloud.com/charts/top?genre=all-music&country=CA")
artist_name <- soundcloud %>% html_nodes('.sc-link-light') %>% html_text()
which yielded artist_name as a list of 0.
My second attempt I changed the last line to:
artist_name <- soundcloud %>% html_node(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", ".sc-link-light", " " ))]') %>% html_text()
which again yielded the same result.
What exactly am I doing wrong? I believe this should give me the artists names in a list.
Any help is appreciated, thank you.

The webpage you are attempting to scrape is dynamic. As a result you will need to use a library such as RSelenium. A sample script is below:
library(tidyverse)
library(RSelenium)
library(rvest)
library(stringr)
url <- "https://soundcloud.com/charts/top?genre=all-music&country=CA"
rD <- rsDriver(browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate(url)
pg <- read_html(remDr$getPageSource()[[1]])
artist_name <- pg %>% html_nodes('.sc-link-light') %>% html_text()
####clean up####
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

How to perform web scraping to get all the reviews of the an app in Google Play?

I pretend to be able to get all the reviews that users leave on Google Play about the apps. I have this code that they indicated there Web scraping in R through Google playstore . But the problem is that you only get the first 40 reviews. Is there a possibility to get all the comments of the app?
`` `
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scraped
url <- 'https://play.google.com/store/apps/details?
id=com.phonegap.rxpal&hl=en_IN&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that
is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-
Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the
reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>%
html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews,
stringsAsFactors = F)
`` `

You can get all the reviews from the web store of GooglePlay.
If you scroll through the reviews, you can see the XHR request is sent to:
https://play.google.com/_/PlayStoreUi/data/batchexecute
With form-data:
f.req: [[["rYsCDe","[[\"com.playrix.homescapes\",7]]",null,"55"]]]
at: AK6RGVZ3iNlrXreguWd7VvQCzkyn:1572317616250
And params of:
rpcids=rYsCDe
f.sid=-3951426241423402754
bl=boq_playuiserver_20191023.08_p0
hl=en
authuser=0
soc-app=121
soc-platform=1
soc-device=1
_reqid=839222
rt=c
After playing around with different parameters, I find out many are optional, and the request can be simplified as:
form-data:
f.req: [[["UsvDTd","[null,null,[2, $sort,[$review_size,null,$page_token]],[$package_name,7]]",null,"generic"]]]
params:
hl=$review_language
The response is cryptic, but it's essentially JSON data with keys stripped, similar to protobuf, I wrote a parser for the response that translate it to regular dict object.
https://gist.github.com/xlrtx/af655f05700eb76bb29aec876493ed90

Web scraping with rvest. Returning as NA

I am quite new to web scraping and I am trying to scrape the 5-yr market value from a five thirty eight site linked here (https://projects.fivethirtyeight.com/carmelo/kyrie-irving/). This is the code I am running from the rvest package to do so.
kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
kyrie_irving %>%
html_node(".market-value") %>%
html_text() %>%
as.numeric()
However the output looks like this:
> kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
> kyrie_irving %>%
+ html_node(".market-value") %>%
+ html_text() %>%
+ as.numeric()
[1] NA
I'm just wondering where I am going wrong with this?
EDIT: I have tried using RSelenium to do this and still get no value returned. I am really lost as to what the problem is. Here is the code:
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
elem <- remDr$findElement(using="css selector", value=".market-value")
elemtxt <- elem$getElementAttribute("div")

Rselenium works, you just need to change the last line code and you can get the result.
elem$getElementText()
[[1]]
[1] "$136.5m"
By the way, the result is a string, so you need to remove $ and m, then you can parse it into a number.

No data when scraping with rvest

I am trying to scrape a website but it does not give me any data.
#Get the Data
require(tidyverse)
require(rvest)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#get data
url %>%
read_html() %>%
html_nodes(".green div:nth-child(1)") %>%
html_text()
character(0)
I have also tried to use the xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "green", " " ))]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a' but this gives me the same result with 0 data.
I am expecting Horse names. Shouldnt I at least get some javascript code even if data on page is rendered by javascript?
I cant see what else CSS selector I should use here.

You can simply use RSelenium package to scrape dynamycal pages :
library(RSelenium)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#Create the remote driver / navigator
rsd <- rsDriver(browser = "chrome")
remDr <- rsd$client
#Go to your url
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
#get your horses data by parsing Selenium page with Rvest as you know to do
page %>% html_nodes(".green div:nth-child(1)") %>% html_text()
Hope that will helps
Gottavianoni

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

rvest handling hidden text - r

No need for the expensive RSelenium. There is an API call you can find in the network tab returning everything as json. For example, library(jsonlite) data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09') View(data$data$priced$rows)

Related

R How to web scrape data from StockTwits with RSelenium?

Scraping soundcloud.com with rvest package in R

How to perform web scraping to get all the reviews of the an app in Google Play?

Web scraping with rvest. Returning as NA

No data when scraping with rvest

Categories

Resources