R How to web scrape data from StockTwits with RSelenium? - r

I want to get some information from tweets posted on the platform StockTwits.
Here you can see an example tweet: https://stocktwits.com/Kndihopefull/message/433815546
I would like to read the following information: Number of replies, number of reshares, number of likes:
I think this is possible with the RSelenium-package. However, I am not really getting anywhere with my approach.
Can someone help me?
library(RSelenium)
url<- "https://stocktwits.com/Kndihopefull/message/433815546"
# RSelenium with Firefox
rD <- RSelenium::remoteDriver(browser="firefox", port=4546L)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
I would like to have a list (or a data set) as a result, which looks like this:
$Reply
[1] 1
$Reshare
[1] 1
$Like
[1] 7

To get required info we can do,
library(rvest)
library(dplyr)
library(RSelenium)
#launch browser
driver = rsDriver(browser = c("firefox"))
url = "https://stocktwits.com/ArcherUS/message/434172145"
remDr <- driver$client
remDr$navigate(url)
#First we shall get the tags
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
[1] "Reply" "Reshare" "Like" "Share" "Search"
#then the number associated with it
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()
[1] "" "" "2" "" ""
The last two items Share and Search will be empty.
The faster approach would be by using rvest.
library(rvest)
url = "https://stocktwits.com/ArcherUS/message/434172145"
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()

Related

character (0) after scraping webpage in read_html

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.
t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//*[contains(#class, 'info__value--2AHH7')]") %>%
rvest::html_text()
However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?
As Dave2e pointed the page uses javascript, thus can't make use of rvest.
url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>%
html_text()
[1] "1,335,000"
Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
page <- read_html('https://fortune.com/company/amazon-com/fortune500/')
data <- page %>% html_element('#preload') %>% html_text() %>%
stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()
print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)
#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))

Need an example of how to scrape this site

My question is a little unclear, but that's because I really don't know how to ask it more clearly at this stage ..
If I get an answer I will rename it more accurately.
I am a complete newbie to scraping and just learning how to do it.
I am trying to scrape just one value from this site
library("rvest")
url <- "https://www.fxblue.com/market-data/tools/sentiment"
web <- read_html(url)
nodes <- html_nodes(web,".SentimentValueCaptionLong")
get
html_text(nodes)
character(0)
my next try
library(RSelenium)
rD <- rsDriver(browser="chrome",port=0999L,verbose = F,chromever = "95.0.4638.54")
remDr <- rD[["client"]]
remDr$maxWindowSize()
remDr$navigate("https://www.fxblue.com/market-data/tools/sentiment")
html <- remDr$getPageSource()[[1]]
page <- read_html(html)
nodes <- html_nodes(page, ".SentimentValueCaptionLong")
get the same
html_text(nodes)
character(0)
Can someone show me how to do it right, and explain what you did
library(rvest)
library(dplyr)
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
Get name
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.SentimentRowCaption') %>%
html_text()
[1] "AUD/CAD" "AUD/JPY" "AUD/NZD" "AUD/USD" "CAD/JPY" "DAX" "EUR/AUD" "EUR/CAD" "EUR/CHF" "EUR/GBP" "EUR/JPY" "EUR/USD" "GBP/AUD" "GBP/CAD" "GBP/CHF"
[16] "GBP/JPY" "GBP/USD" "NZD/USD" "USD/CAD" "USD/CHF" "USD/JPY" "XAU/USD"
Get long figures
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.SentimentValueCaptionLong') %>%
html_text()
[1] "79.2%" "38.4%" "56.1%" "68.9%" "26.8%" "28.7%" "68.7%" "79.5%" "80.7%" "85.3%" "57.0%" "76.4%" "36.1%" "67.4%" "69.7%" "54.9%" "82.3%" "65.1%" "25.0%"
[20] "28.7%" "17.9%" "82.8%"
Get short figures
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.SentimentValueCaptionShort') %>%
html_text()
[1] "20.8%" "61.4%" "43.5%" "31.3%" "73.8%" "70.8%" "31.7%" "20.0%" "19.9%" "14.3%" "43.5%" "23.4%" "64.0%" "32.2%" "30.0%" "45.8%" "17.7%" "34.8%" "74.5%"
[20] "71.3%" "82.2%" "17.0%"

Scraping data from LinkedIn using RSelenium (and rvest)

I am trying to scrape some data from famous people on LinkedIn and I have a few problems. I would like do the following:
On Hadley Wickhams page ( https://www.linkedin.com/in/hadleywickham/ ) I would like to use RSelenium to login and "click" the "Show 1 more education" - and also "Show 1 more experience" (note Hadley does not have the option to "Show 1 more experience" but does have the option to "Show 1 more education").
(by clicking the "Show more experience/education" allows me to scrape the full education and experience from the page). Alternatively Ted Cruz has an option to "Show 5 more experiences" which I would like to expand and scrape.
Code:
library(RSelenium)
library(rvest)
library(stringr)
library(xml2)
userID = "myEmailLogin" # The linkedIn email to login
passID = "myPassword" # and LinkedIn password
try(rsDriver(port = 4444L, browser = 'firefox'))
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.linkedin.com/login")
user <- remDr$findElement(using = 'id',"username")
user$sendKeysToElement(list(userID,key="tab"))
pass <- remDr$findElement(using = 'id',"password")
pass$sendKeysToElement(list(passID,key="enter"))
Sys.sleep(5) # give the page time to fully load
# Navgate to individual profiles
# remDr$navigate("https://www.linkedin.com/in/thejlo/") # Jennifer Lopez
# remDr$navigate("https://www.linkedin.com/in/cruzted/") # Ted Cruz
remDr$navigate("https://www.linkedin.com/in/hadleywickham/") # Hadley Wickham
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
signals <- read_html(html)
personFullNameLocationXPath <- '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/ul[1]/li[1]'
personName <- signals %>%
html_nodes(xpath = personFullNameLocationXPath) %>%
html_text()
personTagLineXPath = '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2'
personTagLine <- signals %>%
html_nodes(xpath = personTagLineXPath) %>%
html_text()
personLocationXPath <- '//*[#id="ember49"]/div[2]/div[2]/div[1]/ul[2]/li[1]'
personLocation <- signals %>%
html_nodes(xpath = personLocationXPath) %>%
html_text()
personLocation %>%
gsub("[\r\n]", "", .) %>%
str_trim(.)
# Here is where I have problems
personExperienceTotalXPath = '//*[#id="experience-section"]/ul'
personExperienceTotal <- signals %>%
html_nodes(xpath = personExperienceTotalXPath) %>%
html_text()
The very end personExperienceTotal is where I go wrong... I cannot seem to scrape the experience-section. When I put my own LinkedIn URL (or some random person) it seems to work...
My question is, how can I click the expand experience/education and scrape these sections?

rvest handling hidden text

I don't see the data/text I am looking for when scraping a web page
I tried googling the issue without having any luck. I also tried using the xpath but i get {xml_nodeset (0)}
require(rvest)
url <- "https://www.nasdaq.com/market-activity/ipos"
IPOS <- read_html(url)
IPOS %>% xml_nodes("tbody") %>% xml_text()
Output:
[1] "\n \n \n \n \n \n "
I do not see any of the IPO data. Expected output should contain the table for the "Priced" IPOs: Symbol, Company Name, etc...
No need for the expensive RSelenium. There is an API call you can find in the network tab returning everything as json.
For example,
library(jsonlite)
data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09')
View(data$data$priced$rows)
It seems that the table data are loaded by scripts. You can use RSelenium package to get them.
library(rvest)
library(RSelenium)
rD <- rsDriver(port = 1210L, browser = "firefox", check = FALSE)
remDr <- rD$client
url <- "https://www.nasdaq.com/market-activity/ipos"
remDr$navigate(url)
IPOS <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table(fill = TRUE)
str(IPOS)
PRICED <- IPOS[[3]]

Web scraping with rvest. Returning as NA

I am quite new to web scraping and I am trying to scrape the 5-yr market value from a five thirty eight site linked here (https://projects.fivethirtyeight.com/carmelo/kyrie-irving/). This is the code I am running from the rvest package to do so.
kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
kyrie_irving %>%
html_node(".market-value") %>%
html_text() %>%
as.numeric()
However the output looks like this:
> kyrie_irving <-
read_html("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
> kyrie_irving %>%
+ html_node(".market-value") %>%
+ html_text() %>%
+ as.numeric()
[1] NA
I'm just wondering where I am going wrong with this?
EDIT: I have tried using RSelenium to do this and still get no value returned. I am really lost as to what the problem is. Here is the code:
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("https://projects.fivethirtyeight.com/carmelo/kyrie-irving/")
elem <- remDr$findElement(using="css selector", value=".market-value")
elemtxt <- elem$getElementAttribute("div")
Rselenium works, you just need to change the last line code and you can get the result.
elem$getElementText()
[[1]]
[1] "$136.5m"
By the way, the result is a string, so you need to remove $ and m, then you can parse it into a number.

Resources