Need an example of how to scrape this site

Need an example of how to scrape this site - r

My question is a little unclear, but that's because I really don't know how to ask it more clearly at this stage ..
If I get an answer I will rename it more accurately.
I am a complete newbie to scraping and just learning how to do it.
I am trying to scrape just one value from this site
library("rvest")
url <- "https://www.fxblue.com/market-data/tools/sentiment"
web <- read_html(url)
nodes <- html_nodes(web,".SentimentValueCaptionLong")
get
html_text(nodes)
character(0)
my next try
library(RSelenium)
rD <- rsDriver(browser="chrome",port=0999L,verbose = F,chromever = "95.0.4638.54")
remDr <- rD[["client"]]
remDr$maxWindowSize()
remDr$navigate("https://www.fxblue.com/market-data/tools/sentiment")
html <- remDr$getPageSource()[[1]]
page <- read_html(html)
nodes <- html_nodes(page, ".SentimentValueCaptionLong")
get the same
html_text(nodes)
character(0)
Can someone show me how to do it right, and explain what you did

library(rvest)
library(dplyr)
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
Get name
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.SentimentRowCaption') %>%
html_text()
[1] "AUD/CAD" "AUD/JPY" "AUD/NZD" "AUD/USD" "CAD/JPY" "DAX" "EUR/AUD" "EUR/CAD" "EUR/CHF" "EUR/GBP" "EUR/JPY" "EUR/USD" "GBP/AUD" "GBP/CAD" "GBP/CHF"
[16] "GBP/JPY" "GBP/USD" "NZD/USD" "USD/CAD" "USD/CHF" "USD/JPY" "XAU/USD"
Get long figures
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.SentimentValueCaptionLong') %>%
html_text()
[1] "79.2%" "38.4%" "56.1%" "68.9%" "26.8%" "28.7%" "68.7%" "79.5%" "80.7%" "85.3%" "57.0%" "76.4%" "36.1%" "67.4%" "69.7%" "54.9%" "82.3%" "65.1%" "25.0%"
[20] "28.7%" "17.9%" "82.8%"
Get short figures
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.SentimentValueCaptionShort') %>%
html_text()
[1] "20.8%" "61.4%" "43.5%" "31.3%" "73.8%" "70.8%" "31.7%" "20.0%" "19.9%" "14.3%" "43.5%" "23.4%" "64.0%" "32.2%" "30.0%" "45.8%" "17.7%" "34.8%" "74.5%"
[20] "71.3%" "82.2%" "17.0%"

Related

R How to web scrape data from StockTwits with RSelenium?

I want to get some information from tweets posted on the platform StockTwits.
Here you can see an example tweet: https://stocktwits.com/Kndihopefull/message/433815546
I would like to read the following information: Number of replies, number of reshares, number of likes:
I think this is possible with the RSelenium-package. However, I am not really getting anywhere with my approach.
Can someone help me?
library(RSelenium)
url<- "https://stocktwits.com/Kndihopefull/message/433815546"
# RSelenium with Firefox
rD <- RSelenium::remoteDriver(browser="firefox", port=4546L)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
I would like to have a list (or a data set) as a result, which looks like this:
$Reply
[1] 1
$Reshare
[1] 1
$Like
[1] 7

To get required info we can do,
library(rvest)
library(dplyr)
library(RSelenium)
#launch browser
driver = rsDriver(browser = c("firefox"))
url = "https://stocktwits.com/ArcherUS/message/434172145"
remDr <- driver$client
remDr$navigate(url)
#First we shall get the tags
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
[1] "Reply" "Reshare" "Like" "Share" "Search"
#then the number associated with it
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()
[1] "" "" "2" "" ""
The last two items Share and Search will be empty.
The faster approach would be by using rvest.
library(rvest)
url = "https://stocktwits.com/ArcherUS/message/434172145"
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()

character (0) after scraping webpage in read_html

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.
t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//*[contains(#class, 'info__value--2AHH7')]") %>%
rvest::html_text()
However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?

As Dave2e pointed the page uses javascript, thus can't make use of rvest.
url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[#id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>%
html_text()
[1] "1,335,000"

Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
page <- read_html('https://fortune.com/company/amazon-com/fortune500/')
data <- page %>% html_element('#preload') %>% html_text() %>%
stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()
print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)
#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))

Web Scraping on multiple pages with RSelenium and select emails with regular expression

I would like to collect email addresses clicking each name from this website https://ki.se/en/research/professors-at-ki I created the following loop. For some reason some email are not collected, and the code is very slow...
Do you have a better code idea?
Thank you very much in advance
library(RSelenium)
#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://ki.se/en/research/professors-at-ki")
database<-data.frame(NA, nrow = length(name), ncol = 3)
for(i in 1:length(name)){
#first website
remDr$navigate("https://ki.se/en/research/professors-at-ki")
elems <- remDr$findElements(using = 'xpath', "//strong") #all elements to be selected
elem <- elems[[i]] #do search and click on each one
class(elem)
people<- elem$getElementText()
elem$clickElement()
page <- remDr$getPageSource()
#stringplit
p<-str_split(as.character(page), "\n")
a<-grep("#", p[[1]])
if(length(a)>0){
email<-p[[1]][a[2]]
email<-gsub(" ", "", email)
database[i,1]<-people
database[i,2]<-email
database[i,3]<-"Karolinska Institute"
}
}

RSelenium is usually not the fastest approach as it requires the browser to load the page. There are cases, when RSelenium is the only option, but in this case, you can achieve what you need using rvest library, which should be faster. As for the errors you receive, there are two professors, for which the links provided do not seem to be working, thus the errors you receive.
library(rvest)
library(tidyverse)
# getting links to professors microsites as part of the KI main website
r <- read_html("https://ki.se/en/research/professors-at-ki")
people_links <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("https://staff.ki.se/people/")
# accessing the obtained links, getting the e-mails
df <- tibble(people_links) %>%
# filtering out these links as they do not seem to be accessible
filter( !(people_links %in% c("https://staff.ki.se/people/gungra", "https://staff.ki.se/people/evryla")) ) %>%
rowwise() %>%
mutate(
mail = read_html(people_links) %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("mailto:") %>%
str_remove("mailto:")
)

Scraping data from LinkedIn using RSelenium (and rvest)

I am trying to scrape some data from famous people on LinkedIn and I have a few problems. I would like do the following:
On Hadley Wickhams page ( https://www.linkedin.com/in/hadleywickham/ ) I would like to use RSelenium to login and "click" the "Show 1 more education" - and also "Show 1 more experience" (note Hadley does not have the option to "Show 1 more experience" but does have the option to "Show 1 more education").
(by clicking the "Show more experience/education" allows me to scrape the full education and experience from the page). Alternatively Ted Cruz has an option to "Show 5 more experiences" which I would like to expand and scrape.
Code:
library(RSelenium)
library(rvest)
library(stringr)
library(xml2)
userID = "myEmailLogin" # The linkedIn email to login
passID = "myPassword" # and LinkedIn password
try(rsDriver(port = 4444L, browser = 'firefox'))
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.linkedin.com/login")
user <- remDr$findElement(using = 'id',"username")
user$sendKeysToElement(list(userID,key="tab"))
pass <- remDr$findElement(using = 'id',"password")
pass$sendKeysToElement(list(passID,key="enter"))
Sys.sleep(5) # give the page time to fully load
# Navgate to individual profiles
# remDr$navigate("https://www.linkedin.com/in/thejlo/") # Jennifer Lopez
# remDr$navigate("https://www.linkedin.com/in/cruzted/") # Ted Cruz
remDr$navigate("https://www.linkedin.com/in/hadleywickham/") # Hadley Wickham
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
signals <- read_html(html)
personFullNameLocationXPath <- '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/ul[1]/li[1]'
personName <- signals %>%
html_nodes(xpath = personFullNameLocationXPath) %>%
html_text()
personTagLineXPath = '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2'
personTagLine <- signals %>%
html_nodes(xpath = personTagLineXPath) %>%
html_text()
personLocationXPath <- '//*[#id="ember49"]/div[2]/div[2]/div[1]/ul[2]/li[1]'
personLocation <- signals %>%
html_nodes(xpath = personLocationXPath) %>%
html_text()
personLocation %>%
gsub("[\r\n]", "", .) %>%
str_trim(.)
# Here is where I have problems
personExperienceTotalXPath = '//*[#id="experience-section"]/ul'
personExperienceTotal <- signals %>%
html_nodes(xpath = personExperienceTotalXPath) %>%
html_text()
The very end personExperienceTotal is where I go wrong... I cannot seem to scrape the experience-section. When I put my own LinkedIn URL (or some random person) it seems to work...
My question is, how can I click the expand experience/education and scrape these sections?

rvest handling hidden text

I don't see the data/text I am looking for when scraping a web page
I tried googling the issue without having any luck. I also tried using the xpath but i get {xml_nodeset (0)}
require(rvest)
url <- "https://www.nasdaq.com/market-activity/ipos"
IPOS <- read_html(url)
IPOS %>% xml_nodes("tbody") %>% xml_text()
Output:
[1] "\n \n \n \n \n \n "
I do not see any of the IPO data. Expected output should contain the table for the "Priced" IPOs: Symbol, Company Name, etc...

No need for the expensive RSelenium. There is an API call you can find in the network tab returning everything as json.
For example,
library(jsonlite)
data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09')
View(data$data$priced$rows)

It seems that the table data are loaded by scripts. You can use RSelenium package to get them.
library(rvest)
library(RSelenium)
rD <- rsDriver(port = 1210L, browser = "firefox", check = FALSE)
remDr <- rD$client
url <- "https://www.nasdaq.com/market-activity/ipos"
remDr$navigate(url)
IPOS <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table(fill = TRUE)
str(IPOS)
PRICED <- IPOS[[3]]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Need an example of how to scrape this site - r

Related

R How to web scrape data from StockTwits with RSelenium?

character (0) after scraping webpage in read_html

Web Scraping on multiple pages with RSelenium and select emails with regular expression

Scraping data from LinkedIn using RSelenium (and rvest)

rvest handling hidden text

Categories

Resources