Scraping soundcloud.com with rvest package in R - r

I am attempting to scrape the this URL to get the names of the top 50 soundcloud artists in Canada.
Using SelectorGadget, I selected the artists names and it told me the path is '.sc-link-light'.
My first attempt was as follows:
library(rvest)
library(stringr)
library(reshape2)
soundcloud <- read_html("https://soundcloud.com/charts/top?genre=all-music&country=CA")
artist_name <- soundcloud %>% html_nodes('.sc-link-light') %>% html_text()
which yielded artist_name as a list of 0.
My second attempt I changed the last line to:
artist_name <- soundcloud %>% html_node(xpath='//*[contains(concat( " ", #class, " " ), concat( " ", ".sc-link-light", " " ))]') %>% html_text()
which again yielded the same result.
What exactly am I doing wrong? I believe this should give me the artists names in a list.
Any help is appreciated, thank you.

The webpage you are attempting to scrape is dynamic. As a result you will need to use a library such as RSelenium. A sample script is below:
library(tidyverse)
library(RSelenium)
library(rvest)
library(stringr)
url <- "https://soundcloud.com/charts/top?genre=all-music&country=CA"
rD <- rsDriver(browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate(url)
pg <- read_html(remDr$getPageSource()[[1]])
artist_name <- pg %>% html_nodes('.sc-link-light') %>% html_text()
####clean up####
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

Related

R How to web scrape data from StockTwits with RSelenium?

I want to get some information from tweets posted on the platform StockTwits.
Here you can see an example tweet: https://stocktwits.com/Kndihopefull/message/433815546
I would like to read the following information: Number of replies, number of reshares, number of likes:
I think this is possible with the RSelenium-package. However, I am not really getting anywhere with my approach.
Can someone help me?
library(RSelenium)
url<- "https://stocktwits.com/Kndihopefull/message/433815546"
# RSelenium with Firefox
rD <- RSelenium::remoteDriver(browser="firefox", port=4546L)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
I would like to have a list (or a data set) as a result, which looks like this:
$Reply
[1] 1
$Reshare
[1] 1
$Like
[1] 7
To get required info we can do,
library(rvest)
library(dplyr)
library(RSelenium)
#launch browser
driver = rsDriver(browser = c("firefox"))
url = "https://stocktwits.com/ArcherUS/message/434172145"
remDr <- driver$client
remDr$navigate(url)
#First we shall get the tags
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
[1] "Reply" "Reshare" "Like" "Share" "Search"
#then the number associated with it
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()
[1] "" "" "2" "" ""
The last two items Share and Search will be empty.
The faster approach would be by using rvest.
library(rvest)
url = "https://stocktwits.com/ArcherUS/message/434172145"
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()

R Selenium unable to findElement return Error Selenium message:Unable to locate element

I am webscraping information from this "https://lsf.uni-heidelberg.de/qisserver/rds?state=change&type=6&moduleParameter=personalSelect&nextdir=change&next=SearchSelect.vm&target=personSearch&subdir=person&init=y&source=state%3Dchange%26type%3D5%26moduleParameter%3DpersonSearch%26nextdir%3Dchange%26next%3Dsearch.vm%26subdir%3Dperson%26menuid%3Dsearch%26_form%3Ddisplay%26topitem%3Dmembers%26subitem%3D%26field%3DNachname&targetfield=Nachname&_form=display" .
I would like to search each individual to collect email addresses. I am doing the following but I can't find a way to submit the search button.
#url
uni<-"https://lsf.uni-heidelberg.de/qisserver/rds?state=change&type=6&moduleParameter=personalSelect&nextdir=change&next=SearchSelect.vm&target=personSearch&subdir=person&init=y&source=state%3Dchange%26type%3D5%26moduleParameter%3DpersonSearch%26nextdir%3Dchange%26next%3Dsearch.vm%26subdir%3Dperson%26menuid%3Dsearch%26_form%3Ddisplay%26topitem%3Dmembers%26subitem%3D%26field%3DNachname&targetfield=Nachname&_form=display"
#people's name
r<-read_html(uni)
name <- r %>%
html_nodes("a") %>%
html_text()
name<-name[40:length(name)]
name<-gsub("\n","",name ,fixed = T)
name<-gsub("\t","",name ,fixed = T)
#people's first link
link <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character()
link<-link[40:length(link)]
link<-str_split(link, '"')
link<-sapply(link, "[", 6)
#create a loop: with R selenium, click on search for each link and get emails which are in the next page
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
#remDr$navigate("https://ki.se/en/research/professors-at-ki")
for (i in 1:lenght(link)) {
i=1
#r<- read_html(link[i])
remDr$navigate(link[i])
webElem <- remDr$findElement(using = 'xpath', '//*+[contains(concat( " ", #class, " " ), concat( " ", "abstand_search", " " ))]//font//input')
webElem$clickElement()
#here i get the error
}
Here are some pointers. I would go with faster and more intuitive, on reading, css selectors to gather the links:
library(rvest)
links <- read_html('https://lsf.uni-heidelberg.de/qisserver/rds?state=change&type=6&moduleParameter=personalSelect&nextdir=change&next=SearchSelect.vm&target=personSearch&subdir=person&init=y&source=state%3Dchange%26type%3D5%26moduleParameter%3DpersonSearch%26nextdir%3Dchange%26next%3Dsearch.vm%26subdir%3Dperson%26menuid%3Dsearch%26_form%3Ddisplay%26topitem%3Dmembers%26subitem%3D%26field%3DNachname&targetfield=Nachname&_form=display') %>%
html_nodes('.regular[name]') %>%
html_attr('href')
Then, I would use same strategy to target the search button:
webElem <- remDr$findElement(using = 'css selector', '.abstand_search + [value="Suche starten"]') # this matches for the element which is interactable
Finally, I would pick up name and email from destination page
name <- remDr$findElement(using = 'css selector', '.regular')
email <- remDr$findElement(using = 'css selector', '[href*=mail]') # could also take 2nd match for .regular
I get around it by using rvest in the following way in the loop
#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
emails<-list()
for (i in 1:length(links)) {
#r<- read_html(link[i])
remDr$navigate(links[i])
webElem <- remDr$findElement(using = 'css selector', '.abstand_search + [value="Suche starten"]') # this matches for the element which is interactable
webElem$clickElement()
r <- read_html(unlist(webElem$getCurrentUrl()))
mail <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("mailto:") %>%
str_remove("mailto:")
if(length(mail)!=0){
a<- str_split(mail, "href")
a<-unlist(a)
w<-which((grepl("#",a, fixed = T)))
emails<-c(emails,a[w])
}else{ emails<-c(emails,NA)}
rm(mail)
}
Not than elegant code but it works. For names is more complex and I cannot find a way to get the right css or xpath. Let me know if you can think of a more elegant and fast code or if the issue can only be soleved using a brute forze way.

Web Scraping on multiple pages with RSelenium and select emails with regular expression

I would like to collect email addresses clicking each name from this website https://ki.se/en/research/professors-at-ki I created the following loop. For some reason some email are not collected, and the code is very slow...
Do you have a better code idea?
Thank you very much in advance
library(RSelenium)
#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://ki.se/en/research/professors-at-ki")
database<-data.frame(NA, nrow = length(name), ncol = 3)
for(i in 1:length(name)){
#first website
remDr$navigate("https://ki.se/en/research/professors-at-ki")
elems <- remDr$findElements(using = 'xpath', "//strong") #all elements to be selected
elem <- elems[[i]] #do search and click on each one
class(elem)
people<- elem$getElementText()
elem$clickElement()
page <- remDr$getPageSource()
#stringplit
p<-str_split(as.character(page), "\n")
a<-grep("#", p[[1]])
if(length(a)>0){
email<-p[[1]][a[2]]
email<-gsub(" ", "", email)
database[i,1]<-people
database[i,2]<-email
database[i,3]<-"Karolinska Institute"
}
}
RSelenium is usually not the fastest approach as it requires the browser to load the page. There are cases, when RSelenium is the only option, but in this case, you can achieve what you need using rvest library, which should be faster. As for the errors you receive, there are two professors, for which the links provided do not seem to be working, thus the errors you receive.
library(rvest)
library(tidyverse)
# getting links to professors microsites as part of the KI main website
r <- read_html("https://ki.se/en/research/professors-at-ki")
people_links <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("https://staff.ki.se/people/")
# accessing the obtained links, getting the e-mails
df <- tibble(people_links) %>%
# filtering out these links as they do not seem to be accessible
filter( !(people_links %in% c("https://staff.ki.se/people/gungra", "https://staff.ki.se/people/evryla")) ) %>%
rowwise() %>%
mutate(
mail = read_html(people_links) %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("mailto:") %>%
str_remove("mailto:")
)

rvest handling hidden text

I don't see the data/text I am looking for when scraping a web page
I tried googling the issue without having any luck. I also tried using the xpath but i get {xml_nodeset (0)}
require(rvest)
url <- "https://www.nasdaq.com/market-activity/ipos"
IPOS <- read_html(url)
IPOS %>% xml_nodes("tbody") %>% xml_text()
Output:
[1] "\n \n \n \n \n \n "
I do not see any of the IPO data. Expected output should contain the table for the "Priced" IPOs: Symbol, Company Name, etc...
No need for the expensive RSelenium. There is an API call you can find in the network tab returning everything as json.
For example,
library(jsonlite)
data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09')
View(data$data$priced$rows)
It seems that the table data are loaded by scripts. You can use RSelenium package to get them.
library(rvest)
library(RSelenium)
rD <- rsDriver(port = 1210L, browser = "firefox", check = FALSE)
remDr <- rD$client
url <- "https://www.nasdaq.com/market-activity/ipos"
remDr$navigate(url)
IPOS <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table(fill = TRUE)
str(IPOS)
PRICED <- IPOS[[3]]

No data when scraping with rvest

I am trying to scrape a website but it does not give me any data.
#Get the Data
require(tidyverse)
require(rvest)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#get data
url %>%
read_html() %>%
html_nodes(".green div:nth-child(1)") %>%
html_text()
character(0)
I have also tried to use the xpath = '//*[contains(concat( " ", #class, " " ), concat( " ", "green", " " ))]//div[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a' but this gives me the same result with 0 data.
I am expecting Horse names. Shouldnt I at least get some javascript code even if data on page is rendered by javascript?
I cant see what else CSS selector I should use here.
You can simply use RSelenium package to scrape dynamycal pages :
library(RSelenium)
#specify the url
url <- 'https://www.travsport.se/sresultat?kommando=tevlingsdagVisa&tevdagId=570243&loppId=0&valdManad&valdLoppnr&source=S'
#Create the remote driver / navigator
rsd <- rsDriver(browser = "chrome")
remDr <- rsd$client
#Go to your url
remDr$navigate(url)
page <- read_html(remDr$getPageSource()[[1]])
#get your horses data by parsing Selenium page with Rvest as you know to do
page %>% html_nodes(".green div:nth-child(1)") %>% html_text()
Hope that will helps
Gottavianoni

Resources