Scraping data from LinkedIn using RSelenium (and rvest) - r

I am trying to scrape some data from famous people on LinkedIn and I have a few problems. I would like do the following:
On Hadley Wickhams page ( https://www.linkedin.com/in/hadleywickham/ ) I would like to use RSelenium to login and "click" the "Show 1 more education" - and also "Show 1 more experience" (note Hadley does not have the option to "Show 1 more experience" but does have the option to "Show 1 more education").
(by clicking the "Show more experience/education" allows me to scrape the full education and experience from the page). Alternatively Ted Cruz has an option to "Show 5 more experiences" which I would like to expand and scrape.
Code:
library(RSelenium)
library(rvest)
library(stringr)
library(xml2)
userID = "myEmailLogin" # The linkedIn email to login
passID = "myPassword" # and LinkedIn password
try(rsDriver(port = 4444L, browser = 'firefox'))
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.linkedin.com/login")
user <- remDr$findElement(using = 'id',"username")
user$sendKeysToElement(list(userID,key="tab"))
pass <- remDr$findElement(using = 'id',"password")
pass$sendKeysToElement(list(passID,key="enter"))
Sys.sleep(5) # give the page time to fully load
# Navgate to individual profiles
# remDr$navigate("https://www.linkedin.com/in/thejlo/") # Jennifer Lopez
# remDr$navigate("https://www.linkedin.com/in/cruzted/") # Ted Cruz
remDr$navigate("https://www.linkedin.com/in/hadleywickham/") # Hadley Wickham
Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]
signals <- read_html(html)
personFullNameLocationXPath <- '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/ul[1]/li[1]'
personName <- signals %>%
html_nodes(xpath = personFullNameLocationXPath) %>%
html_text()
personTagLineXPath = '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2'
personTagLine <- signals %>%
html_nodes(xpath = personTagLineXPath) %>%
html_text()
personLocationXPath <- '//*[#id="ember49"]/div[2]/div[2]/div[1]/ul[2]/li[1]'
personLocation <- signals %>%
html_nodes(xpath = personLocationXPath) %>%
html_text()
personLocation %>%
gsub("[\r\n]", "", .) %>%
str_trim(.)
# Here is where I have problems
personExperienceTotalXPath = '//*[#id="experience-section"]/ul'
personExperienceTotal <- signals %>%
html_nodes(xpath = personExperienceTotalXPath) %>%
html_text()
The very end personExperienceTotal is where I go wrong... I cannot seem to scrape the experience-section. When I put my own LinkedIn URL (or some random person) it seems to work...
My question is, how can I click the expand experience/education and scrape these sections?

Related

R How to web scrape data from StockTwits with RSelenium?

I want to get some information from tweets posted on the platform StockTwits.
Here you can see an example tweet: https://stocktwits.com/Kndihopefull/message/433815546
I would like to read the following information: Number of replies, number of reshares, number of likes:
I think this is possible with the RSelenium-package. However, I am not really getting anywhere with my approach.
Can someone help me?
library(RSelenium)
url<- "https://stocktwits.com/Kndihopefull/message/433815546"
# RSelenium with Firefox
rD <- RSelenium::remoteDriver(browser="firefox", port=4546L)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
I would like to have a list (or a data set) as a result, which looks like this:
$Reply
[1] 1
$Reshare
[1] 1
$Like
[1] 7
To get required info we can do,
library(rvest)
library(dplyr)
library(RSelenium)
#launch browser
driver = rsDriver(browser = c("firefox"))
url = "https://stocktwits.com/ArcherUS/message/434172145"
remDr <- driver$client
remDr$navigate(url)
#First we shall get the tags
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
[1] "Reply" "Reshare" "Like" "Share" "Search"
#then the number associated with it
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()
[1] "" "" "2" "" ""
The last two items Share and Search will be empty.
The faster approach would be by using rvest.
library(rvest)
url = "https://stocktwits.com/ArcherUS/message/434172145"
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_attr('title')
url %>%
read_html() %>% html_nodes('.st_3kvJrBm') %>%
html_text()

Web Scraping on multiple pages with RSelenium and select emails with regular expression

I would like to collect email addresses clicking each name from this website https://ki.se/en/research/professors-at-ki I created the following loop. For some reason some email are not collected, and the code is very slow...
Do you have a better code idea?
Thank you very much in advance
library(RSelenium)
#use Rselenium to dowload emails
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://ki.se/en/research/professors-at-ki")
database<-data.frame(NA, nrow = length(name), ncol = 3)
for(i in 1:length(name)){
#first website
remDr$navigate("https://ki.se/en/research/professors-at-ki")
elems <- remDr$findElements(using = 'xpath', "//strong") #all elements to be selected
elem <- elems[[i]] #do search and click on each one
class(elem)
people<- elem$getElementText()
elem$clickElement()
page <- remDr$getPageSource()
#stringplit
p<-str_split(as.character(page), "\n")
a<-grep("#", p[[1]])
if(length(a)>0){
email<-p[[1]][a[2]]
email<-gsub(" ", "", email)
database[i,1]<-people
database[i,2]<-email
database[i,3]<-"Karolinska Institute"
}
}
RSelenium is usually not the fastest approach as it requires the browser to load the page. There are cases, when RSelenium is the only option, but in this case, you can achieve what you need using rvest library, which should be faster. As for the errors you receive, there are two professors, for which the links provided do not seem to be working, thus the errors you receive.
library(rvest)
library(tidyverse)
# getting links to professors microsites as part of the KI main website
r <- read_html("https://ki.se/en/research/professors-at-ki")
people_links <- r %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("https://staff.ki.se/people/")
# accessing the obtained links, getting the e-mails
df <- tibble(people_links) %>%
# filtering out these links as they do not seem to be accessible
filter( !(people_links %in% c("https://staff.ki.se/people/gungra", "https://staff.ki.se/people/evryla")) ) %>%
rowwise() %>%
mutate(
mail = read_html(people_links) %>%
html_nodes("a") %>%
html_attrs() %>%
as.character() %>%
str_subset("mailto:") %>%
str_remove("mailto:")
)

Google Play web scraping: how can you get the number of votes for each review in R?

I'm doing web scraping in R of the reviews of a Google Play app, but I can't get the number of votes. I indicate the code: likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label") and I get no value. How can it be done?
Get scrape votes
FULL CODE
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
url <- 'https://play.google.com/store/apps/details?id=com.gospace.parenteral&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
likes <- html_obj %>% html_nodes(".xjKiLb") %>% html_attr("aria-label")
What returns me
NA NA NA
What I want to be returned
3 3 2
Maybe you are using the selector gadget to get the css selector. As you, I tried to do that, but the css the selector gadget return is not the correct one.
Inspecting the html code of the page, I realized that the correct element is contain in the tag with class = "jUL89d y92BAb" as you can see in this image.
This way, the code you should use is this one.
html_obj %>% html_nodes('.jUL89d') %>% html_text()
My personal recommendation for you is to always check the source code to confirm the output of the selector gadget.

How to perform web scraping to get all the reviews of the an app in Google Play?

I pretend to be able to get all the reviews that users leave on Google Play about the apps. I have this code that they indicated there Web scraping in R through Google playstore . But the problem is that you only get the first 40 reviews. Is there a possibility to get all the comments of the app?
`` `
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
#Specifying the url for desired website to be scraped
url <- 'https://play.google.com/store/apps/details?
id=com.phonegap.rxpal&hl=en_IN&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that
is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-
Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) name field (assuming that with 'name' you refer to the name of the
reviewer)
names <- html_obj %>% html_nodes(".kx8XBd .X43Kjb") %>% html_text()
# 2) How much star they got
stars <- html_obj %>% html_nodes(".kx8XBd .nt2C1d [role='img']") %>%
html_attr("aria-label")
# 3) review they wrote
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
# create the df with all the info
review_data <- data.frame(names = names, stars = stars, reviews = reviews,
stringsAsFactors = F)
`` `
You can get all the reviews from the web store of GooglePlay.
If you scroll through the reviews, you can see the XHR request is sent to:
https://play.google.com/_/PlayStoreUi/data/batchexecute
With form-data:
f.req: [[["rYsCDe","[[\"com.playrix.homescapes\",7]]",null,"55"]]]
at: AK6RGVZ3iNlrXreguWd7VvQCzkyn:1572317616250
And params of:
rpcids=rYsCDe
f.sid=-3951426241423402754
bl=boq_playuiserver_20191023.08_p0
hl=en
authuser=0
soc-app=121
soc-platform=1
soc-device=1
_reqid=839222
rt=c
After playing around with different parameters, I find out many are optional, and the request can be simplified as:
form-data:
f.req: [[["UsvDTd","[null,null,[2, $sort,[$review_size,null,$page_token]],[$package_name,7]]",null,"generic"]]]
params:
hl=$review_language
The response is cryptic, but it's essentially JSON data with keys stripped, similar to protobuf, I wrote a parser for the response that translate it to regular dict object.
https://gist.github.com/xlrtx/af655f05700eb76bb29aec876493ed90

Google Play web scraping: How to identify response to app reviews in R?

I am doing web scraping in R of the reviews of a Google Play application, but I cannot identify the lack of response to the reviews.
I explain. I intend to set up a database with two columns. One with the text of the review and another column with the app's response to that review. In this last column, it will have empty values when there is no response. However, I only get the answers and I cannot identify the absence of an answer. How can this be done?
INPUT
OUPUT What I want to be returned
How I can get this? Identify the absence of response
FULL CODE
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
url <- 'https://play.google.com/store/apps/details?id=com.gospace.parenteral&showAllReviews=true'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
#1 column
reviews <- html_obj %>% html_nodes(".UD7Dzf") %>% html_text()
#2 column
reply <- html_obj %>% html_nodes('.LVQB0b') %>% html_text()
# create the df with all the info
review_data <- data.frame(reviews = reviews, reply = reply, stringsAsFactors = F)

Resources