Using Rvest to webscrape rankings - r

I am wanting to scrape all rankings from the following website:
https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating
I have tried using CSS selectors, which tell me to use ".ratingNum", but it leaves me with blank data. I have also tried using the GET function, which results in a similar problem.
# Attempt 1
url <- 'https://www.glassdoor.com/ratingsDetails/full.htm?employerId=432&employerName=McDonalds#trends-overallRating'
webpage <- read_html(url)
rank_data_html <- html_nodes(webpage,'.rankingNum')
rank_data <- html_table(rank_data_html)
head(rank_data)
# Attempt 2
res <- GET("https://www.glassdoor.com/ratingsDetails/full.htm",
query=list(employerId="432",
employerName="McDonalds"))
doc <- read_html(content(res, as="text"))
html_nodes(doc, ".ratingNum")
rank_data <- html_table(rank_data_html)
head(rank_data)
I expect the result to give me a list of all of the rankings, but instead it is giving me an empty list, or a list that doesn't include the rankings.

Your list is empty because you're GETing an unpopulated HTML document. Frequently when this happens you have to resort to RSelenium and co., but Glassdoor's public-facing API actually has everything you need – if you know where to look.
(Note: I'm not sure if this is officially part of Glassdoor's public API, but I think it's fair game if they haven't made more of an effort to conceal it. I tried to find some information, but their documentation is pretty meager. Usually companies will look the other way if you're just doing a smallish analysis and not slamming their servers or trying to profit from their data, but it's still a good idea to heed their ToS. You might want to shoot them an email describing what you're doing, or even ask about becoming an API partner. Make sure you adhere to their attribution rules. Continue at your own peril.)
Take a look at the network analysis tab in you browser's developer tools. You will see some GET requests that return JSONs, and one of those has the address you need. Send a GET and parse the JSON:
library(httr)
library(purrr)
library(dplyr)
ratings <- paste0("https://www.glassdoor.com/api/employer/432-rating.htm?",
"locationStr=&jobTitleStr=&filterCurrentEmployee=false")
req_obj <- GET(ratings)
cont <- content(req_obj)
ratings_df <- map(cont$ratings, bind_cols) %>% bind_rows()
ratings_df
You should end up with a dataframe containing ratings data. Just don't forget that the "ceoRating", "bizOutlook", and "recommend" are are proportions from 0-1 (or percentages if *100), while the rest reflect average user ratings on a 5-point scale:
# A tibble: 9 x 3
hasRating type value
<lgl> <chr> <dbl>
1 TRUE overallRating 3.3
2 TRUE ceoRating 0.72
3 TRUE bizOutlook 0.42
4 TRUE recommend 0.570
5 TRUE compAndBenefits 2.8
6 TRUE cultureAndValues 3.1
7 TRUE careerOpportunities 3.2
8 TRUE workLife 3.1
9 TRUE seniorManagement 2.9

Related

Is there a way to put a wildcard character in a web address when using rvest?

I am new to web scrapping and using R and rvest to try and pull some info for a friend. This project might be a bit much for my first, but hopefully someone can help or tell me if it is possible.
I am trying to pull info from https://www.veteranownedbusiness.com/mo like business name, address, phone number, and description. I started by pulling all the names of the business' and was going to loop through each page to pull the information by business. The problem I ran into is that the business url's have numbers assigned to them :
www.veteranownedbusiness.com/business/**32216**/accel-polymers-llc
Is there a way to tell R to ignore this number or accept any entry in its spot so that I could loop through the business names?
Here is the code I have so far to get and clean the business titles if it helps:
library(rvest)
library(tibble)
library(stringr)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
HTML <- read_html(vet_name_list)
biz_names_html <- html_nodes(HTML, '.top_level_listing a')
biz_names <- html_text(biz_names_html)
biz_names <- biz_names[biz_names != ""]
biz_names_lower <- tolower(biz_names)
biz_names_sym <- gsub("[][!#$&%()*,.:;<=>#^_`|~.{}]", "", biz_names_lower)
biz_names_dub <- str_squish(biz_names_sym)
biz_name_clean <- chartr(" ", "-", biz_names_dub)
No, I'm afraid you can't use wildcards to get a valid url. What you can do is to scrape all the correct urls from the page, number and all.
To do this, we find all the correct nodes (I'm using xpath here rather than css selectors since it gives a bit more flexibility). You then get the href attribute from each node.
This can produce a data frame of business names and url. Here's a fully reproducible example:
library(rvest)
library(tibble)
vet_name_list <- "https://www.veteranownedbusiness.com/mo"
biz <- read_html(vet_name_list) %>%
html_nodes(xpath = "//tr[#class='top_level_listing']/td/a[#href]")
tibble(business = html_text(biz),
url = paste0(biz %>% html_attr("href")))
#> # A tibble: 550 x 2
#> business url
#> <chr> <chr>
#> 1 Accel Polymers, LLC /business/32216/ac~
#> 2 Beacon Car & Pet Wash /business/35987/be~
#> 3 Compass Quest Joplin Regional Veteran Services /business/21943/co~
#> 4 Financial Assistance for Military Experiencing Divorce /business/20797/fi~
#> 5 Focus Marines Foundation /business/29376/fo~
#> 6 Make It Virtual Assistant /business/32204/ma~
#> 7 Malachi Coaching & Training Ministries Int'l /business/24060/ma~
#> 8 Mike Jackson - Author /business/29536/mi~
#> 9 The Mission Continues /business/14492/th~
#> 10 U.S. Small Business Conference & EXPO /business/8266/us-~
#> # ... with 540 more rows
Created on 2022-08-05 by the reprex package (v2.0.1)

How do you view your object list when using databricks notebooks?

Potentially a simple question...
So when using RStudio IDE you have the beautiful and very handy Object explorer on the top right. Which quickly lets you know what your objects are called, how many rows etc.. It is especially handy for when you have many to keep track of, and even more so when your code is b0rked and it shows your dataframe has the wrong number of rows that your expecting.
But, in a Databricks notebooks, I cant see anywhere that has a view of your objects like that? Does it not exist? Or am I missing something?
Unfortunately, there is no UI in databricks to view objects.
For this feature you can submit an idea here
To inspect the names of the elements use names(my_df) or str(my_df), which provides a richer overview over your data object.
Sample code –
my_df <- data_frame(dbl_vec,chr_vec,log_vec)
head(my_df)
Output -
dbl_vec chr_vec log_vec
<dbl> <chr> <lgl>
1 1. a TRUE
2 2. b TRUE
3 3. c FALSE
4 4. d TRUE
5 5. e FALSE
6 6. f FALSE
dim(my_df)
[1] 20 3
Reference - https://therbootcamp.github.io/BaselRBootcamp_2018April/_sessions/D1S2_Objects/Objects_practical_answers.html

Why can't I read clickable links for webscraping with rvest?

I am trying to webscrape this website.
The content I need is available after clicking on each title. I can get the content I want if I do this for example (I am using SelectorGadget):
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
However, I would need to get each text for each clickable link in the website. So I usually do:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
I get an empty object though. I tried different variants of the code but with the same result.
How can I read those links and then apply the code in the first part to all the links?
Can anyone help me?
Thanks!
As #KonradRudolph has noted before, the links are inserted dynamically into the webpage. Therefore, I have produced a code using RSelenium and rvest to tackle this issue:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
The resulting data.frame looks like this:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

Extract total frequency of words from vector in R

This is the vector I have:
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)"
I want a data frame as a result, that would contain words and the frequecy of times they occur.
So result should look something like:
word count
a 300
and 260
be 200
... ...
... ...
What I tried to do, was use tm
corpus <- VCorpus(VectorSource(posts))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, removePunctuation)
m <- DocumentTermMatrix(corpus)
Running findFreqTerms(m, lowfreq =0, highfreq =Inf ) just gives me the words, so I understand its a sparse matrix, how do I extract the words and their frequency?
Is there a easier way to do this, maybe by not using tm at all?
posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players. they have private message boards where it appears most of their work goes on. i would bet they are posting more there than in jita speakers corner. i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold. its sort of like ccp used to post here on the forums then they stopped. so they got a csm to represent players and use jita park forum to interact. now the csm no longer posts there as they have their internal forums where they hash things out. perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts) # remove punctuations
posts <- gsub("[[:digit:]]", '', posts) # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") ))) # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] ) # remove empty characters
head(word_counts)
# Var1 Freq
# 2 a 8
# 3 about 3
# 4 allows 1
# 5 although 1
# 6 am 1
# 7 an 1
Plain R solution, assuming all words are separated by space:
words <- strsplit(posts, " ", fixed = T)
words <- unlist(words)
counts <- table(words)
The names(counts) holds words, and values are the counts.
You might want to use gsub to get rid of (),.?: and 's, 't or 're as in your example. As in:
posts <- gsub("'t|'s|'t|'re", "", posts)
posts <- gsub("[(),.?:]", " ", posts)
You've got two options. Depends if you want word count per document, or for all documents.
All Documents
library(dplyr)
count <- as.data.frame(t(inspect(m)))
sel_cols <- colnames(count)
count$word <- rownames(count)
rownames(count) <- seq(length = nrow(count))
count$count <- rowSums(count[,sel_cols])
count <- count %>% select(word,count)
count <- count[order(count$count, decreasing=TRUE), ]
### RESULT of head(count)
# word count
# 140 the 14
# 144 they 10
# 4 and 9
# 25 csm 7
# 43 for 5
# 55 had 4
This should capture occurrences across all documents (by use of rowSums).
Per Document
I would suggesting using the tidytext package, if you want word frequency per document.
library(tidytext)
m_td <- tidy(m)
The tidytext package allows fairly intuitive text mining, including tokenization. It is designed to work in a tidyverse pipeline, so it supplies a list of stop words ("a", "the", "to", etc.) to exclude with dplyr::anti_join. Here, you might do
library(dplyr) # or if you want it all, `library(tidyverse)`
library(tidytext)
data_frame(posts) %>%
unnest_tokens(word, posts) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## # A tibble: 101 × 2
## word n
## <chr> <int>
## 1 csm 7
## 2 0.0 3
## 3 nda 3
## 4 bit 2
## 5 ccp 2
## 6 dominion 2
## 7 forum 2
## 8 forums 2
## 9 hard 2
## 10 internal 2
## # ... with 91 more rows

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

Resources