I am working on a web scraping project, which aims to extract Google + reviews from a set of children's hospitals. My methodology is as follows:
1) Define a list of Google + urls to navigate to for review scraping. The urls are in a dataframe along with other variables defining the hospital.
2) Scrape reviews, number of stars, and post time for all reviews related to a given url.
3) Save these elements in a dataframe, and name the dataframe after another variable in the dataframe corresponding to the url.
4) Move on to the next url ... and so on till all urls are scraped.
Currently, the code is able to scrape from a single url. I have tried to create a function using map from the purrr package. However it doesn't seem to be working, I am doing something wrong.
Here is my attempt, with comments on the purpose of each step
#Load the necessary libraries
devtools::install_github("ropensci/RSelenium")
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
library(xml2)
library(RSelenium)
#To avoid any SSL error messages
library(httr)
set_config( config( ssl_verifypeer = 0L ) )
Defining the URL dataframe
#Now to define the dataframe with the urls
urls_df =data.frame(Name=c("CHKD","AIDHC")
,ID=c("AAWZ12","AAWZ13")
,GooglePlus_URL=c("https://www.google.co.uk/search?ei=fJUKW9DcJuqSgAbPsZ3gDQ&q=Childrens+Hospital+of+the+Kings+Daughter+&oq=Childrens+Hospital+of+the+Kings+Daughter+&gs_l=psy-ab.3..0i13k1j0i22i10i30k1j0i22i30k1l7.8445.8445.0.9118.1.1.0.0.0.0.144.144.0j1.1.0....0...1c.1.64.psy-ab..0.1.143....0.qDMr7IDA-uA#lrd=0x89ba9869b87f1a69:0x384861b1e3a4efd3,1,,,",
"https://www.google.co.uk/search?q=Alfred+I+DuPont+Hospital+for+Children&oq=Alfred+I+DuPont+Hospital+for+Children&aqs=chrome..69i57.341j0j8&sourceid=chrome&ie=UTF-8#lrd=0x89c6fce9425c92bd:0x80e502f2175fb19c,1,,,"
))
Creating the function
extract_google_review=function(googleplus_urls) {
#Opens a Chrome session
rmDr=rsDriver(browser = "chrome",check = F)
myclient= rmDr$client
#Creates a sub-dataframe for the filtered hospital, which I will later use to name the dataframe
urls_df_sub=urls_df %>% filter(GooglePlus_URL %in% googleplus_urls)
#Navigate to the url
myclient$navigate(googleplus_urls)
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
# Save page source
pagesource= myclient$getPageSource()[[1]]
#simulate scroll down for several times-------------
count=read_html(pagesource) %>%
html_nodes(".p13zmc") %>%
html_text()
#Stores the number of reviews for the url, so we know how many times to scroll down
scroll_down_times=count %>%
str_sub(1,nchar(count)-5) %>%
as.numeric()
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1.2 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(1.2)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)})
}
pagesource= myclient$getPageSource()[[1]]
#this should get the full review, including translation and original text
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()
#Consolidating everything into a dataframe
reviews=head(reviews,min(length(reviews),length(stars),length(post_time)))
stars=head(stars,min(length(reviews),length(stars),length(post_time)))
post_time=head(post_time,min(length(reviews),length(stars),length(post_time)))
reviews_df=data.frame(review=reviews,rating=stars,time=post_time)
#Assign the dataframe a name based on the value in column 'Name' of the dataframe urls_df, defined above
df_name <- tolower(urls_df_sub$Name)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), reviews_df)))
} else {
assign(df_name, reviews_df)
}
} #End function
Feeding the urls into the function
#Now that the function is defined, it is time to create a vector of urls and feed this vector into the function
googleplus_urls=urls_df$GooglePlus_URL
googleplus_urls %>% map(extract_google_review)
There seems to be an error in the function ,which is preventing it from scraping and storing the data into separate dataframes like intended.
My Intended Output
2 dataframes, each with 3 columns
Any pointers on how this can be improved will be greatly appreciated.
Related
I wan wanting to automate downloading of some unicef data from https://data.unicef.org/indicator-profile/ using rvest or a simila r package. I have noticed that there are indicator codes, but I am having trouble identifying the correct codes and actually downloading the data.
Upon inspecting element, there is a data-inner-wrapper class that seems like it might be useful. You can access a download link by going to a page associated with an indicator and specifying a time period. For example, CME_TMY5T9 is the code for Deaths aged 5 to 9.
The data is available by going to
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=.CME_TMY5T9..&startPeriod=2017&endPeriod=2022` and then clicking a download link.
If anyone could help me figure out how to get all the data, that would be fantastic. Thanks
library(rvest)
library(dplyr)
library(tidyverse)
page = "https://data.unicef.org/indicator-profile/"
df = read_html(page) %>%
#html_nodes("div.data-inner-wrapper")
html_nodes(xpath = "//div[#class='data-inner-wrapper']")
EDIT: Alternatively, downloading all data for each country would be possible. I think that would just require getting the download link or getting at at the data within the table (since country codes arent much of an issue)
This shows all the data for Afghanistan. I just need to figure out a programmatic way of actually downloading the data....
https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=AFG..&startPeriod=1970&endPeriod=2022
You are on the right track! When you visit the website https://data.unicef.org/indicator-profile/, it does not directly contain the indicator codes, because these are loaded dynamically at a later point. You can try using the "network analysis" function of your webbrowser and look at the different requests your browser does to fully load a webpage. The one you are looking for, with all the indicator codes is here: https://uni-drp-rdm-api.azurewebsites.net/api/indicators
library(httr)
library(jsonlite)
library(glue)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
You also correctly identied the URL, which lets you download individual datasets in the data browser. Now you just needed to find the one that pops up, when you actually download an excel file and recursively add in the differnt helix-codes from the indicators. I have not tried applying this to all indicators, for some the url might differ and you might get incomplete data or errors. But this should get you started.
GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[3]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv()
This might be a good place to get started on how to mimick requests that your browser executes. https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
Here is what I did based on the very helpful code from #Datapumpernickel
library(dplyr)
library(httr)
library(jsonlite)
library(glue)
library(tidyverse)
library(tictoc)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
#browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
tic()
FULL_DF = NULL
for(i in seq(1,length(unique(indicators$helixCode)),1)){
# Set up a trycatch loop to keep on going when it encounters errors
tryCatch({
print(paste0("Processing : ", i, " of 546 ", indicators$helixCode[i]))
TMP = GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[i]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv(col_types = cols())
# # Basic formatting for variables I want
TMP = TMP %>%
select(`Geographic area`, Indicator, Sex, TIME_PERIOD, OBS_VALUE) %>%
mutate(description = indicators$helixCode[i]) %>%
rename(country = `Geographic area`,
variablename = Indicator,
disaggregation = Sex,
year = TIME_PERIOD,
value = OBS_VALUE)
# rbind each indicator to the full dataframe
FULL_DF = FULL_DF %>% rbind(TMP)
},
error = function(cond){
cat("\n WARNING COULD NOT PROCESS : ", i, " of 546 ", indicators$helixCode[i])
message(cond)
return(NA)
}
)
}
toc()
# Save the data
rio::export(FULL_DF, "unicef-data.csv")
I am trying to grab Hawaii-specific data from this site: https://www.opentable.com/state-of-industry. I want to get the data for Hawaii from every table on the site. This is done after selecting the State tab.
In R, I am trying to use rvest library with SelectorGadget.
So far I've tried
library(rvest)
html <- read_html("https://www.opentable.com/state-of-industry")
html %>%
html_element("tbody") %>%
html_table()
However, this isn't giving me what I am looking for yet. I am getting the Global dataset instead in a tibble. So any suggestions on how grab the Hawaii dataset from the State tab?
Also, is there a way to download the dataset that clicks on Download dataset tab? I can also then work from the csv file.
All the page data is stored in a script tag where it is pulled from dynamically in the browser. You can regex out the JavaScript object containing all the data, and write a custom function to extract just the info for Hawaii as shown below. Function get_state_index is written to accept a state argument, in case you wish to view other states' information.
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
Regex:
I would like to extract the following data from four nodes all at the same level and sharing the same code name.
# I was able to extract the first of the four nodes - Property Amenities, using google chrome selector gadget as to identify the nodes.
library(rvest)
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-
Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html")
result_amenities <- html_text (html_node(page0_url,"._1nAmDotd") %>% html_nodes("div") )
However, I cannot figure out how to pass the code to extract the elements within the second object named "Room Features". This is at the same node level and has the same name code as the one above =.This is also the case for the two objects following to this last one and by the names of "Room types" and "Good to know".
You need to query all of the nodes with same class using the html_nodes() function then parse each of those nodes individually.
For Example
library(rvest)
url<- "https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html"
page0_url<-read_html(url)
result_amenities <- html_text(html_nodes(page0_url,"._1nAmDotd") %>% html_nodes("div") )
names <- html_nodes(page0_url,"div._1mJdgpMJ") %>% html_text()
groupNodes <- html_nodes(page0_url,"._1nAmDotd")
outputlist <-lapply(groupNodes, function(node){
results <- node %>% html_nodes("div") %>% html_text()
})
On the reference page there is no corresponding "_1nAmDotd" node the "Good to Know" section thus leading to an unbalance in the results.
Almost all desirable data (including everything you requested) is available via the page manifest, within a script tag, as that is where it is loaded from. You can regex out that enormous amount of data with regex. Then write user defined functions to extract desired info.
I initially parse the regex matched group into a json object all_data. I then look through that list of lists to find strings only associated with the data of interest. For example, starRating is associated with the location data you are interested in. get_target_list returns that list and then I extract from that what I want. You can see
that location_info holds the data related to hotel amenities (including room amentities), the star rating (hotel class) and languages spoken etc.
E.g. location_info$hotelAmenities$languagesSpoken or location_info$hotelAmenities$highlightedAmenities$roomFeatures ........
N.B. As currently written, it is intended that search_string is unique to the desired list, within the list of lists initially held in the json object. I wasn't sure if the names, of the named lists, would remain constant, so chose to dynamically retrieve the right list.
R:
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
is_target_list <- function(x, search_string) {
return(str_detect(x %>% toString(), search_string))
}
get_target_list <- function(data_list, search_string) {
mask <- lapply(data_list, is_target_list, search_string) %>% unlist()
return(subset(data_list, mask))
}
r <- read_html("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html") %>%
toString()
all_data <- gsub("pageManifest:", '"pageManifest":', stringr::str_match(r, "(\\{pageManifest:.*);\\(")[, 2]) %>%
jsonlite::parse_json()
data_list <- all_data$pageManifest$urqlCache
# target_info <- get_target_list(data_list, 'hotelAmenities')
location_info <- get_target_list(data_list, "starRating") %>%
unname() %>%
.[[1]] %>%
{
.$data$locations[[1]]$detail
}
Regex:
I'm using rvest to scrape the .txt files of a blog page, and I have a script that triggers every day, and scrapes the newest post. The base of that script is an lapply function that simply scrapes all of the posts, and I later sort out duplicates using Apache NiFi.
That's not an efficient way to sort duplicates, so I was wondering if there's a way to use the same script, and only scrape the newest posts?
The posts are labelled with numbers that count up, such as BLOG001, BLOG002, etc. I want to put a line of code that makes sure to scrape the newest posts (they may post several in any given day). How do I make sure that I only get BlOG002, and the next run only get BLOG003, and so on?
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
writeBin(fun1$response$content)
return(fun1$response$content)
}, links)
Assuming that all of the links you want start with 'BLOG' as in your post, and you only want to download the one with the maximum number each time the code is run. You could try something like this to achieve that.
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Make sure only 'BLOG' links are checked
links <- links[substr(links, 1, 4) == 'BLOG']
# Get numeric value from link
blog.nums <- as.numeric(substr(links, 5, nchar(links)))
# Get the maximum link value
max.link <- links[which(blog.nums == max(blog.nums))]
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", max.link)),
config(ssl_verifypeer = FALSE))
writeBin(fun1$response$content)
I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}