web scraping from the same level multiple nodes with the same names - r

I would like to extract the following data from four nodes all at the same level and sharing the same code name.
# I was able to extract the first of the four nodes - Property Amenities, using google chrome selector gadget as to identify the nodes.
library(rvest)
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-
Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html")
result_amenities <- html_text (html_node(page0_url,"._1nAmDotd") %>% html_nodes("div") )
However, I cannot figure out how to pass the code to extract the elements within the second object named "Room Features". This is at the same node level and has the same name code as the one above =.This is also the case for the two objects following to this last one and by the names of "Room types" and "Good to know".

You need to query all of the nodes with same class using the html_nodes() function then parse each of those nodes individually.
For Example
library(rvest)
url<- "https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html"
page0_url<-read_html(url)
result_amenities <- html_text(html_nodes(page0_url,"._1nAmDotd") %>% html_nodes("div") )
names <- html_nodes(page0_url,"div._1mJdgpMJ") %>% html_text()
groupNodes <- html_nodes(page0_url,"._1nAmDotd")
outputlist <-lapply(groupNodes, function(node){
results <- node %>% html_nodes("div") %>% html_text()
})
On the reference page there is no corresponding "_1nAmDotd" node the "Good to Know" section thus leading to an unbalance in the results.

Almost all desirable data (including everything you requested) is available via the page manifest, within a script tag, as that is where it is loaded from. You can regex out that enormous amount of data with regex. Then write user defined functions to extract desired info.
I initially parse the regex matched group into a json object all_data. I then look through that list of lists to find strings only associated with the data of interest. For example, starRating is associated with the location data you are interested in. get_target_list returns that list and then I extract from that what I want. You can see
that location_info holds the data related to hotel amenities (including room amentities), the star rating (hotel class) and languages spoken etc.
E.g. location_info$hotelAmenities$languagesSpoken or location_info$hotelAmenities$highlightedAmenities$roomFeatures ........
N.B. As currently written, it is intended that search_string is unique to the desired list, within the list of lists initially held in the json object. I wasn't sure if the names, of the named lists, would remain constant, so chose to dynamically retrieve the right list.
R:
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
is_target_list <- function(x, search_string) {
return(str_detect(x %>% toString(), search_string))
}
get_target_list <- function(data_list, search_string) {
mask <- lapply(data_list, is_target_list, search_string) %>% unlist()
return(subset(data_list, mask))
}
r <- read_html("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html") %>%
toString()
all_data <- gsub("pageManifest:", '"pageManifest":', stringr::str_match(r, "(\\{pageManifest:.*);\\(")[, 2]) %>%
jsonlite::parse_json()
data_list <- all_data$pageManifest$urqlCache
# target_info <- get_target_list(data_list, 'hotelAmenities')
location_info <- get_target_list(data_list, "starRating") %>%
unname() %>%
.[[1]] %>%
{
.$data$locations[[1]]$detail
}
Regex:

Related

Learning web scraping unable to understand : html_nodes("table") %>% `[[`(6) %>%

I am learning web scraping in r , written the following code :
url <- "https://en.wikipedia.org/wiki/World_population"
library(rvest)
library(tidyr)
library(dplyr)
ten_most_df <- read_html(url)
ten_most_populous <- ten_most_df %>%
html_nodes("table") %>% `[[`(6) %>% html_table()
In the above mentioned code, what does : [[(6) represent.
I have referred some document as well for this where the following text is written, but not getting clearity on this :
"For vectors and matrices the [[ forms are rarely used, although they have some slight semantic
differences from the [ form (e.g. it drops any names or dimnames attribute, and that partial
matching is used for character indices)"
Request you to please explain on this , will be very helpful. thanks
It's just one way of selecting the 6th element from the nodeset.
The code ten_most_df %>% html_nodes("table") returns an xml_nodeset object with 26 elements, corresponding to the 26 tables on the page. [[(6) subsets the object and returns the 6th node.
In fact there's a quicker way using only html_table, which returns the tables in a list:
ten_most_df %>%
html_table() %>%
.[[6]]
Personally I find this a little easier to read; the . represents the list and [[n]] is the standard way to access list element number n.

R - Using SelectorGadget to grab a dataset

I am trying to grab Hawaii-specific data from this site: https://www.opentable.com/state-of-industry. I want to get the data for Hawaii from every table on the site. This is done after selecting the State tab.
In R, I am trying to use rvest library with SelectorGadget.
So far I've tried
library(rvest)
html <- read_html("https://www.opentable.com/state-of-industry")
html %>%
html_element("tbody") %>%
html_table()
However, this isn't giving me what I am looking for yet. I am getting the Global dataset instead in a tibble. So any suggestions on how grab the Hawaii dataset from the State tab?
Also, is there a way to download the dataset that clicks on Download dataset tab? I can also then work from the csv file.
All the page data is stored in a script tag where it is pulled from dynamically in the browser. You can regex out the JavaScript object containing all the data, and write a custom function to extract just the info for Hawaii as shown below. Function get_state_index is written to accept a state argument, in case you wish to view other states' information.
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
Regex:

R Scrape a list of Google + urls using purrr package

I am working on a web scraping project, which aims to extract Google + reviews from a set of children's hospitals. My methodology is as follows:
1) Define a list of Google + urls to navigate to for review scraping. The urls are in a dataframe along with other variables defining the hospital.
2) Scrape reviews, number of stars, and post time for all reviews related to a given url.
3) Save these elements in a dataframe, and name the dataframe after another variable in the dataframe corresponding to the url.
4) Move on to the next url ... and so on till all urls are scraped.
Currently, the code is able to scrape from a single url. I have tried to create a function using map from the purrr package. However it doesn't seem to be working, I am doing something wrong.
Here is my attempt, with comments on the purpose of each step
#Load the necessary libraries
devtools::install_github("ropensci/RSelenium")
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
library(xml2)
library(RSelenium)
#To avoid any SSL error messages
library(httr)
set_config( config( ssl_verifypeer = 0L ) )
Defining the URL dataframe
#Now to define the dataframe with the urls
urls_df =data.frame(Name=c("CHKD","AIDHC")
,ID=c("AAWZ12","AAWZ13")
,GooglePlus_URL=c("https://www.google.co.uk/search?ei=fJUKW9DcJuqSgAbPsZ3gDQ&q=Childrens+Hospital+of+the+Kings+Daughter+&oq=Childrens+Hospital+of+the+Kings+Daughter+&gs_l=psy-ab.3..0i13k1j0i22i10i30k1j0i22i30k1l7.8445.8445.0.9118.1.1.0.0.0.0.144.144.0j1.1.0....0...1c.1.64.psy-ab..0.1.143....0.qDMr7IDA-uA#lrd=0x89ba9869b87f1a69:0x384861b1e3a4efd3,1,,,",
"https://www.google.co.uk/search?q=Alfred+I+DuPont+Hospital+for+Children&oq=Alfred+I+DuPont+Hospital+for+Children&aqs=chrome..69i57.341j0j8&sourceid=chrome&ie=UTF-8#lrd=0x89c6fce9425c92bd:0x80e502f2175fb19c,1,,,"
))
Creating the function
extract_google_review=function(googleplus_urls) {
#Opens a Chrome session
rmDr=rsDriver(browser = "chrome",check = F)
myclient= rmDr$client
#Creates a sub-dataframe for the filtered hospital, which I will later use to name the dataframe
urls_df_sub=urls_df %>% filter(GooglePlus_URL %in% googleplus_urls)
#Navigate to the url
myclient$navigate(googleplus_urls)
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
# Save page source
pagesource= myclient$getPageSource()[[1]]
#simulate scroll down for several times-------------
count=read_html(pagesource) %>%
html_nodes(".p13zmc") %>%
html_text()
#Stores the number of reviews for the url, so we know how many times to scroll down
scroll_down_times=count %>%
str_sub(1,nchar(count)-5) %>%
as.numeric()
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1.2 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(1.2)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)})
}
pagesource= myclient$getPageSource()[[1]]
#this should get the full review, including translation and original text
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()
#Consolidating everything into a dataframe
reviews=head(reviews,min(length(reviews),length(stars),length(post_time)))
stars=head(stars,min(length(reviews),length(stars),length(post_time)))
post_time=head(post_time,min(length(reviews),length(stars),length(post_time)))
reviews_df=data.frame(review=reviews,rating=stars,time=post_time)
#Assign the dataframe a name based on the value in column 'Name' of the dataframe urls_df, defined above
df_name <- tolower(urls_df_sub$Name)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), reviews_df)))
} else {
assign(df_name, reviews_df)
}
} #End function
Feeding the urls into the function
#Now that the function is defined, it is time to create a vector of urls and feed this vector into the function
googleplus_urls=urls_df$GooglePlus_URL
googleplus_urls %>% map(extract_google_review)
There seems to be an error in the function ,which is preventing it from scraping and storing the data into separate dataframes like intended.
My Intended Output
2 dataframes, each with 3 columns
Any pointers on how this can be improved will be greatly appreciated.

Using str_extract_all to extract publications from specific portal out of hundreds of portals in a string.

Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www dot time dot mk/"
frontpagelinks = paste0(mark, frontpage)
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
I used
a1onJune = str_extract_all(frontpage, ".*a1on.*") to extract articles from the website a1on dot mk, which worked like a charm finding only the articles I needed.
After getting some help here as to how to make my code more efficient, i.e. extract numerous links at once, via:
linksList <- lapply(frontpagelinks, function(i) {
read_html(frontapagelinks[i]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
which extracts all of the links I need, the same stringr code returns oddly enough something like this
"\"standard dot mk/germancite-ermenskiot-genocid/\", \"//plusinfo dot mk/vest/72702/turcija-ne-go-prifakja-zborot-genocid\", \"/a1on dot mk/wordpress/archives/618719\", \"sitel dot mk/na-povidok-nov-sudir-megju-turcija-i-germanija\",
Where as shown in bold I also extract the links to the website I need, but also a bunch of other noise that I definitely don't want there. I tried a variety of regex expressions, however I've not managed to define only those lines of code that contain a1on posts.
Given that the list which I am attempting to clear out outputs separated links I am a bit baffled by the fact that when I use stringr it (as far as im concerned) randomly divides them into strings of multiple links:
[93] "http://telegraf dot mk /aktuelno/svet/ns-newsarticle-vo-znak-na-protest-turcija-go-povlece-svojot-ambasador-od-germanija.nspx"
[94] "http://tocka dot mk /1/197933/odnosite-pomegju-berlin-i-ankara-pred-totalen-kolaps-germanija-go-prizna-turskiot-genocid-nad-ermencite"
[95] "lokalno dot mk /merkel-vladata-na-germanija-e-podgotvena-da-pomogne-vo-dijalogot-megju-turcija-i-ermenija/"
Any thoughts as to how I can go about this? Perhaps something that is more general, given that I need to do the same type of cleaning for five different portals.
Thank you.
Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)
# lappy returns a list of lists, so use unlist to flatten
linksList <- unlist( lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href") %>%
paste0()}))
# grab the lists of interest
a1onLinks <- linksList[grepl(".*a1on.*", linksList)]
# [1] "http://a1on.mk/wordpress/archives/621196" "http://a1on.mk/wordpress/archives/621038"
# [3] "http://a1on.mk/wordpress/archives/620576" "http://a1on.mk/wordpress/archives/620686"
# [5] "http://a1on.mk/wordpress/archives/620364" "http://a1on.mk/wordpress/archives/620399"

Scraping Movie reviews from IMDB using rvest

I have extracted the reviews of a movie on IMDB but the separate reviews have a lot of blank lines between them. It is unstructured and very difficult to view.
I have to apply certain functions on each of them separately and then store them together as 1 for some text mining for some other functions.
How can I structure (clean) them and access them one at a time and also how to combine them and store it together?
Here is my code for scraping the reviews
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>%
html_nodes("p") %>%
html_text()
I would suggest that you are more specific when you navigate the DOM. For instance, this code will only deliver reviews and none of the other information that you are presumably not looking to scrape:
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/tt", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>% html_nodes("#pagecontent") %>%
html_nodes("div+ p") %>%
html_text()
And here is a way to remove line breaks, applying a function to each review, and merging all reviews into one paragraph (also see this post on concatenating vector elements and this post on replacing line breaks):
ex_review <- gsub("[\r\n]", " ", ex_review) # replace line breaks
sapply(ex_review, function(x){}) # apply function to each review
ex_review <- paste(ex_review, collapse = "") # concatenate reviews into one paragraph
write(ex_review, "test.txt")
I think you were also missing a "tt" in the URL.

Resources