Rstudio Webscraping - Rvest returns character(0) - r

I'm working on an undergraduate project that I am required to webscrape the following data from multiple airbnb listings.
Here is an example:
https://www.airbnb.com.sg/rooms/49091?_set_bev_on_new_domain=1582777903_ZWE4MTBjMGNmYmFh&source_impression_id=p3_1582778001_lB%2BjT8%2BWgIsL%2FrBV
The following data I am required to webscrape is 1 guest, 1 bedroom, 1 bed, 1 bathroom.
However, when I use the CSS selector tool, my following path is "._b2fuovg".
This returns character(0) when I run the following code.
library(rvest)
library(dplyr)
url1 <- read_html("https://www.airbnb.com.sg/rooms/49091?_set_bev_on_new_domain=1582777903_ZWE4MTBjMGNmYmFh&source_impression_id=p3_1582778001_lB%2BjT8%2BWgIsL%2FrBV")
url1 %>%
html_nodes("._b2fuovg") %>%
html_text()
and the following output is
> url1 %>%
+ html_nodes("._b2fuovg") %>%
+ html_text()
character(0)
Any advice or guidance in the right direction is greatly appreciated! :)

I recommend using the Selector Gadget to determine what node to scrape: https://selectorgadget.com/
It works by clicking on the information you want. Other information that will also be included will be shown in yellow. If you don't want those, click on them to turn them red. You will notice at the bottom of your screen a little bar with some text. This is what you want to include in html_nodes(). In this case, I got "._1b3ij9t+ div". Sure enough, this seems to work:
url1 %>%
html_nodes("._1b3ij9t+ div") %>%
html_text()
[1] "1 guest · 1 bedroom · 1 bed · 1 bathroom"

Related

Web Scraping with R - {xml_nodeset (0)}

I'm new to R and I'm trying to get data from this website: https://spritacular.org/gallery.
I want to get the location, time and the hour. I am following this guide, using the SelectorGadget I clicked on the elements I wanted (.card-title , .card-subtitle , .mb-0).
However, it always outputs {xml_nodeset (0)} and I'm not sure why it's not getting those elements.
This is the code I have:
url <- "https://spritacular.org/gallery"
sprite_gallery <- read_html(url)
sprite_location <- html_nodes(sprite_gallery, ".card-title , .card-subtitle , .mb-0")
sprite_location
When I change the website and grab something from a different website it works, so I'm not sure what I'm doing wrong and how to fix it, this is my first time doing something like this and I appreciate any insight you may have!
As per comment, this website has JS embedded and the information only opens when a browser is opened. If you go to developers tools and network tab, you can see the underlying json data
If you post a GET request for this api address, you will get a list back with all the results. From their, you can slice and dice your way to get the required information you need.
One way to do this: I have considered the name of the user who submitted the image and I found out that same user has submitted multiple images. Hence there are duplicate names and locations in the output but the image URL is different. Refer this blog to know how to drill down the json data to make useful dataframes in R
library(httr)
library(tidyverse)
getURL <- 'https://api.spritacular.org/api/observation/gallery/?category=&country=&cursor=cD0xMTI%3D&format=json&page=1&status='
# get the raw json into R
UOM_json <- httr::GET(getURL) %>%
httr::content()
exp_output <- pluck(UOM_json, 'results') %>%
enframe() %>%
unnest_longer(value) %>%
unnest_wider(value) %>%
select(user_data, images) %>%
unnest_wider(user_data) %>%
mutate(full_name = paste(first_name, last_name)) %>%
select(full_name, location, images) %>%
rename(., location_user = location) %>%
unnest_longer(images) %>%
unnest_wider(images) %>%
select(full_name, location, image)
Output of our exp_output
> head(exp_output)
# A tibble: 6 × 3
full_name location image
<chr> <chr> <chr>
1 Kevin Palivec Jones County,Texas,United States https://d1dzduvcvkxs60.cloudfront.net/observation_image/1d4cc82f-f3d2…
2 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/3b6391d1-f839…
3 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/9bcf10d7-bd7c…
4 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/a7dea9cf-8d6e…
5 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/539e0870-c931…
6 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/c729ea03-e1f8…
>

What syntax error is causing this specific error message?

I am uisng R in RStudio and I have an R script which performs web scraping. I am stuck with an error message when running these specific lines:
review<-ta1 %>%
html_node("body") %>%
xml_find_all("//div[contains#class,'location-review-review']")
The error message is as follows:
xmlXPathEval: evaluation failed
Error in `*tmp*` - review : non-numeric argument to binary operator
In addition: Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Invalid predicate [1206]
Note: I have dplyr and rvest libraries loaded in my R script.
I had a look at the answers to the following question on StackOverflow:
Non-numeric Argument to Binary Operator Error
I have a feeling my solution relates to the answer provided by Richard Border to the question linked above. However, I am having a hard time trying to figure out how to correct my R syntax based on that answer.
Thank you for looking into my question.
Sample of ta1 added:
{xml_document}
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="content-type" content="text/html; charset=utf-8">\n<link rel="icon" id="favicon" ...
[2] <body class="rebrand_2017 desktop_web Hotel_Review js_logging" id="BODY_BLOCK_JQUERY_REFLOW" data-tab="TAB ...
I'm going to make a few assumptions here, since your post doesn't contain enough information to generate a reproducible example.
First, I'm guessing that you are trying to scrape TripAdvisor, since the id and class fields match for that site and since your variable is called ta1.
Secondly, I'm assuming that you are trying to get the text of each review and the number of stars for each, since that is the relevant scrape-able information in each of the classes you appear to be looking for.
I'll need to start by getting my own version of the ta1 variable, since that wasn't reproducible from your edited version.
library(httr)
library(rvest)
library(xml2)
library(magrittr)
library(tibble)
"https://www.tripadvisor.co.uk/" %>%
paste0("Hotel_Review-g186534-d192422-Reviews-") %>%
paste0("Glasgow_Marriott_Hotel-Glasgow_Scotland.html") -> url
ta1 <- url %>% GET %>% read_html
Now write the correct xpaths for the data of interest
# xpath for elements whose text contains reviews
xpath1 <- "//div[contains(#class, 'location-review-review-list-parts-Expand')]"
# xpath for the elements whose class indicate the ratings
xpath2 <- "//div[contains(#class, 'location-review-review-')]"
xpath3 <- "/span[contains(#class, 'ui_bubble_rating bubble_')]"
We can get the text of the reviews like this:
ta1 %>%
xml_find_all(xpath1) %>% # run first query
html_text() %>% # extract text
extract(!equals(., "Read more")) -> reviews # remove "blank" reviews
And the associated star ratings like this:
ta1 %>%
xml_find_all(paste0(xpath2, xpath3)) %>%
xml_attr("class") %>%
strsplit("_") %>%
lapply(function(x) x[length(x)]) %>%
as.numeric %>%
divide_by(10) -> stars
And our result looks like this:
tibble(rating = stars, review = reviews)
## A tibble: 5 x 2
# rating review
# <dbl> <chr>
#1 1 7 of us attended the Christmas Party on Satu~
#2 4 "We stayed 2 nights over last weekend to att~
#3 3 Had a good stay, but had no provision to kee~
#4 3 Booked an overnight for a Christmas shopping~
#5 4 Attended a charity lunch here on Friday and ~

Extract text and numbers from web page using regex in R

I want to use R to extract text and numbers from the following page: https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=PA0261696&pgm_sys_acrnm_in=NPDES
Specifically, I want the NPDES SIC code and the description, which is 6515 and "Operators of residential mobile home sites" here.
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
test <- test %>% html_nodes("tr") %>% html_text()
# This extracts 31 lines of text; here is what my target text looks like:
# [10] "NPDES\n6515\nOPERATORS OF RESIDENTIAL MOBILE HOME SITES\n\n"
Ideally, I'd like the following: "6515 OPERATORS OF RESIDENTIAL MOBILE HOME SITES"
How would I do this? I'm trying and failing at regex here even just trying to extract the number 6515 alone, which I thought would be easiest...
as.numeric(sub(".*?NPDES.*?(\\d{4}).*", "\\1", test))
# 4424
Any advice?
From what I can see, your information resides in a table. It might be a better idea to perhaps just extract the information as a dataframe itself. This works:
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
tables <- html_nodes(test, "table")
tables
SIC <- as.data.frame(html_table(tables[5], fill = TRUE))

rvest read_html for a specific table

I am trying to scrape a web page in R. In the table of contents here:
https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc
I am interested in the
Consolidated Statement of Earnings - Page 50
Consolidated Statement of Cash Flows - Page 51
Consolidated Balance Sheet - Page 52
Depending on the document the page number can vary where these statements are.
I am trying to locate these documents using html_nodes() but I cannot seem to get it working. When I inspect the url I find the table at <div align="CENTER"> == $0 but I cannot find a table ID key.
url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"
dat <- url %>%
read_html() %>%
html_table(fill = TRUE)
Any push in the right direction would be great!
EDIT: I know of the finreportr and finstr packages but they are taking the XML documents and not all .HTML pages have XML documents - I also want to do this using the rvest package.
EDIT:
Something like the following Works:
url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
html_table()
x <- population[[1]]
Its very messy but it does get the cash flows table. The Xpath changes depending on the webpage.
For example this one is different:
url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
html_table()
x <- population[[1]]
Is there a way to "search" for the "cash Flow" table and somehow extract the xpath?
Some more links to try.
[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"
[2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"
[3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"
[5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"
[6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"
[8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"
[9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"

Web-scraping in R

I am practicing my web scraping coding in R and I cannot pass one phase no matter what website I try.
For example,
https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music
My goal is to extract all 77 schools' name (Oxford to London Metropolitan)
So I tried...
library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()
From F12, I could find out that all schools' name is under class '.league-table-institution-name'... and that's why I wrote that in html_nodes...
What have I done wrong?
You appear to be running html_nodes() twice: first on college, an xml_document (which is correct) and then on info, a character vector, which is not correct.
Try this instead:
url_college %>%
read_html() %>%
html_nodes('.league-table-institution-name') %>%
html_text()
and then you'll need an additional step to clean up the school names; this one was suggested:
%>%
str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

Resources