I am uisng R in RStudio and I have an R script which performs web scraping. I am stuck with an error message when running these specific lines:
review<-ta1 %>%
html_node("body") %>%
xml_find_all("//div[contains#class,'location-review-review']")
The error message is as follows:
xmlXPathEval: evaluation failed
Error in `*tmp*` - review : non-numeric argument to binary operator
In addition: Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Invalid predicate [1206]
Note: I have dplyr and rvest libraries loaded in my R script.
I had a look at the answers to the following question on StackOverflow:
Non-numeric Argument to Binary Operator Error
I have a feeling my solution relates to the answer provided by Richard Border to the question linked above. However, I am having a hard time trying to figure out how to correct my R syntax based on that answer.
Thank you for looking into my question.
Sample of ta1 added:
{xml_document}
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="content-type" content="text/html; charset=utf-8">\n<link rel="icon" id="favicon" ...
[2] <body class="rebrand_2017 desktop_web Hotel_Review js_logging" id="BODY_BLOCK_JQUERY_REFLOW" data-tab="TAB ...
I'm going to make a few assumptions here, since your post doesn't contain enough information to generate a reproducible example.
First, I'm guessing that you are trying to scrape TripAdvisor, since the id and class fields match for that site and since your variable is called ta1.
Secondly, I'm assuming that you are trying to get the text of each review and the number of stars for each, since that is the relevant scrape-able information in each of the classes you appear to be looking for.
I'll need to start by getting my own version of the ta1 variable, since that wasn't reproducible from your edited version.
library(httr)
library(rvest)
library(xml2)
library(magrittr)
library(tibble)
"https://www.tripadvisor.co.uk/" %>%
paste0("Hotel_Review-g186534-d192422-Reviews-") %>%
paste0("Glasgow_Marriott_Hotel-Glasgow_Scotland.html") -> url
ta1 <- url %>% GET %>% read_html
Now write the correct xpaths for the data of interest
# xpath for elements whose text contains reviews
xpath1 <- "//div[contains(#class, 'location-review-review-list-parts-Expand')]"
# xpath for the elements whose class indicate the ratings
xpath2 <- "//div[contains(#class, 'location-review-review-')]"
xpath3 <- "/span[contains(#class, 'ui_bubble_rating bubble_')]"
We can get the text of the reviews like this:
ta1 %>%
xml_find_all(xpath1) %>% # run first query
html_text() %>% # extract text
extract(!equals(., "Read more")) -> reviews # remove "blank" reviews
And the associated star ratings like this:
ta1 %>%
xml_find_all(paste0(xpath2, xpath3)) %>%
xml_attr("class") %>%
strsplit("_") %>%
lapply(function(x) x[length(x)]) %>%
as.numeric %>%
divide_by(10) -> stars
And our result looks like this:
tibble(rating = stars, review = reviews)
## A tibble: 5 x 2
# rating review
# <dbl> <chr>
#1 1 7 of us attended the Christmas Party on Satu~
#2 4 "We stayed 2 nights over last weekend to att~
#3 3 Had a good stay, but had no provision to kee~
#4 3 Booked an overnight for a Christmas shopping~
#5 4 Attended a charity lunch here on Friday and ~
Related
I'm new to R and I'm trying to get data from this website: https://spritacular.org/gallery.
I want to get the location, time and the hour. I am following this guide, using the SelectorGadget I clicked on the elements I wanted (.card-title , .card-subtitle , .mb-0).
However, it always outputs {xml_nodeset (0)} and I'm not sure why it's not getting those elements.
This is the code I have:
url <- "https://spritacular.org/gallery"
sprite_gallery <- read_html(url)
sprite_location <- html_nodes(sprite_gallery, ".card-title , .card-subtitle , .mb-0")
sprite_location
When I change the website and grab something from a different website it works, so I'm not sure what I'm doing wrong and how to fix it, this is my first time doing something like this and I appreciate any insight you may have!
As per comment, this website has JS embedded and the information only opens when a browser is opened. If you go to developers tools and network tab, you can see the underlying json data
If you post a GET request for this api address, you will get a list back with all the results. From their, you can slice and dice your way to get the required information you need.
One way to do this: I have considered the name of the user who submitted the image and I found out that same user has submitted multiple images. Hence there are duplicate names and locations in the output but the image URL is different. Refer this blog to know how to drill down the json data to make useful dataframes in R
library(httr)
library(tidyverse)
getURL <- 'https://api.spritacular.org/api/observation/gallery/?category=&country=&cursor=cD0xMTI%3D&format=json&page=1&status='
# get the raw json into R
UOM_json <- httr::GET(getURL) %>%
httr::content()
exp_output <- pluck(UOM_json, 'results') %>%
enframe() %>%
unnest_longer(value) %>%
unnest_wider(value) %>%
select(user_data, images) %>%
unnest_wider(user_data) %>%
mutate(full_name = paste(first_name, last_name)) %>%
select(full_name, location, images) %>%
rename(., location_user = location) %>%
unnest_longer(images) %>%
unnest_wider(images) %>%
select(full_name, location, image)
Output of our exp_output
> head(exp_output)
# A tibble: 6 × 3
full_name location image
<chr> <chr> <chr>
1 Kevin Palivec Jones County,Texas,United States https://d1dzduvcvkxs60.cloudfront.net/observation_image/1d4cc82f-f3d2…
2 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/3b6391d1-f839…
3 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/9bcf10d7-bd7c…
4 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/a7dea9cf-8d6e…
5 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/539e0870-c931…
6 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/c729ea03-e1f8…
>
Was presented a problem at work and am trying to think / work my way through it. However, I am very new at web scraping, and need some help, or just good starting points, on web scraping.
I have a website from the education commission.
http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA
This site contains 50 tables, one for each state, with two columns in a question / answer format. My first attempt has been this...
library(tidyverse)
library(httr)
library(XML)
tibble(url = "http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA") %>%
mutate(get_data = map(.x = url,
~GET(.x))) %>%
mutate(list_data = map(.x = get_data,
~readHTMLTable(doc=content(.x, "text")))) %>%
pull(list_data)
My first thought was to create multiple dataframes, one for each state, in a list format.
This idea does not seem to have worked as anticipated. I was expecting a list, but it seems like a list of on response rather than 50. It appears that this one response read each line, but did not differentiate from one table to the next. Confused on next steps, anyone with any ideas? Web Scraping is odd to me.
Second attempt was to copy and paste the table into R as a tribble, one state at a time. This sort of worked, but not every column is formatted the same way. Attempted to use tidyr::separate() to break up the columns by "/t" and that worked for some columns, but not all.
Any help on this problem, or even just where to look to learn more about web scraping, would be very helpful. This did not seem all the difficult at first, but seems like there are a couple of things I am am missing. Maybe rvest? Have never used it, but know it is common with web scraping activities.
Thanks in advance!
As you already guessed rvest is a very good choice for web scraping. Using rvest you can get the table from your desired website in just two steps. With some additional data wrangling this could be transformed in a nice data frame.
library(rvest)
#> Loading required package: xml2
library(tidyverse)
html <- read_html("http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA")
df <- html %>%
html_table(fill = TRUE, header = FALSE) %>%
.[[1]] %>%
# Remove empty rows and rows containing the table header
filter(!(X1 == "" & X2 == ""), !(grepl("^Dual", X1) & grepl("^Dual", X2))) %>%
# Create state column
mutate(is_state = X1 == X2, state = ifelse(is_state, X1, NA_character_)) %>%
fill(state) %>%
filter(!is_state) %>%
select(-is_state)
head(df, 2)
#> X1
#> 1 Statewide policy in place
#> 2 Definition or title of program
#> X2
#> 1 Yes
#> 2 Dual Enrollment – Postsecondary Institutions. High school students are allowed to take college courses for credit either at a high school or on a college campus.
#> state
#> 1 Alabama
#> 2 Alabama
Hello to all professionals out here,
I have created a csv which consists of cities and the corresponding Tripadvisor_Urls. If I now search for a specific link in my list, for example like here to Munich, the subset function ejects the URL. Now I try to read this URL, which is stored under search_url, using read_html. Unfortunately without success.
The relevant part of my code is the following.
search_url <- subset(data, city %in% "München", select = url)
pages <- read_html(search_url)
pages <- pages %>%
html_nodes("._15_ydu6b") %>%
html_attr('href')
When I run search_url I get the following output:
https://www.tripadvisor.de/Restaurants-g187323-Berlin.html
But when I use the above code and want to execute read_html, the following error occurs:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "data.frame"
I have now spent several hours on it, but unfortunately I have not received a suitable tip anywhere. It would be wonderful if you could help me here.
That's because the result of subset() is a data frame here, although the real result is simply one string. Check this simple example with mtcars:
# this will be data.frame although the result is one numeric value 21.4
class(subset(mtcars, disp == 258, select = mpg))
# [1] "data.frame"
So you probably can use
pages <- read_html(as.character(search_url))
if you are sure that your subset returns only 1 character value, otherwise
pages <- read_html(search_url[1, 1])
should work as well for the first result of your subset.
I'm working on an undergraduate project that I am required to webscrape the following data from multiple airbnb listings.
Here is an example:
https://www.airbnb.com.sg/rooms/49091?_set_bev_on_new_domain=1582777903_ZWE4MTBjMGNmYmFh&source_impression_id=p3_1582778001_lB%2BjT8%2BWgIsL%2FrBV
The following data I am required to webscrape is 1 guest, 1 bedroom, 1 bed, 1 bathroom.
However, when I use the CSS selector tool, my following path is "._b2fuovg".
This returns character(0) when I run the following code.
library(rvest)
library(dplyr)
url1 <- read_html("https://www.airbnb.com.sg/rooms/49091?_set_bev_on_new_domain=1582777903_ZWE4MTBjMGNmYmFh&source_impression_id=p3_1582778001_lB%2BjT8%2BWgIsL%2FrBV")
url1 %>%
html_nodes("._b2fuovg") %>%
html_text()
and the following output is
> url1 %>%
+ html_nodes("._b2fuovg") %>%
+ html_text()
character(0)
Any advice or guidance in the right direction is greatly appreciated! :)
I recommend using the Selector Gadget to determine what node to scrape: https://selectorgadget.com/
It works by clicking on the information you want. Other information that will also be included will be shown in yellow. If you don't want those, click on them to turn them red. You will notice at the bottom of your screen a little bar with some text. This is what you want to include in html_nodes(). In this case, I got "._1b3ij9t+ div". Sure enough, this seems to work:
url1 %>%
html_nodes("._1b3ij9t+ div") %>%
html_text()
[1] "1 guest · 1 bedroom · 1 bed · 1 bathroom"
I'm attempting to extract data from a pdf, which can be located at https://www.dol.gov/ui/data.pdf. The data I'm interested in are on page 4 of the PDF and are the 3 observations of the Initial Claims (NSA), the 3 observations of the Insured Unemployment (NSA), and the most recent week used covered employment (footnote 2).
I've read the PDF into R using pdftools, but the text output which is generated is quite ugly (kind of to be expected - due to the nature of PDFs). Is there any way I can extract specific data from this text output? I believe the data will always be in the same place in the output, which is helpful.
The output I'm looking at can be seen with the following script:
library(pdftools)
download.file("https://www.dol.gov/ui/data.pdf", "data.pdf", mode="wb")
uidata <- pdf_text("data.pdf")
uidata[4]
I've searched people with similar questions and fiddled around with scan() and grep(), but can't seem to figure out a way to isolate and extract the data I need from the text output. Thanks in advance if anyone stumbles upon this and can point me in the right direction - if not I'll be trying to figure this out!
With grep and a little regex, you can get everything you need into a usable structure:
library(magrittr)
x <- pdftools::pdf_text('https://www.dol.gov/ui/data.pdf')
x2 <- readLines(textConnection(x[4]))
r <- grep('WEEK ENDING', x2)
l <- lapply(seq_along(r), function(i){
x2[r[i]:(na.omit(c(r[i + 1], grep('FOOTNOTE', x2)))[1] - 1)] %>%
trimws() %>%
gsub('\\s{2,}', ';', .) %>%
paste(collapse = '\n') %>%
read.csv2(text = ., dec = '.')
})
from_footnote <- as.numeric(gsub('^2|\\D', '', x2[grep('2\\.', x2)]))
l[[1]][3,]
#> WEEK.ENDING December.17 December.10 Change
#> Initial Claims (NSA) 315,613 305,333 +10,280 352,534
#> December.3
#> Initial Claims (NSA) 319,641
from_footnote
#> [1] 138322138
You'll still need to parse the numbers, but at least it's usable.