Web Scraping with R - {xml_nodeset (0)} - r

I'm new to R and I'm trying to get data from this website: https://spritacular.org/gallery.
I want to get the location, time and the hour. I am following this guide, using the SelectorGadget I clicked on the elements I wanted (.card-title , .card-subtitle , .mb-0).
However, it always outputs {xml_nodeset (0)} and I'm not sure why it's not getting those elements.
This is the code I have:
url <- "https://spritacular.org/gallery"
sprite_gallery <- read_html(url)
sprite_location <- html_nodes(sprite_gallery, ".card-title , .card-subtitle , .mb-0")
sprite_location
When I change the website and grab something from a different website it works, so I'm not sure what I'm doing wrong and how to fix it, this is my first time doing something like this and I appreciate any insight you may have!

As per comment, this website has JS embedded and the information only opens when a browser is opened. If you go to developers tools and network tab, you can see the underlying json data
If you post a GET request for this api address, you will get a list back with all the results. From their, you can slice and dice your way to get the required information you need.
One way to do this: I have considered the name of the user who submitted the image and I found out that same user has submitted multiple images. Hence there are duplicate names and locations in the output but the image URL is different. Refer this blog to know how to drill down the json data to make useful dataframes in R
library(httr)
library(tidyverse)
getURL <- 'https://api.spritacular.org/api/observation/gallery/?category=&country=&cursor=cD0xMTI%3D&format=json&page=1&status='
# get the raw json into R
UOM_json <- httr::GET(getURL) %>%
httr::content()
exp_output <- pluck(UOM_json, 'results') %>%
enframe() %>%
unnest_longer(value) %>%
unnest_wider(value) %>%
select(user_data, images) %>%
unnest_wider(user_data) %>%
mutate(full_name = paste(first_name, last_name)) %>%
select(full_name, location, images) %>%
rename(., location_user = location) %>%
unnest_longer(images) %>%
unnest_wider(images) %>%
select(full_name, location, image)
Output of our exp_output
> head(exp_output)
# A tibble: 6 × 3
full_name location image
<chr> <chr> <chr>
1 Kevin Palivec Jones County,Texas,United States https://d1dzduvcvkxs60.cloudfront.net/observation_image/1d4cc82f-f3d2…
2 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/3b6391d1-f839…
3 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/9bcf10d7-bd7c…
4 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/a7dea9cf-8d6e…
5 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/539e0870-c931…
6 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/c729ea03-e1f8…
>

Related

How can i scrape the complete dataset from yahoo finance with rvest

Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:
library(rvest)
library(tidyverse)
crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>%
as.data.frame()
I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:
col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <-
col_page %>%
html_nodes(xpath = '//*[#id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>%
html_table(fill = T)
cryp_final <- cryp_table[[1]]
How can i get the whole dataset?
I think you can get the link of download, if you view the Network, you see the link of download, in this case:
"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"
Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:
library(stringr)
library(magrittr)
site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
download_link <- site %>%
stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>%
stringr::str_replace("filter", "events") %>%
stringr::str_c(base_download, .)
readr::read_csv(download_link)

Web Scraping Education Data in R

Was presented a problem at work and am trying to think / work my way through it. However, I am very new at web scraping, and need some help, or just good starting points, on web scraping.
I have a website from the education commission.
http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA
This site contains 50 tables, one for each state, with two columns in a question / answer format. My first attempt has been this...
library(tidyverse)
library(httr)
library(XML)
tibble(url = "http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA") %>%
mutate(get_data = map(.x = url,
~GET(.x))) %>%
mutate(list_data = map(.x = get_data,
~readHTMLTable(doc=content(.x, "text")))) %>%
pull(list_data)
My first thought was to create multiple dataframes, one for each state, in a list format.
This idea does not seem to have worked as anticipated. I was expecting a list, but it seems like a list of on response rather than 50. It appears that this one response read each line, but did not differentiate from one table to the next. Confused on next steps, anyone with any ideas? Web Scraping is odd to me.
Second attempt was to copy and paste the table into R as a tribble, one state at a time. This sort of worked, but not every column is formatted the same way. Attempted to use tidyr::separate() to break up the columns by "/t" and that worked for some columns, but not all.
Any help on this problem, or even just where to look to learn more about web scraping, would be very helpful. This did not seem all the difficult at first, but seems like there are a couple of things I am am missing. Maybe rvest? Have never used it, but know it is common with web scraping activities.
Thanks in advance!
As you already guessed rvest is a very good choice for web scraping. Using rvest you can get the table from your desired website in just two steps. With some additional data wrangling this could be transformed in a nice data frame.
library(rvest)
#> Loading required package: xml2
library(tidyverse)
html <- read_html("http://ecs.force.com/mbdata/mbprofgroupall?Rep=DEA")
df <- html %>%
html_table(fill = TRUE, header = FALSE) %>%
.[[1]] %>%
# Remove empty rows and rows containing the table header
filter(!(X1 == "" & X2 == ""), !(grepl("^Dual", X1) & grepl("^Dual", X2))) %>%
# Create state column
mutate(is_state = X1 == X2, state = ifelse(is_state, X1, NA_character_)) %>%
fill(state) %>%
filter(!is_state) %>%
select(-is_state)
head(df, 2)
#> X1
#> 1 Statewide policy in place
#> 2 Definition or title of program
#> X2
#> 1 Yes
#> 2 Dual Enrollment – Postsecondary Institutions. High school students are allowed to take college courses for credit either at a high school or on a college campus.
#> state
#> 1 Alabama
#> 2 Alabama

Rstudio Webscraping - Rvest returns character(0)

I'm working on an undergraduate project that I am required to webscrape the following data from multiple airbnb listings.
Here is an example:
https://www.airbnb.com.sg/rooms/49091?_set_bev_on_new_domain=1582777903_ZWE4MTBjMGNmYmFh&source_impression_id=p3_1582778001_lB%2BjT8%2BWgIsL%2FrBV
The following data I am required to webscrape is 1 guest, 1 bedroom, 1 bed, 1 bathroom.
However, when I use the CSS selector tool, my following path is "._b2fuovg".
This returns character(0) when I run the following code.
library(rvest)
library(dplyr)
url1 <- read_html("https://www.airbnb.com.sg/rooms/49091?_set_bev_on_new_domain=1582777903_ZWE4MTBjMGNmYmFh&source_impression_id=p3_1582778001_lB%2BjT8%2BWgIsL%2FrBV")
url1 %>%
html_nodes("._b2fuovg") %>%
html_text()
and the following output is
> url1 %>%
+ html_nodes("._b2fuovg") %>%
+ html_text()
character(0)
Any advice or guidance in the right direction is greatly appreciated! :)
I recommend using the Selector Gadget to determine what node to scrape: https://selectorgadget.com/
It works by clicking on the information you want. Other information that will also be included will be shown in yellow. If you don't want those, click on them to turn them red. You will notice at the bottom of your screen a little bar with some text. This is what you want to include in html_nodes(). In this case, I got "._1b3ij9t+ div". Sure enough, this seems to work:
url1 %>%
html_nodes("._1b3ij9t+ div") %>%
html_text()
[1] "1 guest · 1 bedroom · 1 bed · 1 bathroom"

rvest read_html for a specific table

I am trying to scrape a web page in R. In the table of contents here:
https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc
I am interested in the
Consolidated Statement of Earnings - Page 50
Consolidated Statement of Cash Flows - Page 51
Consolidated Balance Sheet - Page 52
Depending on the document the page number can vary where these statements are.
I am trying to locate these documents using html_nodes() but I cannot seem to get it working. When I inspect the url I find the table at <div align="CENTER"> == $0 but I cannot find a table ID key.
url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"
dat <- url %>%
read_html() %>%
html_table(fill = TRUE)
Any push in the right direction would be great!
EDIT: I know of the finreportr and finstr packages but they are taking the XML documents and not all .HTML pages have XML documents - I also want to do this using the rvest package.
EDIT:
Something like the following Works:
url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
html_table()
x <- population[[1]]
Its very messy but it does get the cash flows table. The Xpath changes depending on the webpage.
For example this one is different:
url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
html_table()
x <- population[[1]]
Is there a way to "search" for the "cash Flow" table and somehow extract the xpath?
Some more links to try.
[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"
[2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"
[3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"
[5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"
[6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"
[8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"
[9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"

Web-scraping in R

I am practicing my web scraping coding in R and I cannot pass one phase no matter what website I try.
For example,
https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music
My goal is to extract all 77 schools' name (Oxford to London Metropolitan)
So I tried...
library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()
From F12, I could find out that all schools' name is under class '.league-table-institution-name'... and that's why I wrote that in html_nodes...
What have I done wrong?
You appear to be running html_nodes() twice: first on college, an xml_document (which is correct) and then on info, a character vector, which is not correct.
Try this instead:
url_college %>%
read_html() %>%
html_nodes('.league-table-institution-name') %>%
html_text()
and then you'll need an additional step to clean up the school names; this one was suggested:
%>%
str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

Resources