rvest read_html for a specific table - r

I am trying to scrape a web page in R. In the table of contents here:
https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc
I am interested in the
Consolidated Statement of Earnings - Page 50
Consolidated Statement of Cash Flows - Page 51
Consolidated Balance Sheet - Page 52
Depending on the document the page number can vary where these statements are.
I am trying to locate these documents using html_nodes() but I cannot seem to get it working. When I inspect the url I find the table at <div align="CENTER"> == $0 but I cannot find a table ID key.
url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"
dat <- url %>%
read_html() %>%
html_table(fill = TRUE)
Any push in the right direction would be great!
EDIT: I know of the finreportr and finstr packages but they are taking the XML documents and not all .HTML pages have XML documents - I also want to do this using the rvest package.
EDIT:
Something like the following Works:
url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
html_table()
x <- population[[1]]
Its very messy but it does get the cash flows table. The Xpath changes depending on the webpage.
For example this one is different:
url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
html_table()
x <- population[[1]]
Is there a way to "search" for the "cash Flow" table and somehow extract the xpath?
Some more links to try.
[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"
[2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"
[3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"
[5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"
[6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"
[8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"
[9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"

Related

Web Scraping with R - {xml_nodeset (0)}

I'm new to R and I'm trying to get data from this website: https://spritacular.org/gallery.
I want to get the location, time and the hour. I am following this guide, using the SelectorGadget I clicked on the elements I wanted (.card-title , .card-subtitle , .mb-0).
However, it always outputs {xml_nodeset (0)} and I'm not sure why it's not getting those elements.
This is the code I have:
url <- "https://spritacular.org/gallery"
sprite_gallery <- read_html(url)
sprite_location <- html_nodes(sprite_gallery, ".card-title , .card-subtitle , .mb-0")
sprite_location
When I change the website and grab something from a different website it works, so I'm not sure what I'm doing wrong and how to fix it, this is my first time doing something like this and I appreciate any insight you may have!
As per comment, this website has JS embedded and the information only opens when a browser is opened. If you go to developers tools and network tab, you can see the underlying json data
If you post a GET request for this api address, you will get a list back with all the results. From their, you can slice and dice your way to get the required information you need.
One way to do this: I have considered the name of the user who submitted the image and I found out that same user has submitted multiple images. Hence there are duplicate names and locations in the output but the image URL is different. Refer this blog to know how to drill down the json data to make useful dataframes in R
library(httr)
library(tidyverse)
getURL <- 'https://api.spritacular.org/api/observation/gallery/?category=&country=&cursor=cD0xMTI%3D&format=json&page=1&status='
# get the raw json into R
UOM_json <- httr::GET(getURL) %>%
httr::content()
exp_output <- pluck(UOM_json, 'results') %>%
enframe() %>%
unnest_longer(value) %>%
unnest_wider(value) %>%
select(user_data, images) %>%
unnest_wider(user_data) %>%
mutate(full_name = paste(first_name, last_name)) %>%
select(full_name, location, images) %>%
rename(., location_user = location) %>%
unnest_longer(images) %>%
unnest_wider(images) %>%
select(full_name, location, image)
Output of our exp_output
> head(exp_output)
# A tibble: 6 × 3
full_name location image
<chr> <chr> <chr>
1 Kevin Palivec Jones County,Texas,United States https://d1dzduvcvkxs60.cloudfront.net/observation_image/1d4cc82f-f3d2…
2 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/3b6391d1-f839…
3 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/9bcf10d7-bd7c…
4 Kamil Świca Lublin,Lublin Voivodeship,Poland https://d1dzduvcvkxs60.cloudfront.net/observation_image/a7dea9cf-8d6e…
5 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/539e0870-c931…
6 Evelyn Lapeña Bulacan,Central Luzon,Philippines https://d1dzduvcvkxs60.cloudfront.net/observation_image/c729ea03-e1f8…
>

How to retrieve hyperlinks in google search using rvest

I am using rvest to get the hyperlinks in a Google search. User #AllanCameron helped me in the past to sketch this code but now I do not know how to change the xpath or what I need to do in order to get the links. Here my code:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
links <- html_nodes(first_page, xpath = "//div/div/a/h3") %>%
html_attr('href')
Which entirely returns NA.
I would like to get the links for each item that appears like next (sorry for the quality of images):
Is possible to get that stored in a dataframe? Many thanks!
Look at the parents a of the h3 nodes and find their href attribute. This ensures you have the same number of links as the main titles, to allow for easy arrangement in a dataframe.
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
[1] "https://www.linkedin.com/in/mario-torres-b5796315b"
[2] "https://mariolopeztorres.com/"
[3] "https://www.instagram.com/mario_torres25/%3Fhl%3Den"
[4] "https://www.1stdibs.com/buy/mario-torres-lopez/"
[5] "https://m.facebook.com/2064681987175832"
[6] "https://www.facebook.com/mariotorresmx"
[7] "https://www.transfermarkt.us/mario-torres/profil/spieler/28167"
[8] "https://en.wikipedia.org/wiki/Mario_Garc%25C3%25ADa_Torres"
[9] "https://circawho.com/press-and-magazines/mario-lopez-torress-legacy-is-still-being-woven-in-michoacan-mexico/"

Scraping html nodes from multiple pages

I'm trying to scrape some data for a potential stats project, but I can't seem to get all of the nodes per page. Instead, it only grabs the first one before moving the next page.
library(rvest)
pages <- pages <- c("https://merrimackathletics.com/sports/" %>%
paste0(c("baseball", "mens-basketball", "mens-cross-country") %>%
paste0("/roster")))
Major <- lapply(pages,
function(url){
url %>% read_html(url) %>%
html_node(".sidearm-roster-player-major") %>%
html_text()
})
Subsequently, the above only returns:
> Major
[[1]]
[1] "Business Adminstration"
[[2]]
[1] "Communications"
[[3]]
[1] "Global Management"
How should I go about indexing the node such that I get more than just the first "major" per page? Thanks!
The function html_node only extracts the first element. html_nodes will do what you want.
From the documentation:
html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

Web scraping - selection of a table

I'm trying to extract the table of historical data from Yahoo Finance website.
First, by inspecting the source code I've found that it's actually a table, so I suspect that html_table() from rvest should be able to work with it, however, I can't find a way to reach it from R. I've tried providing the function with just the full page, however, it did not fetch the right table:
url <- https://finance.yahoo.com/quote/^FTSE/history?period1=946684800&period2=1470441600&interval=1mo&filter=history&frequency=1mo
read_html(url) %>% html_table(fill = TRUE)
# Returns only:
# [[1]]
# X1 X2
# 1 Show all results for Tip: Use comma to separate multiple quotes Search
Second, I've found an xpath selector for the particular table, but I am still unsuccessful in fetching the data:
xpath1 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]/table'
read_html(url) %>% html_node(xpath = xpath1)
# Returns an empty nodeset:
# {xml_nodeset (0)}
By removing the last term from the selector I get a non-empty nodeset, however, still no table:
xpath2 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]'
read_html(url) %>% html_node(xpath = xpath2) %>% html_table(fill = TRUE)
# Error: html_name(x) == "table" is not TRUE
What am I doing wrong? Any help would be appreciated!
EDIT: I've found that html_text() with the last xpath returns
read_html(url) %>% html_node(xpath = xpath2) %>% html_text()
[1] "Loading..."
which suggests that the table is not yet loaded when R did the read. This would explain why it failed to see the table. Question: any ways of bypassing that loading text?

Scraping Movie reviews from IMDB using rvest

I have extracted the reviews of a movie on IMDB but the separate reviews have a lot of blank lines between them. It is unstructured and very difficult to view.
I have to apply certain functions on each of them separately and then store them together as 1 for some text mining for some other functions.
How can I structure (clean) them and access them one at a time and also how to combine them and store it together?
Here is my code for scraping the reviews
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>%
html_nodes("p") %>%
html_text()
I would suggest that you are more specific when you navigate the DOM. For instance, this code will only deliver reviews and none of the other information that you are presumably not looking to scrape:
ID <- 1490017
URL <- paste0("http://www.imdb.com/title/tt", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>% html_nodes("#pagecontent") %>%
html_nodes("div+ p") %>%
html_text()
And here is a way to remove line breaks, applying a function to each review, and merging all reviews into one paragraph (also see this post on concatenating vector elements and this post on replacing line breaks):
ex_review <- gsub("[\r\n]", " ", ex_review) # replace line breaks
sapply(ex_review, function(x){}) # apply function to each review
ex_review <- paste(ex_review, collapse = "") # concatenate reviews into one paragraph
write(ex_review, "test.txt")
I think you were also missing a "tt" in the URL.

Resources