I'm trying to scrape some data for a potential stats project, but I can't seem to get all of the nodes per page. Instead, it only grabs the first one before moving the next page.
library(rvest)
pages <- pages <- c("https://merrimackathletics.com/sports/" %>%
paste0(c("baseball", "mens-basketball", "mens-cross-country") %>%
paste0("/roster")))
Major <- lapply(pages,
function(url){
url %>% read_html(url) %>%
html_node(".sidearm-roster-player-major") %>%
html_text()
})
Subsequently, the above only returns:
> Major
[[1]]
[1] "Business Adminstration"
[[2]]
[1] "Communications"
[[3]]
[1] "Global Management"
How should I go about indexing the node such that I get more than just the first "major" per page? Thanks!
The function html_node only extracts the first element. html_nodes will do what you want.
From the documentation:
html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.
Related
I am using rvest to get the hyperlinks in a Google search. User #AllanCameron helped me in the past to sketch this code but now I do not know how to change the xpath or what I need to do in order to get the links. Here my code:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
links <- html_nodes(first_page, xpath = "//div/div/a/h3") %>%
html_attr('href')
Which entirely returns NA.
I would like to get the links for each item that appears like next (sorry for the quality of images):
Is possible to get that stored in a dataframe? Many thanks!
Look at the parents a of the h3 nodes and find their href attribute. This ensures you have the same number of links as the main titles, to allow for easy arrangement in a dataframe.
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
[1] "https://www.linkedin.com/in/mario-torres-b5796315b"
[2] "https://mariolopeztorres.com/"
[3] "https://www.instagram.com/mario_torres25/%3Fhl%3Den"
[4] "https://www.1stdibs.com/buy/mario-torres-lopez/"
[5] "https://m.facebook.com/2064681987175832"
[6] "https://www.facebook.com/mariotorresmx"
[7] "https://www.transfermarkt.us/mario-torres/profil/spieler/28167"
[8] "https://en.wikipedia.org/wiki/Mario_Garc%25C3%25ADa_Torres"
[9] "https://circawho.com/press-and-magazines/mario-lopez-torress-legacy-is-still-being-woven-in-michoacan-mexico/"
I would like to extract the following data from four nodes all at the same level and sharing the same code name.
# I was able to extract the first of the four nodes - Property Amenities, using google chrome selector gadget as to identify the nodes.
library(rvest)
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-
Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html")
result_amenities <- html_text (html_node(page0_url,"._1nAmDotd") %>% html_nodes("div") )
However, I cannot figure out how to pass the code to extract the elements within the second object named "Room Features". This is at the same node level and has the same name code as the one above =.This is also the case for the two objects following to this last one and by the names of "Room types" and "Good to know".
You need to query all of the nodes with same class using the html_nodes() function then parse each of those nodes individually.
For Example
library(rvest)
url<- "https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html"
page0_url<-read_html(url)
result_amenities <- html_text(html_nodes(page0_url,"._1nAmDotd") %>% html_nodes("div") )
names <- html_nodes(page0_url,"div._1mJdgpMJ") %>% html_text()
groupNodes <- html_nodes(page0_url,"._1nAmDotd")
outputlist <-lapply(groupNodes, function(node){
results <- node %>% html_nodes("div") %>% html_text()
})
On the reference page there is no corresponding "_1nAmDotd" node the "Good to Know" section thus leading to an unbalance in the results.
Almost all desirable data (including everything you requested) is available via the page manifest, within a script tag, as that is where it is loaded from. You can regex out that enormous amount of data with regex. Then write user defined functions to extract desired info.
I initially parse the regex matched group into a json object all_data. I then look through that list of lists to find strings only associated with the data of interest. For example, starRating is associated with the location data you are interested in. get_target_list returns that list and then I extract from that what I want. You can see
that location_info holds the data related to hotel amenities (including room amentities), the star rating (hotel class) and languages spoken etc.
E.g. location_info$hotelAmenities$languagesSpoken or location_info$hotelAmenities$highlightedAmenities$roomFeatures ........
N.B. As currently written, it is intended that search_string is unique to the desired list, within the list of lists initially held in the json object. I wasn't sure if the names, of the named lists, would remain constant, so chose to dynamically retrieve the right list.
R:
library(rvest)
library(jsonlite)
library(stringr)
library(magrittr)
is_target_list <- function(x, search_string) {
return(str_detect(x %>% toString(), search_string))
}
get_target_list <- function(data_list, search_string) {
mask <- lapply(data_list, is_target_list, search_string) %>% unlist()
return(subset(data_list, mask))
}
r <- read_html("https://www.tripadvisor.com/Hotel_Review-g1063979-d1447619-Reviews-Solana_del_Ter-Ripoll_Province_of_Girona_Catalonia.html") %>%
toString()
all_data <- gsub("pageManifest:", '"pageManifest":', stringr::str_match(r, "(\\{pageManifest:.*);\\(")[, 2]) %>%
jsonlite::parse_json()
data_list <- all_data$pageManifest$urqlCache
# target_info <- get_target_list(data_list, 'hotelAmenities')
location_info <- get_target_list(data_list, "starRating") %>%
unname() %>%
.[[1]] %>%
{
.$data$locations[[1]]$detail
}
Regex:
I am trying to scrape a web page in R. In the table of contents here:
https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc
I am interested in the
Consolidated Statement of Earnings - Page 50
Consolidated Statement of Cash Flows - Page 51
Consolidated Balance Sheet - Page 52
Depending on the document the page number can vary where these statements are.
I am trying to locate these documents using html_nodes() but I cannot seem to get it working. When I inspect the url I find the table at <div align="CENTER"> == $0 but I cannot find a table ID key.
url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"
dat <- url %>%
read_html() %>%
html_table(fill = TRUE)
Any push in the right direction would be great!
EDIT: I know of the finreportr and finstr packages but they are taking the XML documents and not all .HTML pages have XML documents - I also want to do this using the rvest package.
EDIT:
Something like the following Works:
url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
html_table()
x <- population[[1]]
Its very messy but it does get the cash flows table. The Xpath changes depending on the webpage.
For example this one is different:
url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
html_table()
x <- population[[1]]
Is there a way to "search" for the "cash Flow" table and somehow extract the xpath?
Some more links to try.
[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"
[2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"
[3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"
[5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"
[6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"
[8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"
[9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"
Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www dot time dot mk/"
frontpagelinks = paste0(mark, frontpage)
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
I used
a1onJune = str_extract_all(frontpage, ".*a1on.*") to extract articles from the website a1on dot mk, which worked like a charm finding only the articles I needed.
After getting some help here as to how to make my code more efficient, i.e. extract numerous links at once, via:
linksList <- lapply(frontpagelinks, function(i) {
read_html(frontapagelinks[i]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()
which extracts all of the links I need, the same stringr code returns oddly enough something like this
"\"standard dot mk/germancite-ermenskiot-genocid/\", \"//plusinfo dot mk/vest/72702/turcija-ne-go-prifakja-zborot-genocid\", \"/a1on dot mk/wordpress/archives/618719\", \"sitel dot mk/na-povidok-nov-sudir-megju-turcija-i-germanija\",
Where as shown in bold I also extract the links to the website I need, but also a bunch of other noise that I definitely don't want there. I tried a variety of regex expressions, however I've not managed to define only those lines of code that contain a1on posts.
Given that the list which I am attempting to clear out outputs separated links I am a bit baffled by the fact that when I use stringr it (as far as im concerned) randomly divides them into strings of multiple links:
[93] "http://telegraf dot mk /aktuelno/svet/ns-newsarticle-vo-znak-na-protest-turcija-go-povlece-svojot-ambasador-od-germanija.nspx"
[94] "http://tocka dot mk /1/197933/odnosite-pomegju-berlin-i-ankara-pred-totalen-kolaps-germanija-go-prizna-turskiot-genocid-nad-ermencite"
[95] "lokalno dot mk /merkel-vladata-na-germanija-e-podgotvena-da-pomogne-vo-dijalogot-megju-turcija-i-ermenija/"
Any thoughts as to how I can go about this? Perhaps something that is more general, given that I need to do the same type of cleaning for five different portals.
Thank you.
Using a simple code to extract the links to my articles (one by one)
library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
print(frontpage)
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)
# lappy returns a list of lists, so use unlist to flatten
linksList <- unlist( lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href") %>%
paste0()}))
# grab the lists of interest
a1onLinks <- linksList[grepl(".*a1on.*", linksList)]
# [1] "http://a1on.mk/wordpress/archives/621196" "http://a1on.mk/wordpress/archives/621038"
# [3] "http://a1on.mk/wordpress/archives/620576" "http://a1on.mk/wordpress/archives/620686"
# [5] "http://a1on.mk/wordpress/archives/620364" "http://a1on.mk/wordpress/archives/620399"
I'm trying to extract the table of historical data from Yahoo Finance website.
First, by inspecting the source code I've found that it's actually a table, so I suspect that html_table() from rvest should be able to work with it, however, I can't find a way to reach it from R. I've tried providing the function with just the full page, however, it did not fetch the right table:
url <- https://finance.yahoo.com/quote/^FTSE/history?period1=946684800&period2=1470441600&interval=1mo&filter=history&frequency=1mo
read_html(url) %>% html_table(fill = TRUE)
# Returns only:
# [[1]]
# X1 X2
# 1 Show all results for Tip: Use comma to separate multiple quotes Search
Second, I've found an xpath selector for the particular table, but I am still unsuccessful in fetching the data:
xpath1 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]/table'
read_html(url) %>% html_node(xpath = xpath1)
# Returns an empty nodeset:
# {xml_nodeset (0)}
By removing the last term from the selector I get a non-empty nodeset, however, still no table:
xpath2 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]'
read_html(url) %>% html_node(xpath = xpath2) %>% html_table(fill = TRUE)
# Error: html_name(x) == "table" is not TRUE
What am I doing wrong? Any help would be appreciated!
EDIT: I've found that html_text() with the last xpath returns
read_html(url) %>% html_node(xpath = xpath2) %>% html_text()
[1] "Loading..."
which suggests that the table is not yet loaded when R did the read. This would explain why it failed to see the table. Question: any ways of bypassing that loading text?