Excluding notes using html_nodes() in r

Excluding notes using html_nodes() in r - r

I am scraping stock market prices using the rvest package in R. I would like to exclude nodes when using html_nodes().
The following classes appear on the website with stock prices:
[4] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_DifferenceBlock_lblRelativeDifferenceDown" class="ValueDown">-0,51%</span>
[5] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_ctl02_lblDifference" class="ValueDown Difference">-51%</span>
Now I would like to include only the text after class="ValueDown", and I would like to exclude the text after class="ValueDown Difference".
For this I use the following code:
urlIEX <- "https://www.iex.nl/Koersen/Europa_Lokale_Beurzen/Amsterdam/AMX.aspx"
webpageIEX <- read_html(urlIEX)
percentage_change <- webpageIEX %>%
html_nodes(".ValueDown") %>%
html_text()
However, this gives me both the values -0,51% and -51%. Is there a way to include everything with class="ValueDown" and exclude everything with class="ValueDown Difference"?

I'am not expert, but I think you should use the attribute selector:
percentage_change <- webpageIEX %>%
html_nodes("[class='ValueDown']") %>%
html_text()

Related

How to retrieve hyperlinks in google search using rvest

I am using rvest to get the hyperlinks in a Google search. User #AllanCameron helped me in the past to sketch this code but now I do not know how to change the xpath or what I need to do in order to get the links. Here my code:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
links <- html_nodes(first_page, xpath = "//div/div/a/h3") %>%
html_attr('href')
Which entirely returns NA.
I would like to get the links for each item that appears like next (sorry for the quality of images):
Is possible to get that stored in a dataframe? Many thanks!

Look at the parents a of the h3 nodes and find their href attribute. This ensures you have the same number of links as the main titles, to allow for easy arrangement in a dataframe.
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
[1] "https://www.linkedin.com/in/mario-torres-b5796315b"
[2] "https://mariolopeztorres.com/"
[3] "https://www.instagram.com/mario_torres25/%3Fhl%3Den"
[4] "https://www.1stdibs.com/buy/mario-torres-lopez/"
[5] "https://m.facebook.com/2064681987175832"
[6] "https://www.facebook.com/mariotorresmx"
[7] "https://www.transfermarkt.us/mario-torres/profil/spieler/28167"
[8] "https://en.wikipedia.org/wiki/Mario_Garc%25C3%25ADa_Torres"
[9] "https://circawho.com/press-and-magazines/mario-lopez-torress-legacy-is-still-being-woven-in-michoacan-mexico/"

R WebScraping Getting Extra Text when using Rvest

I'm trying to get sold dates from eBay using R and RVest web scraping
The url is url
literally
https://www.ebay.com/sch/Star%20Wars%20%20BARC%20Speeder%20Bike%20Trooper%20Buzz%20-2009%20-Red%20-Obi-wan%20-Kenobi%20-Jesse%20-halmark%20-Funko%20-Pop%20-Black%20-snaptite%20-model%20-30th%20-Saga%20-Lego%20-McDonalds%20-McDonald%27s%20-Topps%20-Heroes%20-Playskool%20-Transformers%20-Titanium%20-Die-Cast%20-2003%20-2004%20-2005%20-2006%20-2007%20-2008%20-2012%20-2013%20%28Clone%20Wars%29&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1
The full xpath to the first item sold date is: //*[#id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span/span[1]
If I use that and then html_text() to this path, I get nothing. character(0)
When I remove the spans, and add the POSITIVE node, I get the date, but also a bunch of extra text.
R code:
readHTML <- url %>%
read_html()
SoldDate <- readHTML %>%
html_nodes(xpath='//*[#id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div') %>%
html_nodes("[class='POSITIVE']") %>%
html_text(trim = TRUE)
Result:
"SoYlPd N Feb 316,Z RM9USI2021"
I should get:
"Feb 16, 2021"

There are 2 great answers with more detail specifics on the issue here:
Rvest Split Data by Class Name where the class names change

Webscraping in R From Dataframe

From the following data frame
I am trying to use the package rvest to scrape each words Part of speech and synonyms from the website: https://www.thesaurus.com/browse/research?s=t into a csv.
I am not sure how to have R search each word of the data frame and pull its Part of Speech and Synonym.
install.packages("rvest")
install.packages("xml2")
library(xml2)
library(rvest)
library(dplyr)
words<data.frame("keywords"=c("research","survey","staff","outpatient","consent"))
html<- read_html("https://www.merriam-webster.com/thesaurus/research")
html %>% html_nodes(".mw-list") %>% html_text () %>%
head(n=1) # take the first 1st records

If you search [your term] on thesaurus, you will end up on the following HTML page: "https://www.thesaurus.com/browse/[your term]". If you know this, you can get the HTMLs of all the pages of terms you're interested in. After that you should be able to iterate with the map() function from the purrr pacakage to get the information you want:
# It makes more sense to just keep "words" as a vector for now
words <- c("research","survey","staff","outpatient","consent")
htmls <- paste0("https://www.thesaurus.com/browse/", words)
info_list <- map(htmls, .x %>%
read_html() %>%
html_node(.mw-list) %>%
html_text())

Extract text and numbers from web page using regex in R

I want to use R to extract text and numbers from the following page: https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=PA0261696&pgm_sys_acrnm_in=NPDES
Specifically, I want the NPDES SIC code and the description, which is 6515 and "Operators of residential mobile home sites" here.
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
test <- test %>% html_nodes("tr") %>% html_text()
# This extracts 31 lines of text; here is what my target text looks like:
# [10] "NPDES\n6515\nOPERATORS OF RESIDENTIAL MOBILE HOME SITES\n\n"
Ideally, I'd like the following: "6515 OPERATORS OF RESIDENTIAL MOBILE HOME SITES"
How would I do this? I'm trying and failing at regex here even just trying to extract the number 6515 alone, which I thought would be easiest...
as.numeric(sub(".*?NPDES.*?(\\d{4}).*", "\\1", test))
# 4424
Any advice?

From what I can see, your information resides in a table. It might be a better idea to perhaps just extract the information as a dataframe itself. This works:
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
tables <- html_nodes(test, "table")
tables
SIC <- as.data.frame(html_table(tables[5], fill = TRUE))

rvest read_html for a specific table

I am trying to scrape a web page in R. In the table of contents here:
https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc
I am interested in the
Consolidated Statement of Earnings - Page 50
Consolidated Statement of Cash Flows - Page 51
Consolidated Balance Sheet - Page 52
Depending on the document the page number can vary where these statements are.
I am trying to locate these documents using html_nodes() but I cannot seem to get it working. When I inspect the url I find the table at <div align="CENTER"> == $0 but I cannot find a table ID key.
url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"
dat <- url %>%
read_html() %>%
html_table(fill = TRUE)
Any push in the right direction would be great!
EDIT: I know of the finreportr and finstr packages but they are taking the XML documents and not all .HTML pages have XML documents - I also want to do this using the rvest package.
EDIT:
Something like the following Works:
url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
html_table()
x <- population[[1]]
Its very messy but it does get the cash flows table. The Xpath changes depending on the webpage.
For example this one is different:
url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"
population <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
html_table()
x <- population[[1]]
Is there a way to "search" for the "cash Flow" table and somehow extract the xpath?
Some more links to try.
[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"
[2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"
[3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"
[4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"
[5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"
[6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"
[7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"
[8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"
[9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Excluding notes using html_nodes() in r - r

I'am not expert, but I think you should use the attribute selector: percentage_change <- webpageIEX %>% html_nodes("[class='ValueDown']") %>% html_text()

Related

How to retrieve hyperlinks in google search using rvest

R WebScraping Getting Extra Text when using Rvest

Webscraping in R From Dataframe

Extract text and numbers from web page using regex in R

rvest read_html for a specific table

Categories

Resources