R WebScraping Getting Extra Text when using Rvest

R WebScraping Getting Extra Text when using Rvest - r

I'm trying to get sold dates from eBay using R and RVest web scraping
The url is url
literally
https://www.ebay.com/sch/Star%20Wars%20%20BARC%20Speeder%20Bike%20Trooper%20Buzz%20-2009%20-Red%20-Obi-wan%20-Kenobi%20-Jesse%20-halmark%20-Funko%20-Pop%20-Black%20-snaptite%20-model%20-30th%20-Saga%20-Lego%20-McDonalds%20-McDonald%27s%20-Topps%20-Heroes%20-Playskool%20-Transformers%20-Titanium%20-Die-Cast%20-2003%20-2004%20-2005%20-2006%20-2007%20-2008%20-2012%20-2013%20%28Clone%20Wars%29&LH_Sold=1&LH_ItemCondition=3&_dmd=7&_ipg=200&LH_Complete=1&LH_PrefLoc=1
The full xpath to the first item sold date is: //*[#id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div/span/span[1]
If I use that and then html_text() to this path, I get nothing. character(0)
When I remove the spans, and add the POSITIVE node, I get the date, but also a bunch of extra text.
R code:
readHTML <- url %>%
read_html()
SoldDate <- readHTML %>%
html_nodes(xpath='//*[#id="srp-river-results"]/ul/li[1]/div/div[2]/div[2]/div') %>%
html_nodes("[class='POSITIVE']") %>%
html_text(trim = TRUE)
Result:
"SoYlPd N Feb 316,Z RM9USI2021"
I should get:
"Feb 16, 2021"

There are 2 great answers with more detail specifics on the issue here:
Rvest Split Data by Class Name where the class names change

Related

How to use purrr and rvest in R to scrape transcripts from a webpage?

I am trying to extract all transcripts available on this webpage. I have been able to successfully extract the dates and titles of the speeches using the following code in R:
library(purr)
library(rvest)
url_kremlin <- "http://kremlin.ru/events/president/transcripts/page/"
map(1:10, safely(function(i) {
pg <- read_html(paste0(url_kremlin, i))
data.frame(date = html_text(html_nodes(pg, ".dt-published")),
title = html_text(html_nodes(pg, ".p-name")),
link = html_nodes(pg, ".p-name") %>%
html_node("p") %>% html_attr("href"))
})) -> kremlin_df
I am unable to extract the text of the transcripts, though. Does anyone know what I am doing wrong? What could should I use to successfully extract the transcripts?
Edit: When I run the code above, this is what I get: . The link should contain the text of the speeches (or at least that's what I want it to contain).

Webscraping in R From Dataframe

From the following data frame
I am trying to use the package rvest to scrape each words Part of speech and synonyms from the website: https://www.thesaurus.com/browse/research?s=t into a csv.
I am not sure how to have R search each word of the data frame and pull its Part of Speech and Synonym.
install.packages("rvest")
install.packages("xml2")
library(xml2)
library(rvest)
library(dplyr)
words<data.frame("keywords"=c("research","survey","staff","outpatient","consent"))
html<- read_html("https://www.merriam-webster.com/thesaurus/research")
html %>% html_nodes(".mw-list") %>% html_text () %>%
head(n=1) # take the first 1st records

If you search [your term] on thesaurus, you will end up on the following HTML page: "https://www.thesaurus.com/browse/[your term]". If you know this, you can get the HTMLs of all the pages of terms you're interested in. After that you should be able to iterate with the map() function from the purrr pacakage to get the information you want:
# It makes more sense to just keep "words" as a vector for now
words <- c("research","survey","staff","outpatient","consent")
htmls <- paste0("https://www.thesaurus.com/browse/", words)
info_list <- map(htmls, .x %>%
read_html() %>%
html_node(.mw-list) %>%
html_text())

Excluding notes using html_nodes() in r

I am scraping stock market prices using the rvest package in R. I would like to exclude nodes when using html_nodes().
The following classes appear on the website with stock prices:
[4] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_DifferenceBlock_lblRelativeDifferenceDown" class="ValueDown">-0,51%</span>
[5] <span id="ctl00_ctl00_Content_LeftContent_IssueList_StockList_repIssues_ctl02_ctl02_lblDifference" class="ValueDown Difference">-51%</span>
Now I would like to include only the text after class="ValueDown", and I would like to exclude the text after class="ValueDown Difference".
For this I use the following code:
urlIEX <- "https://www.iex.nl/Koersen/Europa_Lokale_Beurzen/Amsterdam/AMX.aspx"
webpageIEX <- read_html(urlIEX)
percentage_change <- webpageIEX %>%
html_nodes(".ValueDown") %>%
html_text()
However, this gives me both the values -0,51% and -51%. Is there a way to include everything with class="ValueDown" and exclude everything with class="ValueDown Difference"?

I'am not expert, but I think you should use the attribute selector:
percentage_change <- webpageIEX %>%
html_nodes("[class='ValueDown']") %>%
html_text()

Web-scraping in R

I am practicing my web scraping coding in R and I cannot pass one phase no matter what website I try.
For example,
https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music
My goal is to extract all 77 schools' name (Oxford to London Metropolitan)
So I tried...
library(rvest)
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music"
college <- read_html(url_college)
info <- html_nodes(college, css = '.league-table-institution-name')
info %>% html_nodes('.league-table-institution-name') %>% html_text()
From F12, I could find out that all schools' name is under class '.league-table-institution-name'... and that's why I wrote that in html_nodes...
What have I done wrong?

You appear to be running html_nodes() twice: first on college, an xml_document (which is correct) and then on info, a character vector, which is not correct.
Try this instead:
url_college %>%
read_html() %>%
html_nodes('.league-table-institution-name') %>%
html_text()
and then you'll need an additional step to clean up the school names; this one was suggested:
%>%
str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

Web scraping - selection of a table

I'm trying to extract the table of historical data from Yahoo Finance website.
First, by inspecting the source code I've found that it's actually a table, so I suspect that html_table() from rvest should be able to work with it, however, I can't find a way to reach it from R. I've tried providing the function with just the full page, however, it did not fetch the right table:
url <- https://finance.yahoo.com/quote/^FTSE/history?period1=946684800&period2=1470441600&interval=1mo&filter=history&frequency=1mo
read_html(url) %>% html_table(fill = TRUE)
# Returns only:
# [[1]]
# X1 X2
# 1 Show all results for Tip: Use comma to separate multiple quotes Search
Second, I've found an xpath selector for the particular table, but I am still unsuccessful in fetching the data:
xpath1 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]/table'
read_html(url) %>% html_node(xpath = xpath1)
# Returns an empty nodeset:
# {xml_nodeset (0)}
By removing the last term from the selector I get a non-empty nodeset, however, still no table:
xpath2 <- '//*[#id="main-0-Quote-Proxy"]/section/div[2]/section/div/section/div[3]'
read_html(url) %>% html_node(xpath = xpath2) %>% html_table(fill = TRUE)
# Error: html_name(x) == "table" is not TRUE
What am I doing wrong? Any help would be appreciated!
EDIT: I've found that html_text() with the last xpath returns
read_html(url) %>% html_node(xpath = xpath2) %>% html_text()
[1] "Loading..."
which suggests that the table is not yet loaded when R did the read. This would explain why it failed to see the table. Question: any ways of bypassing that loading text?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R WebScraping Getting Extra Text when using Rvest - r

There are 2 great answers with more detail specifics on the issue here: Rvest Split Data by Class Name where the class names change

Related

How to use purrr and rvest in R to scrape transcripts from a webpage?

Webscraping in R From Dataframe

Excluding notes using html_nodes() in r

Web-scraping in R

Web scraping - selection of a table

Categories

Resources