I am scraping blog text using RVest and am struggling to figure out a simple way to exclude specific nodes. The following pulls the text:
AllandSundry_test <- read_html
("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_text() %>%
as.character()
I want to exclude the two nodes with ID's "contenttitle" and "commentblock". Below, I try excluding just the comments using the tag "commentblock".
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_node(":not(#commentblock)")
html_text() %>%
as.character()
When I run this, the result is simply the date -- all the rest of the text is gone. Any suggestions?
I have spent a lot of time searching for an answer, but I am new to R (and html), so I appreciate your patience if this is something obvious.
You were almost there. You should use html_nodes instead of html_node.
html_node retrieves the first element it encounter, while html_nodes returns each matching element in the page as a list.
The toString() function collapse the list of strings into one.
library(rvest)
AllandSundry_test <- read_html("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")
testpost <- AllandSundry_test %>%
html_nodes("#contentmiddle>:not(#commentblock)") %>%
html_text %>%
as.character %>%
toString
testpost
#> [1] "\n\t\tMar\n\t\t3\n\t, Mar, 3, \n\t\tLet's go back to
#> commenting on the weather\n\t\t\n\t\t, Let's go back to commenting on
#> the weather, Let's go back to commenting on the weather, I have just
#> returned from the grocery store, and I need to get something off my chest.
#> When did "Got any big plans for the rest of the day?" become
#> the default small ...<truncated>
You still need to clean up the string a bit.
It certainly looks like GGamba solved it for you- however, in my machine, I had to remove the > after #contentmiddle. Therefore, this section was instead:
html_nodes("#contentmiddle:not(#commentblock)")
Best of luck!
Jesse
Related
I am struggling using the rvest package in R, most likely due to my lack of knowledge about CSS or HTML. Here is an example (my guess is the ".quote-header-info" is what is wrong, also tried the ".Trsdu ..." but no luck either):
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
website=read_html(url) %>%
html_nodes(".quote-header-info") %>%
html_text() %>% toString()
website
The below is the webpage I am trying to scrape. Specifically looking to grab the value "416.74". I took a peek at the documentation here (https://cran.r-project.org/web/packages/rvest/rvest.pdf) but think the issue is I don't understand the breakdown of the webpage I am looking at.
The tricky part is determining the correct set of attributes to only select this one html node.
In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
#read page once
page <- read_html(url)
#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()
price
Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.
Those classes are dynamic and change much more frequently than other parts of the html. They should be avoided. You have at least two more robust options.
Extract the javascript option housing that data (plus a lot more) in a script tag then parse with jsonlite
Use positional matching against other, more stable, html elements
I show both below. The advantage of the first is that you can extract lots of other page data from the json object generated.
library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)
page <- read_html('https://finance.yahoo.com/quote/SPY')
data <- page %>%
toString() %>%
stringr::str_match('root\\.App\\.main = (.*?[\\s\\S]+)(?=;[\\s\\S]+\\(th)') %>% .[2]
json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())
I am new to webscraping and working on a test project in which I am trying to scrape every table of data on the following website for this particular team. There should be 15 tables but when I run my code, it only seems to pull the first 6 of the 15. How do I go about getting the rest of the tables?
Here is the code:
library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(magrittr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
iowa_stats %>% html_table()
Edit: So I decided to dig a little bit deeper into the problem and see if I could get any more insights. So I decided to start with the first table that doesn't appear when you call the html_table command which is the 'Totals' Table. I did the following to follow the path of the html all the way down to the table to see if I could figure out what's wrong. To do so, I used the following code.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper")
This is as far as I can get prior to getting an error. At the next step, there should be the following: div#div_totals.table_container.is_setup in which the table is stored but if I were to add that to the above code, it doesn't exist. When I type the following, it doesn't exist as well.
iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper") %>% html_nodes("div")
Does someone who is better with html/css have any idea why this is the case?
It looks like this webpage is storing some of the tables as comments. To solve this read and save the web page. Remove the comment tags and then process normally.
library(rvest)
library(dplyr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
#Only save and work with the body
body<-html_node(iowa_stats,"body")
write_xml(body, "temp.xml")
#Find and remove comments
lines<-readLines("temp.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "temp2.xml")
#Read the file back in and process normally
body<-read_html("temp2.xml")
html_nodes(body, "table") %>% html_table()
Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.
I am trying to scrap data using Rvest. I cannot scrape the number/text from the share counter at this link: "753 udostępnienia".
I used Google Chrome plugin XPath helper to find Xpath. I prepared a simple R code:
library(rvest)
url2<- "https://www.siepomaga.pl/kacper-szlyk"
share_url<-html(url2)
share_url
share <- share_url %>%
html_node(xpath ="/html[#class='turbolinks-progress-bar']/body/div[#id='page']/div[#class='causes-show']/div[#class='ui container']/div[#id='column-container']/div[#id='right-column']/div[#class='ui sticky']/div[#class='box with-padding']/div[#class='bordered-box share-box']/div[#class='content']/div[#class='ui grid two columns']/div[#class='share-counter']") %>%
html_text()
share
However result is equal NA.
Where did I go wrong?
I came up with a solution using rvest, without using the xpath = method. This also uses the pipe operator from the dplyr package, to simplify things:
library(tidyverse) # Contains the dplyr package
library(rvest)
siep_url <- "https://www.siepomaga.pl/kacper-szlyk"
counter <- siep_url %>%
read_html() %>%
html_node(".share-counter") %>% # The node comes from https://selectorgadget.com/, a useful selector tool
html_text()
The output for this comes up like so:
[1] "\n\n755\nudostępnień\n"
You can clean that up using gsub():
counter <- gsub("\n\n755\nudostępnień\n", "755 udostępnień", counter)
This returns 755 udostępnień, as a character. Hope this helps.
Disclaimer: Rather large language barrier, but translate.google.com did wonders.
I've got a dataframe called base_table with a lot of 311 data and URLs that point to a broader description of each call.
I'm trying to create a new variable called case_desc with a series of rvest functions each URL.
base_table$case_desc <-
read_html(base_table$case_url) %>%
html_nodes("rc_descrlong") %>%
html_text()
But this doesn't work for I suppose obvious reasons that I can't muster right now. I've tried playing around with functions, but can't seem to nail the right format.
Any help would be awesome! Thank you!
It doesn't work because read_html doesn't work with a vector of URLs. It will throw an error if you give it a vector...
> read_html(c("http://www.google.com", "http://www.yahoo.com"))
Error: expecting a single value
You probably have to use an apply function...
library("rvest")
base_table$case_desc <- sapply(base_table$case_url, function(x)
read_html(x) %>%
html_nodes("rc_descrlong") %>%
html_text())