Applying rvest pipes to a dataframe - r

I've got a dataframe called base_table with a lot of 311 data and URLs that point to a broader description of each call.
I'm trying to create a new variable called case_desc with a series of rvest functions each URL.
base_table$case_desc <-
read_html(base_table$case_url) %>%
html_nodes("rc_descrlong") %>%
html_text()
But this doesn't work for I suppose obvious reasons that I can't muster right now. I've tried playing around with functions, but can't seem to nail the right format.
Any help would be awesome! Thank you!

It doesn't work because read_html doesn't work with a vector of URLs. It will throw an error if you give it a vector...
> read_html(c("http://www.google.com", "http://www.yahoo.com"))
Error: expecting a single value
You probably have to use an apply function...
library("rvest")
base_table$case_desc <- sapply(base_table$case_url, function(x)
read_html(x) %>%
html_nodes("rc_descrlong") %>%
html_text())

Related

How to log scrape paths rvest used?

Background: Using rvest I'd like to scrape all details of all art pieces for the painter Paulo Uccello on wikiart.org. The endgame will look something like this:
> names(uccello_dt)
[1] title year style genre media imgSRC infoSRC
Problem: When a scraping attempt doesn't go as planned, I get back character(0). This isn't helpful for me in understanding exactly what path the scrape took to get character(0). I'd like to have my scrape attempts output what path it specifically took so that I can better troubleshoot my failures.
What I've tried:
I use Firefox, so after each failed attempt I go back to the web inspector tool to make sure that I am using the correct css selector / element tag. I've been keeping the rvest documentation by my side to better understand its functions. It's been a trial and error that's taking much longer than I think should. Here's a cleaned up source of 1 of many failures:
library(tidyverse)
library(data.table)
library(rvest)
sample_url <-
read_html(
"https://www.wikiart.org/en/paolo-uccello/all-works#!#filterName:all-paintings-chronologically,resultType:detailed"
)
imgSrc <-
sample_url %>%
html_nodes(".wiki-detailed-item-container") %>% html_nodes(".masonry-detailed-artwork-item") %>% html_nodes("aside") %>% html_nodes(".wiki-layout-artist-image-wrapper") %>% html_nodes("img") %>%
html_attr("src") %>%
as.character()
title <-
sample_url %>%
html_nodes(".masonry-detailed-artwork-title") %>%
html_text() %>%
as.character()
Thank you in advance.

webscraping a table with no html class

I exploring webscraping some weather data, specifically the table on the right panel of this page https://wrcc.dri.edu/cgi-bin/cliMAIN.pl?ak4988
I'm able to navigate to the appropriate location (see below), but have not been able to pull out the table e.g., html_nodes("table").
library(tidyverse)
library(rvest)
url<- read_html("https://wrcc.dri.edu/cgi-bin/cliMAIN.pl?ak4988")
url %>%
html_nodes("frame") %>%
magrittr::extract2(2)
# {html_node}
# <frame src="/cgi-bin/cliRECtM.pl?ak4988" name="Graph">
I've also looked at the namespace with no luck
xml_ns(url)
# <->
This works for me.
library(rvest)
library(magrittr)
library(plyr)
#Doing URLs one by one
url<-"https://wrcc.dri.edu/cgi-bin/cliRECtM.pl?ak4988"
##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[1]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)
Originally I was hitting the wrong URL. The comment from Mogzol pointed me in the right direction. I'm not sure how, or why, different URLs feed into the same one. Maybe it has something to do with the different scrolling windows in one single window. I would be interested in hearing how this works...if someone has some insight into it... Thanks!!

R: rvest - share counter, xpath

I am trying to scrap data using Rvest. I cannot scrape the number/text from the share counter at this link: "753 udostępnienia".
I used Google Chrome plugin XPath helper to find Xpath. I prepared a simple R code:
library(rvest)
url2<- "https://www.siepomaga.pl/kacper-szlyk"
share_url<-html(url2)
share_url
share <- share_url %>%
html_node(xpath ="/html[#class='turbolinks-progress-bar']/body/div[#id='page']/div[#class='causes-show']/div[#class='ui container']/div[#id='column-container']/div[#id='right-column']/div[#class='ui sticky']/div[#class='box with-padding']/div[#class='bordered-box share-box']/div[#class='content']/div[#class='ui grid two columns']/div[#class='share-counter']") %>%
html_text()
share
However result is equal NA.
Where did I go wrong?
I came up with a solution using rvest, without using the xpath = method. This also uses the pipe operator from the dplyr package, to simplify things:
library(tidyverse) # Contains the dplyr package
library(rvest)
siep_url <- "https://www.siepomaga.pl/kacper-szlyk"
counter <- siep_url %>%
read_html() %>%
html_node(".share-counter") %>% # The node comes from https://selectorgadget.com/, a useful selector tool
html_text()
The output for this comes up like so:
[1] "\n\n755\nudostępnień\n"
You can clean that up using gsub():
counter <- gsub("\n\n755\nudostępnień\n", "755 udostępnień", counter)
This returns 755 udostępnień, as a character. Hope this helps.
Disclaimer: Rather large language barrier, but translate.google.com did wonders.

Excluding Nodes RVest

I am scraping blog text using RVest and am struggling to figure out a simple way to exclude specific nodes. The following pulls the text:
AllandSundry_test <- read_html
("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_text() %>%
as.character()
I want to exclude the two nodes with ID's "contenttitle" and "commentblock". Below, I try excluding just the comments using the tag "commentblock".
testpost <- AllandSundry_test %>%
html_node("#contentmiddle") %>%
html_node(":not(#commentblock)")
html_text() %>%
as.character()
When I run this, the result is simply the date -- all the rest of the text is gone. Any suggestions?
I have spent a lot of time searching for an answer, but I am new to R (and html), so I appreciate your patience if this is something obvious.
You were almost there. You should use html_nodes instead of html_node.
html_node retrieves the first element it encounter, while html_nodes returns each matching element in the page as a list.
The toString() function collapse the list of strings into one.
library(rvest)
AllandSundry_test <- read_html("http://www.sundrymourning.com/2017/03/03/lets-go-back-to-commenting-on-the-weather/")
testpost <- AllandSundry_test %>%
html_nodes("#contentmiddle>:not(#commentblock)") %>%
html_text %>%
as.character %>%
toString
testpost
#> [1] "\n\t\tMar\n\t\t3\n\t, Mar, 3, \n\t\tLet's go back to
#> commenting on the weather\n\t\t\n\t\t, Let's go back to commenting on
#> the weather, Let's go back to commenting on the weather, I have just
#> returned from the grocery store, and I need to get something off my chest.
#> When did "Got any big plans for the rest of the day?" become
#> the default small ...<truncated>
You still need to clean up the string a bit.
It certainly looks like GGamba solved it for you- however, in my machine, I had to remove the > after #contentmiddle. Therefore, this section was instead:
html_nodes("#contentmiddle:not(#commentblock)")
Best of luck!
Jesse

Difference between GET(), read_html(), getURL() in R

I'm attempting to scrape a realtor.com for a project for school. I have a working solution, which entails using a combination of the rvest and httr packages, but I want to migrate it to using the RCurl package, specifically using the getURLAsynchronous() function. I know that my algorithm will scrape much faster if I can migrate it to a solution that will download multiple URLs at once. My current solution to this problem is as follows:
Here's what I have so far:
library(RCurl)
library(rvest)
library(httr)
urls <- c("http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-1?pgsz=50",
"http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-2?pgsz=50")
prop.info <- vector("list", length = 0)
for (j in 1:length(urls)) {
prop.info <- c(prop.info, urls[[j]] %>% # Recursively builds the list using each url
GET(add_headers("user-agent" = "r")) %>%
read_html() %>% # creates the html object
html_nodes(".srp-item-body") %>% # grabs appropriate html element
html_text()) # converts it to a text vector
}
This gets me an output that I can readily work with. I'm getting all of the information off of the webpages, then reading all of the html from the GET() output. Next, I'm finding the html nodes, and converting it to text. The trouble I'm running into is when I attempt to implement something similar using RCurl.
Here is what I have for that using the same URLs:
getURLAsynchronous(urls) %>%
read_html() %>%
html_node(".srp-item-details") %>%
html_text
When I call getURIAsynchronous() on the urls vector, not all of the information is downloaded. I'm honestly not sure exactly what is being scraped. However, I know it's considerably different then my current solution.
Any ideas what I'm doing wrong? Or maybe an explanation on how getURLAsynchronous() should be working?

Resources