R: rvest - share counter, xpath - r

I am trying to scrap data using Rvest. I cannot scrape the number/text from the share counter at this link: "753 udostępnienia".
I used Google Chrome plugin XPath helper to find Xpath. I prepared a simple R code:
library(rvest)
url2<- "https://www.siepomaga.pl/kacper-szlyk"
share_url<-html(url2)
share_url
share <- share_url %>%
html_node(xpath ="/html[#class='turbolinks-progress-bar']/body/div[#id='page']/div[#class='causes-show']/div[#class='ui container']/div[#id='column-container']/div[#id='right-column']/div[#class='ui sticky']/div[#class='box with-padding']/div[#class='bordered-box share-box']/div[#class='content']/div[#class='ui grid two columns']/div[#class='share-counter']") %>%
html_text()
share
However result is equal NA.
Where did I go wrong?

I came up with a solution using rvest, without using the xpath = method. This also uses the pipe operator from the dplyr package, to simplify things:
library(tidyverse) # Contains the dplyr package
library(rvest)
siep_url <- "https://www.siepomaga.pl/kacper-szlyk"
counter <- siep_url %>%
read_html() %>%
html_node(".share-counter") %>% # The node comes from https://selectorgadget.com/, a useful selector tool
html_text()
The output for this comes up like so:
[1] "\n\n755\nudostępnień\n"
You can clean that up using gsub():
counter <- gsub("\n\n755\nudostępnień\n", "755 udostępnień", counter)
This returns 755 udostępnień, as a character. Hope this helps.
Disclaimer: Rather large language barrier, but translate.google.com did wonders.

Related

Confusion Regarding HTML Code For Web Scraping With R

I am struggling using the rvest package in R, most likely due to my lack of knowledge about CSS or HTML. Here is an example (my guess is the ".quote-header-info" is what is wrong, also tried the ".Trsdu ..." but no luck either):
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
website=read_html(url) %>%
html_nodes(".quote-header-info") %>%
html_text() %>% toString()
website
The below is the webpage I am trying to scrape. Specifically looking to grab the value "416.74". I took a peek at the documentation here (https://cran.r-project.org/web/packages/rvest/rvest.pdf) but think the issue is I don't understand the breakdown of the webpage I am looking at.
The tricky part is determining the correct set of attributes to only select this one html node.
In this case the span tag with a class of Trsdu(0.3s) and Fz(36px)
library(rvest)
url="https://finance.yahoo.com/quote/SPY"
#read page once
page <- read_html(url)
#now extract information from the page
price <- page %>% html_nodes("span.Trsdu\\(0\\.3s\\).Fz\\(36px\\)") %>%
html_text()
price
Note: "(", ")", and "." are all special characters thus the need to double escape "\\" them.
Those classes are dynamic and change much more frequently than other parts of the html. They should be avoided. You have at least two more robust options.
Extract the javascript option housing that data (plus a lot more) in a script tag then parse with jsonlite
Use positional matching against other, more stable, html elements
I show both below. The advantage of the first is that you can extract lots of other page data from the json object generated.
library(magrittr)
library(rvest)
library(stringr)
library(jsonlite)
page <- read_html('https://finance.yahoo.com/quote/SPY')
data <- page %>%
toString() %>%
stringr::str_match('root\\.App\\.main = (.*?[\\s\\S]+)(?=;[\\s\\S]+\\(th)') %>% .[2]
json <- jsonlite::parse_json(data)
print(json$context$dispatcher$stores$StreamDataStore$quoteData$SPY$regularMarketPrice$raw)
print(page %>% html_node('#quote-header-info div:nth-of-type(2) ~ div div:nth-child(1) span') %>% html_text() %>% as.numeric())

Rvest and xpath returns misleading information

I am struggling with some scraping issues, using rvest and xpath.
The objective is to scrape the following page
https://www.barchart.com/futures/quotes/BT*0/futures-prices
and to extract the names of the futures
BTF21
BTG21
BTH21
etc for the full list of names.
The xpath for those variables seem to be xpath='//a'.
The following code provides no information of relevance, thus my query
library(rvest)
url <- 'https://www.barchart.com/futures/quotes/BT*0'
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//a')
value <- valuation_col %>% html_text()
Any hint to proceed further to get the information would be much needed. Thanks in advance!

Difference between GET(), read_html(), getURL() in R

I'm attempting to scrape a realtor.com for a project for school. I have a working solution, which entails using a combination of the rvest and httr packages, but I want to migrate it to using the RCurl package, specifically using the getURLAsynchronous() function. I know that my algorithm will scrape much faster if I can migrate it to a solution that will download multiple URLs at once. My current solution to this problem is as follows:
Here's what I have so far:
library(RCurl)
library(rvest)
library(httr)
urls <- c("http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-1?pgsz=50",
"http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-2?pgsz=50")
prop.info <- vector("list", length = 0)
for (j in 1:length(urls)) {
prop.info <- c(prop.info, urls[[j]] %>% # Recursively builds the list using each url
GET(add_headers("user-agent" = "r")) %>%
read_html() %>% # creates the html object
html_nodes(".srp-item-body") %>% # grabs appropriate html element
html_text()) # converts it to a text vector
}
This gets me an output that I can readily work with. I'm getting all of the information off of the webpages, then reading all of the html from the GET() output. Next, I'm finding the html nodes, and converting it to text. The trouble I'm running into is when I attempt to implement something similar using RCurl.
Here is what I have for that using the same URLs:
getURLAsynchronous(urls) %>%
read_html() %>%
html_node(".srp-item-details") %>%
html_text
When I call getURIAsynchronous() on the urls vector, not all of the information is downloaded. I'm honestly not sure exactly what is being scraped. However, I know it's considerably different then my current solution.
Any ideas what I'm doing wrong? Or maybe an explanation on how getURLAsynchronous() should be working?

Web Scraping using Rvest on a Tennis table from Wiki

Here I am, a total beginner in R. I am trying to learn more about rvest and how to scrape from the web. Here is the wiki page (https://en.wikipedia.org/wiki/Andy_Murray) and below is the table I want to transfer to R.
Using CSS Selector, I found that the particular table is on ".wikitable". Following some tutorials on other webpages, here is the code that I used:
library(rvest)
tennis <- read_html("https://en.wikipedia.org/wiki/Andy_Murray")
trial <- tennis %>% html_nodes(".wikitable") %>% html_table(fill = T)
trial
I could not isolate the result to the table that I wanted. Can someone please teach me how? An another thing, what does the pipe do (%>%)?
You were almost there. What you extracted was a list. To get to your desired element you need to use indexing:
trial[[2]]
To clean it further use:
df <- trial[[2]]
df <- df[-1,]
df[,17:20] <- NULL
%>% is called a pipe from the magrittr/dplyr package. More info here.

Applying rvest pipes to a dataframe

I've got a dataframe called base_table with a lot of 311 data and URLs that point to a broader description of each call.
I'm trying to create a new variable called case_desc with a series of rvest functions each URL.
base_table$case_desc <-
read_html(base_table$case_url) %>%
html_nodes("rc_descrlong") %>%
html_text()
But this doesn't work for I suppose obvious reasons that I can't muster right now. I've tried playing around with functions, but can't seem to nail the right format.
Any help would be awesome! Thank you!
It doesn't work because read_html doesn't work with a vector of URLs. It will throw an error if you give it a vector...
> read_html(c("http://www.google.com", "http://www.yahoo.com"))
Error: expecting a single value
You probably have to use an apply function...
library("rvest")
base_table$case_desc <- sapply(base_table$case_url, function(x)
read_html(x) %>%
html_nodes("rc_descrlong") %>%
html_text())

Resources