Scaping text from webpage using 'rvest' and SelectorGadget - r

I am trying to get a text from a webpage. To simplify my question, let me use #RonakShah's Stackoverflow account as an example to extract the reputation value. With 'SelectorGadget' showing "div, div", I used the following code:
library(rvest)
so <- read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div") %>% html_nodes("div") %>% html_text()
This gave an object so with as many as 307 items.
Then, I turned the object into a dataframe:
so <- as.data.frame(so)
view(so)
Then, manually gone through all items in the dataframe until finding the correct value so$so[69]. My question is how to quickly find the specific target value. In my real case, it is a little more complicated for doing it manually as there are multiple items with the same values and I need to identify the correct order. Thanks.

You need to find a specific tag and it the respective class closer to your target. You can find that using selector gadget.
library(rvest)
read_html('https://stackoverflow.com/users/3962914/ronak-shah') %>%
html_nodes("div.grid--cell.fs-title") %>%
html_text()
#[1] "254,328"
As far as scraping StackOverflow is concerned it has an API to get the information about users/question/answers. In R, there is a wrapper package around it called stackr (not on CRAN) which makes it very easy.
library(stackr)
data <- stack_users(3962914)
data$reputation
[1] 254328
data has lot of other information as well about the user.
3962914 is the user id of the user you are interested in which can be found out from their profile link. (https://stackoverflow.com/users/3962914/ronak-shah).

Related

Parse table data into R but it's blank, javascript?

My first post and a beginner with R so patience requested if I should have found an answer to my question elsewhere.
I'm trying to cobble together a table with data pulled from multiple sites from CME (https://www.cmegroup.com/trading/energy/crude-oil/western-canadian-select-wcs-crude-oil-futures.html is one).
I've tried using rvest but get a blank table.
I think this is because of the Javascript that is being used to populate the table in real time? I've fumbled my way around this site to look for similar problems and haven't quite figured out how best to pull this data. Any help is much appreciated.
library(rvest)
library(dplyr)
WCS_page <- "https://www.cmegroup.com/trading/energy/crude-oil/canadian-heavy-crude-oil-net-energy-index-futures_quotes_globex.html"
WCS_diff <- read_html(WCS_page)
month <- WCS_diff %>%
rvest::html_nodes('th') %>%
xml2::xml_find_all("//scope[contains(#col, 'Month')]") %>%
rvest::html_text()
price <- WCS_diff %>%
rvest::html_nodes('tr') %>%
xml2::xml_find_all("//td[contains(#class, 'quotesFuturesProductTable1_CLK0_last')]") %>%
rvest::html_text()
WTI_df <- data.frame(month, price)
knitr::kable(
WTI_df %>% head (10))
Yes, the page is using JS to load the data.
The easy way to check is to view source and then search for some of the text you saw in the table. For example the word "May" never shows up in the raw HTML, so it must have been loaded later.
The next step is to use something like the Chrome DevTools to inspect the network requests that were made. In this case there is a clear winner, and your structured data is coming down from here:
https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/6038/G

Rvest is unable to find the node specified by css selector, how do I fix it?

I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()
You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:

Xpath found with Elements but not readable/scrapeable via rvest

I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get
$3, $10, $25, $100, $250, $1500, $2800
The xpath indicates that one of them should be
/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/
form/div/div[1]/div/div/ul/li[2]/label
and the css selector
li.btn--wrapper:nth-child(2) > label:nth-child(1)
Up to the following, I see something in the xml_nodeset:
library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
xpath = '//*[#id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)
Then I see add the second part of the xpath and it shows up blank. Same with
X %>% html_nodes("li")
which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.
I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.
I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.
Any advice on how to best extract what I want in this situation? Thanks.
You can use regex to extract numbers from script tag. You get a comma separated character vector
library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])
Try it here

How can I scrape this recipe?

I am trying to webscrape some recipes for my own personal collection. It works great on some sites because the website structure sometimes easily allows for scraping, but some are harder. This one I have no idea how to deal with:
https://www.koket.se/halloumigryta-med-tomat-linser-och-chili
For the moment, let's just assume I want the ingredients on the left. If I inspect the website it looks like what I want are the two article class="ingredients" chunks. But I can't seem to get there.
I start with the following:
library(rvest)
library(tidyverse)
read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
However, running the above code shows that all of the ingredients are stored in data-item like so:
<div id="react-recipe-page" data-item="{
"chefNames":"<a href='/kockar/siri-barje'>Siri Barje</a>",
"groupedIngredients":[{
"header":"Kokosris",
"ingredients":[{
"name":"basmatiris","unit":"dl","amount":"3","amount_info":{"from":3},"main":false,"ingredient":true
}
<<<and so on>>>
So I am a little bit puzzled, because from inspecting the website everything seems to be neatly placed in things I can extract, but now it's not. Instead, I'd need some serious regular expressions in order to get everything like I want it.
So my question is: am I missing something? Is there some way I can get the contents of the ingredients articles?
(I tried SelectorGadget, but it just gave me No valid path found).
You can extract attributes by using html_attr("data-item") from the rvest package.
Furthermore, the data-item attribute looks like it's in JSON, which you can convert to a list using the fromJSON from the jsonlite package:
html <- read_html("https://www.koket.se/halloumigryta-med-tomat-linser-och-chili") %>%
html_nodes(".recipe-column-wrapper") %>%
html_nodes(xpath = '//*[#id="react-recipe-page"]')
recipe <- html %>% html_attr("data-item") %>%
fromJSON
Lastly, the recipe list contains lots of different values, which are not relevant, but the ingredients and measurements are there as well in the element recipe$ingredients.

How to write a loop to extract articles from website archive which links to numerous external sources?

I am trying to extract articles for a period of 200 days from Time dot mk archive, e.g. http://www.time.mk/week/2016/22. Each day has top 10 headings, each of which link to all articles related to it (at bottom of each heading "e.g. 320 поврзани вести". Following this link leads to a list of all related articles.
This is what I've managed so far:
`library(rvest)
url = ("http://www.time.mk/week/2016/22")
frontpage = read_html(url) %>%
html_nodes(".other_articles") %>%
html_attr("href") %>%
paste0()
mark = "http://www.time.mk/"
frontpagelinks = paste0(mark, frontpage)`
by now I access primary links going to related news
The following extracts all of the links to related news for the first heading, from where I clear my data for only those links that I need.
final = list()
final = read_html(frontpagelinks[1]) %>%
html_nodes("h1 a") %>%
html_attr("href")%>%
paste0()`
My question is how I could instruct R, whether via loop or some other option so as to extract links from all 10 headings from "frontpagelinks" at once - I tried a variety of options but nothing really worked.
Thanks!
EDIT
Parfait's response worked like a charm! Thank you so much.
I've run into an inexplicable issue however after using that solution.
Whereas before, when I was going link by link, I could easily sort out the data for only those portals that I need via:
a1onJune = str_extract_all(dataframe, ".a1on.")
Which provided me with a clean output: [130] "a1on dot mk/wordpress/archives/618719"
with only the links I needed, now if I try to run the same code with the larger df of all links I inexplicably get many variants of this this:
"\"alsat dot mk/News/255645\", \"a1on dot mk/wordpress/archives/620944\", , \"http://www dot libertas dot mk/sdsm-poradi-kriminalot-na-vmro-dpmne-makedonija-stana-slepo-tsrevo-na-balkanot/\",
As you can see in bold it returns my desired link, but also many others (I've edited out most for clarity sake) that occur both before and after it.
Given that I'm using the same expression, I don't see why this would be happening.
Any thoughts?
Simply run lapply to return a list of links from each element of frontpagelinks
linksList <- lapply(frontpagelinks, function(i) {
read_html(i) %>%
html_nodes("h1 a") %>%
html_attr("href")
})

Resources