Cleaning HTML code in R: how to clean this list? - r

I know that this question has been asked here tons of times but after reading a bunch of topics I'm still stucked on this :( . I've a list of scraped html nodes like this
http://bit.d o/bnRinN9
and I just want to clean all code part. Unfortunately I'm a newbie and the only thing it comes to my mind is the Cthulhu way (regex, argh!). Which way I can do this?
*I put a space between "d" and "o" in domain name because SO doesn't allow to post that link

This uses the data linked in Why R can't scrape these links? which was downloaded.
library(rvest)
library(stringr)
# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")
# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")
# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]
# the real urls are in the html text, prefixed with http
span_text <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]
# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"

library rvest includes many simple functions for scraping and processing html. It depends on package xml2. Generally you can scrape and filter in one step.
It's not clear if you want to extract the href value or the html text, which are the same in your example. This code extracts the href value by finding the a nodes and then the html attribute href. alternatively you can use html_text to get the link display text.
library(rvest)
links <- list('
http://anydomain.com/bnRinN9
<a href="domain.com/page">
')
# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs
## [1] "http://anydomain.com/bnRinN9" "domain.com/page"

Related

Extracting a specific link form an element based on a previous text-element

I want to extract all available links and dates of the available documents ("Referentenentwurf", "Kabinett", "Bundesrat" and "Inkrafttreten") for each legislative process (each of the gray boxes) from the page. My data set should have the following structure:
Each legislative process is represented by one row and the information about the related documents are in the rows
Here is the HTML structure of the seventh legislative process:
This is one example of the HTML-structure of the elements including the legislative processes.
Extracting the dates of each document per legislative process is not a problem (simply done by the investigation whether the "text()"-element includes e.g. "Kabinett").
But extracting the right URL is much more difficult because the "text()"-elements (indicating the document type) are not directly linked with the ""-elements (including the URL).
I'm trying to find a solution for the seventh legislative process ("Zwanzigste Verordnung zur Änderung von Anlagen des Betäubungsmittelgesetzes") in order to apply this solution to every legislative process.
This is my current work status:
if(!require("rvest")) install.packages("rvest")
library(rvest) #for html_attr & read_html
if(!require("dplyr")) install.packages("dplyr")
library(dplyr) # for %>%
if(!require("stringr")) install.packages("stringr")
library(stringr) # for str_detect()
if(!require("magrittr")) install.packages("magrittr")
library(magrittr) # for extract() [within pipes]
page <- read_html("https://www.bundesgesundheitsministerium.de/service/gesetze-und-verordnungen.html")
#Gesetz.Link -> here "Inkrafttreten"
#Gesetz.Link <- lapply(1:72, function(x){
x <- 7 # for demonstration reasons
node.with.data <- html_nodes(page, css = paste0("#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div:nth-child(",x*2,") > div > div > div.panel-body > p")) %>%
extract(
str_detect(html_text(html_nodes(page, css = paste0("#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div:nth-child(",x*2,") > div > div > div.panel-body > p"))),
"Inkrafttreten")
)
link <- node.with.data %>%
html_children() %>%
extract(
str_detect(html_text(html_nodes(node.with.data, xpath = paste0("text()"))),
"Inkrafttreten")
) %>%
html_attr("href")
ifelse(length(node.with.data)==0, NA, link) # set link to "NA" if there is no Link to "Referentenentwurf"
#}) %>%
# unlist()
(I have commented out the application for the entire website so that the solution can be related to the seventh element.)
The problem is, that there can be several URLs linked to each document (here "Download" & "Stellungnahmen" are linked to "Referentenentwurf"). This lead to an error of my syntax.
Is there any way to extract the nth-element within after another element? So there could be a check if the "text()"-element is "Referentenentwurf" and then extract the first element behind it
-> "<a href="/fileadmin/Dateien/3_Downloads/Gesetze_und_Verordnungen/GuV/B/2020-03-04_RefE_20-BtMAEndV.pdf" ...>".
I would be very grateful for tips on how to solve this problem!
Beyond that, I took the freedom to change a few things in your code and try to get you where you want:
My stab at this is to go into the list of Verordnungen/Gesetze/etc., find the div.panel-body > p as you do and within that the first link that refers to a downloadable document, by searching for a href containing "/fileadmin/Dateien" using xpath.
Looks like this:
library(purrr)
library(xml2)
html_nodes(page, css = '#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div') %>%
map(~{
.x %>%
xml_find_first('./div/div/div[contains(#class,"panel-body")]/p//a[contains(#href,"/fileadmin/Dateien")]') %>%
xml_attr('href')
})
//update:
If the above assumption doesn't work for you and you really just want to check for "first a-tag after 'Referentenentwurf' in the p-element", the following does get you that. However, I couldn't make it as "elegant" and just used a regex :)
html_nodes(page, css = '#skiplink2maincontent > div.col-xs-12.col-sm-10.col-sm-offset-1.col-md-8.col-md-offset-2 > div') %>%
map(~{
.x %>%
xml_find_first('./div/div/div[contains(#class,"panel-body")]/p') %>%
as.character() %>%
str_extract_all('(?<=Referentenentwurf.{0,10000})(?<=<a href=")[^"]*(?=")') %>%
unlist() %>%
first()
})

webscraping loop over url and html node in R rvest

I have a dataframe pubs with two columns: url, html.node. I want to write a loop that reads each url retrieves the html contents, and extract the information indicated by html.node column, and accumulates it in a data frame, or list.
All URL's are different, all html nodes are different. My code so far is:
score <- vector()
k <- 1
for (r in 1:nrow(pubs)){
art.url <- pubs[r, 1] # column 1 contains URL
art.node <- pubs[r, 2] # column 2 contains html nodes as charcters
art.contents <- read_html(art.url)
score <- art.contents %>% html_nodes(art.node) %>% html_text()
k<-k+1
print(score)
}
I appreciate your help.
First of all, be sure that each site you're going to scrape, allows you to scrape data, you can incurr in legal issue if you break some rules.
(Note, I've used only http://toscrape.com/ , a sandbox site to scraping, due you did not provide your data)
After that, you can proceed with this, hope it helps:
# first, your data I think they're similar to this
pubs <- data.frame(site = c("http://quotes.toscrape.com/",
"http://quotes.toscrape.com/"),
html.node = c(".text",".author"), stringsAsFactors = F)
Then the loop you required:
library(rvest)
# an empty list, to fill with the scraped data
empty_list <- list()
# here you are going to fill the list with the scraped data
for (i in 1:nrow(pubs)){
art.url <- pubs[i, 1] # choose the site as you did
art.node <- pubs[i, 2] # choose the node as you did
# scrape it!
empty_list[[i]] <- read_html(art.url) %>% html_nodes(art.node) %>% html_text()
}
Now the result is a list, but, with:
names(empty_list) <- pubs$site
You are going to add to each element of the list the name of the site, with the result:
$`http://quotes.toscrape.com/`
[1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"
[2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"
[3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
[4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"
[5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"
[6] "“Try not to become a man of success. Rather become a man of value.”"
[7] "“It is better to be hated for what you are than to be loved for what you are not.”"
[8] "“I have not failed. I've just found 10,000 ways that won't work.”"
[9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"
[10] "“A day without sunshine is like, you know, night.”"
$`http://quotes.toscrape.com/`
[1] "Albert Einstein" "J.K. Rowling" "Albert Einstein" "Jane Austen" "Marilyn Monroe" "Albert Einstein" "André Gide"
[8] "Thomas A. Edison" "Eleanor Roosevelt" "Steve Martin"
Clearly it should work with different sites, and different nodes.
You could also use map from the purrr package instead of a loop:
expand.grid(c("http://quotes.toscrape.com/", "http://quotes.toscrape.com/tag/inspirational/"), # vector of urls
c(".text",".author"), # vector of nodes
stringsAsFactors = FALSE) %>% # assuming that the same nodes are relevant for all urls, otherwise you would have to do something like join
as_tibble() %>%
set_names(c("url", "node")) %>%
mutate(out = map2(url, node, ~ read_html(.x) %>% html_nodes(.y) %>% html_text())) %>%
unnest()

downloading zip files using R

I am trying to download a bunch of zip files from the website
https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml
Any suggestions? I have tried using rvest to identify the href, but have not had any luck.
We can avoid platform-specific issues with download.file() and handle the downloads with httr.
First, we'll read in the page:
library(xml2)
library(httr)
library(rvest)
library(tidyverse)
pg <- read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml")
Now, we'll target all the .zip file links. They're relative paths (e.g. Zip) so we'll prepend the URL prefix to them as well:
html_nodes(pg, xpath=".//a[contains(#href, '.zip')]") %>% # this href gets _all_ of them
html_attr("href") %>%
sprintf("https://mesonet.agron.iastate.edu%s", .) -> zip_urls
Here's a sample of what ^^ looks like:
head(zip_urls)
## [1] "https://mesonet.agron.iastate.edu/data/gis/shape/4326/us/current_ww.zip"
## [2] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_all.zip"
## [3] "https://mesonet.agron.iastate.edu/pickup/wwa/1986_tsmf.zip"
## [4] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_all.zip"
## [5] "https://mesonet.agron.iastate.edu/pickup/wwa/1987_tsmf.zip"
## [6] "https://mesonet.agron.iastate.edu/pickup/wwa/1988_all.zip"
There are 84 of them:
length(zip_urls)
## [1] 84
So we'll make sure to include a Sys.sleep(5) in our download walker so we aren't hammering their servers since our needs are not more important than the site's.
Make a place to store things:
dir.create("mesonet-dl")
This could also be done with a for loop but using purrr::walk makes it fairly explicit we're generating side effects (i.e. downloading to disk and not modifying anything in the R environment):
walk(zip_urls, ~{
message("Downloading: ", .x) # keep us informed
# this is way better than download.file(). Read the httr man page on write_disk
httr::GET(
url = .x,
httr::write_disk(file.path("mesonet-dl", basename(.x)))
)
Sys.sleep(5) # be kind
})
We use file.path() to construct the save-file location in a platform-agnostic way and use basename() to extract the filename portion vs regex hacking since it's an R C-backed internal function that is platform-idiosyncrasy-aware.
This should work
library(tidyverse)
library(rvest)
setwd("YourDirectoryName") # set the directory where you want to download all files
read_html("https://mesonet.agron.iastate.edu/request/gis/watchwarn.phtml") %>%
html_nodes(".table-striped a") %>%
html_attr("href") %>%
lapply(function(x) {
filename <- str_extract(x, pattern = "(?<=wwa/).*") # this extracts the filename from the url
paste0("https://mesonet.agron.iastate.edu",x) %>% # this creates the relevant url from href
download.file(destfile=filename, mode = "wb")
Sys.sleep(5)})})

Get text from href tag after specific class

I am trying to scrape a webpage
library(RCurl)
webpage <- getURL("https://somewebpage.com")
webpage
<div class='CredibilityFacts'><span id='qZyoLu'><a class='answer_permalink'
action_mousedown='AnswerPermalinkClickthrough' href='/someurl/answer/my_id'
id ='__w2_yeSWotR_link'>
<a class='another_class' action_mousedown='AnswerPermalinkClickthrough'
href='/ignore_url/answer/some_id' id='__w2_ksTVShJ_link'>
<a class='answer_permalink' action_mousedown='AnswerPermalinkClickthrough'
href='/another_url/answer/new_id' id='__w2_ksTVShJ_link'>
class(webpage)
[1] "character"
I am trying to extract all the href value but only when it is preceded with answer_permalink class.
The output of this should be
[1] "/someurl/answer/my_id" "/another_url/answer/new_id"
/ignore_url/answer/some_id should be ignored as it is preceded with another_class and not answer_permalink class.
Right now, I am thinking of an approach with regex. I think something like this can be used for regex in stri_extract_all
class='answer_permalink'.*href='
but this isn't exactly what I want.
In what way can I achieve this? Moreover, apart from regex is there a function in R where we can extract element by class like in Javascript?
With dplyr and rvest we could do:
library(rvest)
library(dplyr)
"https://www.quora.com/profile/Ronak-Shah-96" %>%
read_html() %>%
html_nodes("[class='answer_permalink']") %>%
html_attr("href")
[1] "/How-can-we-adjust-in-engineering-if-we-are-not-in-IITs-or-NITs-How-can-we-enjoy-engineering-if-we-are-pursuing-it-from-a-local-private-college/answer/Ronak-Shah-96"
[2] "/Do-you-think-it-is-worth-it-to-change-my-career-path-For-the-past-2-years-I-was-pursuing-a-career-in-tax-advisory-in-a-BIG4-company-I-just-got-a-job-offer-that-will-allow-me-to-learn-coding-It-is-not-that-well-paid/answer/Ronak-Shah-96"
[3] "/Why-cant-India-opt-for-40-hours-work-a-week-for-all-professions-when-it-is-proved-and-working-well-in-terms-of-efficiency/answer/Ronak-Shah-96"
[4] "/Why-am-I-still-confused-and-thinking-about-my-career-after-working-more-than-one-year-in-software-engineering/answer/Ronak-Shah-96"
[5] "/Would-you-rather-be-a-jack-of-all-trades-or-the-master-of-one-trade/answer/Ronak-Shah-96"
Instead of string parsing, you could use a package like rvest or xml2:
library(xml2)
xml <- read_html(webpage)
l <- as_list(xml)[[1]][[1]][[1]][[1]] #not sure why you need to go this deep.
l2 <- l[sapply(l, attr, ".class") == "answer_permalink"]
sapply(l2, attr, "href")
a a
"/someurl/answer/my_id" "/another_url/answer/new_id"
require(XML)
require(RCurl)
doc <- getURL("https://www.quora.com/profile/Ronak-Shah-96" )
html <- htmlTreeParse(doc, useInternalNodes = TRUE)
nodes <- getNodeSet(html, "//a[#class='answer_permalink']")
sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])
[1] "/Do-you-think-it-is-worth-it-to-change-my-career-path-For-the-past-2-years-I-was-pursuing-a-career-in-tax-advisory-in-a-BIG4-company-I-just-got-a-job-offer-that-will-allow-me-to-learn-coding-It-is-not-that-well-paid/answer/Ronak-Shah-96"
[2] "/Why-cant-India-opt-for-40-hours-work-a-week-for-all-professions-when-it-is-proved-and-working-well-in-terms-of-efficiency/answer/Ronak-Shah-96"
[3] "/Why-am-I-still-confused-and-thinking-about-my-career-after-working-more-than-one-year-in-software-engineering/answer/Ronak-Shah-96"
[4] "/Would-you-rather-be-a-jack-of-all-trades-or-the-master-of-one-trade/answer/Ronak-Shah-96"
[5] "/Is-software-engineering-a-good-career-choice-I-know-it-pays-well-initially-but-if-you-look-at-the-managing-directors-of-most-companies-they-are-people-with-MBAs/answer/Ronak-Shah-96"

R software - rvest package, error in "download number"

I want download Amazon books review counts but I have one problem
I tried the following:
library(rvest)
url<-paste0("http://www.amazon.com/s/ref=lp_4_nr_p_72_3?",
"fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2C",
"n%3A4%2Cp_72%3A1250224011&bbn=4&ie=UTF8&qid",
"=1440446201&rnid=1250219011")
html<-html(url)
Reviews <- try({html_nodes(html, "#s-results-list-atf .a-text-normal:nth-child(2)") %>%
html_text()}, silent = TRUE)
But I only have 4 review counts in my R console and not 12 (Using selector gadget). What did I do wrong?
When I tried to download the books' names I didn't have the same problem... only in review counts.
Book <- try({ html_nodes(html, ".s-access-title") %>%
html_text()}, silent = TRUE)
page link Amazon Page
This is probably not the canonical approach, but here's what I did that works:
#via Inspect element in Chrome, the relevant info is
# in an <a> tag with class 'a-size-small a-link-normal a-text-normal'
# but this does not uniquely identify the review counts
# (e.g., the $12.00 Buy used & new... bit is also there)
# so we take a step up and find that both the rating
# and the review count are stored in a <div> tag
# with class 'a-row a-spacing-mini'
x<-html(url) %>% html_nodes("div.a-row.a-spacing-mini") %>%
html_nodes("a.a-size-small.a-link-normal.a-text-normal") %>%
html_text
#upon inspection of x, we can see that the relevant numbers
# always appear by themselves, thus:
> x[!is.na(as.integer(gsub(",","",x)))]
[1] "168" "232" "1,607" "2,226" "1,060" "25" "731" "2,374" "345" "7,205"
[11] "1,134" "1,137"

Resources