I am new to web scraping and R. I have been trying to build a function that will scrape multiple items from a each node with a particular name. In my search for an answer I came across https://github.com/hadley/rvest/issues/12 which has given me a good start.
Here is my question. I use:
nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
read_html %>%
html_nodes("div.col-md-6")
to give me a xml_nodeset. If I use:
html_node(nodes[1],xpath = "div[1]//a") %>% html_text()
I get the information I am looking for. So I need a way to loop over my sml_nodeset and apply the above function, however I have been unsuccessful.
I originally tried to just use
column <- function(x) nodes %>% html_node(xpath = "div[1]//a") %>% html_text()
like the link at the top did. But I get an error "Error in eval(expr, envir, enclos) : No matches" I have also tried using xpathApply, but it said
"Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "xml_nodeset""
Any direction you could give me would be most helpful.
It will give you all the titles and links for each video:
library(RCurl)
library(XML)
nodes <- "http://pyvideo.org/category/50/pycon-us-2014"
doc <- htmlParse(nodes)
titles <- xpathSApply(doc,"//div[#class='col-md-6']//strong/a", xmlValue)
links <- paste("http://pyvideo.org", xpathSApply(doc,"//div[#class='col-md-6']//strong//#href"), sep ="")
Related
I am trying to scrape multiple webpages by using the list of URLs (a csv file)
This is my dataset: https://www.mediafire.com/file/9qh516tdcto7is7/nyp_data.csv/file
The url column includes all the links that I am trying to use and scrape.
I tried to use for() loop by:
news_urls <- read_csv("nyp_data.csv")
library(rvest)
content_list <- vector()
for (i in 1:nrow(news_urls)) {
nyp_url <- news_urls[i, 'url']
nyp_html <- read_html(nyp_url)
nyp_nodes <- nyp_html %>%
html_elements(".single__content")
tag_name = ".single__content"
nyp_texts <- nyp_html %>%
html_elements(tag_name) %>%
html_text()
{ content_list[i] <- nyp_texts[1]
}}
However, I am getting an error that says:
Error in UseMethod("read_xml") : no applicable method for
'read_xml' applied to an object of class "c('tbl_df', 'tbl',
'data.frame')"
I believe the links that I have work well; they aren't broken and I can access to them by clicking an individual link.
If for loop isn't the one that I should be using here, do have any other idea to scarpe the content?
I also tried:
urls <- news_urls[,5] #identify the column with the urls
url_xml <- try(apply(urls, 1, read_html)) #apply the function read_html() to the `url` vector
textScraper <- function(x) {
html_text(html_nodes (x, ".single__content")) %>% #in this data, my text is in a node called ".single__content"
str_replace_all("\n", "") %>%
str_replace_all("\t", "") %>%
paste(collapse = '')
}
article_text <- lapply(url_xml, textScraper)
article_text[1]
but it kept me giving an error,
Error in open.connection(x, "rb") : HTTP error 404
The error occures in this line:
nyp_html <- read_html(nyp_url)
As the error message tells you that the argument to read_xml (which is what is called internally by read_html) is a data.frame (amongst others, as it actually is a tibble).
This is because in this line:
nyp_url <- news_urls[i, 'url']
you are using single brackets to subset your data. Single brackets do return a data.frame containing the filtered data. You can avoid this by using double brackets like this:
nyp_url <- news_urls[[i, 'url']]
or this (which I usually find more readable):
nyp_url <- news_urls[i, ]$url
Either should fix your problem.
If you want to read more about using these notations you could look at this answer.
I'm building a web scraper for some News websites in Switzerland. After some trial & error and a lot of help from StackOverflow (thx everyone!), I've gotten to a point where I can get text data from all articles.
#packages instalieren
install.packages("rvest")
install.packages("tidyverse")
install.packages("dplyr")
library(rvest)
library(stringr)
#seite einlesen
apisrf<- read_xml('https://www.srf.ch/news/bnf/rss/1646')
urls_srf <- apisrf %>% html_nodes('link') %>% html_text()
zeit_srf <- apisrf %>% html_nodes('pubDate') %>% html_text()
#data.frame basteln
dfsrf_titel_text <- data.frame(Text = character())
#scrape
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- artikel %>% html_nodes('p') %>% html_text()
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
running this gives me dfsrf_titel_text. (I'm going to combine it with the titles of the articles at some point but let that be my problem.)
however, now my data is pretty untidy and I can't really figure out how to clean it in a way so it works for me. Especially annoying is that the texts from the different articles are not really structured in that way but get a new line whenever there is a paragraph in the texts. I can't combine the paragraphs because all the texts have different lengths. (The first article, starting at point 3, is super long because it's a live ticker covering the corona crisis so don't get confused if you run my code.)
how can I get R to create a new row in my dataframe only if the text is from a new article (meaning from a new URL?
thx for your help!
can you provide a sample of your data? you can use the strsplit(string, pattern) function where the pattern you specify is something that only happens between articles. Perhaps the URL?
strsplit(dfsrf_text,"www.\\w+.ch")
That will split your text anytime a URL in the .ch domain is found. you can use this regular expression cheat sheet to help you identify the pattern that seperates your articles.
You should correct this while creating dataframe itself. Here I am binding this all the data for each article together using paste0 adding new line character between them (\n\n).
library(rvest)
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- paste0(artikel %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
However, growing data in a loop is highly inefficient and can slow the process terribly especially when you have large data to scrape like this. Try using sapply.
dfsrf_titel_text <- data.frame(text = sapply(urls_srf, function(x) {
paste0(read_html(x) %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
}))
So this will give you number of rows same as length of urls_srf .
I'm trying to extract a bit of information under the node /html/head/script[16] from a website (here) but am unable to do so.
nykaa <- "https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934"
obj <- read_html(nykaa)
extracted_json <- obj %>%
html_nodes(xpath = "/html/head/script[16]") %>%
html_text(trim = TRUE)
Currently, my output for the above code is null. But I would like to extract the data under the above mentioned node in an organized manner.
You can use regex to grab the javascript object inside that script tag and then pass to jsonlite and parse. You need to root around a bit to get what you want from that but it is all there
library(rvest)
library(magrittr)
library(stringr)
library(jsonlite)
p <- read_html('https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934') %>% html_text()
all_data <- jsonlite::parse_json(str_match_all(p,'window\\.__PRELOADED_STATE__ = (.*)')[[1]][,2])
I'm trying to scrape tabulated data on previous US statewide election results, and I think ballotpedia.org is a good place to be getting this data from - as URLs are in a consistent format for all states.
Here's the code I set up to test it:
library(dplyr)
library(rvest)
# STEP 1 - URL COMPONENTS TO SCRAPE FROM
senate_base_url <- "https://ballotpedia.org/United_States_Senate_elections_in_"
senate_state_urls <- gsub(" ", "_", state.name)
senate_year_urls <- c(",_2012", ",_2014", ",_2016")
# TEST
test_url <- paste0(senate_base_url, senate_state_urls[10], senate_year_urls[2])
this results in the following URL: https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014
Using the 'selectorgadget' chrome plugin, I selected the table in question containing the election result, and tried parsing it into R as follows:
test_data <- read_html(test_url)
test_data <- test_data %>%
html_node(xpath = '//*[#id="collapsibleTable0"]') %>%
html_table()
However, I'm getting the following error:
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
Furthermore, the R object test_data yields a list with 2 empty elements.
Can anyone tell me what I'm doing wrong here? Is the html_table() function the wrong one? Using html_text() simply returns an NA character vector. Any help would be greatly appreciated, thanks very much :).
Your xpath statement is incorrect, thus the html_node function is returning a null value.
Here is a solution using the html tags. "Look for a table tag within a center tag"
library(rvest)
test_data <- read_html(test_url)
test_data <- test_data %>% html_nodes("center table") %>% html_table()
Or to retrieve the fully collapsed table use the html tag with class name:
collapsedtable<-test_data %>% html_nodes("table.collapsible") %>%
html_table(fill=TRUE)
this works for me:
library(httr)
library(XML)
r <- httr::GET("https://ballotpedia.org/United_States_Senate_elections_in_Georgia,_2014")
XML::readHTMLTable(rawToChar(r$content))[[2]]
I imported the csv file that I want to use in r. Here, I am trying to call one of the columns from the csv file. This column has a list of urls titled "URLs". Then, I want the code which I have to scrap data from each url. In short, I want to use more efficient way than listing all the urls in c() function since I have about 200 links.
https://www.nytimes.com/2018/04/07/health/health-care-mergers-doctors.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/11/well/move/why-exercise-alone-may-not-be-the-key-to-weight-loss.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/07/health/antidepressants-withdrawal-prozac-cymbalta.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/well/why-you-should-get-the-new-shingles-vaccine.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/health/fda-essure-bayer-contraceptive-implant.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/health/hot-pepper-thunderclap-headaches.html?rref=collection%2Fsectioncollection%2Fhealth
The error appears when running this: article <- links %>% map(read_html).
It gives me this message:
(Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "factor")
Here is the code:
setwd("C:/Users/Majed/Desktop")
d <- read.csv("NYT.csv")
d
links<- d$URLs
article <- links %>% map(read_html)
title <-
article %>% map_chr(. %>% html_node("title") %>% html_text())
content <-
article %>% map_chr(. %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = ""))
article_table <- data.frame("Title" = title, "Content" = content)
Pay attention to the meaning of your error message: read_html expects a character string, but you're giving it a factor. read.csv converts strings to factors, unless you include the argument stringsAsFactors = F. read_csv from readr is a good alternative if you, like me, forget that you don't want strings automatically turned into factors.
I can't reproduce the problem without your data, but try converting the URLs to strings:
links <- as.character(d$URLs)
article <- links %>% map(read_html)