Using for loop to scrape webpages in R - r

I am trying to scrape multiple webpages by using the list of URLs (a csv file)
This is my dataset: https://www.mediafire.com/file/9qh516tdcto7is7/nyp_data.csv/file
The url column includes all the links that I am trying to use and scrape.
I tried to use for() loop by:
news_urls <- read_csv("nyp_data.csv")
library(rvest)
content_list <- vector()
for (i in 1:nrow(news_urls)) {
nyp_url <- news_urls[i, 'url']
nyp_html <- read_html(nyp_url)
nyp_nodes <- nyp_html %>%
html_elements(".single__content")
tag_name = ".single__content"
nyp_texts <- nyp_html %>%
html_elements(tag_name) %>%
html_text()
{ content_list[i] <- nyp_texts[1]
}}
However, I am getting an error that says:
Error in UseMethod("read_xml") : no applicable method for
'read_xml' applied to an object of class "c('tbl_df', 'tbl',
'data.frame')"
I believe the links that I have work well; they aren't broken and I can access to them by clicking an individual link.
If for loop isn't the one that I should be using here, do have any other idea to scarpe the content?
I also tried:
urls <- news_urls[,5] #identify the column with the urls
url_xml <- try(apply(urls, 1, read_html)) #apply the function read_html() to the `url` vector
textScraper <- function(x) {
html_text(html_nodes (x, ".single__content")) %>% #in this data, my text is in a node called ".single__content"
str_replace_all("\n", "") %>%
str_replace_all("\t", "") %>%
paste(collapse = '')
}
article_text <- lapply(url_xml, textScraper)
article_text[1]
but it kept me giving an error,
Error in open.connection(x, "rb") : HTTP error 404

The error occures in this line:
nyp_html <- read_html(nyp_url)
As the error message tells you that the argument to read_xml (which is what is called internally by read_html) is a data.frame (amongst others, as it actually is a tibble).
This is because in this line:
nyp_url <- news_urls[i, 'url']
you are using single brackets to subset your data. Single brackets do return a data.frame containing the filtered data. You can avoid this by using double brackets like this:
nyp_url <- news_urls[[i, 'url']]
or this (which I usually find more readable):
nyp_url <- news_urls[i, ]$url
Either should fix your problem.
If you want to read more about using these notations you could look at this answer.

Related

Rvest, html_nodes return empty list and string, wield website

For this website: https://www.coinopsy.com/dead-coins/, I'm using R and the rvest package to scrape names, summary, etc., that kind of info, to make my own form. I've done this with other websites and it was really successful, but this one is odd.
I used SelectorGadget, which is useful, in my previous jobs, to figure out the css nodes' names, but html_nodes and html_text return empty character, I don't know if it's because the website is structured under a totally different format!
An example of the css code:
td class="all sorting_1">a class="coin_name" href="007coin">007Coin /a>/td>
a class="coin_name" href="007coin">007Coin /a>
url <- "https://www.coinopsy.com/dead-coins/"
webpage <- read_html(url)
Item_html <- html_nodes(webpage,'.coin_name')
Item <- html_text(Item_html)
> Item
character(0)
Can someone help me out on this issue?
If you disable javascript in the browser you will see that that content is not loaded. If you then inspect the html you will see the data is stored in a script tag; presumably loaded into the table when javascript runs in the browser. Javascript doesn't run with the method you are using. You can extract the javascript array of arrays from the response html. Then parse into a dataframe. I am new to R so looking into how this can be done in this case. I will include a full example with python at the end. I will update if my research yields something. Otherwise, you can regex out contents from returned string in data.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2] # string representation of list of lists
#step to convert string to object
#step to convert object to dataframe
In python there is the ast library which makes the conversion easy and the result of the below is the table you see on the page.
import requests
import re
import ast
import pandas as pd
r = requests.get('https://www.coinopsy.com/dead-coins/')
p = re.compile(r'var table_data = (.*?);') #p1 = re.compile(r'(\[".*?"\])')
data = p.findall(r.text)[0]
listings = ast.literal_eval(data)
df = pd.DataFrame(listings)
print(df)
Edit:
Currently I can't find a library which does the conversion I mentioned. Below is ugly way of combining and feels inefficient. I would welcome suggestions on improvements (though that may be for code review later). I'm still looking at this so will update.
library(rvest)
library(stringr)
library(magrittr)
url = 'https://www.coinopsy.com/dead-coins/'
headers <- c("Column To Drop","Name","Summary","Project Start Date","Project End Date","Founder","urlId")
# https://www.coinopsy.com/dead-coins/bigone-token/ where bigone-token is urlId
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]
z <- substr(data, start = 2, stop = nchar(data)-1) %>% str_match_all(., "\\[(.*?)\\]")
z <- z[[1]][,2]
for(i in seq(1,length(z))){
if(i==1){
df <- rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x)))
}else{
df <- rbind(df,rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x))))
}
}
maybe it will help someone, I had the same problem, the solution was that at the beginning I have to specify the label to which the script is to be directed followed by the ".". In your case you want to address a class named coin_name, when specifying that class in the html_nodes function you don't specify the tag, same as I did. To solve it, I only had to specify the label, which in your case is the "a" label, and it would look like this.
Item_html <- html_nodes(webpage,'a.coin_name')
That way the html_nodes function would not return empty.
I know you already solved it but I hope someone can help you.

undefined columns when trying to use separate function

I have a code that I built to scrape player data from yahoo's fantasy football player page so I can get a list of players and the rank that yahoo gives them.
The code worked fine last year but now I am getting an error when I run the separate function:
> temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
Error in `[.data.frame`(x, x_vars) : undefined columns selected
In addition: Warning message:
Expected 6 pieces. Missing pieces filled with `NA` in 1 rows [1].
I cannot figure out why it is giving this error, the column I am trying to separate looks correct. I have another script that uses this function to do something similar and when I went to try to use it there it worked fine.
The "missing pieces filled in with 'NA'" warning shouldn't be a problem, just that it wont run because of the undefined columns error.
The minimal code that I use to get to where I am is this:
library(rvest)## For read.html
library(tidyr)## For separate function
#scrapes the data
url <- 'https://football.fantasysports.yahoo.com/f1/107573/players?status=A&pos=O&cut_type=9&stat1=S_S_2017&myteam=0&sort=PR&sdir=1&count=0'
web <- read_html(url)
table = html_nodes(web, 'table')
temp <- html_table(table)[[2]]
#
colnames(temp) <- c('one','two',3:26)
temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
The data is scraped in without names so I quickly give names to them including spelling out the column in question so it works with the separate function. I have tried using quotation marks around two in separate but it give the same error.
After remove the first row of temp, you code works.
library(dplyr)
colnames(temp) <- c('one','two',3:ncol(temp))
# Use ncol(temp) to make sure the column number is correct
temp2 <- temp %>%
filter(row_number() > 1) %>%
separate(two, c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)

Calling a column from csv.file to extract its data

I imported the csv file that I want to use in r. Here, I am trying to call one of the columns from the csv file. This column has a list of urls titled "URLs". Then, I want the code which I have to scrap data from each url. In short, I want to use more efficient way than listing all the urls in c() function since I have about 200 links.
https://www.nytimes.com/2018/04/07/health/health-care-mergers-doctors.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/11/well/move/why-exercise-alone-may-not-be-the-key-to-weight-loss.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/07/health/antidepressants-withdrawal-prozac-cymbalta.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/well/why-you-should-get-the-new-shingles-vaccine.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/health/fda-essure-bayer-contraceptive-implant.html?rref=collection%2Fsectioncollection%2Fhealth
https://www.nytimes.com/2018/04/09/health/hot-pepper-thunderclap-headaches.html?rref=collection%2Fsectioncollection%2Fhealth
The error appears when running this: article <- links %>% map(read_html).
It gives me this message:
(Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "factor")
Here is the code:
setwd("C:/Users/Majed/Desktop")
d <- read.csv("NYT.csv")
d
links<- d$URLs
article <- links %>% map(read_html)
title <-
article %>% map_chr(. %>% html_node("title") %>% html_text())
content <-
article %>% map_chr(. %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = ""))
article_table <- data.frame("Title" = title, "Content" = content)
Pay attention to the meaning of your error message: read_html expects a character string, but you're giving it a factor. read.csv converts strings to factors, unless you include the argument stringsAsFactors = F. read_csv from readr is a good alternative if you, like me, forget that you don't want strings automatically turned into factors.
I can't reproduce the problem without your data, but try converting the URLs to strings:
links <- as.character(d$URLs)
article <- links %>% map(read_html)

Web Scraping (in R) - readHTMLTable error

I have a file called Schedule.csv, which is structured as follows:
URLs
http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=27&year=2015
http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=28&year=2015
I am trying to use the explanation provided in the following question to scrape the html tables but it isn't working: How to scrape HTML tables from a list of links
My current code is as follows:
library(XML)
schedule<-read.csv("Schedule.csv")
stats <- list()
for(i in seq_along(schedule))
{
print(i)
total <- readHTMLTable(schedule[i])
n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
stats[[i]] <- as.data.frame(total[[which.max(n.rows)]])
}
I get an error when I run this code as follows:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"data.frame"’
If I manually type the URL's in a vector as per below I get exactly what I want when I run the readHTMLTable code.
schedule<-c("http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=27&year=2015","http://www.basketball-reference.com/friv/dailyleaders.cgi?month=10&day=28&year=2015")
Can someone please explain to me why the read.csv is not giving me a usable vector of information to input into the readHTMLTable function?
read.csv creates a data.frame in your shcedule. Then you want to access it by rows (seq_along and schedule[i] work along the columns of the data frame)
In your case you can do:
for (i in 1:nrow (schedule)) {
total <- readHTMLTable(schedule[i, 1])
as I understand you want the first column of your data.frame, change the , 1] or use column names otherwise.
Also notice that read.csv will read your first column as a factor so you may prefer to read it as a character:
schedule<-read.csv("Schedule.csv", as.is = TRUE)
An other alternative if your file has a unique column is to use readLines an then you can keep your loop as it was...
schedule<-readLines("Schedule.csv")
stats <- list()
for(i in seq_along(schedule))
{
print(i)
total <- readHTMLTable(schedule[i])
...
but be careful with the column names because they will be in the first element of your schedule vector

looping over xml_nodeset in R

I am new to web scraping and R. I have been trying to build a function that will scrape multiple items from a each node with a particular name. In my search for an answer I came across https://github.com/hadley/rvest/issues/12 which has given me a good start.
Here is my question. I use:
nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
read_html %>%
html_nodes("div.col-md-6")
to give me a xml_nodeset. If I use:
html_node(nodes[1],xpath = "div[1]//a") %>% html_text()
I get the information I am looking for. So I need a way to loop over my sml_nodeset and apply the above function, however I have been unsuccessful.
I originally tried to just use
column <- function(x) nodes %>% html_node(xpath = "div[1]//a") %>% html_text()
like the link at the top did. But I get an error "Error in eval(expr, envir, enclos) : No matches" I have also tried using xpathApply, but it said
"Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "xml_nodeset""
Any direction you could give me would be most helpful.
It will give you all the titles and links for each video:
library(RCurl)
library(XML)
nodes <- "http://pyvideo.org/category/50/pycon-us-2014"
doc <- htmlParse(nodes)
titles <- xpathSApply(doc,"//div[#class='col-md-6']//strong/a", xmlValue)
links <- paste("http://pyvideo.org", xpathSApply(doc,"//div[#class='col-md-6']//strong//#href"), sep ="")

Resources