I have a CSV file that contains information about a set of articles and the 9th volume refers to the URLs. I have successfully scraped the title and abstract by a single URL with the following code:
library('rvest')
url <- 'https://link.springer.com/article/10.1007/s10734-019-00404-5'
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'.u-h1')
title_data <- html_text(title_data_html)
head(title_data)
abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
abstract_data <- html_text(abstract_data_html)
head(abstract_data)
myTable = data.frame(Title = title_data, Abstract = abstract_data)
View(myTable)
Now I want to use R to scrape the title and abstract of each article. My question is how to import the URLs contained in the CVS file and how to write a for loop to scrape the data I need. I'm quite new to r so thanks in advance for your help.
Try This:
library(rvest)
URLs <- read.csv("urls.csv")
n <-nrow(URLs)
URLs2 <-character()
for (i in 1:n) {
URLs2[i]<-as.character(URLs[i,1])
}
df <- data.frame(Row = as.integer(), Title=as.character(), Abstract=as.character(), stringsAsFactors = FALSE)
for (i in 1:n) {
webpage <- tryCatch(read_html(URLs2[i]), error = function(e){'empty page'})
if (!"empty page" %in% webpage) {
title_data_html <- html_nodes(webpage,'.u-h1')
title_data <- html_text(title_data_html)
abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
abstract_data <- html_text(abstract_data_html)
temp <- as.data.frame(cbind(Row = match(URLs2[i], URLs2), Title = title_data, Abstract = abstract_data))
if(ncol(temp)==3) {
df <- rbind(df,temp)
}
}
}
View(df)
Edit: The code has been edited in such a way that it will work even if the urls are broken (skipping them). The output rows will be numbered with the entry's corresponding row number in the csv.
I'm looking to download fundamental data for public companies. Utilizing the quantmod package, I was trying to use getFinancials() to pull data, for which it works for some companies but has varied results (I read and understand the disclaimer about free data) but want to confirm that I am pulling this correctly.
For JPM:
On the Yahoo finance website, I do see financials populated, but the below call seems to pull "google" as the src instead of "yahoo", for which there are sparse financials populated.
Google - https://www.google.com/finance?q=NYSE%3AJPM&fstype=ii&ei=9kh-WejLE5e_etbzmpgP
Yahoo - https://finance.yahoo.com/quote/JPM/financials?p=JPM
library(quantmod)
JPM <- getFinancials("JPM", src = "yahoo", auto.assign = FALSE)
viewFin(JPM, type = "IS", period = "A")
Is there a correct way to specify the src? Also is there a way to use getFinancials() but if there is a NA in an indicative column (Revenues for example) switch the source (google vs. yahoo)?
The top of the help page for getFinancials says (emphasis added),
Download Income Statement, Balance Sheet, and Cash Flow Statements from Google Finance.
There is currently no way to specify Yahoo Finance as a source. Doing so would require someone to write a method to scrape and parse the HTML from Yahoo Finance, since there's no way to download it in a file like there is for price data.
I think Yahoo changed it's API very recently. Download the file from the link titled "Get Excel Spreadsheet to Download Bulk Historical Stock Data from Google Finance"
http://investexcel.net/multiple-stock-quote-downloader-for-excel/
That is for Excel, which you can easily load into R.
You could try something like this, as well.
# assumes codes are known beforehand
codes <- c("MSFT","SBUX","S","AAPL","ADT")
urls <- paste0("https://www.google.com/finance/historical?q=",codes,"&output=csv")
paths <- paste0(codes,"csv")
missing <- !(paths %in% dir(".", full.name = TRUE))
missing
# simple error handling in case file doesn't exists
downloadFile <- function(url, path, ...) {
# remove file if exists already
if(file.exists(path)) file.remove(path)
# download file
tryCatch(
download.file(url, path, ...), error = function(c) {
# remove file if error
if(file.exists(path)) file.remove(path)
# create error message
c$message <- paste(substr(path, 1, 4),"failed")
message(c$message)
}
)
}
# wrapper of mapply
Map(downloadFile, urls[missing], paths[missing])
Or, this.
## downloads historic prices for all constituents of SP500
library(zoo)
library(tseries)
## read in list of constituents, with company name in first column and
## ticker symbol in second column
## CREATE A FILE TO READ DATA FROM!!!
spComp <- read.csv("C:/Users/Excel/Desktop/stocks.csv" )
## specify time period
dateStart <- "2013-01-01"
dateEnd <- "2015-05-08"
## extract symbols and number of iterations
symbols <- spComp[, 1]
nAss <- length(symbols)
## download data on first stock as zoo object
z <- get.hist.quote(instrument = symbols[1], start = dateStart,
end = dateEnd, quote = "AdjClose",
retclass = "zoo", quiet = T)
## use ticker symbol as column name
dimnames(z)[[2]] <- as.character(symbols[1])
## download remaining assets in for loop
for (i in 2:nAss) {
## display progress by showing the current iteration step
cat("Downloading ", i, " out of ", nAss , "\n")
result <- try(x <- get.hist.quote(instrument = symbols[i],
start = dateStart,
end = dateEnd, quote = "AdjClose",
retclass = "zoo", quiet = T))
if(class(result) == "try-error") {
next
}
else {
dimnames(x)[[2]] <- as.character(symbols[i])
## merge with already downloaded data to get assets on same dates
z <- merge(z, x)
}
}
## save data
# CREATE A FILE TO WRITE DATA TO!!!
write.zoo(z, file = "C:/Users/Excel/Desktop/all_sp500_price_data.csv", index.name = "time")
Here is, yet another, option for you to consider.
Method #1:
---
layout: post
title: "2014-11-20-Download-Stock-Data-1"
description: ""
category: R
tags: [knitr,lubridate,stringr,plyr,dplyr]
---
{% include JB/setup %}
This article illustrates how to download stock price data files from Google, save it into a local drive and merge them into a single data frame. This script is slightly modified from a script which downloads RStudio package download log data. The original source can be found [here](https://github.com/hadley/cran-logs-dplyr/blob/master/1-download.r).
First of all, the following three packages are used.
{% highlight r %}
library(knitr)
library(lubridate)
library(stringr)
library(plyr)
library(dplyr)
{% endhighlight %}
The script begins with creating a folder to save data files.
{% highlight r %}
# create data folder
dataDir <- paste0("data","_","2014-11-20-Download-Stock-Data-1")
if(file.exists(dataDir)) {
unlink(dataDir, recursive = TRUE)
dir.create(dataDir)
} else {
dir.create(dataDir)
}
{% endhighlight %}
After creating urls and file paths, files are downloaded using `Map` function - it is a warpper of `mapply`. Note that, in case the function breaks by an error (eg when a file doesn't exist), `download.file` is wrapped by another function that includes an error handler (`tryCatch`).
{% highlight r %}
# assumes codes are known beforehand
codes <- c("MSFT", "TCHC") # codes <- c("MSFT", "1234") for testing
urls <- paste0("http://www.google.com/finance/historical?q=NASDAQ:",
codes,"&output=csv")
paths <- paste0(dataDir,"/",codes,".csv") # back slash on windows (\\)
# simple error handling in case file doesn't exists
downloadFile <- function(url, path, ...) {
# remove file if exists already
if(file.exists(path)) file.remove(path)
# download file
tryCatch(
download.file(url, path, ...), error = function(c) {
# remove file if error
if(file.exists(path)) file.remove(path)
# create error message
c$message <- paste(substr(path, 1, 4),"failed")
message(c$message)
}
)
}
# wrapper of mapply
Map(downloadFile, urls, paths)
{% endhighlight %}
Finally files are read back using `llply` and they are combined using `rbind_all`. Note that, as the merged data has multiple stocks' records, `Code` column is created.
{% highlight r %}
# read all csv files and merge
files <- dir(dataDir, full.name = TRUE)
dataList <- llply(files, function(file){
data <- read.csv(file, stringsAsFactors = FALSE)
# get code from file path
pattern <- "/[A-Z][A-Z][A-Z][A-Z]"
code <- substr(str_extract(file, pattern), 2, nchar(str_extract(file, pattern)))
# first column's name is funny
names(data) <- c("Date","Open","High","Low","Close","Volume")
data$Date <- dmy(data$Date)
data$Open <- as.numeric(data$Open)
data$High <- as.numeric(data$High)
data$Low <- as.numeric(data$Low)
data$Close <- as.numeric(data$Close)
data$Volume <- as.integer(data$Volume)
data$Code <- code
data
}, .progress = "text")
data <- rbind_all(dataList)
{% endhighlight %}
Some of the values are shown below.
|Date | Open| High| Low| Close| Volume|Code |
|:----------|-----:|-----:|-----:|-----:|--------:|:----|
|2014-11-26 | 47.49| 47.99| 47.28| 47.75| 27164877|MSFT |
|2014-11-25 | 47.66| 47.97| 47.45| 47.47| 28007993|MSFT |
|2014-11-24 | 47.99| 48.00| 47.39| 47.59| 35434245|MSFT |
|2014-11-21 | 49.02| 49.05| 47.57| 47.98| 42884795|MSFT |
|2014-11-20 | 48.00| 48.70| 47.87| 48.70| 21510587|MSFT |
|2014-11-19 | 48.66| 48.75| 47.93| 48.22| 26177450|MSFT |
This way wouldn't be efficient compared to the way where files are read directly without being saved into a local drive. This option may be useful, however, if files are large and the API server breaks connection abrubtly.
I hope this article is useful and I'm going to write an article to show the second way.
Method #2:
---
layout: post
title: "2014-11-20-Download-Stock-Data-2"
description: ""
category: R
tags: [knitr,lubridate,stringr,plyr,dplyr]
---
{% include JB/setup %}
In an [earlier article](http://jaehyeon-kim.github.io/r/2014/11/20/Download-Stock-Data-1/), a way to download stock price data files from Google, save it into a local drive and merge them into a single data frame. If files are not large, however, it wouldn't be effective and, in this article, files are downloaded and merged internally.
The following packages are used.
{% highlight r %}
library(knitr)
library(lubridate)
library(stringr)
library(plyr)
library(dplyr)
{% endhighlight %}
Taking urls as file locations, files are directly read using `llply` and they are combined using `rbind_all`. As the merged data has multiple stocks' records, `Code` column is created. Note that, when an error occurrs, the function returns a dummy data frame in order not to break the loop - values of the dummy data frame(s) are filtered out at the end.
{% highlight r %}
# assumes codes are known beforehand
codes <- c("MSFT", "TCHC") # codes <- c("MSFT", "1234") for testing
files <- paste0("http://www.google.com/finance/historical?q=NASDAQ:",
codes,"&output=csv")
dataList <- llply(files, function(file, ...) {
# get code from file url
pattern <- "Q:[0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z]"
code <- substr(str_extract(file, pattern), 3, nchar(str_extract(file, pattern)))
# read data directly from a URL with only simple error handling
# for further error handling: http://adv-r.had.co.nz/Exceptions-Debugging.html
tryCatch({
data <- read.csv(file, stringsAsFactors = FALSE)
# first column's name is funny
names(data) <- c("Date","Open","High","Low","Close","Volume")
data$Date <- dmy(data$Date)
data$Open <- as.numeric(data$Open)
data$High <- as.numeric(data$High)
data$Low <- as.numeric(data$Low)
data$Close <- as.numeric(data$Close)
data$Volume <- as.integer(data$Volume)
data$Code <- code
data
},
error = function(c) {
c$message <- paste(code,"failed")
message(c$message)
# return a dummy data frame
data <- data.frame(Date=dmy(format(Sys.Date(),"%d%m%Y")), Open=0, High=0,
Low=0, Close=0, Volume=0, Code="NA")
data
})
})
# dummy data frame values are filtered out
data <- filter(rbind_all(dataList), Code != "NA")
{% endhighlight %}
Some of the values are shown below.
|Date | Open| High| Low| Close| Volume|Code |
|:----------|-----:|-----:|-----:|-----:|--------:|:----|
|2014-11-26 | 47.49| 47.99| 47.28| 47.75| 27164877|MSFT |
|2014-11-25 | 47.66| 47.97| 47.45| 47.47| 28007993|MSFT |
|2014-11-24 | 47.99| 48.00| 47.39| 47.59| 35434245|MSFT |
|2014-11-21 | 49.02| 49.05| 47.57| 47.98| 42884795|MSFT |
|2014-11-20 | 48.00| 48.70| 47.87| 48.70| 21510587|MSFT |
|2014-11-19 | 48.66| 48.75| 47.93| 48.22| 26177450|MSFT |
It took a bit longer to complete the script as I had to teach myself how to handle errors in R. And this is why I started to write articles in this blog.
I hope this article is useful.
Summarize Stock returns From Multiple Files:
---
layout: post
title: "2014-11-27-Summarise-Stock-Returns-from-Multiple-Files"
description: ""
category: R
tags: [knitr,lubridate,stringr,reshape2,plyr,dplyr]
---
{% include JB/setup %}
This is a slight extension of the previous two articles ( [2014-11-20-Download-Stock-Data-1](http://jaehyeon-kim.github.io/r/2014/11/20/Download-Stock-Data-1/), [2014-11-20-Download-Stock-Data-2](http://jaehyeon-kim.github.io/r/2014/11/20/Download-Stock-Data-2/) ) and it aims to produce gross returns, standard deviation and correlation of multiple shares.
The following packages are used.
{% highlight r %}
library(knitr)
library(lubridate)
library(stringr)
library(reshape2)
library(plyr)
library(dplyr)
{% endhighlight %}
The script begins with creating a data folder in the format of *data_YYYY-MM-DD*.
{% highlight r %}
# create data folder
dataDir <- paste0("data","_",format(Sys.Date(),"%Y-%m-%d"))
if(file.exists(dataDir)) {
unlink(dataDir, recursive = TRUE)
dir.create(dataDir)
} else {
dir.create(dataDir)
}
{% endhighlight %}
Given company codes, URLs and file paths are created. Then data files are downloaded by `Map`, which is a wrapper of `mapply`. Note that R's `download.file` function is wrapped by `downloadFile` so that the function does not break when an error occurs.
{% highlight r %}
# assumes codes are known beforehand
codes <- c("MSFT", "TCHC")
urls <- paste0("http://www.google.com/finance/historical?q=NASDAQ:",
codes,"&output=csv")
paths <- paste0(dataDir,"/",codes,".csv") # backward slash on windows (\)
# simple error handling in case file doesn't exists
downloadFile <- function(url, path, ...) {
# remove file if exists already
if(file.exists(path)) file.remove(path)
# download file
tryCatch(
download.file(url, path, ...), error = function(c) {
# remove file if error
if(file.exists(path)) file.remove(path)
# create error message
c$message <- paste(substr(path, 1, 4),"failed")
message(c$message)
}
)
}
# wrapper of mapply
Map(downloadFile, urls, paths)
{% endhighlight %}
Once the files are downloaded, they are read back to combine using `rbind_all`. Some more details about this step is listed below.
* only Date, Close and Code columns are taken
* codes are extracted from file paths by matching a regular expression
* data is arranged by date as the raw files are sorted in a descending order
* error is handled by returning a dummy data frame where its code value is NA.
* individual data files are merged in a long format
* 'NA' is filtered out
{% highlight r %}
# read all csv files and merge
files <- dir(dataDir, full.name = TRUE)
dataList <- llply(files, function(file){
# get code from file path
pattern <- "/[A-Z][A-Z][A-Z][A-Z]"
code <- substr(str_extract(file, pattern), 2, nchar(str_extract(file, pattern)))
tryCatch({
data <- read.csv(file, stringsAsFactors = FALSE)
# first column's name is funny
names(data) <- c("Date","Open","High","Low","Close","Volume")
data$Date <- dmy(data$Date)
data$Close <- as.numeric(data$Close)
data$Code <- code
# optional
data$Open <- as.numeric(data$Open)
data$High <- as.numeric(data$High)
data$Low <- as.numeric(data$Low)
data$Volume <- as.integer(data$Volume)
# select only 'Date', 'Close' and 'Code'
# raw data should be arranged in an ascending order
arrange(subset(data, select = c(Date, Close, Code)), Date)
},
error = function(c){
c$message <- paste(code,"failed")
message(c$message)
# return a dummy data frame not to break function
data <- data.frame(Date=dmy(format(Sys.Date(),"%d%m%Y")), Close=0, Code="NA")
data
})
}, .progress = "text")
# data is combined to create a long format
# dummy data frame values are filtered out
data <- filter(rbind_all(dataList), Code != "NA")
{% endhighlight %}
Some values of this long format data is shown below.
|Date | Close|Code |
|:----------|-----:|:----|
|2013-11-29 | 38.13|MSFT |
|2013-12-02 | 38.45|MSFT |
|2013-12-03 | 38.31|MSFT |
|2013-12-04 | 38.94|MSFT |
|2013-12-05 | 38.00|MSFT |
|2013-12-06 | 38.36|MSFT |
The data is converted into a wide format data where the x and y variables are Date and Code respectively (`Date ~ Code`) while the value variable is Close (`value.var="Close"`). Some values of the wide format data is shown below.
{% highlight r %}
# data is converted into a wide format
data <- dcast(data, Date ~ Code, value.var="Close")
kable(head(data))
{% endhighlight %}
|Date | MSFT| TCHC|
|:----------|-----:|-----:|
|2013-11-29 | 38.13| 13.52|
|2013-12-02 | 38.45| 13.81|
|2013-12-03 | 38.31| 13.48|
|2013-12-04 | 38.94| 13.71|
|2013-12-05 | 38.00| 13.55|
|2013-12-06 | 38.36| 13.95|
The remaining steps are just differencing close price values after taking log and applying `sum`, `sd`, and `cor`.
{% highlight r %}
# select except for Date column
data <- select(data, -Date)
# apply log difference column wise
dailyRet <- apply(log(data), 2, diff, lag=1)
# obtain daily return, variance and correlation
returns <- apply(dailyRet, 2, sum, na.rm = TRUE)
std <- apply(dailyRet, 2, sd, na.rm = TRUE)
correlation <- cor(dailyRet)
returns
{% endhighlight %}
{% highlight text %}
## MSFT TCHC
## 0.2249777 0.6293973
{% endhighlight %}
{% highlight r %}
std
{% endhighlight %}
{% highlight text %}
## MSFT TCHC
## 0.01167381 0.03203031
{% endhighlight %}
{% highlight r %}
correlation
{% endhighlight %}
{% highlight text %}
## MSFT TCHC
## MSFT 1.0000000 0.1481043
## TCHC 0.1481043 1.0000000
{% endhighlight %}
Finally the data folder is deleted.
{% highlight r %}
# delete data folder
if(file.exists(dataDir)) { unlink(dataDir, recursive = TRUE) }
{% endhighlight %}
I am trying to get concatenate text files from url but i don't know how to do this with the html and the different folders?
This is the code i tried, but it only lists the text files and has a lot of html code like this How do I fix this so that I can combine the text files into one csv file?
library(RCurl)
url <- "http://weather.ggy.uga.edu/data/daily/"
dir <- getURL(url, dirlistonly = T)
filenames <- unlist(strsplit(dir,"\n")) #split into filenames
#append the files one after another
for (i in 1:length(filenames)) {
file <- past(url,filenames[i],delim='') #concatenate for urly
if (i==1){
cp <- read_delim(file, header=F, delim=',')
}
else{
temp <- read_delim(file,header=F,delim=',')
cp <- rbind(cp,temp) #append to existing file
rm(temp)# remove the temporary file
}
}
here is a code snippet that I got to work for me. I like to use rvest over RCurl, just because that's what I've learned. In this case, I was able to use the html_nodes function to isolate each file ending in .txt. The result table has the times saved as character strings, but you could fix that later. Let me know if you have any questions.
library(rvest)
library(readr)
url <- "http://weather.ggy.uga.edu/data/daily/"
doc <- xml2::read_html(url)
text <- rvest::html_text(rvest::html_nodes(doc, "tr td a:contains('.txt')"))
# define column types of fwf data ("c" = character, "n" = number)
ctypes <- paste0("c", paste0(rep("n",11), collapse = ""))
data <- data.frame()
for (i in 1:2){
file <- paste0(url, text[1])
date <- as.Date(read_lines(file, n_max = 1), "%m/%d/%y")
# Read file to determine widths
columns <- fwf_empty(file, skip = 3)
# Manually expand `solar` column to be 3 spaces wider
columns$begin[8] <- columns$begin[8] - 3
data <- rbind(data, cbind(date,read_fwf(file, columns,
skip = 3, col_types = ctypes)))
}
# I would like to read the list of .html files to extract data. Appreciate your help.
library(rvest)
library(XML)
library(stringr)
library(data.table)
library(RCurl)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- ("C:/R/BNB/")
pages <- html_text(html_node(u1, ".results_count"))
Total_Pages <- substr(pages, 4, 7)
TP <- as.numeric(Total_Pages)
# reading first two pages, writing them as separate .html files
for (i in 1:TP) {
url <- paste(u0, "page=/", i, sep = "")
download.file(url, paste(download_folder, i, ".html", sep = ""))
#create html object
html <- html(paste(download_folder, i, ".html", sep = ""))
}
Here is a potential solution:
library(rvest)
library(stringr)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- getwd() #note change in output directory
TP<-max(as.integer(html_text(html_nodes(u1,"a.page-numbers"))), na.rm=TRUE)
# reading first two pages, writing them as separate .html files
for (i in 1:TP ) {
url <- paste(u0,"page/",i, "/", sep="")
print(url)
download.file(url,paste(download_folder,i,".html",sep=""))
#create html object
html <- read_html(paste(download_folder,i,".html",sep=""))
}
I could not find the class .result-count in the html, so instead I looked for the page-numbers class and pick the highest returned value.
Also, the function html is deprecated thus I replaced it with read_html.
Good luck