This is a newbie question in R. I am downloading yahoo finance monthly stock price data using R where the ticker names are read from a text file. I am using a loop to read the ticker names to download the data and putting them in a list. My problem is some ticker names may not be correct thus my code stops when it encounters this case. I want the following.
skip the ticker name if it is not correct.
Each element in the list is a dataframe. I want the ticker names to be appended to variable names in element dataframes.
I need an efficient way to create a dataframe that has the closing prices as variables.
Here is the sample code for the simplified version of my problem.
library(tseries)
tckk <- c("MSFT", "C", "VIA/B", "MMM") # ticker names defined
numtk <- length(tckk);
ustart <- "2000-12-30";
uend <- "2007-12-30" # start and end date
all_dat <- list(); # empty list to fill in the data
for(i in 1:numtk)
{
all_dat[[i]] <- xxx <- get.hist.quote(instrument = tckk[i], start=ustart, end=uend, quote = c("Open", "High", "Low", "Close"), provider = "yahoo", compression = "m")
}
The code stops at the third entry but I want to skip this ticker and move on to "MMM". I have heard about Trycatch() function but do not know how to use it.
As per question 2, I want the variable names for the first element of the list to be "MSFTopen", "MSFThigh", "MSFTlow", and "MSFTclose". Is there a better to way to do it apart from using a combination of loop and paste() function.
Finally, for question 3, I need a dataframe with three columns corresponding to closing prices. Again, I am trying to avoid a loop here.
Thank you.
Your best bet is to use quantmod and store the results as a time series (in this case, it will be xts):
library(quantmod)
library(plyr)
symbols <- c("MSFT","C","VIA/B","MMM")
#1
l_ply(symbols, function(sym) try(getSymbols(sym)))
symbols <- symbols[symbols %in% ls()]
#2
sym.list <- llply(symbols, get)
#3
data <- xts()
for(i in seq_along(symbols)) {
symbol <- symbols[i]
data <- merge(data, get(symbol)[,paste(symbol, "Close", sep=".")])
}
This also a little late...If you want to grab data with just R's base functions without dealing with any add-on packages, just use the function read.csv(URL), where the URL is a string pointing to the right place at Yahoo. The data will be pulled in as a dataframe, and you will need to convert the 'Date' from a string to a Date type in order for any plots to look nice. Simple code snippet is below.
URL <- "http://ichart.finance.yahoo.com/table.csv?s=SPY"
dat <- read.csv(URL)
dat$Date <- as.Date(dat$Date, "%Y-%m-%d")
Using R's base functions may give you more control over the data manipulation.
I'm a little late to the party, but I think this will be very helpful to other late comers.
The stockSymbols function in TTR fetches instrument symbols from nasdaq.com, and adjusts the symbols to be compatible with Yahoo! Finance. It currently returns ~6,500 symbols for AMEX, NYSE, and NASDAQ. You could also take a look at the code in stockSymbols that adjusts tickers to be compatible with Yahoo! Finance to possibly adjust some of the tickers in your file.
NOTE: stockSymbols in the version of TTR on CRAN is broken due to a change on nasdaq.com, but it is fixed in the R-forge version of TTR.
I do it like this, because I need to have the historic pricelist and a daily update file in order to run other packages:
library(fImport)
fecha1<-"03/01/2009"
fecha2<-"02/02/2010"
Sys.time()
y <- format(Sys.time(), "%y")
m <- format(Sys.time(), "%m")
d <- format(Sys.time(), "%d")
fecha3 <- paste(c(m,"/",d,"/","20",y), collapse="")
write.table(yahooSeries("GCI", from=fecha1, to=fecha2), file = "GCI.txt", sep="\t", quote = FALSE, eol="\r\n", row.names = TRUE)
write.table(yahooSeries("GCI", from=fecha2, to=fecha3), file = "GCIupdate.txt", sep="\t", quote = FALSE, eol="\r\n", row.names = TRUE)
GCI <- read.table("GCI.txt")
GCI1 <- read.table("GCIupdate.txt")
GCI <- rbind(GCI1, GCI)
GCI <- unique(GCI)
write.table(GCI, file = "GCI.txt", sep="\t", quote = FALSE, eol="\r\n", row.names = TRUE)
If your ultimate goal is to get the data.frame of three columns of closing prices, then the new package tidyquant may be better suited for this.
library(tidyquant)
symbols <- c("MSFT", "C", "VIA/B", "MMM")
# Download data in tidy format.
# Will remove VIA/B and warn you.
data <- tq_get(symbols)
# Ticker symbols as column names for closing prices
data %>%
select(.symbol, date, close) %>%
spread(key = .symbol, value = close)
This will scale to any number of stocks, so the file of 1000 tickers should work just fine!
Slightly modified from the above solutions... (thanks Shane and Stotastic)
symbols <- c("MSFT", "C", "MMM")
# 1. retrieve data
for(i in seq_along(symbols)) {
URL <- paste0("http://ichart.finance.yahoo.com/table.csv?s=", symbols[i])
dat <- read.csv(URL)
dat$Date <- as.Date(dat$Date, "%Y-%m-%d")
assign(paste0(symbols[i]," _data"), dat)
dat <- NULL
}
Unfortunately, URL "ichart.finance.yahoo.com" is dead and not working now. As I know, Yahoo closed it and it seems it will not be opened.
Several days ago I found nice alternative (https://eodhistoricaldata.com/) with an API very similar to Yahoo Finance.
Basically, for R-script described above you just need to change this part:
URL <- paste0("ichart.finance.yahoo.com/table.csv?s=", symbols[i])
to this:
URL <- paste0("eodhistoricaldata.com/api/table.csv?s=", symbols[i])
Then add an API key and it will work in the same way as before. I saved a lot of time for my R-scripts on it.
Maybe give the BatchGetSymbols library a try. What I like about it over quantmod is that you can specify a time period for your data.
library(BatchGetSymbols)
# set dates
first.date <- Sys.Date() - 60
last.date <- Sys.Date()
freq.data <- 'daily'
# set tickers
tickers <- c('FB','MMM','PETR4.SA','abcdef')
l.out <- BatchGetSymbols(tickers = tickers,
first.date = first.date,
last.date = last.date,
freq.data = freq.data,
cache.folder = file.path(tempdir(),
'BGS_Cache') ) # cache in tempdir()
Related
I am downloading data on stocks from yahoo finance with tseries package. The issue is that I am not getting the most recent date - last price is always for 2 days ago.
Below is my code, can you please advise what I should correct to get all the available prices?
Thank you!
`dir <- "D:/Yahoo stock prices" #location
setwd(dir)
# Packages needed
require(tseries)
require(zoo)
YH <- read.csv2(file="SBI.csv",header=T, sep=";", dec=".")
date <- "2012-09-20"
penny_stocks <- c("SMDS.L", "MNDI.L", "SKG.L")
prices <- NULL
for(i in 1:length(YH[,1])){
prices <- try(get.hist.quote(as.character(YH[i,1]),
start=date,
quote='Open'
)
,silent=TRUE
)
if(!is.character(prices)){
if(as.character(YH[i,1]) %in% penny_stocks) prices <- prices / 100
prices <- as.data.frame(prices)
prices <- cbind(rownames(prices),prices)
colnames(prices) <- c("date",as.character(YH[i,1]))
if(length(prices) > 1){
if(i == 1){
allprices <- prices
names <- c("date",as.character(YH[i,1]))
} else {
names <- append(colnames(allprices),as.character(YH[i,1]))
allprices <- merge(allprices,prices,by ="date", all.x = TRUE)
colnames(allprices) <- names
}
}
}
}
write.csv2(allprices,"Prices 200511.csv")
warnings()
`
At the moment of writing, the data available on the yahoo site is until the 2020-05-12. You need to specify the end date as by default in tseries, the end is defined as Sys.Date() - 1. So using tseries::get.hist.quote("SMDS.L", end = Sys.Date(), quote = "Open") will return the data until 2020-05-12. Now you would expect that the default would be good enough, but there are a lot of issues with the yahoo data and getting the correct last records if the data is not located in the US. There probably is a process in place that loads the data a day after closing.
Note that the default settings of tseries::get.hist.quote are slightly different than the defaults of underlying function call to quantmod::getSymbols. tseries uses the default Sys.Date() - 1, quantmod uses Sys.Date(). Also the start dates are different. tseries uses "1991-01-02" as the start date, quantmod uses "2007-01-01".
I want to extract data from the OECD website particularily the dataset "REGION_ECONOM" with the dimensions "GDP" (GDP of the respective regions) and "POP_AVG" (the average population of the respective region).
This is the first time I am doing this:
I picked all the required dimensions on the OECD website and copied the SDMX (XML) link.
I tried to load them into R and convert them to a data frame with the following code:
(in the link I replaced the list of all regions with "ALL" as otherwise the link would have been six pages long)
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url2 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.ALL.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx2 <- readSDMX(url2)
stats2 <- as.data.frame(sdmx2)
head(stats2)
Unfortunately, this returns a "400 Bad request" error.
When just selecting a couple of regions the error does not appear:
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url1 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.AUS+AU1+AU101+AU103+AU104+AU105.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx1 <- readSDMX(url1)
stats1 <- as.data.frame(sdmx1)
head(stats1)
I also tried to use the "OECD" package to get the data. There I had the same problem. ("400 Bad Request")
if (!require(OECD)) install.packages('OECD') + library(OECD)
df1<-get_dataset("REGION_ECONOM", filter = "GDP+POP_AVG",
start_time = 2008, end_time = 2009, pre_formatted = TRUE)
However, when I use the package for other data sets it does work:
df <- get_dataset("FTPTC_D", filter = "FRA+USA", pre_formatted = TRUE)
Does anyone know where my mistake could lie?
the sdmx-ml api does not seem to work as explained (using the all parameter), whereas the json API works just fine. The following query returns the values for all countries and returns them as json - I simply replaced All by an empty field.
query <- https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?
Transforming it to a readable format is not so trivial. I played around a bit to find the following work-around:
# send a GET request using httr
library(httr)
query <- "https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
dat_raw <- GET(query)
dat_parsed <- parse_json(content(dat_raw, "text")) # parse the content
Next, access the observations from the nested list and transform them to a matrix. Also extract the features from the keys:
dat_obs <- dat_parsed[["dataSets"]][[1]][["observations"]]
dat0 <- do.call(rbind, dat_obs) # get a matrix
new_features <- matrix(as.numeric(do.call(rbind, strsplit(rownames(dat0), ":"))), nrow = nrow(dat0))
dat1 <- cbind(new_features, dat0) # add feature columns
dat1_df <- as.data.frame(dat1) # optionally transform to data frame
Finally you want to find out about the keys. Those are hidden in the "structure". This one you also need to parse correctly, so I wrote a function for you to easier extract the values and ids:
## Get keys of features
keys <- dat_parsed[["structure"]][["dimensions"]][["observation"]]
for (i in 1:length(keys)) print(paste("id position:", i, "is feature", keys[[i]]$id))
# apply keys
get_features <- function(data_input, keys_input, feature_index, value = FALSE) {
keys_temp <- keys_input[[feature_index]]$values
keys_temp_matrix <- do.call(rbind, keys_temp)
keys_temp_out <- keys_temp_matrix[, value + 1][unlist(data_input[, feature_index])+1] # column 1 is id, 2 is value
return(unlist(keys_temp_out))
}
head(get_features(dat1_df, keys, 7))
head(get_features(dat1_df, keys, 2, value = FALSE))
head(get_features(dat1_df, keys, 2, value = TRUE))
I hope that helps you in your project.
Best, Tobias
For all you Bloomberg and R users out there:
I usually have no problem pulling Bloomberg data into R via the Rblpapi package, but have run across an issue when trying to pull index-level data.
The problem is the code below returns erroneous results as it begins pulling data starting in 1986 (not 1950) and it leaves many values NA that should be populated. Using the excel API, the data pulls in fine, but I need to add "days = a" for some of the fields since they don't begin until after 1950.
Reproducible example (assuming you have Bloomberg access):
# Load packages ----------------------------------------------------------
library("Rblpapi")
library("tidyverse")
library("lubridate")
# Connect to Bloomberg --------------------------------------------------
blpConnect()
# Pull equity index-level specific data over time for S&P 500, S&P Mid Cap (400) and S&P Small Cap (600) indices ----------------------
# Index tickers
tickers <- c("SPX Index", "MID Index", "SML Index")
# Bloomberg inputs
myField <- c("PX_LAST", "TRAIL_12M_EPS", "TRAIL_12M_DILUTED_EPS", "BEST_EPS", "PE_RATIO", "BEST_PE_RATIO",
"TRAIL_12M_EBITDA_PER_SHARE", "PX_TO_EBITDA", "PX_TO_BOOK_RATIO", "PX_TO_SALES_RATIO",
"PX_TO_FREE_CASH_FLOW", "EQY_DVD_YLD_12M", "TOT_DEBT_TO_EBITDA", "EV_TO_T12M_SALES", "EV_TO_T12M_EBITDA",
"TRAIL_12M_GROSS_MARGIN", "EBITDA_MARGIN", "TRAIL_12M_OPER_MARGIN", "TRAIL_12M_PROF_MARGIN",
"RETURN_ON_ASSET", "RETURN_COM_EQY", "RETURN_ON_CAP", "NET_DEBT_TO_EBITDA", "CUR_MKT_CAP", "AVERAGE_MARKET_CAP"
)
# Pull data
sp_indices_fundmtls_raw <- as.data.frame(bdh(tickers,
myField,
start.date = as.Date("1950-01-01"),
end.date = Sys.Date(),
include.non.trading.days = TRUE
)
)
Since this didn't work, I tried just pulling the data using SPX Index only. Same issue. I then tried the formula with fewer tickers
# Bloomberg inputs
myField <- c("PX_LAST", "TRAIL_12M_EPS", "TRAIL_12M_DILUTED_EPS", "BEST_EPS", "PE_RATIO", "BEST_PE_RATIO",
"TRAIL_12M_EBITDA_PER_SHARE", "PX_TO_EBITDA", "PX_TO_BOOK_RATIO", "PX_TO_SALES_RATIO",
"PX_TO_FREE_CASH_FLOW", "EQY_DVD_YLD_12M",
"TOT_DEBT_TO_EBITDA", "EV_TO_T12M_SALES", "EV_TO_T12M_EBITDA"
)
That worked better, but still started in 1964 not 1950. Again, excel API works fine and will just return NA if data is missing earlier as I expected R to do.
This makes me think that there must be a field that needs an option or an override to pull the data correctly. I tried adding
ovrd <- c("PERIODICITY_OVERRIDE" = "D")
# Pull data
sp_indices_fundmtls_raw <- as.data.frame(bdh(tickers,
myField,
start.date = as.Date("1950-01-01"),
end.date = Sys.Date(),
include.non.trading.days = TRUE,
overrides = ovrd
)
)
But no luck.
Can anyone figure out the issue?
Thanks!
After much trial and error, I figured out a way to get the data.
I created a function to pull the data:
# Function to pull data
sp_indices_pull_fx <- function(myField, index_ticker) {
df <- as.data.frame(bdh(index_ticker,
myField,
start.date = as.Date("1950-01-01"),
end.date = Sys.Date(),
include.non.trading.days = TRUE
)
)
Then I used lapply to cycle through each ticker. For example:
# SP500
sp_500_pull <- lapply(myField, sp_indices_pull_fx, index_ticker = "SPX Index")
Then I combined those results into a single data frame:
# Merge
sp_500_fundmtls_raw = Reduce(function(...) merge(..., all = TRUE), sp_500_pull)
So in short, what worked was creating a function and feeding that function each individual ticker as opposed to trying to pull multiple tickers at once using the bdh function.
I would like to download daily data from yahoo for the S&P 500, the DJIA, and 30-year T-Bonds, map the data to the proper time zone, and merge them with my own data. I have several questions.
My first problem is getting the tickers right. From yahoo's website, it looks like the tickers are: ^GSPC, ^DJI, and ^TYX. However, ^DJI fails. Any idea why?
My second problem is that I would like to constrain the time zone to GMT (I would like to ensure that all my data is on the same clock, GMT seems like a neutral choice), but I couldn' get it to work.
My third problem is that I would like to merge the yahoo data with my own data, obtained by other means and available in a different format. It is also daily data.
Here is my attempt at constraining the data to the GMT time zone. Executed at the top of my R script.
Sys.setenv(TZ = "GMT")
# > Sys.getenv("TZ")
# [1] "GMT"
# the TZ variable is properly set
# but does not affect the time zone in zoo objects, why?
Here is my code to get the yahoo data:
library("tseries")
library("xts")
date.start <- "1999-12-31"
date.end <- "2013-01-01"
# tickers <- c("GSPC","TYX","DJI")
# DJI Fails, why?
# http://finance.yahoo.com/q?s=%5EDJI
tickers <- c("GSPC","TYX") # proceed without DJI
z <- zoo()
index(z) <- as.Date(format(time(z)),tz="")
for ( i in 1:length(tickers) )
{
cat("Downloading ", i, " out of ", length(tickers) , "\n")
x <- try(get.hist.quote(
instrument = paste0("^",tickers[i])
, start = date.start
, end = date.end
, quote = "AdjClose"
, provider = "yahoo"
, origin = "1970-01-01"
, compression = "d"
, retclass = "zoo"
, quiet = FALSE )
, silent = FALSE )
print(x[1:4]) # check that it's not empty
colnames(x) <- tickers[i]
z <- try( merge(z,x), silent = TRUE )
}
Here is the dput(head(df)) of my dataset:
df <- structure(list(A = c(-0.011489000171423, -0.00020300000323914,
0.0430639982223511, 0.0201549995690584, 0.0372899994254112, -0.0183669999241829
), B = c(0.00110999995376915, -0.000153000000864267, 0.0497750006616116,
0.0337960012257099, 0.014121999964118, 0.0127800004556775), date = c(9861,
9862, 9863, 9866, 9867, 9868)), .Names = c("A", "B", "date"
), row.names = c("0001-01-01", "0002-01-01", "0003-01-01", "0004-01-01",
"0005-01-01", "0006-01-01"), class = "data.frame")
I'd like to merge the data in df with the data in z. I can't seem to get it to work.
I am new to R and very much open to your advice about efficiency, best practice, etc.. Thanks.
EDIT: SOLUTIONS
On the first problem: following GSee's suggestions, the Dow Jones Industrial Average data may be downloaded with the quantmod package: thus, instead of the "^DJI" ticker, which is no longer available from yahoo, use the "DJIA" ticker. Note that there is no caret in the "DJIA" ticker.
On the second problem, Joshua Ulrich points out in the comments that "Dates don't have timezones because days don't have a time component."
On the third problem: The data frame appears to have corrupted dates, as pointed out by agstudy in the comments.
My solutions rely on the quantmod package and the attached zoo/xts packages:
library(quantmod)
Here is the code I have used to get proper dates from my csv file:
toDate <- function(x){ as.Date(as.character(x), format("%Y%m%d")) }
dtz <- read.zoo("myData.csv"
, header = TRUE
, sep = ","
, FUN = toDate
)
dtx <- as.xts(dtz)
The dates in the csv file were stored in a single column in the format "19861231". The key to getting correct dates was to wrap the date in "as.character()". Part of this code was inspired from R - Stock market data from csv to xts. I also found the zoo/xts manuals helpful.
I then extract the date range from this dataset:
date.start <- start(dtx)
date.end <- end(dtx)
I will use those dates with quantmod's getSymbols function so that the other data I download will cover the same period.
Here is the code I have used to get all three tickers.
tickers <- c("^GSPC","^TYX","DJIA")
data <- new.env() # the data environment will store the data
do.call(cbind, lapply( tickers
, getSymbols
, from = date.start
, to = date.end
, env = data # data saved inside an environment
)
)
ls(data) # see what's inside the data environment
data$GSPC # access a particular ticker
Also note, as GSee pointed out in the comments, that the option auto.assign=FALSE cannot be used in conjunction with the option env=data (otherwise the download fails).
A big thank you for your help.
Yahoo doesn't provide historical data for ^DJI. Currently, it looks like you can get the same data by using the ticker "DJIA", but your mileage may vary.
It does work in this case because you're only dealing with Dates
the df object your provided is yearly data beginning in the year 0001. So, that's probably not what you wanted.
Here's how I would fetch and merge those series (or use an environment and only make one call to getSymbols)
library(quantmod)
do.call(cbind, lapply(c("^GSPC", "^TYX"), getSymbols, auto.assign=FALSE))
Is it possible to get the publication date of CRAN packages from within R? I would like to get a list of the k most recently published CRAN packages, or alternatively all packages published after date dd-mm-yy. Similar to the information on the available_packages_by_date.html?
The available.packages() command has a "fields" argument, but this only extracts fields from the DESCRIPTION. The date field on the package description is not always up-to-date.
I can get it with a smart regex from the html page, but I am not sure how reliable and up-to-date the this html file is... At some point Kurt might decide to give the layout a makeover which would break the script. An alternative is to use timestamps from the CRAN FTP but I am also not sure how good this solution is. I am not sure if there is somewhere a formally structured file with publication dates? I assume the HTML page is automatically generated from some DB.
Turns out there is an undocmented file "packages.rds" which contains the publication dates (not times) of all packages. I suppose these data are used to recreate the HTML file every day.
Below a simple function that extracts publication dates from this file:
recent.packages.rds <- function(){
mytemp <- tempfile();
download.file("http://cran.r-project.org/web/packages/packages.rds", mytemp);
mydata <- as.data.frame(readRDS(mytemp), row.names=NA);
mydata$Published <- as.Date(mydata[["Published"]]);
#sort and get the fields you like:
mydata <- mydata[order(mydata$Published),c("Package", "Version", "Published")];
}
The best approach is to take advantage of the fact the package DESCRIPTION is published on the cran mirror, and since the DESCRIPTION is from the build package, it contains information about exactly when it was packaged:
pkgs <- unname(available.packages()[, 1])[1:20]
desc_urls <- paste("http://cran.r-project.org/web/packages/", pkgs, "/DESCRIPTION", sep = "")
desc <- lapply(desc_urls, function(x) read.dcf(url(x)))
sapply(desc, function(x) x[, "Packaged"])
sapply(desc, function(x) x[, "Date/Publication"])
(I'm restricting it to the first 20 packages here to illustrate the basic idea)
Here a function that uses the HTML and regular expressions. I still rather get the information from a more formal place though in case the HTML ever changes layout.
recent.packages <- function(number=10){
#html is malformed
maxlines <- number*2 + 11
mytemp <- tempfile()
if(getOption("repos") == "#CRAN#"){
repo <- "http://cran.r-project.org"
} else {
repo <- getOption("repos");
}
newurl <- paste(repo,"/web/packages/available_packages_by_date.html", sep="");
download.file(newurl, mytemp);
datastring <- readLines(mytemp, n=maxlines)[12:maxlines];
#we only find packages from after 2010-01-01
myexpr1 <- '201[0-9]-[0-9]{2}-[0-9]{2} </td> <td> <a href="../../web/packages/[a-zA-Z0-9\\.]{2,}/'
myexpr2 <- '^201[0-9]-[0-9]{2}-[0-9]{2}'
myexpr3 <- '[a-zA-Z0-9\\.]{2,}/$'
newpackages <- unlist(regmatches(datastring, gregexpr(myexpr1, datastring)));
newdates <- unlist(regmatches(newpackages, gregexpr(myexpr2, newpackages)));
newnames <- unlist(regmatches(newpackages, gregexpr(myexpr3, newpackages)));
newdates <- as.Date(newdates);
newnames <- substring(newnames, 1, nchar(newnames)-1);
returndata <- data.frame(name=newnames, date=newdates);
return(head(returndata, number));
}
So here a solution that uses the dir listing from the FTP. It is a little tricky because the FTP gives the date in linux format with either a timestamp or a year. Other than that it does it's job. I'm still not convinced this is reliable though. If packages are copied over to another server all timestmaps might be reset.
recent.packages.ftp <- function(){
setwd(tempdir())
download.file("ftp://cran.r-project.org/pub/R/src/contrib/", destfile=tempfile(), method="wget", extra="--no-htmlify");
#because of --no-htmlify the destfile argument does not work
datastring <- readLines(".listing");
unlink(".listing");
myexpr1 <- "(?<date>[A-Z][a-z]{2} [0-9]{2} [0-9]{2}:[0-9]{2}) (?<name>[a-zA-Z0-9\\.]{2,})_(?<version>[0-9\\.-]*).tar.gz$"
matches <- gregexpr(myexpr1, datastring, perl=TRUE);
packagelines <- as.logical(sapply(regmatches(datastring, matches), length));
#subset proper lines
matches <- matches[packagelines];
datastring <- datastring[packagelines];
N <- length(matches)
#from the ?regexpr manual
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
#parse all records
mydf <- data.frame(date=rep(NA, N), name=rep(NA, N), version=rep(NA,N))
for(i in 1:N){
mydf[i,] <- parse.one(datastring[i], matches[[i]]);
}
row.names(mydf) <- NULL;
#convert dates
mydf$date <- strptime(mydf$date, format="%b %d %H:%M");
#So linux only displays dates for packages of less then six months old.
#However strptime will assume the current year for packages that don't have a timestamp
#Therefore for dates that are in the future, we subtract a year. We can use some margin for timezones.
infuture <- (mydf$date > Sys.time() + 31*24*60*60);
mydf$date[infuture] <- mydf$date[infuture] - 365*24*60*60;
#sort and return
mydf <- mydf[order(mydf$date),];
row.names(mydf) <- NULL;
return(mydf);
}
You could process the page http://cran.r-project.org/src/contrib/, and split the fields by whitespace in order to obtain the fully specified package source filename, which includes the version # and a .gz suffix.
There are a few other items in the list that are not package files, such as the .rds files, various subdirectories, and so on.
Barring changes in how the directory structure is presented or the locations of the files, I can't think of anything more authoritative than this.