I made an empty matrix to populate with time series data and forecasts using
pred <- matrix(rep(NA,80),20,4)
But when I try to populate the matrix with a for loop, I get error message ("Error in pred[i, 2] <- forecast(fit.season, h = 1) : number of items to replace is not a multiple of replacement length")
beer1 <- window(ausbeer, start=1990,end=c(2009,4))
n.end <- 2004.75 # 2004Q4
fit.season <- tslm(beer1 ~ season, data=beer1)
fit.trend <- tslm(beer1 ~ season + trend, data=beer1)
for(i in 1:20){
tmp0 <- 1990
tmp1 <- n.end+(i-1)*.25
tmp <- window(beer1,tmp0,tmp1)
pred[i,1] <- window(beer1,tmp1+.25,tmp1+.25) # actual data
# compute forecasts
pred[i,2] <- forecast(fit.season, h=1)
pred[i,3] <- forecast(fit.trend, h=1)
}
I know that the error message means the columns aren't equal so I checked the matrix and only the first element (row 1, column 1) was populated.
And my window seems okay so I tried with another set of functions in the loop.
for(i in 1:20){
tmp0 <- 1992
tmp1 <- n.end+(i-1)*.25
tmp <- window(beer1,tmp0,tmp1)
pred[i,1] <- window(beer1,tmp1+.25,tmp1+.25) # actual
# compute forecasts
pred[i,2] <- meanf(tmp, h=1)$mean
pred[i,3] <- rwf(tmp, h=1)$mean
pred[i,4] <- snaive(tmp, h=1)$mean
}
And the whole matrix was populated.
What's wrong with the one I initially did?
The forecast function returns an object of class forecast, not a vector. Replace the last two lines in your loop by
pred[i,2] <- forecast(fit.season, h=1)$mean
pred[i,3] <- forecast(fit.trend, h=1)$mean
to extract just the point forecasts
Related
I want to calculate the Moran index for several random phylogenetic trees from a data set.
As the result I want a list of: the values of Moran index calculation and the selected tree for that given calculation of Moran index (at each loop).
I need to know which tree was used to run Moran index.
I was able to get Moran values for the random trees, however I was not able to save in any way the random tree that were selected at each loop.
Since my data set is big, I created the example below:
library(ape)
x <- rmtree(20, 20)
names(x) <- paste("t", 1:10, sep = "")
var <- as.data.frame(matrix(rnorm(20),nrow=20))
nams <- data.frame(paste0("t",1:20,sep=""))
var_list <- cbind(nams,var)
colnames(var_list) <- c("names","variable")
var_list1 <- var_list$variable
names(var_list1) <- c(var_list[,1])
# Loop to select a random tree and run the Moran index
tr <- function(x, var_list1){
resulist <- as.list(1:15)
# treelist <- as.list(as.phylo(x[[i]])) #### not working
for(i in 1:15){
tutreer <- sample(x,size=1)[[1]]
inr <- 1/cophenetic(tutreer)
diag(inr) <- 0
mran <- Moran.I(var_list1,inr,scaled = TRUE)
mran
resulist[[i]] <- list(obs=mran$observed,expect=mran$expected,
sd=mran$sd,pval=mran$p.value)
# treelist[[i]] <- list(tutreer) #### Here, not working
}
return(resulist)
# return(treelist)
}
tr(x,var_list1)
Any ideas on how could I save the trees that was selected (and used to calculate the index) at each loop?
Why not just add it to the resulist[[i]] on each iteration?:
resulist[[i]] <- list(obs=mran$observed,expect=mran$expected,
sd=mran$sd,pval=mran$p.value, tree=tutreer)
Or, return the entire sampled tree, like this:
tr <- function(x, var_list1){
resulist <- as.list(1:15)
for(i in 1:15){
sampled_tree <- sample(x,size=1)
inr <- 1/cophenetic(sampled_tree[[1]])
diag(inr) <- 0
mran <- Moran.I(var_list1,inr,scaled = TRUE)
resulist[[i]] <- c(mran, list(tree = sampled_tree))
}
return(resulist)
}
I need to repeat the sampling procedure of the below loop 1000 times using a second loop.
This is the simplified code i produced for reproducability, the inner loop.
##Number of iterations
N = 8
##Store data from inner loop in vectors
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
for (I in 1:N){
PolynomialDegree [I] <- I
PMSE [I] <- I*rnorm(1)
}
Now, using a second , outer loop. I want repeat this "sampling procedure" 1000 times and store the data of all those vectors into a single dataframe. Im struggling to write the outer loop and was hoping for some assistance.
This is my attempt with non-reproducable code, I hope it is clear what i am attempting to do.
##Set number of iterations
N <- 8
M <- 1000
##Store data
OUTPUT <- rep(1,M)
##Outer loop starts
for (J in 1:M){
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
##Inner loop starts
for (I in 1:N){
##Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = I), data = training)
##fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
##define and store PMSE
PMSE[I] <- (1/(nrow(tempraindata)- nrow(training)))*(sum(testing$tem-predictions))^2
PolynomialDegree [I] <- I
} ## End of inner loop
OUTPUT[J] <- ##THIS IS WHERE I WANT TO SAVE THE DATA
} ##End outer loop
I want to store all the data inside OUTPUT and make it a dataframe, if done correctly it should contain 8000 values of PMSE and 8000 values of PolynomialDegree.
Avoid the bookkeeping of initializing vectors and then assigning elements by index. Consider a single sapply (or vapply) passing both iterations to build a matrix of 8,000 elements of the PSME calculations within a 1000 X 8 structure. Every column would then be a model run (or PolynomialDegree) and every row the training/testing data pair.
## Set number of iterations
N <- 8
M <- 1000
## Defined method to generalize process
calc_PSME <- function(M, N) {
## Randomly build training/testing sets
set.seed(M+N) # TO REPRODUCE RANDOM SAMPLES
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
## Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = N), data = training)
## Fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
## Return single PSME value
(
(1/(nrow(tempraindata)- nrow(training))) *
(sum(testing$tem-predictions)) ^ 2
)
}
# RETURN (1000 X 8) MATRIX WITH NAMED COLUMNS
PSME_matrix <- sapply(1:N, calc_PSME, 1:M)
PSME_matrix <- vapply(1:N, calc_PSME, numeric(M), 1:M)
Should you need a 8,000-row data frame of two columns, consider reshape to long format:
long_df <- reshape(
data.frame(output_matrix),
varying = 1:8,
timevar = "PolynomialDegree",
v.names = "PSME",
ids = NULL,
new.row.names = 1:1E4,
direction = "long"
)
Ok so I have a loop that works out the annualized / cumulative return of a stock price series.
I wish to do the same thing over many files. So made a loop to do so.
First some dummy data:
# Create dummy data
# Use lubridate to change timestamp to date format
# Use dplyr to arrange by ascending order
# Use fread from data.table to read .csv to data frame
require(lubridate)
require(data.table)
require(dplyr)
MSFT <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
MSFT$timestamp <- ymd(MSFT$timestamp)
MSFT <- arrange(MSFT,timestamp)
AAPL <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=AAPL&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
AAPL$timestamp <- ymd(AAPL$timestamp)
AAPL <- arrange(AAPL,timestamp)
NFLX <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=NFLX&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
NFLX$timestamp <- ymd(NFLX$timestamp)
NFLX <- arrange(NFLX,timestamp)
TSLA <- fread("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=TSLA&outputsize=full&apikey=6RSYX9BPXKZVXUS9&datatype=csv")
TSLA$timestamp <- ymd(TSLA$timestamp)
TSLA <- arrange(TSLA,timestamp)
# Place data frames in a list
df.list <- list(MSFT,AAPL,NFLX,TSLA)
# Specify file names
file.names <- c("MSFT","AAPL","NFLX","TSLA")
Now that prepares the data.
Next I want to calculate the cumulative and annualized returns for each series. I place this in a function then call the function with a loop:
# Create function for performing commands.
genAnnualized = function(x){
next.file <- data.frame(df.list[[1]],stringsAsFactors=FALSE)
next.name <- paste0(file.names[i])
new.df <- data.frame(next.file)
# Calculate annualized return
# Make prices vector
prices <- new.df[, "close", drop = FALSE]
# Denote n the number of time periods:
n <- nrow(prices)
# Calculate close to close returns
# lead in with rep,NA,1 to maintain length of vector comparible to data frame
close_ret <- c(rep(NA, 1),(prices[2:n, 1] - prices[1:(n-1), 1])/prices[1:(n-1), 1])
close_ret[1] <- 0
# Compute continuously returns (log returns)
close_ccret <- log(prices[2:n, 1]) - log(prices[1:(n-1), 1])
# Compute gross returns
close_gret <- 1 + close_ret # use close to close ret
# Compute future values
close_fv <- cumprod(close_gret)
# Obtain first and last values
ret.last <- tail(close_fv, n=1)
ret.first <- head(close_fv, n=1)
cum.ret <- (ret.last-ret.first)/ret.first
# Get First And Last row to calculate time between
ret.first.row <- head(new.df$timestamp, n=1)
ret.last.row <- tail(new.df$timestamp, n=1)
# Time diff
#trading.years.between <- as.numeric(difftime(as.Date(ret.last.row), as.Date(ret.first.row), unit="weeks"))/52.25
# Find time diff
ret.time <- ret.last.row - ret.first.row
ret.trading.years.between <- ret.time/365 #252 trading days or 365
ret.trading.years.between <- as.numeric(ret.trading.years.between, units="days") # Extract numerical value from time difference 'Time difference of 2837.208 days'
# Annualized return
# (1 + % diff of final) / (last balance to the power of 1/time first and last balance) -1
ret.annual.return <- (1+cum.ret) ^ (1/ret.trading.years.between) -1
########## Store annualized and cumulative return in data frame for each iteration #########
# Store file name as a row name :: next.name variable
# Store final annualized return :: cret.annual.return
# Store final cumulative return :: cum.ret
output.df <- cbind(cum.ret,ret.annual.return)
rownames(output.df) <- next.name
##################################################################
# Sanity check, use PerformanceAnalytics for annualized return
# TTR for returns
# Calculate Close-to-Close returns
require(TTR)
require(PerformanceAnalytics)
new.df$clret <- ROC(new.df$close, type = c("discrete"))
new.df$clret[1] <- 0
# Make time series object of returns and date
require(xts)
xts1 = xts(new.df$clret, order.by=as.Date(new.df$timestamp, format="%m/%d/%Y"))
Return.annualized(xts1)
Return.cumulative(xts1, geometric=TRUE)
}
And call the function to loop through each data frame in the data frame list:
for (i in 1:length(df.list)){
tryCatch({
genAnnualized(df.list[[i]])
}, error = function(e) { print(paste("i =", i, "failed:")) })
}
This should make a re producible example.
On each iteration, I wish to store the cumulative and annualized return of each series as with the name of the data set (so its identifiable later).
I am attempting this with the below within my function:
output.df <- cbind(cum.ret,ret.annual.return)
rownames(output.df) <- next.name
I am specifying the names with:
file.names <- c("MSFT","AAPL","NFLX","TSLA")
and then in the function calling it:
next.name <- paste0(file.names[i])
I was hoping to paste the file name so I can tag my final output in the data frame.
I think might need to rep the name twice when naming each row or column name. So that it tags the cumulative return and also the annualized return.
I think have the general idea but have been wrestling with this for a few weeks so looking for some assistance.
Essentially with the output data frame I can then organise into quartiles etc etc for further analysis
My simplest solution was to rbind a dataframe for each file.names evaluation, and name this row the same name of the corresponding file.
I've deleted comments for clarity (and put some of mine).
'genAnnualized' = function(df_list) {
next.file <- data.frame(df_list, stringsAsFactors=FALSE) # Put the parameter of the function here
next.name <- paste0(file.names[i])
new.df <- data.frame(next.file)
prices <- new.df[, "close", drop = FALSE]
n <- nrow(prices)
close_ret <- c(rep(NA, 1),(prices[2:n, 1] - prices[1:(n-1), 1])/prices[1:(n-1), 1])
close_ret[1] <- 0
close_ccret <- log(prices[2:n, 1]) - log(prices[1:(n-1), 1])
close_gret <- 1 + close_ret
close_fv <- cumprod(close_gret)
ret.last <- tail(close_fv, n=1)
ret.first <- head(close_fv, n=1)
cum.ret <- (ret.last-ret.first)/ret.first
ret.first.row <- head(new.df$timestamp, n=1)
ret.last.row <- tail(new.df$timestamp, n=1)
ret.time <- ret.last.row - ret.first.row
ret.trading.years.between <- ret.time/365
ret.trading.years.between <- as.numeric(ret.trading.years.between, units="days")
ret.annual.return <- (1+cum.ret) ^ (1/ret.trading.years.between) -1
output.df <- cbind(cum.ret,ret.annual.return)
rownames(output.df) <- next.name
##################################################################
new.df$clret <- TTR::ROC(new.df$close, type = c("discrete"))
new.df$clret[1] <- 0
xts1 = xts::xts(new.df$clret, order.by=as.Date(new.df$timestamp, format="%m/%d/%Y"))
# Create the output of the function : a named data.frame
out_df <- data.frame("Annualized Return" = PerformanceAnalytics::Return.annualized(xts1),
"Cumulative Return" = PerformanceAnalytics::Return.cumulative(xts1, geometric=TRUE))
return(out_df)
}
# Initialize the output dataframe to which we will rowbind the results
cum_ret <- data.frame()
for (i in 1:length(df.list)){
temp <- genAnnualized(df.list[[i]] )
rownames(temp) <- file.names[i]
cum_ret <- rbind.data.frame(cum_ret, temp)
}
This gives a data frame with number of named rows equal to the number of
files in df.list and 2 columns for the annualized and cumulative returns.
> cum_ret
Annualized.Return Cumulative.Return
MSFT -0.02279597 -0.3361359
AAPL 0.02039616 0.4314812
NFLX 0.17454862 10.8991045
TSLA 0.44666765 13.8233571
I'm running a loop that has 8 responses and 12 predictors, so the resulting matrix of results I get is made up of 96 unlabeled rows. Is there a way I can get R to automatically label those rows for me based on the response and predictor plugged into the model without needing to list every combination myself? This is an example of the code I've been using:
set.seed(0)
set.seed(1)
dat <- gamSim(1,n=100,scale=2)
dat2 <- gamSim(1,n=100,scale=2)
names(dat2)[1:5]<-c("y1", paste0("x", 4:7))
d<-cbind(dat[, 1:5], dat2[, 1:5])
resp <- d[ c("y", "y1")]
pred <- d[, !(colnames(d) %in% c("y", "y1"))]
results<- vector("list", length=ncol(resp)*ncol(pred))
dim(results) <- c(ncol(resp), ncol(pred))
for(i in 1:ncol(resp)){
for(j in 1:ncol(pred)){
results[i, j][[1]] <- gamm(resp[, i] ~ s(pred[, j]))
}
}
resultsl <- do.call("list", results)
pspline<-sapply(resultsl, function(l) summary(l$lme)$tTable[,5])
pspline2<-plyr::ldply(pspline, rbind)
pspline2
1 6.949984e-39
2 7.174833e-01
3 1.665304e-40
4 4.928242e-01
....
I would like the rows to automatically be labelled as "y0-xo","y1-xo","yo-x1","y1-x1"...etc, or in any way to make it clear which response and predictor the row refers to. I tried different iterations of dimnames and rownames in the results matrix before running the loop but can't get it work. Any ideas?
You can store the name of each formula in your results & use them to name the columns of your matrix:
for(i in 1:ncol(resp)){
for(j in 1:ncol(pred)){
results[i, j][[1]] <- mgcv::gamm(resp[, i] ~ s(pred[, j]))
attr(results[i, j][[1]], "formula") <- paste(colnames(resp)[i],
colnames(pred)[j], sep = "-") colnames(pred)[j], sep = "-"))
}
}
resultsl <- do.call("list", results)
pspline<-sapply(resultsl, function(l) summary(l$lme)$tTable[,5])
colnames(pspline) <- sapply(resultsl, function(l) attr(l, "formula"))
> pspline
y-x0 y1-x0 y-x1 y1-x1 y-x2 y1-x2
X(Intercept) 1.889636e-40 1.072054e-35 4.272656e-47 1.179033e-35 5.963889e-53 1.004778e-35
Xs(pred[, j])Fx1 6.794519e-01 6.142264e-01 2.529175e-09 8.959402e-01 1.050719e-01 5.192141e-01
y-x3 y1-x3 y-x4 y1-x4 y-x5 y1-x5
X(Intercept) 2.021544e-40 1.110827e-35 2.030668e-40 1.560173e-36 2.020669e-40 1.709149e-39
Xs(pred[, j])Fx1 9.175815e-01 6.840799e-01 9.958544e-01 2.489406e-02 9.137317e-01 3.383139e-06
y-x6 y1-x6 y-x7 y1-x7
X(Intercept) 1.223498e-40 5.539263e-48 1.757977e-40 1.158747e-35
Xs(pred[, j])Fx1 2.731945e-01 1.911815e-02 5.586884e-01 8.059418e-01
I'm not sure why you used ldply to bind the matrix in your code. You end up with everything in the same column, without a label to distinguish X(Intercept) from Xs(pred[, j]). If you simply want it in long format as a data frame, you can use as.data.frame(t(pspline)):
> head(as.data.frame(t(pspline)))
X(Intercept) Xs(pred[, j])Fx1
y-x0 1.889636e-40 6.794519e-01
y1-x0 1.072054e-35 6.142264e-01
y-x1 4.272656e-47 2.529175e-09
y1-x1 1.179033e-35 8.959402e-01
y-x2 5.963889e-53 1.050719e-01
y1-x2 1.004778e-35 5.192141e-01
I created a function to run cross validation on whichever model I enter into it:
# vectors to hold results
master_modelname <- vector()
master_modelmea <- vector()
# model coefficients
master_sink <- attr(mini_kitchensink$terms,"term.labels")
master_stepF <- attr(stepF$terms,"term.labels")
master_stepB <- attr(stepB$terms,"term.labels")
# folds
myKfolds <- function(preds, dataset, num_folds=5, dependant=c("loss")) {
# create vectors for holding predictions and actuals from k folds iterations
cv_prediction <- data.frame()
testsetCopy <- data.frame()
kmeas <- vector()
folds <- createFolds(dataset[,dependant], k=num_folds)
for ( f in folds ) {
ktrain <- dataset[-f,]
ktest <- dataset[f,]
kmodel <- lm(paste(dependant,"~", paste(preds, collapse="+"),sep=""), data=ktrain)
predictions <- predict(kmodel, interval="prediction", newdata=ktest)
temp <- as.data.frame(predictions)
cv_prediction <- rbind(cv_prediction, temp)
testsetCopy <- rbind(testsetCopy, ktest)
errors <- ktest[,dependant] - as.numeric(predictions[,"fit"])
kmeas <- c(kmeas,mean(abs(errors)))
}
master_modelname <- c(master_modelname, "bla")
master_modelmea <- c(master_modelmea, mean(kmeas))
}
After calling the function:
myKfolds(master_stepB, ptrain, num_folds=5)
It runs after about 10 minutes.
But my vectors master_modelname and master_modelmea are empty.
However, if I just highlight and run the last two lines of my function, it works, these two vectors now have values 'bla' and a single number which is mean(kmeas)
I have to "force run" the final two lines in the function to get expected results. So does this mean that for some reason the ast two lines of my function are not running for some reason?