Exporting forecast to a dataset while running a loop - r

I am trying to take one dataset and run a forecast on it based on different cutoffs. The for i runs, but it ends up saying "replacement has 76 rows, data has 0"
Here is the replicable example:
library(tidyverse)
library(forecast)
library(scales)
library(growthcurver)
options(scipen = 12) # Scientific Notation
options(digits = 6) # Specify Digits
noup<-3 #Days without update
claims <- tribble(~perdaycases, 3,1,1,0,0,0,
1,8, 7, 2,
8, 8, 12,
13, 15,
21, 27,
47, 65,
47, 30,
62, 74,
23, 38)
claims$cases <- cumsum(claims$perdaycases)
claims$id<-1:nrow(claims)
inds <- seq(as.Date("2020-03-11"), as.Date(Sys.Date()-noup), by = "day")
set.seed(1)
## Forecast length
h0 = 30
#Here, I create the empty dataset
estimates<-data.frame(Simulation=numeric(),Forecast=numeric()) #Empty Dataset
for(i in 1:length(claims$id)) {
cap<-subset(claims,id<14+i) #First cutoff then it runs again
cts <- ts(cap$cases, start = 1,frequency = 365) #Time Series
cfore <- forecast(auto.arima(cts), h= h0, level = c(80)) #Do the Arima
gc_fit <- SummarizeGrowth(seq(1,nrow(cap)),cap$cases) #Fit the Growth curve
tt <- seq(from=nrow(cap)+1,to=90,by=1)
forelog <- predict(gc_fit$model,newdata=list(t=tt)) #Prediction
forecast<-forelog #Create the item with mean projection
len<-as.numeric(length(forecast)) #Length of each forecast
estimates$Simulation<-as.numeric(rep(i,len)) #id each iteration
estimates$Forecast<-forecast #Here I try to export the forecast
}
The Error I get is ...
Error in $<-.data.frame(*tmp*, "Simulation", value = c(1, 1, 1, 1, :
replacement has 76 rows, data has 0
I am guessing it has to do with the last 2 lines but after 4 hours of struggling I have decided to ask for help.
SOS.
Thanks,

Try to provide a minimal example that reproduces your error. Your problem boils down to this simple problem:
test <- data.frame(
a=numeric(),
b=numeric()
)
test$a <- c(1,2,3)
# Error in `$<-.data.frame`(`*tmp*`, a, value = c(1, 2, 3)) :
# replacement has 3 rows, data has 0
Which doesn't work. You could do something like this:
test <- data.frame(
a=numeric(),
b=numeric()
)
for (i in 1:3){
a <- c(rep(i, 3))
b <- c(rep(i-1, 3))
df <- data.frame(
a = a,
b = b
)
test <- bind_rows(test,df)
}
And for your specific example, you could replace the following lines:
# estimates$Simulation <-as.numeric(rep(i,len)) #id each iteration
# estimates$Forecast<-forecast #Here I try to export the forecast
df <- data.frame(
Simulation = as.numeric(rep(i,len)),
estimates = forecast
)
estimates <- bind_rows(estimates,df)

Related

Cant create a data.frame with my findings from loop

I have a for-loop which return 4 different answeres, which is correct, but when I try to retrieve these values to my data.frame I get "Error in [<-.data.frame(*tmp*, p, 1, value = 29.1520685791182) :
missing values are not allowed in subscripted assignments of data frames"
Goal: Im trying to get values, which is printed 29, 485,-14, 12, in a data.frame
library("xts")
library("quantmod")
library("fredr")
Tesla <- getSymbols("TSLA", from=as.Date("2014-11-03"),to=as.Date("2019-11-03"))
Amazon <- getSymbols("AMZN", from=as.Date("2014-11-03"),to=as.Date("2019-11-03"))
Equinor <- getSymbols("EQNR",from="2014-11-03",to="2019-11-03")
FTSE100 <- getSymbols("^FTSE",from="2014-11-03",to="2019-11-03")
dftest <- data.frame(merge(TSLA$TSLA.Close, AMZN$AMZN.Close, EQNR$EQNR.Close,FTSE$FTSE.Close))
df <- data.frame(matrix(nrow = 1, ncol = 4)) #The data.frame where i want my returned values from print(pros) to be in.
colnames(dfProsent) <- c("TESLA", "AMAZON","EQUINOR","FTSE")
for (p in dftest) {
pros <- ((last(as.numeric(p)))-(first(as.numeric(p))))/(first(as.numeric(p)))*100
print(pros) #this print out 29, 485,-14,12
df[p,1] <- pros #the problem
}

"invalid time series parameters specified" error in R

I am trying to modify the Nelson/Siegel example from the YieldCurve documentation using data from the USTYC package (https://cran.r-project.org/web/packages/YieldCurve/YieldCurve.pdf).
The original code is:
library(YieldCurve)
### Nelson.Siegel function and Fed data-set ###
data(FedYieldCurve)
rate.Fed = first(FedYieldCurve,'5 month')
maturity.Fed <- c(3/12, 0.5, 1,2,3,5,7,10)
NSParameters <- Nelson.Siegel( rate= rate.Fed, maturity=maturity.Fed )
y <- NSrates(NSParameters[5,], maturity.Fed)
My modified code is below
library(ustyc)
library(YieldCurve)
xlist = getYieldCurve() # 2.5 mins
yields <- xlist$df
maturities <- c(1/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30)
curve <- yields["2018-05-21",1:11]
NSParameters <- Nelson.Siegel(curve,maturities)
y <- NSrates(NSParameters[1,],maturities)
However, I am getting the error:
Error in attr(x, "tsp") <- c(1, NROW(x), 1) :
invalid time series parameters specified
What am I doing wrong? Thanks in advance
Solved it: 'yields' is in the wrong format. The fourth line needs to be:
yields <- as.xts(xlist$df)

R function for creating, naming and lagging variables

I have some data like so:
a <- c(1, 2, 9, 18, 6, 45)
b <- c(12, 3, 34, 89, 108, 44)
c <- c(0.5, 3.3, 2.4, 5, 13,2)
df <- data.frame(a, b,c)
I need to create a function to lag a lot of variables at once for a very large time series analysis with dozens of variables. So i need to lag a lot of variables without typing it all out. In short, I would like to create variables a.lag1, b.lag1 and c.lag1 and be able to add them to the original df specified above. I figure the best way to do so is by creating a custom function, something along the lines of:
lag.fn <- function(x) {
assign(paste(x, "lag1", sep = "."), lag(x, n = 1L)
return (assign(paste(x, "lag1", sep = ".")
}
The desired output is:
a.lag1 <- c(NA, 1, 2, 9, 18, 6, 45)
b.lag1 <- c(NA, 12, 3, 34, 89, 108, 44)
c.lag1 <- c(NA, 0.5, 3.3, 2.4, 5, 13, 2)
However, I don't get what I am looking for. Should I change the environment to the global environment? I would like to be able to use cbind to add to orignal df. Thanks.
Easy using dplyr. Don't call data frames df, may cause confusion with the function of the same name. I'm using df1.
library(dplyr)
df1 <- df1 %>%
mutate(a.lag1 = lag(a),
b.lag1 = lag(b),
c.lag1 = lag(c))
The data frame statement in the question is invalid since a, b and c are not the same length. What you can do is create a zoo series. Note that the lag specified in lag.zoo can be a vector of lags as in the second example below.
library(zoo)
z <- merge(a = zoo(a), b = zoo(b), c = zoo(c))
lag(z, -1) # lag all columns
lag(z, 0:-1) # each column and its lag
We can use mutate_all
library(dplyr)
df %>%
mutate_all(funs(lag = lag(.)))
If everything else fails, you can use a simple base R function:
my_lag <- function(x, steps = 1) {
c(rep(NA, steps), x[1:(length(x) - steps)])
}

How to subset evenly spaced samples from a dataframe *without duplicates* in R?

I'm trying to create an evenly spaced (in time or in depth) subset of a larger dataset in R. My original data isn't evenly spaced.
These are the functions that need improvement:
# calculate step size and subsets df accordingly
spacedSS <- function(df, n, var){
stp <- (max(var)-min(var))/(n - 1) #calculate step size
stps <- min(var)+0:(n-1)*stp #calculate step values
res <- lookupDepth(df, stps, var)
return(as.data.frame(res))
}
# finds values in var closest to stps, returns subsetted df
lookupDepth <- function(df, stps, var){
indxs <- rep(0, times=length(stps)) # create empty index vector
for(i in seq_along(stps)) { # for every subsample row
# find the one closest to the step value
# TODO: only if it isn't already in the df
indxs[i] <- which.min((var - stps[i])^2)
}
sampls <- df[indxs, ] #subset by these new indexes
return(as.data.frame(sampls))
}
And here they are applied to data similar to my own to illustrate the problem:
# generate data
depth <- c(seq(650, 750, length.out = 50), seq(750, 760, length.out = 3),
seq(760, 780, length.out = 5), seq(780, 800, length.out = 20))
age <- c(seq(40, 41, length.out = 50), seq(41, 42, length.out = 3),
seq(42, 47, length.out = 5), seq(47, 48, length.out = 20))
id <- seq_along(age)
dat <- data.frame(id, depth, age)
# subset 10 samples of dat evenly spaced in depth/age
ss.depth <- spacedSS(dat, 10, dat$depth)
ss.age <- spacedSS(dat, 10, dat$age)
Here's a plot of the data:
# plot it using my depthplotter function
source("https://raw.githubusercontent.com/japhir/DepthPlotter/master/DepthPlotter.R")
DepthPlotter(dat[, c("depth", "age")], xlab = "Age (Ma)")
segments(30, ss.depth$depth, ss.depth$age, col = "blue")
segments(ss.age$age, 640, y1 = ss.age$depth, col = "red")
So the problem I'm trying to solve is that the subset function currently doesn't look at the indeces that are already used:
# the problem I'm trying to solve:
length(unique(ss.age$id)) != length(unique(ss.depth$id))
TRUE
# it picked the same samples sometimes because they were the closest ones!
ss.age$id
[1] 1 45 53 55 55 56 57 57 61 78
So as you can see, the problem is that when it is subsetting, it currently doesn't take into account the samples that have already been selected. Any idea on how to fix this?
So I ended up asking a friend to help me out, and we've constructed a rather complicated Simulated Annealing approach.
Basically we created a function to see if there are any duplicate index values, and if so fixes them really simply. A mutation function then randomly changes the index values. The loss of this new subset is checked against the original dataset, random mutations are generated and selected if they are better than the previous selection. The selection criteria are rather loose at first but get more stringent over time, resulting in a pretty cool optimised subset of the data!
If you are interested in the code we used, comment below and we'll put it up somewhere.

How to apply a custom function to each participant in a data frame

I created a function that calculates dPrime. The function takes a data frame as its argument. This works fine, however the columns must be called "stimDiff" and "stimSame", as the function calculates dPrime using these specific names. I would like to apply this function to a data frame that has multiple subjects, and be able to calculate dPrime for each subject, with the result being a new data frame with the dPrime score of each subject. The test data frame looks like this:
stimDiff0 <- c(rep("diff", 20), rep("same", 5))
stimSame0 <- c(rep("diff", 10), rep("same", 15))
stimDiff1 <- c(rep("diff", 10), rep("same", 15))
stimSame1 <- c(rep("diff", 10), rep("same", 15))
stimDiff2 <- c(rep("diff", 19), rep("same", 6))
stimSame2 <- c(rep("diff", 11), rep("same", 14))
stimDiff3 <- c(rep("diff", 21), rep("same", 4))
stimSame3 <- c(rep("diff", 9), rep("same", 16))
stimDiff4 <- c(rep("diff", 18), rep("same", 7))
stimSame4 <- c(rep("diff", 12), rep("same", 13))
stimDiff5 <- c(rep("diff", 22), rep("same", 3))
stimSame5 <- c(rep("diff", 14), rep("same", 11))
stimDiff <- c(stimDiff0, stimDiff1, stimDiff2,
stimDiff3, stimDiff4, stimDiff5)
stimSame <- c(stimSame0, stimSame1, stimSame2,
stimSame3, stimSame4, stimSame5)
subject <- rep(0:5, each = 25)
x <- data.frame(subject = subject, stimDiff = stimDiff, stimSame = stimSame)
I am trying to obtain a dPrim by subject data frame using the following code:
tapply(c(x$stimDiff, x$stimSame), x$subject, data = x, FUN = dPrime)
I get the following error:
Error en tapply(list(x$stimDiff, x$stimSame), x$subject, data = x, FUN = dPrime) :
arguments must have same length
I am aware of the fact that there are packages that can calculate dPrime. I am doing this in order to learn how to write functions. I would prefer to find a solution using base R.
Here is the code for the function dPrime:
dPrime <- function(x) {
# Calculate number of same, diff and total responses
# for the stimuli that were actually different
stimDiffRdiff <- nrow(x[x$stimDiff == 'diff', ])
stimDiffRsame <- nrow(x[x$stimDiff == 'same', ])
stimDiffTotal <- length(x$stimDiff)
# Calculate number of same, diff and total responses
# for the stimuli that were actually the same
stimSameRdiff <- nrow(x[x$stimSame == 'diff', ])
stimSameRsame <- nrow(x[x$stimSame == 'same', ])
stimSameTotal <- length(x$stimSame)
# Hit rate = the number of correct responses 'diff'
# when the stimuli were actually diff, divided by
# the total number of responses
hitRate <- stimDiffRdiff / stimDiffTotal
# Miss rate = the number of incorrect responses
# 'same' when the stimuli were actually diff
# divided by the total number of responses
missRate <- stimDiffRsame / stimDiffTotal
# False alarm = the number responses 'diff'
# when the stimuli were actually the same
# divided by the total number of responses
falseAlarm <- stimSameRdiff / stimSameTotal
# Correct rejection = the number of responses
# same when the stimuli were actually the same
# divided by the number of total responses
corrReject <- stimSameRsame / stimSameTotal
# Calculate z-critical values for hit rate
# and false alarm rate
zHitRate <- qnorm(hitRate)
zFalseAlarm <- qnorm(falseAlarm)
# Calculate d prime
dPrime <- zHitRate - zFalseAlarm
print(dPrime)
}
To build on #jvcasill's original function and on other users' responses:
dPrime <- function (data, subj = 1, stimDiff = 2, stimSame = 3) {
# dPrime() returns a vector of the length of the number of subjects
#+ in data[, subj] that contains the sensitivity index "d'" for each.
# `data`: data frame
# `subj`: index of "subject" column in `data`; default is 1
# `stimDiff`: index of "stimDiff" column in `data`; default is 2
# `stimSame`: index of "stimSame" column in `data`; default is 3
if (is.data.frame(data)) {
# Divide `data` by subject with split(), as have done others who've
#+ responded to this question
data.by.subj <- split(data, data[, subj])
# Calculate number of subjects and create vector of same length
#+ to return
n.subj <- length(data.by.subj)
dPrime.by.subj <- vector(mode = "double", length = n.subj)
# Loop through "data.by.subj" subject by subject and calculate d'
for (subj in seq_len(n.subj)) {
# For clarity, create temporary data set with data of
#+ current "subj"
data.tmp <- data.by.subj[[subj]]
stimDiffRdiff <- nrow(data.tmp[data.tmp[, stimDiff] == "diff", ])
stimDiffRsame <- nrow(data.tmp[data.tmp[, stimDiff] == "same", ])
stimDiffTotal <- length(data.tmp[, stimDiff])
stimSameRdiff <- nrow(data.tmp[data.tmp[, stimSame] == "diff", ])
stimSameRsame <- nrow(data.tmp[data.tmp[, stimSame] == "same", ])
stimSameTotal <- length(data.tmp[, stimSame])
hitRate <- stimDiffRdiff / stimDiffTotal
missRate <- stimDiffRsame / stimDiffTotal
falseAlarm <- stimSameRdiff / stimSameTotal
# The following appears unused in the original function
# corrReject <- stimSameRsame / stimSameTotal
zHitRate <- qnorm(hitRate)
zFalseAlarm <- qnorm(falseAlarm)
dPrime <- zHitRate - zFalseAlarm
dPrime.by.subj[subj] <- dPrime
}
# For clarity, give each d' value in vector to be returned,
#+ "dPrime.by.subj", name of corresponding subject
names(dPrime.by.subj) <- names(data.by.subj)
return(dPrime.by.subj)
} else stop("'data' is not a data frame")
}
Note that I'm not sure if the values it returns, which --- for the example data set provided by #jvcasill --- are the same as those obtained with #Splendour's method, make sense.
Try data.table (using length function rather than dPrime):
library(data.table)
xt = data.table(x)
xt[,list(len=length(c(stimSame,stimDiff))),by=subject]
subject len
1: 0 50
2: 1 50
3: 2 50
4: 3 50
5: 4 50
6: 5 50
With base R:
sapply(split(x, x$subject), dPrime)
[1] 1.094968
[1] 0
[1] 0.8572718
[1] 1.352917
[1] 0.6329951
[1] 1.024018
0 1 2 3 4 5
1.0949683 0.0000000 0.8572718 1.3529167 0.6329951 1.0240176
Duplicate output is because of 'print(dPrime)' statement in your dPrime function. You should replace that by return(dPrime). Better still, since dPrime is a function also, you should replace dPrime in 'dPrime <- zHitRate - zFalseAlarm' statement to some other name, say 'ret':
ret = dPrime <- zHitRate - zFalseAlarm
return(ret)
Here's a (somewhat inelegant) solution in base R:
Split the dataframe into lists, one per subject:
by.subject <- split(x, x$subject)
Calculate dPrime for every chunk, returning a named numeric vector:
dPrime.values <- unlist(lapply(by.subject, dPrime), use.names=T)
Construct a new dataframe:
df <- data.frame(dPrime=dPrime.values)
df$subject <- as.numeric(rownames(df))

Resources