I am having some trouble storing arima predictions into an empty vector. The problem is arima predictions give you predictions and standard errors. There are two columns of values. I cannot seem to store the values in an empty vector. I tried to create two empty vectors and bind them together, but it did not solve the problem.
My intention is to simulate 1000 observations. Use the first 900 observations to make 100 predictions. The list of values have to update. For example, use 900 observations to predict the value of the 901th observation. Now use 901 observations, including the predicted 901th observation, to predict the 902th observation. Repeat until you use 999 observations to predict the 1000th observation. I hope to figure out how to store multiple values into a vector.
The empty vector I hope to contain 100 predictions is called Predictions1.
# Create Arima Series #
ArimaSeries1 = arima.sim(n=1000, list(ar=c(0.99), ma=c(0.1)))+50
ts.plot(ArimaSeries1)
acf(ArimaSeries1)
ArimaSeries2 = arima.sim(n=1000, list(ar=c(0.7,0.2), ma=c(0.1,0.1)))+50
ts.plot(ArimaSeries2)
acf(ArimaSeries2)
ArimaSeries3 = arima.sim(n=1000, list(ar=c(0.6,0.2,0.1), ma=c(0.1,0.1,0.1)))+50
ts.plot(ArimaSeries3)
acf(ArimaSeries3)
# Estimate Arima Coefficients using maximum likehood #
ARC1 = arima(ArimaSeries1, order = c(1,0,1))
ARC2 = arima(ArimaSeries2, order = c(2,0,2))
ARC3 = arima(ArimaSeries3, order = c(3,0,3))
# Estimate Arima Coefficients with 900 observations #
AR1 = arima(ArimaSeries1[1:900], order = c(1,0,1))
AR2 = arima(ArimaSeries2[1:900], order = c(2,0,2))
AR3 = arima(ArimaSeries3[1:900], order = c(3,0,3))
# Create for-loop to make one prediction ahead for 100 times #
PredictionsA = rep(0,100)
PredictionsB = rep(0,100)
Predictions1 = cbind(PredictionsA,PredictionsB)
for(a in 1:100){ Forcasting1 = predict(arima(ArimaSeries1[1:900+a], order=c(1,0,1)), n.ahead=1)}
Predictions1[a] = Forcasting1
R would give me this error message:
Warning message: In Predictions1[a] = Forcasting1 : number of items
to replace is not a multiple of replacement length
I would be grateful for any suggestions. Any explanations on where I went wrong is also appreciated. Thank you for your time.
Maybe something like this:
Predictions1 <- array(NA, c(100,2))
for(a in 1:100){
Forcasting1 = predict(arima(ArimaSeries1[1:900+a], order=c(1,0,1)), n.ahead=1)
Predictions1[a,] = unlist(Forcasting1)
}
Related
I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.
I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?
You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train
The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000
I have been trying to calculate Cook's distance manually for a multiple linear regression dataset, but running into problems with the for loop. What I have been doing is this:
This is the original linear model, and the associated fitted values, length = 'n'.
{fitted = lm10$fitted.values}
This is the new, n X n, blank matrix, I created to hold the new fitted values.
{lev.mat <- matrix(rep(0, nrow(X.des)^2), nrow = nrow(X.des))}
I wanted to save time, so I filled in the first column of the matrix manually.
{newData = as.data.frame(X.des[-1,])
newModel = lm(fev~., data = newData - 1)
newFitted = newModel$fitted.values
newDist = c(fitted[1],newFitted)
lev.mat[,1] = newDist}
I then tried to fill in the rest of the columns of the lev.mat similarly, using the for loop.
for(i in 2:nrow(lev.mat)){
newData = as.data.frame(X.des[-i, ])
newModel = lm(fev~., data = newData - 1)
newFitted = newModel$fitted.values
newDist = c(newFitted[1:(i-1)],fitted[i],newFitted[i:length(newFitted)])
lev.mat[,i] = newDist
}
But I keep getting this error repeatedly:
{Error in lev.mat[, i] <- newDist :
number of items to replace is not a multiple of replacement length}
I have been at this for three hours now, and it's getting frustrating. Can anybody point out the error and help me move along? My net steps are to calculate the difference between the original fitted values and each column of values in the new fitted values matrix, sum the differences, and divide by the product of the number of predictors and the MSE.
Thanks!
Thanks a lot to #Harlan Nelson for providing me with a wonderful link! I used the background provided in the link here to complete my work. Here is the rest of my code:
Hmat = hatvalues(lm10)
Leverage = Hmat/(1 - Hmat)
mse = (lm10$residuals)^2/var(lm10$residuals)
CooksD <- (1/6)*(mse)*Leverage
lm10 was the name of my linear model, and I had 6 predictors in the model. This helped me calculate Cook's Distance for the model. Thanks again!
I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)
My predicted values are all negative. I would have expected 0's or 1's. Can anyone see where i am going wrong?
fold = 10
end = nrow(birthwt)
fold_2 = floor(end/fold)
df_i = birthwt[sample(nrow(birthwt)),] # random sort the dataframe birthwt
tester = df_i[1:fold_2,] # remove first tenth of rows - USE PREDICT ON THIS DATA
trainer = df_i[-c(1:fold_2),] # all other than the first tenth of rows - USE GLM ON THIS DATA
mod = glm(low~lwt,family=binomial,data=trainer)
ypred = predict(mod,data=tester) # predicted values
The default for predict.glm is to give you the value of the link (on the scale of the linear predictors) before transformation. If you want to predict the response, use
ypred <- predict(mod, data=tester, type="response")
If may be helpful to read the ?predict.glm help file.