I am using the package depmixS4 to fit a HMM on time-series data. Here is an example with some high vol and low vol data.
In the getpars function we can see the parameter value estimates.
What is happening is that sometimes the first two values in the density are the low vol state and sometimes the second two values are the low vol state. Is there any way to fix (maybe setting initial priors?)
set.seed(1)
a <- data.frame(v1 = c(rnorm(n = 100, sd = 10), rnorm(n=100, sd = 1)))
a <- sample(a)
my_model <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a)
fitted_model <- depmixS4::fit(my_model)
getpars(fitted_model)
for (i in 100:200) {
my_model2 <- depmixS4::depmix(response = v1 ~ 1 , nstates = 2 , data = a[1:i, , drop = FALSE])
fitted_model2 <- depmixS4::fit(my_model2)
pars <- getpars(fitted_model2)
if (pars[8] > 8) {
print(i)
}
}
This is called label switching.
Models in which you swap the label of states (e.g., relabel state 1 as state 2 and state 2 as state 1) have the same likelihood and hence are both valid maximum likelihood solutions.
You can try to "fix" this issue by:
setting initial values for the parameters (which make it more likely that the EM algorithm will converge to a particular solution, although this is not guaranteed!);
or by setting order constraints (e.g. forcing the mean for state 1 to be larger than the mean for state 2). Such constraints can be supplied to the fit method in depmixS4 (see examples in ?fit);
a final option is to switch the labels of a fitted depmixS4 object.
Here is a function to relabel a fitted depmix object I have used before (not tested well though!):
label_switch <- function(mod,labels) {
# labels is vector, first element is new integer label for original state integer 1, second is new integer label for original state integer 2, etc.
if(!is(mod,"depmix") || !is(mod,"depmix.fitted")) stop("this function is for depmix models")
n_states <- mod#nstates
if(length(labels) != n_states || length(unique(labels)) != n_states || !(all(labels) %in% 1:n_states)) {
stop("labels needs to be a vector of unique integers between 1 and", n_states)
}
inv_labels <- sapply(1:n_states,function(x) which(labels == x))
tmp <- mod
# relabel prior
ppars <- getpars(mod#prior)
fpars <- getpars(mod#prior,which="fixed")
out_pars <- as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels]))
out_fixed <- as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels]))
if(!tmp#prior#family$link=="identity") tmp#prior#family$base <- labels[tmp#prior#family$base]
# relabel transition
for(i in 1:n_states) {
ppars <- getpars(mod#transition[[inv_labels[i]]])
fpars <- getpars(mod#transition[[inv_labels[i]]],which="fixed")
out_pars <- c(out_pars,as.numeric(t(matrix(ppars,nrow=length(ppars)/n_states,byrow = TRUE)[,inv_labels])))
out_fixed <- c(out_fixed,as.logical(t(matrix(fpars,nrow=length(fpars)/n_states,byrow = TRUE)[,inv_labels])))
tmp#transition[[i]] <- mod#transition[[inv_labels[i]]]
if(!tmp#transition[[i]]#family$link=="identity") tmp#transition[[i]]#family$base <- labels[tmp#transition[[i]]#family$base]
#out_pars <- c(out_pars,getpars(mod#transition[[inv_labels[i]]]))
}
# relabel response
for(i in 1:n_states) {
out_pars <- c(out_pars,unlist(lapply(mod#response[[inv_labels[i]]],getpars)))
out_fixed <- c(out_fixed,unlist(lapply(mod#response[[inv_labels[i]]],getpars,which="fixed")))
}
tmp <- setpars(tmp,out_fixed,which="fixed")
tmp <- setpars(tmp,out_pars)
if(is(tmp,"depmix.fitted")) tmp#posterior <- viterbi(tmp)
return(tmp)
}
Related
I am trying make a binary prediction (predicting QQQ states) using 16 input variables. My data set is 2001-2022. Here is what my data set looks like (predicting X0, which is 5 days ahead)
First I use cross validation with an 80-20 train test split on data from 2001-2017 in order
to test the accuracy of a potential model.
However, since I want our model doing forward predictions, I train the model using the 2001-2017 data set and make a chronological prediction for the 2018-2022 data set.
Understandably, the accuracy drops off
In order to improve the accuracy, I run an expanding window prediction model, where I keep
retraining the model using all prior available observations in order to predict the next state in the data set. For each model I increment the training set by one date. The output is a 2018-2022 prediction of states where the state for each date was predicted using a different training set. This ideally should also help the model to train on new market conditions/phases. The accuracy improves.
However, when I change the lags, I begin to notice that the accuracy does not begin to drop off with increased lags…
The code has been checked extensively and it seems like the lags for each dataset are legitimate. This leads to the question…what is wrong with mu model? Might there be a model better suited for our purposes? It also makes me wonder, why is there such a variability in the Sharpe for each model, is the 15th lag having the highest Sharpe purely coincidental? One theory was that the training set is quite comprehensive, therefore the model is great at making prediction regardless of lag in the near term. However, when I took the lags to an extreme, the accuracy still did not drop off:
Should I try using a different model? Any advice or guidance would be greatly appreciated. Please see my code below (the loop commented out is the expanding window RandomForest application).
library(ggplot2)
library(BatchGetSymbols)
library(data.table)
library(plyr)
library(quantmod)
library(PerformanceAnalytics)
defaultW <- getOption("warn")
options(warn = -1)
library(dplyr)
library(caret)
library(ranger)
### Data Import ######
states_full <- read.csv(file = "rolling_qqq_hidden_states_full_five_back.csv")
states_full$formatted_date <- as.Date(states_full$formatted_date)
states_full <- states_full[!duplicated(states_full$formatted_date),]
tickers <- c("QQQ", "^VXN")
l.out <- BatchGetSymbols(tickers = tickers,
first.date = states_full$formatted_date[1],
last.date = states_full$formatted_date[nrow(states_full)]+1, do.cache=FALSE, be.quiet = TRUE)
price_data <- data.frame(l.out$df.tickers$price.adjusted,l.out$df.tickers$ret.adjusted.prices, l.out$df.tickers$ref.date, l.out$df.tickers$ticker)
colnames(price_data) <- c("Value", "Daily Return", "Date", "Label")
QQQ_full <- price_data[which(price_data$Label == "QQQ"),]
# Make sure dates match
mylist <- c()
for (i in i:nrow(QQQ_full)){
if (sum(QQQ_full$Date[i] == states_full$formatted_date) != 1){
mylist <- c(mylist, i)
}
}
if(length(mylist) > 0){
QQQ_full <- QQQ_full[-mylist,]
}
mylist <- c()
for (i in 1:nrow(QQQ_01_17)){
if (sum(states_full$formatted_date[i] == QQQ_full$Date) != 1){
mylist <- c(mylist, i)
}
}
if(length(mylist) > 0){
states_full <- states_full[-mylist,]
}
# split the data into 2001-2017, 2018-2022
states_01_17 <- states_full[1:which(states_full$formatted_date == "2017-12-29"),]
states_17_22 <- states_full[(nrow(states_01_17)+1):nrow(states_full),]
QQQ_01_17<- QQQ_full[1:which(QQQ_full$Date == "2017-12-29"),]
QQQ_17_22 <- QQQ_full[(which(QQQ_full$Date == "2017-12-29")+1):nrow(QQQ_full),]
# build QQQ portfolio
QQQ_portfolio <- as.data.frame(matrix(nrow = nrow(QQQ_17_22) , ncol = 3))
colnames(QQQ_portfolio) <- c("Value", "Date", "Label")
QQQ_portfolio$Value <- 100
QQQ_portfolio$Label <- "QQQ Portfolio"
QQQ_portfolio$Date <- QQQ_17_22$Date
for(m in 2:nrow(QQQ_portfolio)){
QQQ_portfolio$Value[m] <- QQQ_portfolio$Value[m-1] * (1+QQQ_17_22$`Daily Return`[m])
}
# build non-lagged states portfolio
states_portfolio <- as.data.frame(matrix(nrow = nrow(QQQ_17_22) , ncol = 3))
colnames(states_portfolio) <- c("Value", "Date", "Label")
states_portfolio$Value <- 100
states_portfolio$Label <- "0 Lag RandomForest Prediction of MSDR"
states_portfolio$Date <- QQQ_17_22$Date
for(i in 2:nrow(states_portfolio)){
if (states_17_22$X0[i-1] == 1){
states_portfolio$Value[i] <- states_portfolio$Value[i-1] * (1+QQQ_17_22$`Daily Return`[i])
} else {
states_portfolio$Value[i] <- states_portfolio$Value[i-1]
}
}
# Calculate non-lagged sharpe as benchmark
#states_portfolio_returns <- data.frame(Delt(states_portfolio$Value)[-1])
#states_portfolio_returns_xts <- xts(states_portfolio_returns,states_portfolio$Date[-1])
#as.numeric(SharpeRatio.annualized(states_portfolio_returns_xts))
# bind portfolios together for plotting
port_comp <- rbind(QQQ_portfolio,states_portfolio)
# data set that will hold performance metrics
loop_output <- as.data.frame(matrix(0, nrow = 22, ncol = 8))
colnames(loop_output) <- c("Lag", "Cross Validation Accuracy 01-17","Forward Accuracy 18-22","Sharpe", "Average 1YR Rolling Sharpe",
"Median 1YR Rolling Sharpe","Min 1YR Rolling Sharpe","Max 1YR Rolling Sharpe")
# read macro data (do it each time because)
macro_full <- data.frame(read.csv("macroindicators3.csv"))
for (j in 2:ncol(macro_full)){
macro_full[j] <- as.numeric(nafill(macro_full[,j], type = "locf"))
}
macro_full$Date <- as.Date(macro_full[,1], "%m/%d/%Y")
macro_full <- macro_full[,-1]
macro_full <- macro_full[-1,]
# Remove NA columns, can try with more columns values later...
macro_no_na_full <- macro_full[,colSums(is.na(macro_full))==0]
# make sure dates match
mylist <- c()
for (k in 1:nrow(states_full)){
if (sum(states_full$formatted_date[k] == macro_full$Date) != 1){
mylist <- c(mylist, k)
}
}
if(length(mylist) > 0){
states_full <- states_full[-mylist,]
}
mylist <- c()
for (l in 1:nrow(macro_full)){
if (sum(macro_full$Date[l] == states_full$formatted_date) != 1){
mylist <- c(mylist, l)
}
}
if(length(mylist) > 0){
macro_full <- macro_full[-mylist,]
}
# states are a factor
states_full$X0 <- as.factor(states_full$X0)
set.seed(42)
for (i in 1:50){
if (i <= 8){
lag = i*5 # increment lag by 5 until 40
} else if (i <= 14){
lag = 40 + (i-8)*10 # increment lag by 10 until 100
} else {
lag = 100+(i-14)*100 # increment lag by 100 until 900
}
print(lag)
#Save lag
loop_output$Lag[i] <- lag
#Create a lagged data frame
full <- cbind(macro_no_na_full[1:(nrow(macro_no_na_full)-lag),], states_full[(lag+1):nrow(states_full),])
full_01_17 <- full[1:(which(full$Date == "2017-12-29")-lag),]
full_17_22 <- full[-(1:nrow(full_01_17)),]
# save version with dates to verify lags later
full_w_dates <- full
full_01_17_w_dates <- full_01_17
full_17_22_w_dates <- full_17_22
# remove dates for ml
full <- full[,-c(17,18)]
full_01_17 <- full_01_17[,-c(17,18)]
full_17_22 <- full_17_22[,-c(17,18)]
# this is just for cross validation model
x_01_17 <- data.frame(full_01_17[,-ncol(full_01_17)])
y_01_17 <- full_01_17$X0
# run cross validation model
train=sample(nrow(full_01_17),nrow(full_01_17)*.8,replace=FALSE) #Train/Test
rf.reg = ranger(y = y_01_17[train], x= x_01_17[train,] ,mtry=round(sqrt(ncol(x_01_17))),num.trees=200,importance = "impurity")
y.rf.pred = predict(rf.reg, x_01_17[-train,])$predictions # Predict with bagging
# cross validation model accuracy
rf.acc = mean(y.rf.pred==y_01_17[-train]) # Directly compute the accuracy
#rf.acc
#table(y.rf.pred,y_01_17[-train])
loop_output$`Cross Validation Accuracy 01-17`[i] <- rf.acc
# Expanding window models - takes a while
# prediction <- as.data.frame(matrix(0,nrow = nrow(full_17_22), ncol= 2)) # data set to store predictions
# prediction$V1 <- as.factor(c(0,1))[1] # store predictions as a factor
# previous = 0 # progress bar
# for(a in nrow(full_01_17):(nrow(full)-1)){ #expanding window starts with 2001-2017, next iteration is 2001-2017+1day
# progress = (a-nrow(full_01_17))/(nrow(full)-1-nrow(full_01_17)) # progress bar
# progress = round_any(progress, 0.01) # progress bar
# if (progress != previous){ # progress bar
# print(progress) # progress bar
# }
# previous = progress # progress bar
# rf.reg = ranger(full$X0[1:a]~.,data=full[1:a,],mtry=round(sqrt(ncol(x_01_17))),num.tree=800,importance = 'impurity') # ranger model
# y.rf.pred = predict(rf.reg, full[a+1,])$prediction # make the prediction on the a+1 observation
# prediction$V1[a-nrow(full_01_17)+1] <- y.rf.pred #save the prediction
# prediction$V2<-as.Date(prediction$V2) # save the date so we can verify lags
# prediction$V2[a-nrow(full_01_17)+1] <- as.Date(full_w_dates$formatted_date[a+1])
# if (a == nrow(full)-1) message("Done!") # gives a status update
# }
#
# write.csv(prediction, paste(lag,"lagprediction.csv", sep = "")) # save the prediction so we don't have to rerun
####
### to read-in results from already completed backtets
prediction <- read.csv(paste(lag,"lagprediction.csv", sep = ""))[2]
###
full_17_22_w_pred <- full_17_22_w_dates
full_17_22_w_pred$prediction <- prediction$V1
# Evaluate the accuracy
rf.acc = mean(full_17_22_w_pred$prediction==full_17_22_w_pred$X0)
loop_output$`Forward Accuracy 18-22`[i] <- rf.acc
# build a portfolio out of the predicted states
portfolio <- as.data.frame(matrix(0,nrow = nrow(full_17_22), ncol= 3))
colnames(portfolio) <- c("Value", "Date", "Label")
portfolio$Date <- full_17_22_w_pred$formatted_date
portfolio$Value <- 100
portfolio$Label <- paste(lag,"Lag RandomForest Prediction of MSDR", sep = " ")
for(b in 2:nrow(portfolio)){
if (full_17_22_w_pred$prediction[b-1] == 1){
portfolio$Value[b] <- portfolio$Value[b-1] * (1+QQQ_17_22$`Daily Return`[b])
} else {
portfolio$Value[b] <- portfolio$Value[b-1]
}
}
# save it to dataset containing port
port_comp <- rbind(port_comp, portfolio)
# calculate Sharpe
portfolio_returns <- data.frame(Delt(portfolio$Value)[-1])
portfolio_returns_xts <- xts(portfolio_returns, portfolio$Date[-1])
loop_output$Sharpe[i] <- as.numeric(SharpeRatio.annualized(portfolio_returns_xts))
# rolling sharpe
mylist <- c()
for (z in 1:(nrow(portfolio_returns)-252)){
portfolio_xts_rolling <- portfolio_returns_xts[z:(z+252)]
mylist <- c(mylist, as.numeric(SharpeRatio.annualized(portfolio_xts_rolling)))
}
loop_output$`Average 1YR Rolling Sharpe`[i]<- mean
loop_output$`Median 1YR Rolling Sharpe`[i]<- median(mylist)
loop_output$`Min 1YR Rolling Sharpe`[i]<- min(mylist)
loop_output$`Max 1YR Rolling Sharpe`[i]<- max(mylist)
}
options(warn = defaultW)
# plot output
ggplot(port_comp, aes(x = port_comp$Date, y = port_comp$Value, color = port_comp$Label, group = port_comp$Label))+geom_line()
#loop_output_v1 <- rbind(loop_output_v1, loop_output)
loop_output_v1
while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.
I have created the correct number of indeces for my vectors and I am trying to input the i'th element from the for loop as the index to hold the classification error value. But I get the error:
Error in indeces.gen_error[[i]] <- paste(classification_error) :
attempt to select less than one element
uspscl.txt
uspsdata.txt
My code:
library(e1071)
library(caret)
set.seed(733)
uspscldf = read.table('uspscl.txt', header=F, sep=',')
uspsdatadf = read.table('uspsdata.txt', header=F, sep='\t')
trainIndex <- createDataPartition(uspscldf$V1,list=FALSE, p = .80,times=1)
dataTrain <- uspsdatadf[ trainIndex,]
dataTest <- uspsdatadf[-trainIndex,]
classTrain <- uspscldf[ trainIndex,]
classTest <- uspscldf[-trainIndex,]
indeces = seq(0.00001, 1, by=0.001)
indeces.gen_error = NULL
indeces.softmargin = NULL
for (i in seq(0.00001, 1, by=0.001)){
# For svm(): soft margin is "cost"
# Gaussian kernel bandwidth (sigma) = is implicitly defined by "gamma"
# kernal=radial is non-linear while kernal=linear is linear
svm.model <- svm(classTrain ~ ., data = dataTrain, cost = i,type="C-classification",kernal = "linear")
svm.pred <- predict(svm.model, dataTrain)
# confusion matrix
tab <- table(pred = svm.pred, true = classTrain)
classification_error <- 1- sum(svm.pred == classTrain)/length(svm.pred)
indeces.gen_error[[i]] <- paste(classification_error)
indeces.softmargin[[i]]<-i
}
I printed the first i in the first iteration and it give 1e-5 which is correct so I am at a loss as to why it says I am selecting less than one element.
Any help would be appreciated. Thanks
ANSWER:::
I did not see Pierre's answer to this until I solved the answer myself but his explanation is better so I am accepting his answer. My new code now is:
indeces = seq(0.00001, 1, by=0.001)
indeces.gen_error = NULL
indeces.softmargin = NULL
count=0
for (i in indeces){
count=count+1
# For svm(): soft margin is "cost"
# Gaussian kernel bandwidth (sigma) = is implicitly defined by "gamma"
# kernal=radial is non-linear while kernal=linear is linear
svm.model <- svm(classTrain ~ ., data = dataTrain, cost = i,type="C-classification",kernal = "linear")
svm.pred <- predict(svm.model, dataTrain)
# confusion matrix
tab <- table(pred = svm.pred, true = classTrain)
classification_error <- 1- sum(svm.pred == classTrain)/length(svm.pred)
indeces.gen_error[[count]] <- paste(classification_error)
indeces.softmargin[[count]]<-i
}
#Example
x <- NULL
for( i in seq(0.01, 1, .01)) {
a <- 10 * i
x[[i]] <- paste("b", a)
}
# Error in x[[i]] <- paste("b", a) :
# attempt to select less than one element
#The right way
x <- NULL
myseq <- seq(0.01, 1, 0.01)
for( i in 1:length(myseq)) {
a <- 10 * myseq[i]
x[i] <- paste("b", a)
}
Why the first way fails for( i in seq(0.01, 1, .01)) will use the sequence as 'i'. Anytime when a loop fails, the first way to troubleshoot is to try out each loop one by one. So each loop takes a value of the sequence and enters it wherever there is an i. The first loop looks like:
for (i in seq(0.00001, 1, by=0.001)){
svm.model <- svm(classTrain ~ ., data = dataTrain, cost = 0.00001,type="C-classification",kernal = "linear")
svm.pred <- predict(svm.model, dataTrain)
# confusion matrix
tab <- table(pred = svm.pred, true = classTrain)
classification_error <- 1- sum(svm.pred == classTrain)/length(svm.pred)
indeces.gen_error[[0.00001]] <- paste(classification_error)
indeces.softmargin[[0.00001]]<- 0.00001
}
Do you see the problem? With indeces.gen_error[[0.00001]] pay attention to what is happening here. You did not mean to do this. You meant for indeces.gen_error[[1]] to be the first entry.
You are subsetting by a decimal. If we had:
x <- 1:10
What do you think would happen with x[2.5]? We are asking R for the element at the position 2.5. That doesn't make sense. There is no half position. There's either the 2nd or 3rd. Try it to see what is returned.
In your loop you are asking R for indeces.gen_error[[0.00001]]. Therefore, you are requesting the 1/100,000th position. That doesn't make sense. The evaluator will force the subset to be an integer. It goes to 0. And we get an error.
I am trying to simulate the unequal sample size in the multilevel model.I have four groups, the sample size is 100,200,300,and 400, respectively.
So, the total sample size is 1000. w, u0,u1 variables are in the level 2 ; x , r0 are in the level 1. y is an outcome
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4 ## 4 groups
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
dataLevel1 <- mat.or.vec(sum(nSubWithinGroup),4)
colnames(dataLevel1) <- c("Group","X","W","Y")
rowIndex <- 0
for (group in 1:nGroup) {
u0 <- rnorm(1,mean=0,sd=1)
u1 <- rnorm(1,mean=0,sd=1)
w <- rnorm(1,mean=0,sd=1)
for(i in 1:length(nSubWithinGroup)){
for (j in 1:nSubWithinGroup[i]){
r0 <- rnorm(1,mean=0,sd=1)
x <- rnorm(1,mean=0,sd=1)
y <- (gamma00+gamma01*w+u0)+(gamma10+gamma11*w+u1)*x+r0
rowIndex <- rowIndex + 1
dataLevel1[rowIndex,] <- c(group,x,w,y)
}
}
}
I ran the codes, and it showed me the value in the "Group" column is 1 , no 2,3, or 4. Also, it has errors, which is:
"Error in [<-(*tmp*, rowIndex, , value = c(2, -1.94476463667851, -0.153516782293473, :
subscript out of bounds"
Your original issue was a bit hard to find with all the for-loops, but you were looping twice on your grouping level (one time in 1:nGroup and then again in 1:length(nSubWithinGroup). This lead to more combinations than you had allowed for in your matrix, and thus your error. (If you want to check, run your loop without assigining to dataLevel1 and see what value rowIndex has at the end.
However, generating data like this in R can be notoriously slow and every function you use with n=1 can just as easily be used to generate nTotal numbers. I have rewritten your code to something that's (hopefully) more readable, but also more vectorized.
#set seed; you can never reproduce your result if you don't do this
set.seed(289457)
#set constants
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
#set size parameters
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4
nTotal <- sum(nSubWithinGroup)
#simulate group-level data
level2_data <- data.frame(group=1:nGroup,
size=nSubWithinGroup, #not really necessary here, but I like to have everything documented/accessible
u0 = rnorm(nGroup,mean=0,sd=1),
u1 = rnorm(nGroup,mean=0,sd=1),
w = rnorm(nGroup,mean=0,sd=1)
)
#simulate individual_level data (from example code x and r0 where generated in the same way for each individual)
level1_data <- data.frame(id=1:nTotal,
group=rep(1:nGroup, nSubWithinGroup),
r0 = rnorm(nTotal,mean=0,sd=1),
x = rnorm(nTotal, mean=0,sd=1)
)
#several possibilities here, you can merge the two dataframes together or reference the level2data when calculating the outcome
#merging generates more data, but is also readable
combined_data <- merge(level1_data,level2_data,by="group",all.x=T)
#calculate outcome. This can be shortened for instance by calculating some linear parts before
#merging but wanted to stay as close to original code as possible.
combined_data$y <- (gamma00+gamma01*combined_data$w+combined_data$u0)+
(gamma10+gamma11*combined_data$w+combined_data$u1)*combined_data$x+combined_data$r0
I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable?
test.df <- data.frame(a = sample(c(0,1),100, rep = T),
b = factor(sample(letters, 100, rep = T)),
c = factor(sample(c("apple", "orange"), 100, rep = T)))
set.seed(123)
partition <- function(data, train.size = .7){
train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
test <- data[-as.numeric(row.names(train)), ]
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])
EDIT - New function using 'caret' package and createDataPartition():
partition <- function(data, factor=NULL, train.size = .7){
if (("package:caret" %in% search()) == FALSE){
stop("Install and Load 'caret' package")
}
if (is.null(factor)){
train.index <- createDataPartition(as.numeric(row.names(data)),
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
else{
train.index <- createDataPartition(factor,
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:
caret - data splitting
The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).
stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)
#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)
}