depmixs4 list of state estimates - r

I have fitted a time series of data to a Hidden Markov model using the fit(depmix()) functions in the package depmixs4. I want to obtain a time series of estimated states, where the estimate for state x is the mean value assigned to state x. Currently, I'm only getting a times series of values that say the index of the state (e.g. 1,3,5,2,5...).
This is my current code:
set.seed(9)
hmm9 <- depmix(volume ~ 1, data=data.frame(volume), nstates=9)
fitted_hmm9 <- fit(hmm9)
summary(fitted_hmm9)
state_ests9 <- posterior(fitted_hmm9)
state_ests9[,1]
The last part, state_ests9[,1], is the time series of state indices, while the actual expected values for states are stored somewhere in summary(fitted_hmm9).

By "estimated state" I assume you want the predicted response (i.e. "volume") according to each state. To get state-wise predicted values for responses, you can use the predict method, which however needs to be called on the response sub-model, which is somewhat hidden in the fitted depmix object.
In a fitted depmix object, there is a slot called "response", which is a list of lists, structured as
fitted_model#response[[state_id]][[response_id]]
In your case, I think "volume" is univariate, so that there is a single response and response_id is always 1. For a model with 9 states, state_id would vary from 1 to 9.
The following code (including randomly generated values for "volume" to make it reproducible) gives you what you want (I think):
set.seed(123)
volume <- rnorm(10000)
hmm9 <- depmix(volume ~ 1, data=data.frame(volume), nstates=9)
fitted_hmm9 <- fit(hmm9)
summary(fitted_hmm9)
state_ests9 <- posterior(fitted_hmm9)
state_ests9[,1]
# construct matrix for state-dependent predictions
pred_resp9 <- matrix(0.0,ncol=9,nrow=nrow(state_ests9))
# fill matrix column-wise by using the "predict" method of
# corresponding response model
for(i in 1:9) {
pred_resp9[,i] <- predict(fitted_hmm9#response[[i]][[1]])
}
## use MAP state estimates to extract a single time-series
## with predictions
pred_resp9[cbind(1:nrow(pred_resp9),state_ests9[,1])]

Related

Is calculated factor scores in robCompositions R package, correct?

Please have a look at the factor scores returned from robCompositions package in this example:
data(expenditures)
x <- expenditures
res.rob <- pfa(x, factors=1, score="regression")
according to pfa help, since the covariance is not specified,
the covariance is estimated from isometric log-ratio transformed data internally, but the data used for factor analysis are back-transformed to the "clr" space.
So the clr transformed data obtain as follows:
# ilr transformation
ilr <- function(x){
x.ilr=matrix(NA,nrow=nrow(x),ncol=ncol(x)-1)
for (i in 1:ncol(x.ilr)){
x.ilr[,i]=sqrt((i)/(i+1))*
log(((apply(as.matrix(x[,1:i]), 1, prod))^(1/i))/(x[,i+1]))
}
return(x.ilr)
}
#construct orthonormal basis:
#(matrix with ncol(x) rows and ncol(x)-1 columns)
V=matrix(0,nrow=ncol(x),ncol=ncol(x)-1)
for (i in 1:ncol(V)){
V[1:i,i] <- 1/i
V[i+1,i] <- (-1)
V[,i] <- V[,i]*sqrt(i/(i+1))
}
z=ilr(x) #ilr transformed data
y=z%*%t(V) #clr transformed data
now the factor scores using regression method might be calculated as follows:
loa<-c(0.970,0.830,0.986,0.876,0.977) #res.rob object
facscores<- y%*%loa
head(facscores)
-0.009485110
0.009680645
0.008426665
-0.015401000
-0.003610644
-0.004584145
but calling res.rob$scores returns us
head(res.rob$scores)
Factor1
-755.2681
705.5309
4196.5652
-778.6955
-628.2141
-663.4534
So please check am I wrong or there is probably a bug in the pfa command?
Yours,
Hamid

output the predicted value of random forest as a new column in the test dataset

I ran a random forest on a training data set (60% of the parent data set).
I used the "predict" function in R to evaluate how well the model performs on the test data set (40% of the parent data set)
I used
require(randomForest)
trained_model <- randomForest(y~.,data = training,
importance=TRUE,
keep.forest=TRUE)
result <- predict(trained_model, type = response, newdata=testing[-17])
removed 17th column since it's "y"
The problem is that the "result" is an array, this there a way to add this predicted value back into the testing data set as a new column
Something like
Id x1 x2 ...... Xn y predicted
1 0.1 0.12 23 no no
Do this:
testing[ ,(ncol(testing)+1)] <- result
When result will added back to testing dataset, there will be a column not array.

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

How to find an optimal adstock decay factor for an independent variable in panel analysis in R?

I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Resources