ChoiceModelR - Hierarchical Bayes Multinomial Logit Model - r

I hope that some of you are a bit experienced with the R package ChoiceModelR by Sermas and Colias, to estimate a Hierarchical Bayes Multinomial Logit Model. Actually, I am quite a newbie on both R and Hierarchical Bayes. However, I tried to get some estimates by using the script provided by Sermas and Colias in the help file. I have a data set in the same structure as they use (ID, choice set, alternative, independent variables, and choice variable). I have four independent variables all of them binary coded as categorical variables, none of them restricted. I have eight choice sets with three alternatives within each set as well as one no-choice-option as fourth alternative. I tried the following script:
library (ChoiceModelR)
data <- read.delim("Z:/KLU/CSR/CBC/mp3_vio.txt")
xcoding=c(0,0,0,0)
mcmc = list(R = 10, use = 10)
options = list(none=FALSE, save=TRUE, keep=1)
attlevels=c(2,2,2,2)
c1=matrix(c(0,0,0,0),2,2)
c2=matrix(c(0,0,0,0),2,2)
c3=matrix(c(0,0,0,0),2,2)
c4=matrix(c(0,0,0,0),2,2)
constraints = list(c1, c2, c3, c4)
out = choicemodelr(data, xcoding, mcmc = mcmc, options = options, constraints = constraints)
and have got the following error message:
Error in 1:nalts[i] : result would be too long a vector
In addition: There were 50 or more warnings (use warnings() to see the first 50). The mentioned warnings are of the following:
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
Actually, I have no idea what went wrong so far as I used the same data structure even I have more independent variables, more choice sets, and more alternatives within a choice set. I would be fantastic if anybody can shed some light into the darkness

I know that this may not be helpful since you posted so long ago, but if it comes up again in the future, this could prove useful.
One of the most common reasons for this error (in my experience) has been that either the scenario variable or the alternative variable is not in ascending order within your data.
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 3 1 4 2
1 3 2 5 0
2 1 4 3 1
2 1 5 1 0
2 2 1 4 2
2 2 2 3 0
This dataset will give you errors since the scenario and alternative variables must be ascending, and they must not skip any values. Just to fully reiterate what I mean, the scenario and alt variables must be reordered as follows in order to work:
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 2 1 4 2
1 2 2 5 0
2 1 1 3 1
2 1 2 1 0
2 2 1 4 2
2 2 2 3 0
I work with ChoiceModelR quite frequently, and this is what has caused these errors for me in the past. If you have a github account, you can also post your data (or modified data) there if you end up wanting to have other users take a look.

Related

Subset lineal regression in R

I have the following formula:
Reg_Total<- In_Bigdata2 %>%
lm(log(This_6) ~ This_1+This_2+This_3+This_4+
This_5+This_7+This_8+
This_12+This_13+This_14+This_15+This_16+This_17,This_18,data = .)
With that data
With only the variable This_18 as a subset, do you know why it gives me a perfect regression with an r2 of 1?
OK, this was a good puzzle.
You have to dig a little bit to find out what the subset= argument does, as it gets passed to the model.frame() function inside lm(). From ?model.frame():
subset: a specification of the rows to be used: defaults to all rows.
This can be any valid indexing vector (see ‘[.data.frame’)
for the rows of ‘data’ or if that is not supplied, a data
frame made up of the variables used in ‘formula’.
(emphasis added). Usually people specify a logical expression for subset= (e.g. This_5>2) to restrict the regression to particular cases. If you put in an integer vector, lm()/model.frame() will select the rows corresponding to those integers.
So ... what lm()/model.frame() have done is to construct a data set for the linear model that consists of rows of the original data set indexed by This_18. In other words, since the first few elements of This_18 are (2,3,4,3,3,2, ...), the first row of the new data set will be row 2 of the original data set; the second row will be row 3; the third row will be row 4; the fourth row will be another copy of row 3; and so on ...
head(model.frame(This_6~.-This_18, data=dd, subset=This_18))
## This_6 This_1 This_2 This_3 This_4 This_5 This_7 This_8 This_9 This_10 ...
## 2 2 5 3 3 3 3 3 2 3 1 ...
## 3 3 3 3 3 3 3 3 4 4 4 ...
## 4 1 3 3 3 3 3 3 2 1 2 ...
## 3.1 3 3 3 3 3 3 3 4 4 4 ...
## 3.2 3 3 3 3 3 3 3 4 4 4 ...
## 2.1 2 5 3 3 3 3 3 2 3 1 ...
(you can also get this object by running model.frame(fitted_model)).
Therefore, since the only values of This_18 are the integers 1-6, you get a regression run only on multiple copies of rows 1-6 of the original data set. Thus it's not surprising that you get a perfect fit, since there are only 6 unique response/sets of predictors.
The remaining question is ... what did you intend to do by using subset=This_18 ... ? "subset" refers to a subset of observations, not a subset of predictors.
If you want to do best subset regression (i.e. find the subset of predictors that maximize some criterion) there is not a single easy answer (and in fact there are some potential statistical pitfalls if you are interested in inference rather than prediction). Googling "R best subset regression" should help you, or searching for those keywords on Stack Overflow. (Or see the glmulti package, or the leaps package, or the stepAIC function in the MASS package, or r the MuMIn package, or ...)

Like dcast but without sum of data

I have data organized for the R survival package, but want to export it to work in Graphpad Prism, which uses a different structure.
#Example data
Treatment<-c("A","A","A","A","A","B","B","B","B","B")
Time<-c(3,4,5,5,5,1,2,2,3,5)
Status<-c(1,1,0,0,0,1,1,1,1,1)
df<-data.frame(Treatment,Time,Status)
The R survival package data structure looks like this
Treatment Time Status
A 3 1
A 4 1
A 5 0
A 5 0
A 5 0
B 1 1
B 2 1
B 2 1
B 3 1
B 5 1
The output I need organizes each treatment as one column, and then sorts by time. Each individual is then recorded as a 1 or 0 according to its Status. The output should look like this:
Time A B
1 1
2 1
2 1
3 1 1
4 1
5 0 1
5 0
5 0
dcast() does something similar to what I want, but it sums up the Status values and merges them into one cell for all individuals with matching Time values.
Thanks for any help!
I ran into a weird issue when trying to implement Sotos' code to my actual data. I got the error:
Error in Math.factor(var) : ‘abs’ not meaningful for factors
Which is weird, because Sotos' code works for the example. When I checked the example data frame using sapply() it gave me the result:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "numeric"
My issue as far as I could tell, was that my Status variable was read as numeric in my example, but an integer in my real data:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "integer"
I loaded my data from a .csv, so maybe that's what caused the change in variable calling. I ended up changing my Status variable using as.numeric(), and then re-generating the dataframe.
Status<-as.numeric(df$Status)
df<-data.frame(Treatment, Time, Status)
And was able to apply Sotos' code to the new dataframe.

Displaying of factor levels and labels in R

I am having an issue with displaying the correct grouping of a factor variable after using MICE. I believe this is an R thing, but I included it with mice just to be sure.
So, I run my mice algorithm, here is a snipit of how I call I format it in the mice algorithm. Note, I want it to be 0 for no drug, and 1 for yes drug, so I coerce it to be a factor with levels 0 and 1 before I run it
mydat$drug=factor(mydat$drug,levels=c(0,1),labels=c(0,1))
I then run mice and it runs logistic regression (this is the default) on drug, along with my other variables to be imputed.
I can extract the results of one of the imputations when it is complete by
drug=complete(imp,1)$drug
We can view it
> head(drug)
[1] 0 0 1 0 1 1
attr(,"contrasts")
2
0 0
1 1
Levels: 0 1
So the data is certainly 0,1.
However, when I do something with it, like cbind, it changes to 1's and 2's
> head(cbind(drug))
drug
[1,] 1
[2,] 1
[3,] 2
[4,] 1
[5,] 2
[6,] 2
Even when I coerce it to a numeric
> head(as.numeric(drug))
[1] 1 1 2 1 2 2
I want to say it has something to do with the contrasts, but when I delete the contrast by doing
attr(drug,"contrasts")=NULL
It still shows up with 1's and 2's when called and printed by others.
I am able to get it to print correctly by using I()
> head(I(drug))
[1] 0 0 1 0 1 1
Levels: 0 1
So, I believe that this is an R issue, but I don't know how to remedy it. Is using I() the correct solution, or is it just a workaround that happens to work here? What is actually happening behind the scenes that is making the output display as 1's and 2's?
Thanks
Factors start with the first level being represented internally by 1.
Your two options:
1) Adjust for 1-based index of levels:
as.numeric(drug) - 1
2) Take the labels of the factors and convert to numeric:
as.numeric(as.character(drug))
Some people will point you in the direction of the faster option that does the same thing:
as.numeric(levels(drug))[drug]
I'd also consider using logical values instead of factor in the first place.
mydat$drug = as.logical(mydat$drug)
The 0s and 1s are the names of your levels. The underlying integer corresponding to the names is 1 and 2. You can see with str,
str(drug)
# Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 2 2
When you coerce the factor to numeric, you drop the names and get the integer representation.
This is how R encodes factors. The underlying numeric representation of the factors always starts with 1. As you can see with the following to examples:
as.numeric(factor(c(0,1)))
as.numeric(factor(c(A,B)))
Not sure about the specifics about how MICE works, but if it requires a factor instead of a simple 0/1 numeric variable to use logistic regression, you can always hack the results with something like the following:
as.numeric(as.character(factor(c(0,1))))
or in your specific case
drug <- as.numeric(as.character(drug))

Using mlogit in R dependant and independent categorical variables

I have two vectors (A and B) with categorical data of 36 subjects. A_i,j being the categorytype1 j, subject i fits into and B_i,k is categorytype2 k of subject i. With i=1:36, j=1:5 and k=1:6.
library(mlogit)
AB <- read.csv("C:/.../AB.csv")
head(AB)
Subject A B
1 1 1 3
2 2 3 3
3 3 1 6
4 4 1 3
5 5 1 2
6 6 1 4
I would like to find a probability for every category combination. So with what chance does a subject choose category j and k for all j=1:5 and k=1:6.
I was told the probit/logit model was a great tool to use for this problem and I tried estimating it in R.
mldata<-mlogit.data(AB, choice="A", alt.var="B", shape="long", id.var = "Subject")
Gives me an error and I can not find my mistake.
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.3", "1.3", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1.3’, ‘2.2’, ‘2.3’, ‘3.1’,‘3.5’,‘4.2’,‘4.3’, ‘5.3’, ‘5.4’, ‘6.5’, ‘7.3’, ‘8.2’, ‘8.3’
I tried looking through the help files but has not helped me a lot.
I hope someone can point out the mistake(s) I'm making.
Thank you very much for your help.
Post output of dput(A) and dput(b) and specify what the first couple of answers should be. . Looks like you want rowSums(.)/6 across some logical operation on those two matrices. Probably:
rowSums(A==B)/6

Not sure why dcast() this data set results in dropping variables

I have a data frame that looks like:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
Each row describes a number of messages (length) sent from one user to another user.
What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.
There are almost 200 countries. I use the function dcast as follows:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.
At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th cell represents the messages sent from country i to country j. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.
So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?
Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:
chats$from_country <- factor(chats$from_country,
levels = unique(c(chats$from_country,
chats$to_country)))
chats$to_country <- factor(chats$to_country,
levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# from_country US GB SG PR
# 1 US 5 0 0 1
# 2 GB 0 2 0 0
# 3 SG 1 0 1 0
# 4 PR 0 0 0 0
If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:
chats$from_country <- factor(chats$from_country,
levels = unique(c(levels(chats$from_country),
levels(chats$to_country)))
Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country) will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>.

Resources