I have data from a discrete choice experiment (DCE), looking at hiring preferences for individuals from different sectors. that I've formatted into long format. I want to model using mlogit. I have exported the data and can successfully run the model in Stata using the asclogit command, but I'm having trouble getting it to run in R.
Here's a snapshot of the first 25 rows of data:
> data[1:25,]
userid chid item sector outcome cul fit ind led prj rel
1 11275 211275 2 1 1 0 1 0 1 1 1
2 11275 211275 2 2 0 1 0 0 0 0 0
3 11275 211275 2 0 0 0 0 1 1 0 1
4 11275 311275 3 0 1 1 1 0 0 0 1
5 11275 311275 3 2 0 0 1 0 0 0 1
6 11275 311275 3 1 0 0 1 0 0 0 0
7 11275 411275 4 0 0 1 0 1 1 0 0
8 11275 411275 4 2 1 0 1 1 1 1 0
9 11275 411275 4 1 0 0 1 0 1 0 0
10 11275 511275 5 1 1 1 0 1 0 1 1
11 11275 511275 5 2 0 0 0 1 1 0 0
12 11275 511275 5 0 0 0 0 1 1 1 0
13 11275 611275 6 0 0 0 1 1 0 0 1
14 11275 611275 6 1 1 1 1 1 0 0 1
15 11275 611275 6 2 0 1 1 1 0 1 0
16 11275 711275 7 1 0 0 0 0 0 1 0
17 11275 711275 7 0 0 1 0 0 1 1 0
18 11275 711275 7 2 1 1 0 0 1 1 1
19 11275 811275 8 0 1 0 1 0 0 1 1
20 11275 811275 8 1 0 1 0 1 1 1 1
21 11275 811275 8 2 0 0 0 0 0 1 1
22 11275 911275 9 0 0 1 1 0 0 1 0
23 11275 911275 9 2 1 1 1 1 1 0 1
24 11275 911275 9 1 0 1 0 1 1 0 0
25 11275 1011275 10 0 0 0 0 0 0 0 0
userid and chid are factor variables, the rest are numeric. The variables:
Userid is unique respondent ID
chid is unique choice set ID per respondent
item is choice set ID (they are repeated across respondents)
sector is alternatives (3 different sectors)
outcome is alternative selected by respondent in the given choice set
cul-rel is binary factor variables, alternative specific that vary across alternatives according to the experimental design.
Here is my mlogit syntax:
mlogit(outcome~cul+fit+ind+led+prj+rel,shape="long",
data=data,id.var=userid,chid.var="chid",
choice=outcome,alt.var="sector")
Here is the error I get:
Error in if (abs(x - oldx) < ftol) { :
missing value where TRUE/FALSE needed
I've made sure there are no missing data, and that each choice set has exactly 1 selected alternative.
Any ideas about why I'm getting this error, when the model runs fine in Stata with the exact same dataset? I've probably misread the mlogit syntax somewhere. If it helps, my Stata syntax is:
asclogit outcome cul fit rel ind fit led prj, case(chid) alternatives(sector)
Answering my own question here as I figured it out.
R mlogit can't handle when none of the alternatives in a choice set is selected. R also needs the data ordered properly, each alternative in a choice set must be in a row. I hadn't done that due to some data management. Interestingly, Stata can handle both of these conditions, so that's why my Stata commands worked.
As an aside, for those interested, Stata's asclogit and R's mlogit give the exact same results. Always nice when that happens.
You may need to use mlogit.data() to shape the data. There's an examples at ?mlogit. Hope that helps.
Related
I am trying to use the binary matrix containing transactions for the apriori algorithm I don't know how to implement it
data_purchase
Txn Bag Blush Nail.Polish Brushes Concealer Eyebrowpencil Bronzer
1 1 0 1 1 1 1 0 1
2 2 0 0 1 0 1 0 1
3 3 0 1 0 0 1 1 1
4 4 0 0 1 1 1 0 1
5 5 0 1 0 0 1 0 1
6 6 0 0 0 0 1 0 0
7 7 0 1 1 1 1 0 1
8 8 0 0 1 1 0 0 1
9 9 0 0 0 0 1 0 0
10 10 1 1 1 1 0 0 0
11 11 0 0 1 0 0 0 1
12 12 0 0 1 1 1 0 1
The above is the data frame containing the binary matrix.
Have a look at the R package arules at https://cran.r-project.org/package=arules
I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1
I'm trying to do discrete choice modeling on the below data. Basically, 30 customers have 16 different choices of pizza. They can choose more than 1 type of pizza and the ones they choose is indicated by choice variable.
pizza cust choice pan thin pineapple veggie sausage romano mozarella oz
1 1 Cust1 0 1 0 1 0 0 1 0 1
2 2 Cust1 1 0 1 1 0 0 0 0 0
3 3 Cust1 0 0 0 1 0 0 0 1 1
4 4 Cust1 1 0 1 1 0 0 0 0 0
5 5 Cust1 1 1 0 0 1 0 0 0 1
6 6 Cust1 0 0 1 0 1 0 1 0 0
7 7 Cust1 0 0 0 0 1 0 0 0 1
8 8 Cust1 1 0 1 0 1 0 0 1 0
9 9 Cust1 0 1 0 0 0 1 0 1 0
10 10 Cust1 1 0 1 0 0 1 0 0 1
11 11 Cust1 0 0 0 0 0 1 1 0 0
12 12 Cust1 0 0 1 0 0 1 0 0 1
13 13 Cust1 0 1 0 0 0 0 0 0 0
14 14 Cust1 1 0 1 0 0 0 0 1 1
15 15 Cust1 0 0 0 0 0 0 0 0 0
16 16 Cust1 0 0 1 0 0 0 1 0 1
17 1 Cust10 0 1 0 1 0 0 1 0 1
18 2 Cust10 0 0 1 1 0 0 0 0 0
19 3 Cust10 0 0 0 1 0 0 0 1 1
20 4 Cust10 0 0 1 1 0 0 0 0 0
When I use the below command to transform my data. I tried making few changes here like adding chid.var = "chid" and alt.levels=c(1:16). If I use both alt.levels and alt.var it gives me an error saying pizza already exists and will be replaced. However, I get no error if I use either of them.
pz <- mlogit.data(pizza,shape = "long",choice = "choice",
varying = 4:8, id = "cust", alt.var = "pizza")
Finally, when I use the mlogit command, I get this error.
mlogit(choice ~ pan + thin + pineapple + veggie + sausage + romano + mozarella + oz, pz)
Error in solve.default(H, g[!fixed]) :
system is computationally singular: reciprocal condition number = 8.23306e-19
This is my first post on stackoverflow. I visit this site very often and so far never needed to post as I found solutions already. I went through almost all similar posts like this one but in vain. I'm new to discrete choice modeling so I don't know if I'm making any fundamental mistake here.
Also, I'm not really sure what chid.var does.
Couldn't solve this problem. Though you can use multinom function from nnet package. It seems to work. Verified the answer.
The dataset remains the same as shown in the question so no need for any transformation
library("nnet")
pizza_model <- multinom(choice ~ Price + IsThin + IsPan ,data=pizza_all)
summary(pizza_model)
where choice is a dependent categorical variable which you want to predict. Price, IsThin, and IsPan are independent variables. Below is the output
Call:
multinom(formula = choice ~ Price + I_cPan + I_cThin, data = pizza_all)
Coefficients:
Values Std. Err.
(Intercept) 0.007192623 1.3298018
Price -0.149665357 0.1464976
I_cPan 0.098438084 0.3138538
I_cThin 0.624447867 0.2637110
Residual Deviance: 553.8519
AIC: 561.8519
I am a rookie in R. I think my questions are basic ones. I want to know the frequency of a variable under couple conditions. I try to use table() but it does not work. I have searched a lot, I still cannot find the answers.
My data looks like this
ID AGE LEVEL End_month
1 14 1 201005
2 25 2 201006
3 17 2 201006
4 16 1 201008
5 19 3 201007
6 33 2 201008
7 17 2 201006
8 15 3 201005
9 23 1 201004
10 25 2 201007
I want to know two things.
First, I want to know the frequency of age under different level. The age shows in certain range and aggregate the rest as a variable. It looks like this.
level
1 2 3 sum
age 14 1 0 0 1
16 1 0 0 1
15 0 0 1 1
17 0 2 0 2
19 0 0 1 1
20+ 1 3 0 4
sum 3 5 2 10
Second, I want to know the frequency of different age in different end_month of level 2&3 customer. I want to get a table like this.
For level 2 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 0 0 0 0 0
19 0 0 0 0 0 0
17 0 0 2 0 0 2
19 0 0 0 0 0 0
25 0 0 0 1 0 1
33 0 0 0 1 1 2
sum 0 0 2 2 1 5
For level 3 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 1 0 0 0 1
19 0 0 0 1 0 1
17 0 0 0 0 0 0
19 0 0 0 0 0 0
25 0 0 0 0 0 0
33 0 0 0 0 0 0
sum 0 1 0 1 0 2
Many thanks in advance.
You can still achieve this with table, because it can take more than one variables.
For example, use
table(AGE, LEVEL)
to get the first two-way table.
Now, when you want to produce such table for each subset according to LEVEL, you can do it this way, assuming we are going for level 1:
subset <- LEVEL == 1
table(AGE[subset], END[subset])
I have data from a discrete choice experiment (DCE), looking at hiring preferences for individuals from different sectors. that I've formatted into long format. I want to model using mlogit. I have exported the data and can successfully run the model in Stata using the asclogit command, but I'm having trouble getting it to run in R.
Here's a snapshot of the first 25 rows of data:
> data[1:25,]
userid chid item sector outcome cul fit ind led prj rel
1 11275 211275 2 1 1 0 1 0 1 1 1
2 11275 211275 2 2 0 1 0 0 0 0 0
3 11275 211275 2 0 0 0 0 1 1 0 1
4 11275 311275 3 0 1 1 1 0 0 0 1
5 11275 311275 3 2 0 0 1 0 0 0 1
6 11275 311275 3 1 0 0 1 0 0 0 0
7 11275 411275 4 0 0 1 0 1 1 0 0
8 11275 411275 4 2 1 0 1 1 1 1 0
9 11275 411275 4 1 0 0 1 0 1 0 0
10 11275 511275 5 1 1 1 0 1 0 1 1
11 11275 511275 5 2 0 0 0 1 1 0 0
12 11275 511275 5 0 0 0 0 1 1 1 0
13 11275 611275 6 0 0 0 1 1 0 0 1
14 11275 611275 6 1 1 1 1 1 0 0 1
15 11275 611275 6 2 0 1 1 1 0 1 0
16 11275 711275 7 1 0 0 0 0 0 1 0
17 11275 711275 7 0 0 1 0 0 1 1 0
18 11275 711275 7 2 1 1 0 0 1 1 1
19 11275 811275 8 0 1 0 1 0 0 1 1
20 11275 811275 8 1 0 1 0 1 1 1 1
21 11275 811275 8 2 0 0 0 0 0 1 1
22 11275 911275 9 0 0 1 1 0 0 1 0
23 11275 911275 9 2 1 1 1 1 1 0 1
24 11275 911275 9 1 0 1 0 1 1 0 0
25 11275 1011275 10 0 0 0 0 0 0 0 0
userid and chid are factor variables, the rest are numeric. The variables:
Userid is unique respondent ID
chid is unique choice set ID per respondent
item is choice set ID (they are repeated across respondents)
sector is alternatives (3 different sectors)
outcome is alternative selected by respondent in the given choice set
cul-rel is binary factor variables, alternative specific that vary across alternatives according to the experimental design.
Here is my mlogit syntax:
mlogit(outcome~cul+fit+ind+led+prj+rel,shape="long",
data=data,id.var=userid,chid.var="chid",
choice=outcome,alt.var="sector")
Here is the error I get:
Error in if (abs(x - oldx) < ftol) { :
missing value where TRUE/FALSE needed
I've made sure there are no missing data, and that each choice set has exactly 1 selected alternative.
Any ideas about why I'm getting this error, when the model runs fine in Stata with the exact same dataset? I've probably misread the mlogit syntax somewhere. If it helps, my Stata syntax is:
asclogit outcome cul fit rel ind fit led prj, case(chid) alternatives(sector)
Answering my own question here as I figured it out.
R mlogit can't handle when none of the alternatives in a choice set is selected. R also needs the data ordered properly, each alternative in a choice set must be in a row. I hadn't done that due to some data management. Interestingly, Stata can handle both of these conditions, so that's why my Stata commands worked.
As an aside, for those interested, Stata's asclogit and R's mlogit give the exact same results. Always nice when that happens.
You may need to use mlogit.data() to shape the data. There's an examples at ?mlogit. Hope that helps.