mlogit gives error: the two indexes don't define unique observations - r

My dataframe named longData looks like:
ID Set Choice Apple Microsoft IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia
1 1 1 0 1 0 0 0 0 0 0 0 0 0
2 1 2 0 0 1 0 0 0 0 0 0 0 0
3 1 3 0 0 0 1 0 0 0 0 0 0 0
4 1 4 1 0 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 0 0 0 0 0 0 0 1
6 1 6 0 -1 0 0 0 0 0 0 0 0 0
I am trying to run mlogit on it by:
logitModel = mlogit(Choice ~ Apple+Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, shape = "long")
it gives the following error:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
after looking for some time I found that this error was given by dfidx as seen in here as:
z <- data[, c(posid1[1], posid2[1])]
if (nrow(z) != nrow(unique(z)))
stop("the two indexes don't define unique observations")
but upon calling the following code, it runs without the error and gives the names of two idx that are uniquely able to identify a row in dataframe:
dfidx(longData)$idx
this gives expected output as:
~~~ indexes ~~~~
ID Set
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
indexes: 1, 2
So what am I doing wrong, I saw some related questions 1, 2 but couldn't find what I am missing.

It looks like your example comes from here: https://docs.displayr.com/wiki/MaxDiff_Analysis_Case_Study_Using_R
The code seems outdated, I remember it worked for me, but not anymore.
The error message is valid because every pair (ID, Set) appears several times, once for each alternative.
However this works:
# there will be complaint that choice can't be coerced to logical otherwise
longData$Choice <- as.logical(longData$Choice)
# create alternative number (nAltsPerSet is 5 in this example)
longData$Alternative <- 1+( 0:(nrow(longData)-1) %% nAltsPerSet)
# define dataset
mdata <- mlogit.data(data=longData,shape="long", choice="Choice",alt.var="Alternative",id.var="ID")
# model
logitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0,
data = mdata
)
summary(logitModel)

Related

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

How to compare data frame with a factor as a variable?

I have a data frame, please see below.
How do I compare the Volume where Purchase == 1 to the previous Purchase == 1 Volume and create a factor variable V1 like shown in the Picture 2?
The df[5,"V1"] == 1 because df[5,"Volume"] > df[3,"Volume"].... and so on.
How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?
I've tried sub-setting, then do the comparison but when tried to put them back to a factor variable, the number of rows of the result is not the same as the number of rows of the df therefore I cannot put the factor variable to the dataframe.
Picture 2
Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0
With data.table:
library(data.table)
data <- data.table(read.table(text=' Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0', header=T))
data[, V1 := 0]
data[Purchase == 1, V1 := as.integer(Volume > shift(Volume)) ]
data[, V1 := as.factor(V1)]
Here, I filtered data to where Purchase = 1, then I brought previous Volume with shift function.
Finally, I compared Volume to Previous volume and assigned 1 if Volume is larger than Previous.

Document-Term Matrix with Quanteda

I have a dataframe df with this structure :
Rank Review
5 good film
8 very goood film
..
Then I tried to create a DocumentTermMatris using quanteda package :
temp.tf <- df$Review %>% tokens(ngrams = 1:1) %>% # generate tokens
+ dfm %>% # generate dfm
+ convert(to = "tm")
I get this matrix :
> inspect(temp.tf)
<<DocumentTermMatrix (documents: 63023, terms: 23892)>>
Non-/sparse entries: 520634/1505224882
Sparsity : 100%
Maximal term length: 77
Weighting : term frequency (tf)
Sample :
Whith this structure :
Terms
Docs good very film my excellent heart David plus always so
text14670 1 0 0 0 1 0 0 0 2 0
text19951 3 0 0 0 0 0 0 1 1 1
text24305 7 0 2 1 0 0 0 2 0 0
text26985 6 0 0 0 0 0 0 4 0 1
text29518 4 0 1 0 1 0 0 3 0 1
text34547 5 2 0 0 0 0 2 3 1 3
text3781 3 0 1 4 0 0 0 3 0 0
text5272 4 0 0 4 0 5 0 3 1 2
text5367 3 0 1 3 0 0 1 4 0 1
text6001 3 0 9 1 0 6 0 1 0 1
So I think It is good , but I think that : text6001 , text5367, text5272 ... refer to document's name...
My question is that rows in this matrix are ordered? or randoms putted in the matrix?
Thank you
EDIT :
I created a document term frequency :
mydfm <- dfm(df$Review, remove = stopwords("french"), stem = TRUE)
Then, I created a tf-idf matrix :
tfidf <- tfidf(mydfm)[, 5:10]
Then I would like to merge the tfidf matrix to the Rank column to have something like this
features
Docs good very film my excellent heart David plus always so Rank
text14670 1 0 0 0 1 0 0 0 2 0 3
text19951 3 0 0 0 0 0 0 1 1 1 2
text24305 7 0 2 1 0 0 0 2 0 0 4
text26985 6 0 0 0 0 0 0 4 0 1 5
Can you help to make this merge?
Thank you
The rows (documents) are alphabetically ordered, which is why text14670 comes before text19951. It is possible that the conversion has reordered the documents, but you can test this using
sum(rownames(temp.tf) == sort(rownames(temp.tf))
If that is not 0, then they are not alphabetically ordered.
The feature ordering, at least in the quanteda dfm, come from the order in which they are found in the texts. You can resort both using dfm_sort().
In your code, the tokens(ngrams = 1:1) is unnecessary since dfm() does that and ngrams = 1 is the default.
Also, do you need to convert this to a tm object? Probably most of what you need can be done in quanteda.

counting the occurrences of a number and when it occurred in R data.frame and data.table

I have newly started to learn R, so my question may be utterly ridiculous. I have a data frame
data<- data.frame('number'=1:11, 'col1'=sample(10:20),'col2'=sample(10:20),'col3'=sample(10:20),'col4'=sample(10:20),'col5'=sample(10:20), 'date'= c('12-12-2014','12-11-2014','12-10-2014','12-09-2014', '12-08-2014','12-07-2014','12-06-2014','12-05-2014','12-04-2014', '12-04-2014', '12-03-2014') )
The number column is an 'id' column and the last column is a date.
I want to count the number of times that each number occurs across (not per column, but the whole data frame containing data) the columns 2:6 and when they occurred.
I am stuck on the first part having tried the following using data.table:
count <- function(){
i = 1
DT <-data.table(data[2:6])
for (i in 10:20){
DT[, .N, by =i]
i = i + 1
}
}
which gives an error that I don't begin to understand
Error in `[.data.table`(DT, , .N, by = i) :
The items in the 'by' or 'keyby' list are length (1). Each must be same length as rows in x or number of rows returned by i (11)
Can someone help, please. Also with the second part that I have not even attempted yet i.e. associating a date or a row number with each occurrence of a number
Perhaps you may want this
library(reshape2)
table(melt(data[,-1], id.var='date')[,-2])
# value
#date 10 11 12 13 14 15 16 17 18 19 20
# 12-03-2014 0 0 1 0 0 1 0 0 1 2 0
# 12-04-2014 2 0 0 2 2 0 1 0 1 1 1
# 12-05-2014 0 0 0 0 0 0 1 1 2 0 1
# 12-06-2014 1 1 0 0 0 1 0 1 0 0 1
# 12-07-2014 0 1 0 1 0 1 1 1 0 0 0
# 12-08-2014 1 1 0 0 1 0 0 1 1 0 0
# 12-09-2014 0 0 2 0 1 2 0 0 0 0 0
# 12-10-2014 0 0 1 1 0 0 1 0 0 1 1
# 12-11-2014 0 1 1 0 0 0 1 0 0 1 1
# 12-12-2014 1 1 0 1 1 0 0 1 0 0 0
Or if you need a data.table solution (from #Arun's comments)
library(data.table)
dcast.data.table(melt(setDT(data),
id="date", measure=2:6), date ~ value)

model.matrix() with na.action=NULL?

I have a formula and a data frame, and I want to extract the model.matrix(). However, I need the resulting matrix to include the NAs that were found in the original dataset. If I were to use model.frame() to do this, I would simply pass it na.action=NULL. However, the output I need is of the model.matrix() format. Specifically, I need only the right-hand side variables, I need the output to be a matrix (not a data frame), and I need factors to be converted to a series of dummy variables.
I'm sure I could hack something together using loops or something, but I was wondering if anyone could suggest a cleaner and more efficient workaround. Thanks a lot for your time!
And here's an example:
dat <- data.frame(matrix(rnorm(20),5,4), gl(5,2))
dat[3,5] <- NA
names(dat) <- c(letters[1:4], 'fact')
ff <- a ~ b + fact
# This omits the row with a missing observation on the factor
model.matrix(ff, dat)
# This keeps the NA, but it gives me a data frame and does not dichotomize the factor
model.frame(ff, dat, na.action=NULL)
Here is what I would like to obtain:
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.7266086 0 0 0 0
2 1 -0.6088697 0 0 0 0
3 NA 0.4643360 NA NA NA NA
4 1 -1.1666248 1 0 0 0
5 1 -0.7577394 0 1 0 0
6 1 0.7266086 0 1 0 0
7 1 -0.6088697 0 0 1 0
8 1 0.4643360 0 0 1 0
9 1 -1.1666248 0 0 0 1
10 1 -0.7577394 0 0 0 1
Joris's suggestion works, but a quicker and cleaner way to do this is via the global na.action setting. The 'Pass' option achieves our goal of preserving NA's from the original dataset.
Option 1: Pass
Resulting matrix will contain NA's in rows corresponding to the original dataset.
options(na.action='na.pass')
model.matrix(ff, dat)
Option 2: Omit
Resulting matrix will skip rows containing NA's.
options(na.action='na.omit')
model.matrix(ff, dat)
Option 3: Fail
An error will occur if the original data contains NA's.
options(na.action='na.fail')
model.matrix(ff, dat)
Of course, always be careful when changing global options because they can alter behavior of other parts of your code. A cautious person might store the original setting with something like current.na.action <- options('na.action'), and then change it back after making the model.matrix.
Another way is to use the model.frame function with argument na.action=na.pass as your second argument to model.matrix:
> model.matrix(ff, model.frame(~ ., dat, na.action=na.pass))
(Intercept) b fact2 fact3 fact4 fact5
1 1 -1.3560754 0 0 0 0
2 1 2.5476965 0 0 0 0
3 1 0.4635628 NA NA NA NA
4 1 -0.2871379 1 0 0 0
5 1 2.2684958 0 1 0 0
6 1 -1.3560754 0 1 0 0
7 1 2.5476965 0 0 1 0
8 1 0.4635628 0 0 1 0
9 1 -0.2871379 0 0 0 1
10 1 2.2684958 0 0 0 1
model.frame allows you to set the appropriate action for na.action which is maintained when model.matrix is called.
I half-stumbled across a simpler solution after looking at mattdevlin and Nathan Gould's answers:
model.matrix.lm(ff, dat, na.action = "na.pass")
model.matrix.default may not support the na.action argument, but model.matrix.lm does!
(I found model.matrix.lm from Rstudio's auto-complete suggestions — it appears to be the only non-default method for model.matrix if you haven't loaded any libraries that add others. Then I just guessed it might support the na.action argument.)
You can mess around a little with the model.matrix object, based on the rownames :
MM <- model.matrix(ff,dat)
MM <- MM[match(rownames(dat),rownames(MM)),]
MM[,"b"] <- dat$b
rownames(MM) <- rownames(dat)
which gives :
> MM
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.9583010 0 0 0 0
2 1 0.3266986 0 0 0 0
3 NA 1.4992358 NA NA NA NA
4 1 1.2867461 1 0 0 0
5 1 0.5024700 0 1 0 0
6 1 0.9583010 0 1 0 0
7 1 0.3266986 0 0 1 0
8 1 1.4992358 0 0 1 0
9 1 1.2867461 0 0 0 1
10 1 0.5024700 0 0 0 1
Alternatively, you can use contrasts() to do the work for you. Constructing the matrix by hand would be :
cont <- contrasts(dat$fact)[as.numeric(dat$fact),]
colnames(cont) <- paste("fact",colnames(cont),sep="")
out <- cbind(1,dat$b,cont)
out[is.na(dat$fact),1] <- NA
colnames(out)[1:2]<- c("Intercept","b")
rownames(out) <- rownames(dat)
which gives :
> out
Intercept b fact2 fact3 fact4 fact5
1 1 0.2534288 0 0 0 0
2 1 0.2697760 0 0 0 0
3 NA -0.8236879 NA NA NA NA
4 1 -0.6053445 1 0 0 0
5 1 0.4608907 0 1 0 0
6 1 0.2534288 0 1 0 0
7 1 0.2697760 0 0 1 0
8 1 -0.8236879 0 0 1 0
9 1 -0.6053445 0 0 0 1
10 1 0.4608907 0 0 0 1
In any case, both methods can be incorporated in a function that can deal with more complex formulae. I leave the exercise to the reader (what do I loath that sentence when I meet it in a paper ;-) )

Resources