R: how to format my data for multinomial logit? - r

I am reproducing some Stata code on R and I would like to perform a multinomial logistic regression with the mlogit function, from the package of the same name (I know that there is a multinom function in nnet but I don't want to use this one).
My problem is that, to use mlogit, I need my data to be formatted using mlogit.data and I can't figure out how to format it properly. Comparing my data to the data used in the examples in the documentation and in this question, I realize that it is not in the same form.
Indeed, the data I use is like:
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
ID type age dum1 dum2
1 1 2 28 1 1
2 2 3 31 0 0
3 3 4 12 0 1
4 4 2 1 0 1
5 5 1 49 0 0
6 6 1 80 1 0
7 7 4 36 0 1
8 8 1 53 1 0
9 9 3 22 1 1
10 10 2 10 0 0
whereas the data they use is like:
key altkey A B C D
1 201005131 1 2.6 118.17 117 0
2 201005131 2 1.4 117.11 115 0
3 201005131 3 1.1 117.38 122 1
4 201005131 4 24.6 NA 122 0
5 201005131 5 48.6 91.90 122 0
6 201005131 6 59.8 NA 122 0
7 201005132 1 20.2 118.23 113 0
8 201005132 2 2.5 123.67 120 1
9 201005132 3 7.4 116.30 120 0
10 201005132 4 2.8 118.86 120 0
11 201005132 5 6.9 124.72 120 0
12 201005132 6 2.5 123.81 120 0
As you can see, in their case, there is a column altkey that details every category for each key and there is also a column D showing which alternative is chosen by the person.
However, I only have one column (type) which shows the choice of the individual but does not show the other alternatives or the value of the other variables for each of these alternatives. When I try to apply mlogit, I have:
library(mlogit)
mlogit(type ~ age + dum1 + dum2, df)
Error in data.frame(lapply(index, function(x) x[drop = TRUE]), row.names = rownames(mydata)) :
row names supplied are of the wrong length
Therefore, how can I format my data so that it corresponds to the type of data mlogit requires?
Edit: following the advices of #edsandorf, I modified my dataframe and mlogit.data works but now all the other explanatory variables have the same value for each alternative. Should I set these variables at 0 in the rows where the chosen alternative is 0 or FALSE ? (in fact, can somebody show me the procedure from where I am to the results of the mlogit because I don't get where I'm wrong for the estimation?)
The data I show here (df) is not my true data. However, it is exactly the same form: a column with the choice of the alternative (type), columns with dummies and age, etc.
Here's the procedure I've made so far (I did not set the alternatives to 0):
# create a dataframe with all alternatives for each ID
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
ID choice type age dum1 dum2
1.1 1 1 FALSE 28 1 1
1.2 1 2 TRUE 28 1 1
1.3 1 3 FALSE 28 1 1
1.4 1 4 FALSE 28 1 1
2.1 2 1 FALSE 31 0 0
2.2 2 2 FALSE 31 0 0
If I do :
mlogit(type ~ age + dum1 + dum2, df3)
I have the error:
Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number

Your data doesn't lend itself well to be estimated using an MNL model unless we make more assumptions. In general, since all your variables are individual specific and does not vary across alternatives (types), the model cannot be identified. All of your individual specific characteristics will drop out unless we treat them as alternative specific. By the sounds of it, each professional program carries meaning in an of itself. In that case, we could estimate the MNL model using constants only, where the constant captures everything about the program that makes an individual choose it.
library(mlogit)
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
Now, just to be on the safe side, I create dummy variables for each of the programs. type_1 refers to program 1, type_2 to program 2 etc.
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# Add alternative specific variables (here only constants)
df2$type_1 <- ifelse(df2$choice == 1, 1, 0)
df2$type_2 <- ifelse(df2$choice == 2, 1, 0)
df2$type_3 <- ifelse(df2$choice == 3, 1, 0)
df2$type_4 <- ifelse(df2$choice == 4, 1, 0)
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
Now we can run the model. I include the dummies for each of the alternatives keeping alternative 4 as my reference level. Only J-1 constants are identified, where J is the number of alternatives. In the second half of the formula (after the pipe operator), I make sure that I remove all alternative specific constants that the model would have created and I add your individual specific variables, treating them as alternative specific. Note that this only makes sense if your alternatives (programs) carry meaning and are not generic.
model <- mlogit(type ~ type_1 + type_2 + type_3 | -1 + age + dum1 + dum2,
reflevel = 4, data = df3)
summary(model)

Related

ifelse over each element of a vector

Looking at this post, I thought ifelse is vectorized in the sense that f(c(x1, x2, x3)) = c(f(x1), f(x2), f(x3)).
So, I thought if the code for z1 (provided below) will perform the following for each element of the vector y:
Test whether it is unity or not.
If YES, draw a random number from {1, 3, 5, 7, 9}.
If NO, draw a random number from {0, 2, 4, 6, 8}.
But, unfortunately it doesn't do that. It generates once for each case, and returns that very random number always.
Where exactly am I doing wrong? Or, is it actually the expected behaviour of ifelse?
Just to note, if I use this as a wrapper function inside sapply, I get the expected output z2 (in the sense that it is not deterministic as z1 where observing one occurrence of each case is enough), as you can see below.
y <- rbinom(n = 20,
size = 1,
prob = 0.5)
z1 <- ifelse(test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9),
size = 1),
no = sample(x = c(0, 2, 4, 6, 8),
size = 1))
z2 <- sapply(X = y,
FUN = function(w)
{
ifelse(test = (w == 1),
yes = sample(x = c(1, 3, 5, 7, 9),
size = 1),
no = sample(x = c(0, 2, 4, 6, 8),
size = 1))
})
data.frame(y, z1, z2)
#> y z1 z2
#> 1 0 2 2
#> 2 1 1 3
#> 3 1 1 9
#> 4 1 1 7
#> 5 0 2 0
#> 6 0 2 2
#> 7 1 1 7
#> 8 1 1 7
#> 9 0 2 0
#> 10 1 1 5
#> 11 0 2 0
#> 12 0 2 0
#> 13 0 2 6
#> 14 0 2 0
#> 15 0 2 2
#> 16 1 1 7
#> 17 1 1 7
#> 18 0 2 2
#> 19 0 2 2
#> 20 0 2 0
unique(x = z1[y == 1])
#> [1] 1
unique(x = z1[y == 0])
#> [1] 2
Created on 2019-03-13 by the reprex package (v0.2.1)
Any help will be appreciated.
ifelse isn't a function of one vector, it is a function of 3 vectors of the same length. The first vector, called test, is a boolean, the second vector yes and third vector no give the elements in the result, chosen item-by-item based on the test value.
A sample of size = 1 is a different size than test (unless the length of test is 1), so it will be recycled by ifelse (see note below). Instead, draw samples of the same size as test from the start:
ifelse(
test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9), size = length(y), replace = TRUE),
no = sample(x = c(0, 2, 4, 6, 8), size = lenght(y), replace = TRUE)
)
The vectors don't actually have to be of the same length. The help page ?ifelse explains: "If yes or no are too short, their elements are recycled." This is the behavior you observed with "It generates once for each case, and returns that very random number always.".

Formula to substitute dataframe column names with categories defined in a second dataframe

Let's say I have data in wide format (samples in row and species in columns).
species <- data.frame(
Sample = 1:10,
Lobvar = c(21, 15, 12, 11, 32, 42, 54, 10, 1, 2),
Limtru = c(2, 5, 1, 0, 2, 22, 3, 0, 1, 2),
Pocele = c(3, 52, 11, 30, 22, 22, 23, 10, 21, 32),
Genmes = c(1, 0, 22, 1, 2,32, 2, 0, 1, 2)
)
And I want to automatically change the species names, based on a reference of functional groups that I have for all of the species (so it works even if I have more references than actual species in the dataset), for example:
reference <- data.frame(
Species_name = c("Lobvar", "Ampmis", "Pocele", "Genmes", "Limtru", "Secgio", "Nasval", "Letgos", "Salnes", "Verbes"),
Functional_group = c("Crustose", "Geniculate", "Erect", "CCA", "CCA", "CCA", "Geniculate", "Turf","Turf", "Crustose"),
stringsAsFactors = FALSE
)
EDIT
Thanks to #Dan Y suggestions, I can now changes the species names to their functional group names:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
However, in my actual data.frame I have more species, and this creates many functional groups with the same name in different columns. I now would like to sum the columns that have the same names. I updated the example to give a results in which there is more than one functional group with the same name.
So i get this:
Sample Crustose CCA Erect CCA Crustose
1 21 2 3 1 2
2 15 5 52 0 3
3 12 1 11 22 4
4 11 0 30 1 1
5 32 2 22 2 0
6 42 22 22 32 0
and the final result I am looking for is this:
Sample Crustose CCA Erect
1 23 3 3
2 18 5 52
3 16 22 11
4 12 1 30
5 32 4 22
6 42 54 22
How do you advise on approaching this? Thanks for your help and the amazing suggestions I already received.
Re Q1) We can use match to do the name lookup:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
Re Q2) Then we can mapply the rowSums function after some regular expression work on the colnames:
namevec <- gsub("\\.[[:digit:]]", "", names(df))
mapply(function(x) rowSums(df[which(namevec == x)]), unique(namevec))

Chi square for filtered data from column

I tried googling this answer but I am at a loss.
So I have data like this:
PatientNum<- c(1, 2, 3, 4, 5,6,7)
Plastics <- c(1, 0, 1, 1, 0, 0,1)
Age <- c(19, 18, 35, 82, 45,46, 65)
BloodLoss<-c(5,4,5,10,5,15,9)
AgeGroup<-c("Teens","Teens","30s", "80s","40s","40s","60s")
dataset <- data.frame(PatientNum, Plastics,Age, BloodLoss,AgeGroup)
And I'm trying to recreate some stats from an earlier paper where they had less data. In it they would sometimes do a chi-square the way I would expect it, i.e.:
chisq.test(table(dataset$Plastics,dataset$AgeGroup))
But then in other tables they would do a chi-square comparing only the Teens from Plastics vs Non-plastics (1 and 0 in the Plastic column) against each other. This is easy enough for me to do in an online chi-square calculator where I fill in (teens+plastics)/(teens+nonPlastic) vs (non-teen+plastics)/non-teen+nonPlastics)... but how do I do that in R?
Also, feel free to advise on if one of those statistical approaches should not be done.
You can make a binary variable for whether or not the age group is teens. I like to give binary or boolean variables names like isTeen to keep track. So using an ifelse call, I just give that new column a 1 if AgeGroup is "Teens", and a 0 otherwise.
PatientNum<- c(1, 2, 3, 4, 5,6,7)
Plastics <- c(1, 0, 1, 1, 0, 0,1)
Age <- c(19, 18, 35, 82, 45,46, 65)
BloodLoss<-c(5,4,5,10,5,15,9)
AgeGroup<-c("Teens","Teens","30s", "80s","40s","40s","60s")
dataset <- data.frame(PatientNum, Plastics,Age, BloodLoss,AgeGroup)
dataset$isTeen <- ifelse(dataset$AgeGroup == "Teens", 1, 0)
dataset
#> PatientNum Plastics Age BloodLoss AgeGroup isTeen
#> 1 1 1 19 5 Teens 1
#> 2 2 0 18 4 Teens 1
#> 3 3 1 35 5 30s 0
#> 4 4 1 82 10 80s 0
#> 5 5 0 45 5 40s 0
#> 6 6 0 46 15 40s 0
#> 7 7 1 65 9 60s 0
chisq.test(table(dataset$Plastics, dataset$isTeen))
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(dataset$Plastics, dataset$isTeen)
#> X-squared = 1.438e-32, df = 1, p-value = 1
Created on 2018-05-31 by the reprex package (v0.2.0).

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

How to visualize change in binary/categorical data over time?

>dput(data)
structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3), Dx = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1), Month = c(0,
6, 12, 18, 24, 0, 6, 12, 18, 24, 0, 6, 12, 18, 24), score = c(0,
0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0)), .Names = c("ID",
"Dx", "Month", "score"), row.names = c(NA, -15L), class = "data.frame")
>data
ID Dx Month score
1 1 1 0 0
2 1 1 6 0
3 1 1 12 0
4 1 1 18 1
5 1 1 24 1
6 2 1 0 1
7 2 1 6 1
8 2 2 12 1
9 2 2 18 0
10 2 2 24 1
11 3 1 0 0
12 3 1 6 0
13 3 1 12 0
14 3 1 18 0
15 3 1 24 0
Suppose I have the above data.frame. I have 3 patients (ID = 1, 2 or 3). Dx is the diagnosis (Dx = 1 is normal, = 2 is diseased). There is a month variable. And last but not least, is a test score variable. The participants' test score is binary, and it can change from 0 or 1 or revert back from 1 to 0. I am having trouble coming up with a way to visualize this data. I would like an informative graph that looks at:
The trend of the participants' test scores over time.
How that trend compares to the participants' diagnosis over time
In my real dataset I have over 800 participants, so I do not want to construct 800 separate graphs ... I think the test score variable being binary really has me stumped. Any help would be appreciated.
With ggplot2 you can make faceted plots with subplots for each patient (see my solution for dealing with the large number of plots below). An example visualization:
library(ggplot2)
ggplot(data, aes(x=Month, y=score, color=factor(Dx))) +
geom_point(size=5) +
scale_x_continuous(breaks=c(0,6,12,18,24)) +
scale_color_discrete("Diagnosis",labels=c("normal","diseased")) +
facet_grid(.~ID) +
theme_bw()
which gives:
Including 800 patients in one plot might be a bit too much as already mentioned in the comments of the question. There are several solutions to this problem:
Aggregate the data.
Create patient subgroups and make a plot for each subgroup.
Filter out all the patients who have never been ill.
With regard to the last suggestion, you can do that with the following code (which I adapted from an answer to one of my own questions):
deleteable <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
data2 <- data[deleteable==0,]
You can use this as well for creating a new variable identifying patient who have been ill:
data$neverill <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
Then you can for example aggregate the data with the several grouping variables (e.g. Month, neverill).
Note: A lot of the following data manipulation needs to be done for part 2. Part 1 is less complex, and you can see it fit in below.
Uses
library(data.table)
library(ggplot2)
library(reshape2)
To Compare
First, change the Dx from 1 to 2 to 0 to 1 (Assuming that a 0 in score corresponds to a 1 in Dx)
data$Dx <- data$Dx - 1
Now, create a matrix that returns a 1 for a 1 diagnosis with a 0 test, and a -1 for a 1 test with a 0 diagnosis.
compare <- matrix(c(0,1,-1,0),ncol = 2,dimnames = list(c(0,1),c(0,1)))
> compare
0 1
0 0 -1
1 1 0
Now, lets score every event. This simply looks up the matrix above for every entry in your matrix:
data$calc <- diag(compare[as.character(data$Dx),as.character(data$score)])
*Note: This can be sped up for large matrices using matching, but it is a quick fix for smaller sets like yours
To allow us to use data.table aggregation:
data <- data.table(data)
Now we need to create our variables:
tograph <- melt(data[, list(ScoreTrend = sum(score)/.N,
Type = sum(calc)/length(calc[calc != 0]),
Measure = sum(abs(calc))),
by = Month],
id.vars = c("Month"))
ScoreTrend: This calculates the proportion of positive scores in each
month. Shows the trend of scores over time
Type: Shows the proportion of -1 vs 1 over time. If this returns -1,
all events were score = 1, diag = 0. If it returns 1, all events were
diag = 1, score = 0. A zero would mean a balance between the two
Measure: The raw number of incorrect events.
We melt this data frame along month so that we can create a facet graph.
If there are no incorrect events, we will get a NaN for Type. To set this to 0:
tograph[value == NaN, value := 0]
Finally, we can plot
ggplot(tograph, aes(x = Month, y = value)) + geom_line() + facet_wrap(~variable, ncol = 1)
We can now see, in one plot:
The number of positive scores by month
The proportion of under vs. over diagnosis
The number of incorrect diagnoses.

Resources