model.matrix() with na.action=NULL? - r

I have a formula and a data frame, and I want to extract the model.matrix(). However, I need the resulting matrix to include the NAs that were found in the original dataset. If I were to use model.frame() to do this, I would simply pass it na.action=NULL. However, the output I need is of the model.matrix() format. Specifically, I need only the right-hand side variables, I need the output to be a matrix (not a data frame), and I need factors to be converted to a series of dummy variables.
I'm sure I could hack something together using loops or something, but I was wondering if anyone could suggest a cleaner and more efficient workaround. Thanks a lot for your time!
And here's an example:
dat <- data.frame(matrix(rnorm(20),5,4), gl(5,2))
dat[3,5] <- NA
names(dat) <- c(letters[1:4], 'fact')
ff <- a ~ b + fact
# This omits the row with a missing observation on the factor
model.matrix(ff, dat)
# This keeps the NA, but it gives me a data frame and does not dichotomize the factor
model.frame(ff, dat, na.action=NULL)
Here is what I would like to obtain:
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.7266086 0 0 0 0
2 1 -0.6088697 0 0 0 0
3 NA 0.4643360 NA NA NA NA
4 1 -1.1666248 1 0 0 0
5 1 -0.7577394 0 1 0 0
6 1 0.7266086 0 1 0 0
7 1 -0.6088697 0 0 1 0
8 1 0.4643360 0 0 1 0
9 1 -1.1666248 0 0 0 1
10 1 -0.7577394 0 0 0 1

Joris's suggestion works, but a quicker and cleaner way to do this is via the global na.action setting. The 'Pass' option achieves our goal of preserving NA's from the original dataset.
Option 1: Pass
Resulting matrix will contain NA's in rows corresponding to the original dataset.
options(na.action='na.pass')
model.matrix(ff, dat)
Option 2: Omit
Resulting matrix will skip rows containing NA's.
options(na.action='na.omit')
model.matrix(ff, dat)
Option 3: Fail
An error will occur if the original data contains NA's.
options(na.action='na.fail')
model.matrix(ff, dat)
Of course, always be careful when changing global options because they can alter behavior of other parts of your code. A cautious person might store the original setting with something like current.na.action <- options('na.action'), and then change it back after making the model.matrix.

Another way is to use the model.frame function with argument na.action=na.pass as your second argument to model.matrix:
> model.matrix(ff, model.frame(~ ., dat, na.action=na.pass))
(Intercept) b fact2 fact3 fact4 fact5
1 1 -1.3560754 0 0 0 0
2 1 2.5476965 0 0 0 0
3 1 0.4635628 NA NA NA NA
4 1 -0.2871379 1 0 0 0
5 1 2.2684958 0 1 0 0
6 1 -1.3560754 0 1 0 0
7 1 2.5476965 0 0 1 0
8 1 0.4635628 0 0 1 0
9 1 -0.2871379 0 0 0 1
10 1 2.2684958 0 0 0 1
model.frame allows you to set the appropriate action for na.action which is maintained when model.matrix is called.

I half-stumbled across a simpler solution after looking at mattdevlin and Nathan Gould's answers:
model.matrix.lm(ff, dat, na.action = "na.pass")
model.matrix.default may not support the na.action argument, but model.matrix.lm does!
(I found model.matrix.lm from Rstudio's auto-complete suggestions — it appears to be the only non-default method for model.matrix if you haven't loaded any libraries that add others. Then I just guessed it might support the na.action argument.)

You can mess around a little with the model.matrix object, based on the rownames :
MM <- model.matrix(ff,dat)
MM <- MM[match(rownames(dat),rownames(MM)),]
MM[,"b"] <- dat$b
rownames(MM) <- rownames(dat)
which gives :
> MM
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.9583010 0 0 0 0
2 1 0.3266986 0 0 0 0
3 NA 1.4992358 NA NA NA NA
4 1 1.2867461 1 0 0 0
5 1 0.5024700 0 1 0 0
6 1 0.9583010 0 1 0 0
7 1 0.3266986 0 0 1 0
8 1 1.4992358 0 0 1 0
9 1 1.2867461 0 0 0 1
10 1 0.5024700 0 0 0 1
Alternatively, you can use contrasts() to do the work for you. Constructing the matrix by hand would be :
cont <- contrasts(dat$fact)[as.numeric(dat$fact),]
colnames(cont) <- paste("fact",colnames(cont),sep="")
out <- cbind(1,dat$b,cont)
out[is.na(dat$fact),1] <- NA
colnames(out)[1:2]<- c("Intercept","b")
rownames(out) <- rownames(dat)
which gives :
> out
Intercept b fact2 fact3 fact4 fact5
1 1 0.2534288 0 0 0 0
2 1 0.2697760 0 0 0 0
3 NA -0.8236879 NA NA NA NA
4 1 -0.6053445 1 0 0 0
5 1 0.4608907 0 1 0 0
6 1 0.2534288 0 1 0 0
7 1 0.2697760 0 0 1 0
8 1 -0.8236879 0 0 1 0
9 1 -0.6053445 0 0 0 1
10 1 0.4608907 0 0 0 1
In any case, both methods can be incorporated in a function that can deal with more complex formulae. I leave the exercise to the reader (what do I loath that sentence when I meet it in a paper ;-) )

Related

mlogit gives error: the two indexes don't define unique observations

My dataframe named longData looks like:
ID Set Choice Apple Microsoft IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia
1 1 1 0 1 0 0 0 0 0 0 0 0 0
2 1 2 0 0 1 0 0 0 0 0 0 0 0
3 1 3 0 0 0 1 0 0 0 0 0 0 0
4 1 4 1 0 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 0 0 0 0 0 0 0 1
6 1 6 0 -1 0 0 0 0 0 0 0 0 0
I am trying to run mlogit on it by:
logitModel = mlogit(Choice ~ Apple+Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, shape = "long")
it gives the following error:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
after looking for some time I found that this error was given by dfidx as seen in here as:
z <- data[, c(posid1[1], posid2[1])]
if (nrow(z) != nrow(unique(z)))
stop("the two indexes don't define unique observations")
but upon calling the following code, it runs without the error and gives the names of two idx that are uniquely able to identify a row in dataframe:
dfidx(longData)$idx
this gives expected output as:
~~~ indexes ~~~~
ID Set
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
indexes: 1, 2
So what am I doing wrong, I saw some related questions 1, 2 but couldn't find what I am missing.
It looks like your example comes from here: https://docs.displayr.com/wiki/MaxDiff_Analysis_Case_Study_Using_R
The code seems outdated, I remember it worked for me, but not anymore.
The error message is valid because every pair (ID, Set) appears several times, once for each alternative.
However this works:
# there will be complaint that choice can't be coerced to logical otherwise
longData$Choice <- as.logical(longData$Choice)
# create alternative number (nAltsPerSet is 5 in this example)
longData$Alternative <- 1+( 0:(nrow(longData)-1) %% nAltsPerSet)
# define dataset
mdata <- mlogit.data(data=longData,shape="long", choice="Choice",alt.var="Alternative",id.var="ID")
# model
logitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0,
data = mdata
)
summary(logitModel)

How can I create a new binary variable that categorizes people based on if they EVER had a certain response in the dataset?

I'm examining drug use over 5 years in about 100 people. I want to create a binary variable that indicates whether people could ever be considered drug users (0=never user, 1=user).
Below, 1 indicates drug use, 0 indicates none, and NA indicates missing data at that time. Here are some example cases:
0 0 0 1 1
0 1 0 1 1
NA 0 1 0 NA
NA 0 0 0 1
0 0 NA NA 0
Almost all of my cases have missing data for at least one time point.
I'm new to R so I'm really struggling to figure out how to create this new binary variable. Basically the code needs to scan all 5 time points to see if a "1" ever appears, and it needs to be able to handle NAs.
Any advice would be great!
We can use rowSums
df1$new <- +(rowSums(df1 == 1, na.rm = TRUE) > 0)
Assuming the individuals show up on the rows, and the years as columns:
d <- read.table(text="
0 0 0 1 1
0 1 0 1 1
NA 0 1 0 NA
NA 0 0 0 1
0 0 NA NA 0", header=F
)
d$true <- apply(d,1, function(x)any(x==1, na.rm = T))*1
d
V1 V2 V3 V4 V5 true
1 0 0 0 1 1 1
2 0 1 0 1 1 1
3 NA 0 1 0 NA 1
4 NA 0 0 0 1 1
5 0 0 NA NA 0 0

values changes (avoid 0 1 to 1 2)

I want to transform factor to numeric to be able to take the mean of it as.numeric changes the value, numeric doesn't work.
mtcars$vec <- factor(c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
num.cols <- c("vec" )
mtcars[num.cols] <- lapply(mtcars[num.cols], as.numeric)
str(mtcars)
mtcars$vec
expected results should be numeric and consist of only 0 and 1
mtcars$vec
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
many thanks in advance
We need to convert to character and then to numeric because if we directly apply as.numeric, it gets coerced to the integer storage values instead of the actual values which starts from 1. In this case, there is a confusion because the values are binary
mtcars[num.cols] <- lapply(mtcars[num.cols],
function(x) as.numeric(as.character(x)))
mtcars$vec
#[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Or a faster option is also
mtcars[num.cols] <- lapply(mtcars[num.cols], function(x) as.numeric(levels(x)[x]))
If it is a single column, we can do this more easily
mtcars[[num.cols]] <- as.numeric(levels(mtcars[[num.cols]])[mtcars[[num.cols]]])
As an example
v1 <- factor(c(15, 15, 3, 3))
as.numeric(v1)
#[1] 2 2 1 1
as.numeric(as.character(v1))
#[1] 15 15 3 3

Exporting data of inequal length to CSV

I'm preprocessing some data from sensor-created files into the format required for external analysis (ultimately, it needs to be output as a CSV). The end goal is something like this:
1 C3 C4 Cz Pz AllSites 2 C3 C4 Cz Pz AllSites 3 C3 C4 Cz Pz AllSites
50:23.9 0 0 0 0 0 53:15.0 0 0 0 0 0 09:15.0 0 0 0 0 0
50:24.9 1 0 0 1 0 53:16.0 1 0 0 1 0 09:16.1 0 0 1 0 0
50:26.0 1 0 0 0 0 53:17.1 1 0 0 1 0 09:17.1 0 0 1 0 0
50:27.0 1 0 0 1 0 53:18.1 1 1 1 0 0 09:18.1 0 0 1 1 0
50:28.0 0 1 0 0 0 53:19.2 1 0 0 0 0 09:19.2 0 0 1 0 0
50:29.1 1 1 1 1 1 53:20.2 1 0 0 1 0 09:20.2 0 0 1 0 0
50:30.2 0 1 1 0 0 53:21.2 1 0 0 0 0 09:21.2 0 0 0 1 0
50:31.2 0 0 0 0 0 53:22.3 0 0 0 0 0 09:22.3 0 0 0 1 0
Each set of columns is data from one session. The only catch is that sessions are of inequal length (and thus each group has a different number of observations), so at the moment, it's all in a list instead of a data frame. I have found a few different ways of exporting to CSV (e.g., this question), but they all involve converting to a data frame first. How do I export a list to CSV without converting it to a data frame first?
N.B.: I also found a bunch of questions about exporting a list of data frames to a series of CSV files, but for this application, all the data frames need to be in a single CSV.
Lets make some simple samples:
b1 = data.frame(C3=sample(c(0,1),8,TRUE),C4=sample(c(0,1),8,TRUE),Cz=sample(c(0,1),8,TRUE))
b2 = data.frame(C3=sample(c(0,1),3,TRUE),C4=sample(c(0,1),3,TRUE),Cz=sample(c(0,1),3,TRUE))
b3 = data.frame(C3=sample(c(0,1),8,TRUE),C4=sample(c(0,1),8,TRUE),Cz=sample(c(0,1),8,TRUE))
You cant just column-bind them and hope R pads out the smaller columns:
> cbind(b1,b2,b3)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 8, 3
So we need to paste them into a big enough data frame. Lets make one full of NAs to start:
b = data.frame(matrix(NA, ncol=ncol(b1)+ncol(b2)+ncol(b3), nrow=max(nrow(b1),nrow(b2),nrow(b3))))
dim(b)
[1] 8 9
Then this code puts each b data frame in the right place. Each one is a bit further along:
> b[1:nrow(b1),1:ncol(b1)]=b1
> b[1:nrow(b2),(1:ncol(b1))+ncol(b1)]=b2
> b[1:nrow(b3),(1:ncol(b1))+ncol(b1)+ncol(b2)]=b3
> b
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 1 1 1 0 0 0 0 1
2 1 1 0 0 0 0 0 1 0
3 0 0 1 0 1 1 0 1 1
4 1 1 1 NA NA NA 1 1 1
5 0 0 0 NA NA NA 0 0 0
6 0 1 0 NA NA NA 1 0 1
7 0 0 0 NA NA NA 1 1 1
8 0 1 0 NA NA NA 1 1 1
Easy enough to generalise in a loop over a list. Now:
> write.csv(b,na="")
"","X1","X2","X3","X4","X5","X6","X7","X8","X9"
"1",1,1,1,1,0,0,0,0,1
"2",1,1,0,0,0,0,0,1,0
"3",0,0,1,0,1,1,0,1,1
"4",1,1,1,,,,1,1,1
"5",0,0,0,,,,0,0,0
"6",0,1,0,,,,1,0,1
"7",0,0,0,,,,1,1,1
"8",0,1,0,,,,1,1,1
Gives us those empty columns. You probably need to fiddle about to get the column headers back and repeated but that's easy enough...
Not sure if this is what you need... but it's a shot...
a <- data.frame(small=letters)
b <- data.frame(big=LETTERS)
l <- list(a=a, b=b)
sapply(names(l), function(x)write.csv(l[[x]], file=paste0(x, ".csv")))
# or maybe all in the same file...
sapply(names(l), function(x)write.table(l[[x]], file="c.csv", append=T))
A csv file is most often used to export data in tabular form. They map perfectly with the data.frame R objects. list objects are way more general and exhibit a lot of flexibility that a simple csv format cannot handle in many cases.
In your cases sure you have a list, but the components of your list are data frames that share (apparently) the same structure (same number and names of the columns). So, it's pretty trivial to join all them in just one data frame. You only need an additional column that indicates the session. So, if mylist is your list, you can try:
mydf<-do.call(rbind,mylist)
elLength<-vapply(mylist,length,1)
mydf$Session<-rep(1:length(mylist),times=elLength))
In this way you end up with a single data frame and you can extract the session through the Session column. You can use read.csv to export it to a csv file.

else if loop running very slowly in R

I am performing a research project into the factors that make someone more likely to vote, with a focus on the distance people live from a polling place. I the full voter registration and voter histories for millions of individuals. There are several ways in which someone can vote (in person, absentee, early, or provisional) or not vote (not registered, registered but didn't vote, or ineligible to vote). My data comes with a column (29) for how someone voted in a given election. NULL means not registered, V for in person, etc.
For regression analysis, I want to create a different column for each voter type (1 for yes, 0 for no, column numbers 68-74) and another 1/0 column (number 75) for whether or not someone voted at all. The code I wrote below should do the trick, but it's running impossibly slowly on my computer and hasn't even been able to get to the 1000th row after an hour. It works perfectly, except the speed. I've been approved to use my university's supercomputer*, but I want to figure out a faster algorithm. I have R and STATA both on my laptop and the supercomputer* and would be happy to use either.
dcv.new <- read.csv("VoterHist.csv", header=TRUE)
# I previously set columns 68-75 to default to 0
for(i in 1:nrow(dcv.new))
{
if(is.na(dcv.new[i,29]))
{
dcv.new[i,69] <- 1
}
else if(dcv.new[i,29]=="V")
{
dcv.new[i,68] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="A")
{
dcv.new[i,70] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="N")
{
dcv.new[i,71] <- 1
}
else if(dcv.new[i,29]=="E")
{
dcv.new[i,72] <- 1
}
else if(dcv.new[i,29]=="Y")
{
dcv.new[i,73] <- 1
}
else if(dcv.new[i,29]=="P")
{
dcv.new[i,74] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="X")
{
dcv.new[i,74] <- 1
dcv.new[i,75] <- 1
}
}
*Technically "High performance computing cluster", but let's be honest, supercomputer sounds way cooler.
R is vectorised, in the main, so look for vectorised operations in place of loops. In this case you can vectorise each operation so it works on the entire matrix rather than on individual rows.
Here are the first three of your if else statements:
dcv.new[is.na(dcv.new[,29]), 69] <- 1
dcv.new[dcv.new[,29]=="V", c(68,75)] <- 1
dcv.new[dcv.new[,29]=="A", c(70,75)] <- 1
....
You should get the idea.
Some explanation:
What we are doing is selecting rows from certain columns of dcv.new that meet criteria (such as == "V") and then we assign the value 1 to each of those selected elements of dcv.new in a single operation. R recycles the 1 that we assigned such that it becomes the same length as that required to fill all the selected elements.
Note how we select more than one column at once for updating: dcv.new[x , c(68,75)] updates columns 68 and 75 for rows x only, where x is a logical vector indexing the rows we need to update. The logical vector is produced by statements like dcv.new[,29]=="V". These return a TRUE if an element of dcv.new[,29] equals "V" and FALSE if not.
However...!
In the case of regression, we can let R make the matrix of dummy variables for us, we don't need to do it by hand. Say the column dcv.new[, 29] was named voterType. If we coerce it to be a factor
dcv.new <- transform(dcv.new, voterType = factor(voterType))
when we fit a model using the formula notation we can do:
mod <- lm(response ~ voterType, data = dcv.new)
and R will create the appropriate contrasts to make voterType use the correct degrees of freedom. By default R uses the first level of a factor as the base level and hence model coefficients represent deviations from this reference level. To see what is the reference level for voterType after converting it to a factor do
with(dcv.new, levels(voterType)[1])
Note that most modelling functions that take a formula, like the one shown above, work as I described and show below. You aren't restricted to lm() models.
Here is a small example
set.seed(42)
dcv.new <- data.frame(response = rnorm(20),
voterType = sample(c("V","A","N","E","Y","P","X",NA), 20,
replace = TRUE))
head(dcv.new)
> head(dcv.new)
response voterType
1 1.3709584 E
2 -0.5646982 E
3 0.3631284 V
4 0.6328626 <NA>
5 0.4042683 E
6 -0.1061245 <NA>
The model can then be fitted as
mod <- lm(response ~ voterType, data = dcv.new)
summary(mod)
giving in this case
> mod <- lm(response ~ voterType, data = dcv.new)
> summary(mod)
Call:
lm(formula = response ~ voterType, data = dcv.new)
Residuals:
Min 1Q Median 3Q Max
-2.8241 -0.4075 0.0000 0.5856 1.9030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.656 1.425 -1.864 0.0952 .
voterTypeE 2.612 1.593 1.639 0.1356
voterTypeN 3.040 1.646 1.847 0.0978 .
voterTypeP 2.742 1.646 1.666 0.1300
voterTypeV 2.771 1.745 1.588 0.1468
voterTypeX 2.378 2.015 1.180 0.2684
voterTypeY 3.285 1.745 1.882 0.0925 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.425 on 9 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.3154, Adjusted R-squared: -0.1411
F-statistic: 0.6909 on 6 and 9 DF, p-value: 0.6635
The magic all happens with the formula code but essentially what happens behind the scenes is that once R has located all the variables named in the formula, it essentially ends up calling something like
model.matrix( ~ voterType, data = dcv.new)
which generates the covariate matrix needed for the underlying matrix algebra and QR decomposition. That code above, for the small example gives:
> model.matrix(~ voterType, data = dcv.new)
(Intercept) voterTypeE voterTypeN voterTypeP voterTypeV voterTypeX
1 1 1 0 0 0 0
2 1 1 0 0 0 0
3 1 0 0 0 1 0
5 1 1 0 0 0 0
8 1 0 0 1 0 0
10 1 0 0 0 0 0
11 1 0 1 0 0 0
12 1 0 1 0 0 0
13 1 1 0 0 0 0
14 1 0 0 0 0 1
15 1 0 0 0 1 0
16 1 0 0 1 0 0
17 1 0 0 1 0 0
18 1 0 0 0 0 0
19 1 0 1 0 0 0
20 1 0 0 0 0 0
voterTypeY
1 0
2 0
3 0
5 0
8 0
10 1
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 1
attr(,"assign")
[1] 0 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$voterType
[1] "contr.treatment"
Which is what you are wanting to do with your code. So if you really need it, you could use model.matrix() like I show to also generate the matrix - stripping off the attributes as you don't need them.
In this case the reference level is "A":
> with(dcv.new, levels(voterType)[1])
[1] "A"
which is represented by the (Intercept) column in the output from model.matrix. Note that these treatment contrasts code for deviations from the reference level. You can get dummy values by suppressing the intercept in the formula by adding -1 (0r +0):
> model.matrix(~ voterType - 1, data = dcv.new)
voterTypeA voterTypeE voterTypeN voterTypeP voterTypeV voterTypeX voterTypeY
1 0 1 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0
5 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0
10 0 0 0 0 0 0 1
11 0 0 1 0 0 0 0
12 0 0 1 0 0 0 0
13 0 1 0 0 0 0 0
14 0 0 0 0 0 1 0
15 0 0 0 0 1 0 0
16 0 0 0 1 0 0 0
17 0 0 0 1 0 0 0
18 1 0 0 0 0 0 0
19 0 0 1 0 0 0 0
20 0 0 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$voterType
[1] "contr.treatment"
You should vectorize your code. And forget about so many if's
dcv.new[is.na(dcv.new[,29]),69] <- 1
dcv.new[dcv.new[,29] == "V", c(68, 75)] <- 1
....enter code here
Continue as needed

Resources