I am performing a research project into the factors that make someone more likely to vote, with a focus on the distance people live from a polling place. I the full voter registration and voter histories for millions of individuals. There are several ways in which someone can vote (in person, absentee, early, or provisional) or not vote (not registered, registered but didn't vote, or ineligible to vote). My data comes with a column (29) for how someone voted in a given election. NULL means not registered, V for in person, etc.
For regression analysis, I want to create a different column for each voter type (1 for yes, 0 for no, column numbers 68-74) and another 1/0 column (number 75) for whether or not someone voted at all. The code I wrote below should do the trick, but it's running impossibly slowly on my computer and hasn't even been able to get to the 1000th row after an hour. It works perfectly, except the speed. I've been approved to use my university's supercomputer*, but I want to figure out a faster algorithm. I have R and STATA both on my laptop and the supercomputer* and would be happy to use either.
dcv.new <- read.csv("VoterHist.csv", header=TRUE)
# I previously set columns 68-75 to default to 0
for(i in 1:nrow(dcv.new))
{
if(is.na(dcv.new[i,29]))
{
dcv.new[i,69] <- 1
}
else if(dcv.new[i,29]=="V")
{
dcv.new[i,68] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="A")
{
dcv.new[i,70] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="N")
{
dcv.new[i,71] <- 1
}
else if(dcv.new[i,29]=="E")
{
dcv.new[i,72] <- 1
}
else if(dcv.new[i,29]=="Y")
{
dcv.new[i,73] <- 1
}
else if(dcv.new[i,29]=="P")
{
dcv.new[i,74] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="X")
{
dcv.new[i,74] <- 1
dcv.new[i,75] <- 1
}
}
*Technically "High performance computing cluster", but let's be honest, supercomputer sounds way cooler.
R is vectorised, in the main, so look for vectorised operations in place of loops. In this case you can vectorise each operation so it works on the entire matrix rather than on individual rows.
Here are the first three of your if else statements:
dcv.new[is.na(dcv.new[,29]), 69] <- 1
dcv.new[dcv.new[,29]=="V", c(68,75)] <- 1
dcv.new[dcv.new[,29]=="A", c(70,75)] <- 1
....
You should get the idea.
Some explanation:
What we are doing is selecting rows from certain columns of dcv.new that meet criteria (such as == "V") and then we assign the value 1 to each of those selected elements of dcv.new in a single operation. R recycles the 1 that we assigned such that it becomes the same length as that required to fill all the selected elements.
Note how we select more than one column at once for updating: dcv.new[x , c(68,75)] updates columns 68 and 75 for rows x only, where x is a logical vector indexing the rows we need to update. The logical vector is produced by statements like dcv.new[,29]=="V". These return a TRUE if an element of dcv.new[,29] equals "V" and FALSE if not.
However...!
In the case of regression, we can let R make the matrix of dummy variables for us, we don't need to do it by hand. Say the column dcv.new[, 29] was named voterType. If we coerce it to be a factor
dcv.new <- transform(dcv.new, voterType = factor(voterType))
when we fit a model using the formula notation we can do:
mod <- lm(response ~ voterType, data = dcv.new)
and R will create the appropriate contrasts to make voterType use the correct degrees of freedom. By default R uses the first level of a factor as the base level and hence model coefficients represent deviations from this reference level. To see what is the reference level for voterType after converting it to a factor do
with(dcv.new, levels(voterType)[1])
Note that most modelling functions that take a formula, like the one shown above, work as I described and show below. You aren't restricted to lm() models.
Here is a small example
set.seed(42)
dcv.new <- data.frame(response = rnorm(20),
voterType = sample(c("V","A","N","E","Y","P","X",NA), 20,
replace = TRUE))
head(dcv.new)
> head(dcv.new)
response voterType
1 1.3709584 E
2 -0.5646982 E
3 0.3631284 V
4 0.6328626 <NA>
5 0.4042683 E
6 -0.1061245 <NA>
The model can then be fitted as
mod <- lm(response ~ voterType, data = dcv.new)
summary(mod)
giving in this case
> mod <- lm(response ~ voterType, data = dcv.new)
> summary(mod)
Call:
lm(formula = response ~ voterType, data = dcv.new)
Residuals:
Min 1Q Median 3Q Max
-2.8241 -0.4075 0.0000 0.5856 1.9030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.656 1.425 -1.864 0.0952 .
voterTypeE 2.612 1.593 1.639 0.1356
voterTypeN 3.040 1.646 1.847 0.0978 .
voterTypeP 2.742 1.646 1.666 0.1300
voterTypeV 2.771 1.745 1.588 0.1468
voterTypeX 2.378 2.015 1.180 0.2684
voterTypeY 3.285 1.745 1.882 0.0925 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.425 on 9 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.3154, Adjusted R-squared: -0.1411
F-statistic: 0.6909 on 6 and 9 DF, p-value: 0.6635
The magic all happens with the formula code but essentially what happens behind the scenes is that once R has located all the variables named in the formula, it essentially ends up calling something like
model.matrix( ~ voterType, data = dcv.new)
which generates the covariate matrix needed for the underlying matrix algebra and QR decomposition. That code above, for the small example gives:
> model.matrix(~ voterType, data = dcv.new)
(Intercept) voterTypeE voterTypeN voterTypeP voterTypeV voterTypeX
1 1 1 0 0 0 0
2 1 1 0 0 0 0
3 1 0 0 0 1 0
5 1 1 0 0 0 0
8 1 0 0 1 0 0
10 1 0 0 0 0 0
11 1 0 1 0 0 0
12 1 0 1 0 0 0
13 1 1 0 0 0 0
14 1 0 0 0 0 1
15 1 0 0 0 1 0
16 1 0 0 1 0 0
17 1 0 0 1 0 0
18 1 0 0 0 0 0
19 1 0 1 0 0 0
20 1 0 0 0 0 0
voterTypeY
1 0
2 0
3 0
5 0
8 0
10 1
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 1
attr(,"assign")
[1] 0 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$voterType
[1] "contr.treatment"
Which is what you are wanting to do with your code. So if you really need it, you could use model.matrix() like I show to also generate the matrix - stripping off the attributes as you don't need them.
In this case the reference level is "A":
> with(dcv.new, levels(voterType)[1])
[1] "A"
which is represented by the (Intercept) column in the output from model.matrix. Note that these treatment contrasts code for deviations from the reference level. You can get dummy values by suppressing the intercept in the formula by adding -1 (0r +0):
> model.matrix(~ voterType - 1, data = dcv.new)
voterTypeA voterTypeE voterTypeN voterTypeP voterTypeV voterTypeX voterTypeY
1 0 1 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0
5 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0
10 0 0 0 0 0 0 1
11 0 0 1 0 0 0 0
12 0 0 1 0 0 0 0
13 0 1 0 0 0 0 0
14 0 0 0 0 0 1 0
15 0 0 0 0 1 0 0
16 0 0 0 1 0 0 0
17 0 0 0 1 0 0 0
18 1 0 0 0 0 0 0
19 0 0 1 0 0 0 0
20 0 0 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$voterType
[1] "contr.treatment"
You should vectorize your code. And forget about so many if's
dcv.new[is.na(dcv.new[,29]),69] <- 1
dcv.new[dcv.new[,29] == "V", c(68, 75)] <- 1
....enter code here
Continue as needed
Related
My dataframe named longData looks like:
ID Set Choice Apple Microsoft IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia
1 1 1 0 1 0 0 0 0 0 0 0 0 0
2 1 2 0 0 1 0 0 0 0 0 0 0 0
3 1 3 0 0 0 1 0 0 0 0 0 0 0
4 1 4 1 0 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 0 0 0 0 0 0 0 1
6 1 6 0 -1 0 0 0 0 0 0 0 0 0
I am trying to run mlogit on it by:
logitModel = mlogit(Choice ~ Apple+Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, shape = "long")
it gives the following error:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
after looking for some time I found that this error was given by dfidx as seen in here as:
z <- data[, c(posid1[1], posid2[1])]
if (nrow(z) != nrow(unique(z)))
stop("the two indexes don't define unique observations")
but upon calling the following code, it runs without the error and gives the names of two idx that are uniquely able to identify a row in dataframe:
dfidx(longData)$idx
this gives expected output as:
~~~ indexes ~~~~
ID Set
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
indexes: 1, 2
So what am I doing wrong, I saw some related questions 1, 2 but couldn't find what I am missing.
It looks like your example comes from here: https://docs.displayr.com/wiki/MaxDiff_Analysis_Case_Study_Using_R
The code seems outdated, I remember it worked for me, but not anymore.
The error message is valid because every pair (ID, Set) appears several times, once for each alternative.
However this works:
# there will be complaint that choice can't be coerced to logical otherwise
longData$Choice <- as.logical(longData$Choice)
# create alternative number (nAltsPerSet is 5 in this example)
longData$Alternative <- 1+( 0:(nrow(longData)-1) %% nAltsPerSet)
# define dataset
mdata <- mlogit.data(data=longData,shape="long", choice="Choice",alt.var="Alternative",id.var="ID")
# model
logitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0,
data = mdata
)
summary(logitModel)
I am trying to run the r code from Network-Analysis on Attitudes: A Brief Tutorial.
You can find it here.
First we loaded the cognitive attitudes.
unzip('ANES2012.zip')
ANES2012 <- read.dta('anes_timeseries_2012_Stata12.dta')#loads the data to the object ANES2012
#########################
#Recode variables
#Items regarding Obama
ObamaCog <- data.frame(Mor = as.numeric(ANES2012$ctrait_dpcmoral),#this creates a data frame containing the items tapping beliefs
Led = as.numeric(ANES2012 $ ctrait_dpclead),
Car = as.numeric(ANES2012$ctrait_dpccare),
Kno = as.numeric(ANES2012$ctrait_dpcknow),
Int = as.numeric(ANES2012$ctrait_dpcint),
Hns = as.numeric(ANES2012$ctrait_dpchonst))
ObamaCog[ObamaCog < 3] <- NA#values below 3 represent missing values
I had to change the code a little bit, as the .binarize function didn't work. (I couldn't load a package ("cmprsk") that was necessary.) So I installed library(biclust) and was able to binarize the data:
ObamaCog <- binarize(ObamaCog, threshold = 5)
Then we did the same for the affective attitudes:
ObamaAff <- data.frame(Ang = as.numeric(ANES2012$candaff_angdpc),#this creates a data frame containing the items tapping feelings
Hop = as.numeric(ANES2012$candaff_hpdpc),
Afr = as.numeric(ANES2012$candaff_afrdpc),
Prd = as.numeric(ANES2012$candaff_prddpc))
ObamaAff[ObamaAff < 3] <- NA#values below 3 represent missing values
ObamaAff <- binarize(ObamaAff, 4)#(not) endorsing the feelings is encoded as 1 (0)
And created one Obama-matrix out of it:
Obama <- data.frame(ObamaCog,ObamaAff)
Then we are omitting the NA values:
Obama <- na.omit(Obama)
And I checked:
write.csv(Obama, file = "Obama-Excel1")
There are no more NA values in my matrix.
And I think it fits the required structure: nobs x nvars
Mor Led Car Kno Int Hns Ang Hop Afr Prd
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
...
60 0 0 0 0 0 0 0 0 0 0
61 1 1 1 1 1 1 0 0 0 0
62 0 0 0 0 0 0 0 0 0 0
63 0 0 0 0 0 0 0 0 0 0
65 0 1 1 0 0 1 0 0 0 0
66 1 1 1 1 1 1 0 0 0 0
67 0 0 0 0 0 0 0 0 0 0
until 5914. And if there was an NA-value in the row before, it is now missing. (For example row 64)
If I am then trying to run the IsingFit-function:
ObamaFit <- IsingFit(Obama)
It doesn't work, I am getting the error message:
Error in y %*% rep(1, nc) : non-conformable arguments
I am a beginner in R and I assumed that non-conformable arguments are NA-values, but this doesn't seem to be the case. Can anyone tell me, what the error message means and how I might solve the problem, so I can use the IsingFit-function?
I have a dataframe that looks like this:
Date A B MONTH
2016-01-01 3 10 January
2016-01-02 5 13 January
2016-01-03 8 12 January
.
.
.
2016-12-29 4 13 December
2016-12-30 5 12 December
2016-12-31 6 4 December
With this dataframe, I want to run a regression model with the Month column as dummy variables.
I have tried two methods to run this and each time I do it, it always excludes the month "April".
Any idea why this may be happening?
1st method:
lm(A ~ MONTH + B, data = df)
Example output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.248e+01 3.600e+01 0.902 0.36754
MONTHAugust 7.425e+02 3.630e+01 6.680 9.29e-11 ***
MONTHDecember -1.840e+02 3.277e+01 -5.613 4.02e-08 ***
MONTHFebruary -8.673e+00 2.855e+01 -0.129 0.89770
MONTHJanuary -4.084e+01 2.945e+01 -0.368 0.71291
MONTHJuly 9.407e+02 3.100e+01 4.540 7.73e-06 ***
MONTHJune 3.387e+01 3.077e+01 2.401 0.01687 *
MONTHMarch 2.797e+02 2.884e+01 6.231 1.32e-09 ***
MONTHMay -9.500e+01 3.122e+01 -3.043 0.00252 **
MONTHNovember -1.321e+01 3.555e+01 -1.778 0.07626 .
MONTHOctober 7.145e+01 3.200e+01 0.983 0.32637
MONTHSeptember 9.691e+02 3.916e+01 4.319 2.04e-05 ***
B 5.279e-02 1.161e-03 11.013 < 2e-16 ***
2nd Method:
A <- model.matrix(A ~ B + MONTH, df)
head(A)
(Intercept) Sum.of.Media.Cost MONTHAugust MONTHDecember MONTHFebruary MONTHJanuary MONTHJuly MONTHJune MONTHMarch MONTHMay
1 1 0 0 0 0
1 0 0 0 0
2 1 0 0 0 0
1 0 0 0 0
3 1 0 0 0 0
1 0 0 0 0
4 1 0 0 0 0
1 0 0 0 0
5 1 0 0 0 0
1 0 0 0 0
6 1 0 0 0 0
1 0 0 0 0
MONTHNovember MONTHOctober MONTHSeptember
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Try A ~ B + MONTH -1 -- if your dummies are complete, their linear combination is the same as the constant. Hence reduced rank, and you cannot do that so something has to give.
Either you keep the constant (and remove one monthly dummy) to produce "per month offset to intercept", or, and that is what I would do, remove the constant to get "monthly intercept".
When you deal with dummy variables it's normal. If you have n levels for your factor variable, then you need only n-1 dummy variables. Since the remaining case is when all the dummy variables are zero. I think that April is the month excluded beacause is the first one if you consider alphabetical ordering.
I have a data set like below:
head(df)
## Field.1 Complexity RQT.1 RQT.2 RQT.3 EQT.1 EQT.2 EQT.3 Outcome
## 1 Application 1 M 48 13 1 1594 945 50 832
## 2 Application 2 C 3 1 0 0 0 0 0
## 3 Application 3 C 1 31 2 0 0 0 0
## 4 Application 4 C 0 1 0 0 0 0 0
## 5 Application 5 M 11 5 0 0 0 0 0
## 6 Application 6 C 3 0 0 1 0 0 18
Now lets check the skewness of the numeric var:
library(e1071)
sapply(df.num,skewness)
RQT.1 RQT.2 RQT.3 EQT.1 EQT.2 EQT.3 Outcome
5.228004 4.109652 6.842860 9.571051 7.520120 11.363172 9.922396
As we can see that the variables are highly right skewed....Also since we also have lot of zero values as observations.....Log transformation wont work in this case.....
How can i transform these numeric variables into near normal variables so that i can apply regression technique and then do predictions on the outcome variable as the dependent variable??
I have a formula and a data frame, and I want to extract the model.matrix(). However, I need the resulting matrix to include the NAs that were found in the original dataset. If I were to use model.frame() to do this, I would simply pass it na.action=NULL. However, the output I need is of the model.matrix() format. Specifically, I need only the right-hand side variables, I need the output to be a matrix (not a data frame), and I need factors to be converted to a series of dummy variables.
I'm sure I could hack something together using loops or something, but I was wondering if anyone could suggest a cleaner and more efficient workaround. Thanks a lot for your time!
And here's an example:
dat <- data.frame(matrix(rnorm(20),5,4), gl(5,2))
dat[3,5] <- NA
names(dat) <- c(letters[1:4], 'fact')
ff <- a ~ b + fact
# This omits the row with a missing observation on the factor
model.matrix(ff, dat)
# This keeps the NA, but it gives me a data frame and does not dichotomize the factor
model.frame(ff, dat, na.action=NULL)
Here is what I would like to obtain:
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.7266086 0 0 0 0
2 1 -0.6088697 0 0 0 0
3 NA 0.4643360 NA NA NA NA
4 1 -1.1666248 1 0 0 0
5 1 -0.7577394 0 1 0 0
6 1 0.7266086 0 1 0 0
7 1 -0.6088697 0 0 1 0
8 1 0.4643360 0 0 1 0
9 1 -1.1666248 0 0 0 1
10 1 -0.7577394 0 0 0 1
Joris's suggestion works, but a quicker and cleaner way to do this is via the global na.action setting. The 'Pass' option achieves our goal of preserving NA's from the original dataset.
Option 1: Pass
Resulting matrix will contain NA's in rows corresponding to the original dataset.
options(na.action='na.pass')
model.matrix(ff, dat)
Option 2: Omit
Resulting matrix will skip rows containing NA's.
options(na.action='na.omit')
model.matrix(ff, dat)
Option 3: Fail
An error will occur if the original data contains NA's.
options(na.action='na.fail')
model.matrix(ff, dat)
Of course, always be careful when changing global options because they can alter behavior of other parts of your code. A cautious person might store the original setting with something like current.na.action <- options('na.action'), and then change it back after making the model.matrix.
Another way is to use the model.frame function with argument na.action=na.pass as your second argument to model.matrix:
> model.matrix(ff, model.frame(~ ., dat, na.action=na.pass))
(Intercept) b fact2 fact3 fact4 fact5
1 1 -1.3560754 0 0 0 0
2 1 2.5476965 0 0 0 0
3 1 0.4635628 NA NA NA NA
4 1 -0.2871379 1 0 0 0
5 1 2.2684958 0 1 0 0
6 1 -1.3560754 0 1 0 0
7 1 2.5476965 0 0 1 0
8 1 0.4635628 0 0 1 0
9 1 -0.2871379 0 0 0 1
10 1 2.2684958 0 0 0 1
model.frame allows you to set the appropriate action for na.action which is maintained when model.matrix is called.
I half-stumbled across a simpler solution after looking at mattdevlin and Nathan Gould's answers:
model.matrix.lm(ff, dat, na.action = "na.pass")
model.matrix.default may not support the na.action argument, but model.matrix.lm does!
(I found model.matrix.lm from Rstudio's auto-complete suggestions — it appears to be the only non-default method for model.matrix if you haven't loaded any libraries that add others. Then I just guessed it might support the na.action argument.)
You can mess around a little with the model.matrix object, based on the rownames :
MM <- model.matrix(ff,dat)
MM <- MM[match(rownames(dat),rownames(MM)),]
MM[,"b"] <- dat$b
rownames(MM) <- rownames(dat)
which gives :
> MM
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.9583010 0 0 0 0
2 1 0.3266986 0 0 0 0
3 NA 1.4992358 NA NA NA NA
4 1 1.2867461 1 0 0 0
5 1 0.5024700 0 1 0 0
6 1 0.9583010 0 1 0 0
7 1 0.3266986 0 0 1 0
8 1 1.4992358 0 0 1 0
9 1 1.2867461 0 0 0 1
10 1 0.5024700 0 0 0 1
Alternatively, you can use contrasts() to do the work for you. Constructing the matrix by hand would be :
cont <- contrasts(dat$fact)[as.numeric(dat$fact),]
colnames(cont) <- paste("fact",colnames(cont),sep="")
out <- cbind(1,dat$b,cont)
out[is.na(dat$fact),1] <- NA
colnames(out)[1:2]<- c("Intercept","b")
rownames(out) <- rownames(dat)
which gives :
> out
Intercept b fact2 fact3 fact4 fact5
1 1 0.2534288 0 0 0 0
2 1 0.2697760 0 0 0 0
3 NA -0.8236879 NA NA NA NA
4 1 -0.6053445 1 0 0 0
5 1 0.4608907 0 1 0 0
6 1 0.2534288 0 1 0 0
7 1 0.2697760 0 0 1 0
8 1 -0.8236879 0 0 1 0
9 1 -0.6053445 0 0 0 1
10 1 0.4608907 0 0 0 1
In any case, both methods can be incorporated in a function that can deal with more complex formulae. I leave the exercise to the reader (what do I loath that sentence when I meet it in a paper ;-) )