I'm trying to generate a model.matrix that puts dummy variables for a categorical variable if it exists in either of a pair of factors. Here is an example:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
This results in 10 rows for the (5*(5-1)/2) pairs of (A,B,C,D,E):
df
group1 group2 val control1
1 A B 0.9788535 1.620103
2 A C 1.2101000 2.146025
3 A D 0.8841293 2.210699
4 A E 0.8622912 1.352755
5 B C 1.1840101 2.034643
6 B D 0.9730296 1.593481
7 B E 0.9574277 2.755427
8 C D 0.7279171 1.864196
9 C E 0.2472371 2.779127
10 D E 0.8517064 1.881325
I want to control for a fixed effect in a linear model when a particular level is in either group1 or group2. I can construct a model matrix for this:
tmp1 <- model.matrix(~ 0+group1,df)
tmp2 <- model.matrix(~ 0+group2,df)
tmp3 <- (tmp1|tmp2)*1
tmp3
group1A group1B group1C group1D group1E
1 1 1 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
5 0 1 1 0 0
6 0 1 0 1 0
7 0 1 0 0 1
8 0 0 1 1 0
9 0 0 1 0 1
10 0 0 0 1 1
A few questions:
Doing it this way does not leave me a lot of options in terms of other covariates. How can I construct such a dummy variable as represented by the model matrix tmp3 and then use it in a call to lm with other covariates such as control1?
The idea is that there is a fixed effect on whether an individual (A,B,C,D,E) is in either group1 or group2. This seems like a reasonable assumption, but I haven't found any references. Am I missing something obvious or does this have a common name in statistics?
Thanks for any help.
I am not sure if model.matrix does provide any options from this, but at least in your example you can reconstruct the matrix you are after without too much effort.
model_mat <- data.frame(tmp3[,-1], val = df$val, control1 = df$control1)
lm(val ~ ., data = model_mat)
You need to remove one of the dummies, I have removed A but you can of course pick any of the others as reference category.
Here's a solution using akrun's idea:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
tmpval <- as.data.frame(Reduce('|',lapply(df[1:2], function(group) model.matrix(~0+group)))*1)
indf <- cbind(df,tmpval)
mod1 <- lm(val ~ 0+groupA+groupB+groupC+groupD+groupE,
indf)
summary(mod1)
Related
I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1
I want a simple way to create a new variable determining whether a boolean is ever true in R data frame.
Here is and example:
Suppose in the dataset I have 2 variables (among other variables which are not relevant) 'a' and 'b' and 'a' determines a group, while 'b' is a boolean with values TRUE (1) or FALSE (0). I want to create a variable 'c', which is also a boolean being 1 for all entries in groups where 'b' is at least once 'TRUE', and 0 for all entries in groups in which 'b' is never TRUE.
From entries like below:
a b
-----
1 1
2 0
1 0
1 0
1 1
2 0
2 0
3 0
3 1
3 0
I want to get variable 'c' like below:
a b c
-----------
1 1 1
2 0 0
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
3 0 1
3 1 1
3 0 1
-----------
I know how to do it in Stata, but I haven't done similar things in R yet, and it is difficult to find information on that on the internet.
In fact I am doing that only in order to later remove all the observations for which 'c' is 0, so any other suggestions would be fine as well. The application of that relates to multinomial logit estimation, where the alternatives that are never-chosen need to be removed from the dataset before estimation.
if X is your data frame
library(dplyr)
X <- X %>%
group_by(a) %>%
mutate(c = any(b == 1))
A base R option would be
df1$c <- with(df1, ave(b, a, FUN=any))
Or
library(sqldf)
sqldf('select * from df1
left join(select a, b,
(sum(b))>0 as c
from df1
group by a)
using(a)')
Simple data.table approach
require(data.table)
data <- data.table(data)
data[, c := any(b), by = a]
Even though logical and numeric (0-1) columns behave identically for all intents and purposes, if you'd like a numeric result you can simply wrap the call to any with as.numeric.
An answer with base R, assuming a and b are in dataframe x
c value is a 1-to-1 mapping with a, and I create a mapping here
cmap <- ifelse(sapply(split(x, x$a), function(x) sum(x[, "b"])) > 0, 1, 0)
Then just add in the mapped value into the data frame
x$c <- cmap[x$a]
Final output
> x
a b c
1 1 1 1
2 2 0 0
3 1 0 1
4 1 0 1
5 1 1 1
6 2 0 0
7 2 0 0
8 3 0 1
9 3 1 1
10 3 0 1
edited to change call to split.
I d like to create a new variable that contains 1 and 0. A 1 represents agreement between the rater (both raters 1 or both raters 0) and a zero represents disagreement.
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- cbind(rater_A, rater_B)
The new variable would be like the following vector I created manually:
df$agreement <- c(1,0,0,0,1,0,1,1,1,1)
Maybe there's a package or a function I don't know. Any help would be great.
You could create df as a data.frame (instead of using cbind) and use within and ifelse:
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- data.frame(rater_A, rater_B)
##
df <- within(df,
agreement <- ifelse(
rater_A==rater_B,1,0))
##
> df
rater_A rater_B agreement
1 1 1 1
2 0 1 0
3 1 0 0
4 1 0 0
5 1 1 1
6 0 1 0
7 0 0 1
8 1 1 1
9 0 0 1
10 0 0 1
I have a dataframe that looks like this:
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
etc.
Because I need to correlate variabeles a-g (their scores are in score1) with score2 in only step 5 I think i need to schange my dataset into this first:
a b c d e f g score2_step5
0 1 1 0 0 0
1 1 1 0 1 1
1 0
etc.
I am pretty sure that the Reshape package should be able to help me to do the job, but I haven't been able to make it work yet.
Can anyone help me? Many thanks in advance!
Here's another version. In case there is no step = 5, the value for score2_step = 0. Assuming your data.frame is df:
require(reshape2)
out <- do.call(rbind, lapply(seq(1, nrow(df), by=5), function(ix) {
iy <- min(ix+4, nrow(df))
df.b <- df[ix:iy, ]
tt <- dcast(df.b, 1 ~ var1, fill = 0, value.var = "score1", drop=F)
tt$score2_step5 <- 0
if (any(df.b$step == 5)) {
tt$score2_step5 <- df.b$score2[df.b$step == 5]
}
tt[,-1]
}))
> out
a b d e f g score2_step5
2 0 1 1 0 0 0 0
21 1 1 1 0 1 0 1
22 0 0 1 0 0 0 0
It looks like you want 7 correlations between the variables a-g and score2_step5--is that correct? First, you're going to need another variable. I'm assuming that step repeats continuously from 1 to 5; if not, this is going to be more complicated. I'm assuming your data is called df. I also prefer the newer reshape2 package, so I'm using that.
df$block <- rep(1:(nrow(df)/5),each=5)
df.molten <- melt(df,id.vars=c("var1", "step", "block"),measure.vars=c("score1"))
df2 <- dcast(df.molten, block ~ var1)
score2_step5 <- df$score2[df$step==5]
and then finally
cor(df2, score2_step5, use='pairwise')
There's an extra column (block) in df2 that you can get rid of or just ignore.
I added another row to your example data because my code doesn't work unless there is a step-5 observation in every block.
dat <- read.table(textConnection("
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
5 a 1 0"),header=TRUE)
Like #JonathanChristensen, I made another variable (I called it rep instead of block), and I made var1 into a factor (since there are no c values in the example data set given and I wanted a placeholder).
dat <- transform(dat,var1=factor(var1,levels=letters[1:7]),
rep=cumsum(step==1))
tapply makes the table of score1 values:
tab <- with(dat,tapply(score1,list(rep,var1),identity))
add the score2, step-5 values:
data.frame(tab,subset(dat,step==5,select=score2))
So, my data set consists of 15 variables, one of them (sex) has only 2 levels. I want to use it as a dummy variable, but the levels are 1 and 2. How do I do this? I want to have levels 0 and 1, but I don't know how to manage this in R!
With most of R's modelling tools with a formula interface you don't need to create dummy variables, the underlying code that handles and interprets the formula will do this for you. If you want a dummy variable for some other reason then there are several options. The easiest (IMHO) is to use model.matrix():
set.seed(1)
dat <- data.frame(sex = sample(c("male","female"), 10, replace = TRUE))
model.matrix( ~ sex - 1, data = dat)
which gives:
> dummy <- model.matrix( ~ sex - 1, data = dat)
> dummy
sexfemale sexmale
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"
> dummy[,1]
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
You can use either column of dummy as a numeric dummy variable; choose whichever column you want to be the 1-based level. dummy[,1] chooses 1 as representing the female class and dummy[,2] the male class.
Cast this as a factor if you want it to be interpreted as a categorical object:
> factor(dummy[, 1])
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
Levels: 0 1
But that is defeating the object of factor; what is 0 again?
Ty this
set.seed(001) # generating some data
sex <- factor(sample(1:2, 10, replace=TRUE)) # this is what you have
[1] 1 1 2 2 1 2 2 2 2 1
Levels: 1 2
sex<-factor(ifelse(as.numeric(sex)==2, 1,0)) # this is what you want
sex
[1] 0 0 1 1 0 1 1 1 1 0
Levels: 0 1
If you want labels to be 0 = Male and 1 = Female, then...
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
sex # this is what you want
[1] M M F F M F F F F M
Levels: M F
Actually you don't need to create a dummy variable in order to estimate a model using lm, let's see this example:
set.seed(001) # Generating some data
N <- 100
x <- rnorm(N, 50, 20)
y <- 20 + 3.5*x + rnorm(N)
sex <- factor(sample(1:2, N, replace=TRUE))
# Estimating the linear model
lm(y ~ x + sex) # using the first category as the baseline (this means sex==1)
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sex2
19.97815 3.49994 -0.02719
# renaming the categories and labelling them
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
lm(y ~ x + sex) # the same results, baseline is 'Male'
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sexF
19.97815 3.49994 -0.02719
As you can see R deals with the dummies pretty well, you just pass them into the formula as factor variable and R will do the rest for you.
By the way there's no need to change the categories from c(2,1) into c(0,1), the results will be the same as you can seen in the example above.
As suggested by many above, turn it into factor.
If you really want to dummy code the gender variable, consider this
set.seed(100)
gender = rbinom(100,1,0.5)+1
gender_dummy = gender-1