Creating categorical variables from mutually exclusive dummy variables - r

My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable.
In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here), so I don't think interaction does what I need to do.
For example, my data might look like this:
id conditionA conditionB conditionC conditionD
1 NA 1 NA NA
2 1 NA NA NA
3 NA NA 1 NA
4 NA NA NA 1
5 NA 2 NA NA
6 2 NA NA NA
7 NA NA 2 NA
8 NA NA NA 2
I'd like to now make categorical variables that combine ACROSS different types of conditions. For example, people who had values for condition A and B might be coded with one categorical variable, and people who had values for condition C and D.
id conditionA conditionB conditionC conditionD factor1 factor2
1 NA 1 NA NA 1 NA
2 1 NA NA NA 1 NA
3 NA NA 1 NA NA 1
4 NA NA NA 1 NA 1
5 NA 2 NA NA 2 NA
6 2 NA NA NA 2 NA
7 NA NA 2 NA NA 2
8 NA NA NA 2 NA 2
Right now, I'm doing this using ifelse() statements, which quite simply is a hot mess (and doesn't always work). Please help! There's probably some super-obvious "easier way."
EDIT:
The kinds of ifelse commands that I am using are as follows:
attach(df)
df$factor<-ifelse(conditionA==1 | conditionB==1, 1, NA)
df$factor<-ifelse(conditionA==2 | conditionB==2, 2, df$factor)
In reality, I'm combining across 6-8 columns each time, so a more elegant solution would help a lot.

Update (2019): Please use dplyr::coalesce(), it works pretty much the same.
My R package has a convenience function that allows to choose the first non-NA value for each element in a list of vectors:
#library(devtools)
#install_github('kimisc', 'muelleki')
library(kimisc)
df$factor1 <- with(df, coalesce.na(conditionA, conditionB))
(I'm not sure if this works if conditionA and conditionB are factors. Convert them to numerics before using as.numeric(as.character(...)) if necessary.)
Otherwise, you could give interaction a try, combined with recoding of the levels of the resulting factor -- but to me it looks like you're more interested in the first solution:
df$conditionAB <- with(df, interaction(coalesce.na(conditionA, 0),
coalesce.na(conditionB, 0)))
levels(df$conditionAB) <- c('A', 'B')

I think this function gives you what you need (admittedly, this is a quick hack).
to_indicator <- function(x, grp)
{
apply(tbl, 1,
function (x)
{
idx <- which(!is.na(x))
nm <- names(idx)
if (nm %in% grp)
x[idx]
else
NA
})
}
And here is it's used with the example data you provide.
tbl <- read.table(header=TRUE, text="
conditionA conditionB conditionC conditionD
NA 1 NA NA
1 NA NA NA
NA NA 1 NA
NA NA NA 1
NA 2 NA NA
2 NA NA NA
NA NA 2 NA
NA NA NA 2")
tbl <- data.frame(tbl)
(tbl <- cbind(tbl,
factor1=to_indicator(tbl, c("conditionA", "conditionB")),
factor2=to_indicator(tbl, c("conditionC", "conditionD"))))

Well, I think you can do it simply with ifelse, something like :
factor1 <- ifelse(is.na(conditionA), conditionB, conditionA)
Another way could be :
factor1 <- conditionA
factor1[is.na(factor1)] <- conditionB
And a third solution, certainly more pratical if you have more than two columns conditions :
factor1 <- apply(df[,c("conditionA","conditionB")], 1, sum, na.rm=TRUE)

Related

How to make new dataframe columns from factor levels (& troubleshoot mutate error)

My searches on SO & elsewhere are coming up with interesting solutions to problems that have similar search terms but not my issue. Thought I found a solution, but the error is leaving me quite puzzled. I'm trying to learn tidyverse approaches better, but I appreciate any solution strategies.
Aim: Create new vector columns in a dataframe where each new vector is named from the factor level of an existing dataframe vector.
The code solution should be dynamic so that it can be applied to factors with any number of levels.
Test data
df <- data.frame(x=c(1:5), y=letters[1:5])
Which produces as expected
> str(df)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> df
x y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
and when finished should look like
> df
x y a b c d e
1 1 a NA NA NA NA NA
2 2 b NA NA NA NA NA
3 3 c NA NA NA NA NA
4 4 d NA NA NA NA NA
5 5 e NA NA NA NA NA
Tidy for loop approach
library(tidyverse)
for (i in 1:length(levels(df$y))) {
df <- mutate(df, levels(df$y)[i] = NA)
}
but that gives me the following error:
> for (i in 1:length(levels(df$y))) {
+ df <- mutate(df, levels(df$y)[i] = NA)
Error: unexpected '=' in:
"for (i in 1:length(levels(df$y))) {
df <- mutate(df, levels(df$y)[i] ="
> }
Error: unexpected '}' in "}"
Troubleshooting, I removed the loop and simplified the mutate to see if it works in general, which it will with or without the quotation marks (note, I reran the test data to start fresh).
levels(df$y)[1]
> "a"
df <- mutate(df, a = NA)
df <- mutate(df, "a" = NA) # works the same as the previous line
> df
x y a
1 1 a NA
2 2 b NA
3 3 c NA
4 4 d NA
5 5 e NA
Substituting the levels function back in, but without the loop returns the mutate error (note, I reran the test data to start fresh):
> df <- mutate(df, levels(df$y)[1] = NA)
Error: unexpected '=' in "df <- mutate(df, levels(df$y)[1] ="
I continue to get the same error is I try to use .data=df to specify the dataset or wrap as.character(), paste(), or paste0() around the levels function--which I picked up other various solutions online. Nor is R just being picky if I restructure the code using the %>% pipe.
What about the equal sign is unexpected with my levels code substitution (and potential newb mistakes)?
Any assistance is greatly appreciated!
Posting solutions for others based on comments received, and so I can mark this question as solved. Please give up votes to #arg0naut91 and #Gregor for their solutions & guided help.
Test data
df <- data.frame(x=c(1:5), y=letters[1:5])
Solution 1: base R
#arg0naut91 provided an elegant base R solution:
df[, levels(df$y)] <- NA
df
x y a b c d e
1 1 a NA NA NA NA NA
2 2 b NA NA NA NA NA
3 3 c NA NA NA NA NA
4 4 d NA NA NA NA NA
5 5 e NA NA NA NA NA
Solution 2: using quo() and :=
#Gregor's guidance & useful links showed how some functions, and pretty much all of the tidyverse, does not evaluate objects as we might expect.
First test with a single new column:
df <- data.frame(x=c(1:5), y=letters[1:5]) # refresh test data
varlevel <- levels(df$y)[1] # where level 1=a
df <- mutate(df, !!varlevel := NA)
rm(varlevel) # cleanup
df
x y a
1 1 a NA
2 2 b NA
3 3 c NA
4 4 d NA
5 5 e NA
Then put it into the for loop to capture each factor level as a new column:
df <- data.frame(x=c(1:5), y=letters[1:5]) # refresh test data
for (i in 1:length(levels(df$y))) {
+ varlevel <- levels(df$y)[i]
+ df <- mutate(df, !!varlevel := NA)
+ rm(varlevel) # cleanup
+ }
df
x y a b c d e
1 1 a NA NA NA NA NA
2 2 b NA NA NA NA NA
3 3 c NA NA NA NA NA
4 4 d NA NA NA NA NA
5 5 e NA NA NA NA NA

Adapt layered R data frame so that values of variables match in rows (based on group and date)

I want to research group A's effect on B regarding certain dependent variables I dubbed "target_n". Due to the way in which the data was generated I have "layers" of information in my dataset that are ordered by group. That means, in rows for which Group=="B" I have information on B's values on "target_n" and for rows where Group=="A", I have information on the A's values on "X_n". Group "C" is basically a "other"-category but I would need to have them in the same row as A and B as well to make sure that A's effects are on B and not on C. The following should add some clarity:
My data (df) are structured like this:
df<-data.frame(
"Date"=c(1990-03,2000-01,2010-09,1990-03,2000-01,2010-09,1990-03,2000-01,2010-09),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X_1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X_2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2)
)
What I want is to compute new variables both for group "A" and group "C" so that everything falls within the same rows. If I were to do that manually,I would take A's column "X_1" score at date "1990-03" and assign it to B's place in A's column for the same date.
So in the end, my data would look like this:
df<-data.frame(
"Date"=c(1990,2000,2010,1990,2000,2010,1990,2000,2010),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X_1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X_2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2),
"NEW_X_1_A"=c(NA,NA,NA,9,4,7,NA,NA,NA),
"NEW_X_2_A"=c(NA,NA,NA,1,2,6,NA,NA,NA),
"NEW_target_1_C"=c(NA,NA,NA,5,3,1,NA,NA,NA),
"NEW_target_2_C"=c(NA,NA,NA,1,9,2,NA,NA,NA)
)
(I have a number of these "X_"s and exactly the same number of "target_" variables. I also do not just have this group of A, B and C, but A1,A2,A3,C1,C2,C3 and even more Bs. For each set of A1,B1,C1 I also have a "set" of dates that does not match anothers "set". But that would be less of a problem as I could simply slice my dataset horizontally into sets, do the trick for all of them separately and merge them again.)
But how would I bring A's and C's values into B's rows based on Group=="B" and based on date?
Using data.table you can try
df<-data.frame(
"Date"=c("1990-03","2000-01","2010-09","1990-03","2000-01","2010-09","1990-03","2000-01","2010-09"),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_value_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_value_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_value_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_value_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2)
)
library(data.table)
setDT(df)[,`:=` (NEW_X1 = ifelse(Group=="B",X1_A[Group=="A"],NA),
NEW_X2 = ifelse(Group=="B",X2_A[Group=="A"],NA),
NEW_target_value_1_C =ifelse(Group=="B",target_value_1_C[Group=="C"],NA),
NEW_target_value_2_C =ifelse(Group=="B",target_value_2_C[Group=="C"],NA)
)]
Which results in:
df
Date Group X1_A X2_A target_value_1_B target_value_2_B target_value_1_C target_value_2_C NEW_X1 NEW_X2 NEW_target_value_1_C NEW_target_value_2_C
1: 1990-03 A 9 1 NA NA NA NA NA NA NA NA
2: 2000-01 A 4 2 NA NA NA NA NA NA NA NA
3: 2010-09 A 7 6 NA NA NA NA NA NA NA NA
4: 1990-03 B NA NA 0 9 NA NA 9 1 5 1
5: 2000-01 B NA NA 2 2 NA NA 4 2 3 9
6: 2010-09 B NA NA 9 1 NA NA 7 6 1 2
7: 1990-03 C NA NA NA NA 5 1 NA NA NA NA
8: 2000-01 C NA NA NA NA 3 9 NA NA NA NA
9: 2010-09 C NA NA NA NA 1 2 NA NA NA NA

ordering a vector in R while ignoring yet keeping NAs

If I have a vector a = c(1300,NA,NA,NA,NA,1500,NA,NA,6000,NA,NA,900)
How can I order this vector to result in:
b=[2,NA,NA,NA,NA,3,NA,NA,4,NA,NA,5]?
Sidenote :I tried to make them repeat so it was
a=[1300,1300,1300,1300,1300,1500,1500,1500,6000,6000,6000,900]
But when I use rank its gets some crazy half numbers, any ideas? I'm at wits end for figuring this out.
Keeping the amount of NAs after a number is very important here so I cant just only ignore them
The dplyr::dense_rank function behaves as you want:
library(dplyr)
dense_rank(a)
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1
It also works on the dense vector:
b = c(1300,1300,1300,1300,1300,1500,1500,1500,6000,6000,6000,900)
dense_rank(b)
# [1] 2 2 2 2 2 3 3 3 4 4 4 1
replace(a, !is.na(a), rank(a[!is.na(a)], ties.method = "first"))
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1
Take a ^ is.na(a) and multiply it by rank(a). We use ties="first" to ensure we get increasing values at each index, not averages.
rank(a, ties="first") * a ^ is.na(a)
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1

Merging multiple columns to one in data frame by sum r

I want to merge 7 columns in one column by sum, but i can not find a good way to do this. The data frames contains 71 observations and 7 variables.
The first ones are
> head(df)
pop_exposed_1_1 pop_exposed_1_2 pop_exposed_1_3 pop_exposed_1_4
1 NA NA 15778358 NA
2 NA NA NA NA
3 NA NA NA 3971412
4 NA NA NA 2694625
5 NA NA NA NA
6 NA NA NA NA
pop_exposed_2_2 pop_exposed_2_3 pop_exposed_2_4
1 NA NA NA
2 38044072 NA NA
3 NA NA NA
4 NA NA NA
5 NA 1626335.0 NA
6 NA 429924.4 NA
All the NA values need to be replaced by a value from another variable and some rows have multiple values that need to be combined by sum. So that the outcome is just one variable pop_exposed. I have tried several things, but nothing worked the way I would like to.
Look up ?rowSums
rowSums(df, na.rm=TRUE)
rowMeans(df, na.rm=TRUE)
or the apply way
apply(df,1,sum ,na.rm = TRUE) # Sum by row '1' (for columns use '2')
apply(df,1,mean,na.rm = TRUE) # Mean by row '1' (for columns use '2')

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

Resources