My searches on SO & elsewhere are coming up with interesting solutions to problems that have similar search terms but not my issue. Thought I found a solution, but the error is leaving me quite puzzled. I'm trying to learn tidyverse approaches better, but I appreciate any solution strategies.
Aim: Create new vector columns in a dataframe where each new vector is named from the factor level of an existing dataframe vector.
The code solution should be dynamic so that it can be applied to factors with any number of levels.
Test data
df <- data.frame(x=c(1:5), y=letters[1:5])
Which produces as expected
> str(df)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> df
x y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
and when finished should look like
> df
x y a b c d e
1 1 a NA NA NA NA NA
2 2 b NA NA NA NA NA
3 3 c NA NA NA NA NA
4 4 d NA NA NA NA NA
5 5 e NA NA NA NA NA
Tidy for loop approach
library(tidyverse)
for (i in 1:length(levels(df$y))) {
df <- mutate(df, levels(df$y)[i] = NA)
}
but that gives me the following error:
> for (i in 1:length(levels(df$y))) {
+ df <- mutate(df, levels(df$y)[i] = NA)
Error: unexpected '=' in:
"for (i in 1:length(levels(df$y))) {
df <- mutate(df, levels(df$y)[i] ="
> }
Error: unexpected '}' in "}"
Troubleshooting, I removed the loop and simplified the mutate to see if it works in general, which it will with or without the quotation marks (note, I reran the test data to start fresh).
levels(df$y)[1]
> "a"
df <- mutate(df, a = NA)
df <- mutate(df, "a" = NA) # works the same as the previous line
> df
x y a
1 1 a NA
2 2 b NA
3 3 c NA
4 4 d NA
5 5 e NA
Substituting the levels function back in, but without the loop returns the mutate error (note, I reran the test data to start fresh):
> df <- mutate(df, levels(df$y)[1] = NA)
Error: unexpected '=' in "df <- mutate(df, levels(df$y)[1] ="
I continue to get the same error is I try to use .data=df to specify the dataset or wrap as.character(), paste(), or paste0() around the levels function--which I picked up other various solutions online. Nor is R just being picky if I restructure the code using the %>% pipe.
What about the equal sign is unexpected with my levels code substitution (and potential newb mistakes)?
Any assistance is greatly appreciated!
Posting solutions for others based on comments received, and so I can mark this question as solved. Please give up votes to #arg0naut91 and #Gregor for their solutions & guided help.
Test data
df <- data.frame(x=c(1:5), y=letters[1:5])
Solution 1: base R
#arg0naut91 provided an elegant base R solution:
df[, levels(df$y)] <- NA
df
x y a b c d e
1 1 a NA NA NA NA NA
2 2 b NA NA NA NA NA
3 3 c NA NA NA NA NA
4 4 d NA NA NA NA NA
5 5 e NA NA NA NA NA
Solution 2: using quo() and :=
#Gregor's guidance & useful links showed how some functions, and pretty much all of the tidyverse, does not evaluate objects as we might expect.
First test with a single new column:
df <- data.frame(x=c(1:5), y=letters[1:5]) # refresh test data
varlevel <- levels(df$y)[1] # where level 1=a
df <- mutate(df, !!varlevel := NA)
rm(varlevel) # cleanup
df
x y a
1 1 a NA
2 2 b NA
3 3 c NA
4 4 d NA
5 5 e NA
Then put it into the for loop to capture each factor level as a new column:
df <- data.frame(x=c(1:5), y=letters[1:5]) # refresh test data
for (i in 1:length(levels(df$y))) {
+ varlevel <- levels(df$y)[i]
+ df <- mutate(df, !!varlevel := NA)
+ rm(varlevel) # cleanup
+ }
df
x y a b c d e
1 1 a NA NA NA NA NA
2 2 b NA NA NA NA NA
3 3 c NA NA NA NA NA
4 4 d NA NA NA NA NA
5 5 e NA NA NA NA NA
I want to research group A's effect on B regarding certain dependent variables I dubbed "target_n". Due to the way in which the data was generated I have "layers" of information in my dataset that are ordered by group. That means, in rows for which Group=="B" I have information on B's values on "target_n" and for rows where Group=="A", I have information on the A's values on "X_n". Group "C" is basically a "other"-category but I would need to have them in the same row as A and B as well to make sure that A's effects are on B and not on C. The following should add some clarity:
My data (df) are structured like this:
df<-data.frame(
"Date"=c(1990-03,2000-01,2010-09,1990-03,2000-01,2010-09,1990-03,2000-01,2010-09),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X_1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X_2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2)
)
What I want is to compute new variables both for group "A" and group "C" so that everything falls within the same rows. If I were to do that manually,I would take A's column "X_1" score at date "1990-03" and assign it to B's place in A's column for the same date.
So in the end, my data would look like this:
df<-data.frame(
"Date"=c(1990,2000,2010,1990,2000,2010,1990,2000,2010),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X_1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X_2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2),
"NEW_X_1_A"=c(NA,NA,NA,9,4,7,NA,NA,NA),
"NEW_X_2_A"=c(NA,NA,NA,1,2,6,NA,NA,NA),
"NEW_target_1_C"=c(NA,NA,NA,5,3,1,NA,NA,NA),
"NEW_target_2_C"=c(NA,NA,NA,1,9,2,NA,NA,NA)
)
(I have a number of these "X_"s and exactly the same number of "target_" variables. I also do not just have this group of A, B and C, but A1,A2,A3,C1,C2,C3 and even more Bs. For each set of A1,B1,C1 I also have a "set" of dates that does not match anothers "set". But that would be less of a problem as I could simply slice my dataset horizontally into sets, do the trick for all of them separately and merge them again.)
But how would I bring A's and C's values into B's rows based on Group=="B" and based on date?
Using data.table you can try
df<-data.frame(
"Date"=c("1990-03","2000-01","2010-09","1990-03","2000-01","2010-09","1990-03","2000-01","2010-09"),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_value_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_value_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_value_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_value_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2)
)
library(data.table)
setDT(df)[,`:=` (NEW_X1 = ifelse(Group=="B",X1_A[Group=="A"],NA),
NEW_X2 = ifelse(Group=="B",X2_A[Group=="A"],NA),
NEW_target_value_1_C =ifelse(Group=="B",target_value_1_C[Group=="C"],NA),
NEW_target_value_2_C =ifelse(Group=="B",target_value_2_C[Group=="C"],NA)
)]
Which results in:
df
Date Group X1_A X2_A target_value_1_B target_value_2_B target_value_1_C target_value_2_C NEW_X1 NEW_X2 NEW_target_value_1_C NEW_target_value_2_C
1: 1990-03 A 9 1 NA NA NA NA NA NA NA NA
2: 2000-01 A 4 2 NA NA NA NA NA NA NA NA
3: 2010-09 A 7 6 NA NA NA NA NA NA NA NA
4: 1990-03 B NA NA 0 9 NA NA 9 1 5 1
5: 2000-01 B NA NA 2 2 NA NA 4 2 3 9
6: 2010-09 B NA NA 9 1 NA NA 7 6 1 2
7: 1990-03 C NA NA NA NA 5 1 NA NA NA NA
8: 2000-01 C NA NA NA NA 3 9 NA NA NA NA
9: 2010-09 C NA NA NA NA 1 2 NA NA NA NA
If I have a vector a = c(1300,NA,NA,NA,NA,1500,NA,NA,6000,NA,NA,900)
How can I order this vector to result in:
b=[2,NA,NA,NA,NA,3,NA,NA,4,NA,NA,5]?
Sidenote :I tried to make them repeat so it was
a=[1300,1300,1300,1300,1300,1500,1500,1500,6000,6000,6000,900]
But when I use rank its gets some crazy half numbers, any ideas? I'm at wits end for figuring this out.
Keeping the amount of NAs after a number is very important here so I cant just only ignore them
The dplyr::dense_rank function behaves as you want:
library(dplyr)
dense_rank(a)
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1
It also works on the dense vector:
b = c(1300,1300,1300,1300,1300,1500,1500,1500,6000,6000,6000,900)
dense_rank(b)
# [1] 2 2 2 2 2 3 3 3 4 4 4 1
replace(a, !is.na(a), rank(a[!is.na(a)], ties.method = "first"))
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1
Take a ^ is.na(a) and multiply it by rank(a). We use ties="first" to ensure we get increasing values at each index, not averages.
rank(a, ties="first") * a ^ is.na(a)
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1
First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1