Passing list of variable names to custom function with mutate - r

I am trying to perform a function over each row and create a new column that considers multiple columns using tidyverse , I was initially using rowwise() but that was very slow. I want the list of columns into my custom function be a variable, but I can't get it to work unless I explicitly list the variable names. For example, this works:
low_risk_codes <- c(0,1,10)
vars <- c("V1", "V2")
m <- matrix(1:9, ncol=3)
classify_low_risk_drug <- function(...){
t <- cbind(...)
return(apply(t, 1, function(x) ifelse(any(x %in% low_risk_codes), 1, 0)))
}
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(V1, V2))
But if I want it to evaluate using the column input as vars :
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(vars))
But I can't get it to work even if I include !!, what am I missing?!
Also any suggestions for how to do this with map instead are also appreciated!

This sounds like it will do what you want, but I need to qualify it (a lot). First, FYI, I am still wrapping my mind around NSE in R but I find this vignette very helpful.
Related to the solution, I tried to speed up the function by avoiding rowwise() or apply(). It should be quicker with rapply()/rowSums() but I did not benchmark it. It may run into issues with very large data because rowSums() will convert the dataframe into a matrix but that probably wont be a problem. In theory, you should also be able to use select helpers / unquoted variable names / columns positions (if you so dare).
Also, I find it a little quirky that you need to supply the dataframe as the first argument (i.e., as .), but there may be a way around that. I am certainly open to anyway that wants to edit this / use this as the base for their solution. Hope this helps / gets you going in the right direction!
classify_low_risk_drug <- function(.data, vars, codes, na.rm = FALSE){
df <- rapply(.data, function(x) x %in% codes, how = "replace")
as.integer(rowSums(select(df, !!enquo(vars)), na.rm = na.rm) > 0)
}
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(., vars = vars, codes = c(0, 1, 10)))
V1 V2 V3 val4
1 1 4 7 1
2 2 5 8 0
3 3 6 9 0
EDIT: you could improve the speed a little bit by avoiding the matrix conversion / using lapply() w/ pmax():
classify_low_risk_drug2 <- function(.data, vars, codes, na.rm = FALSE){
as.integer(do.call(pmax, lapply(select(.data, !!enquo(vars)), `%in%`, codes)))
}

Related

How to use custom functions in mutate (dplyr)?

I'm rewriting all my code using dplyr, and need help with mutate / mutate_at function. All I need is to apply custom function to two columns in my table. Ideally, I would reference these columns by their indices, but now I can't make it work even referencing by names.
The function is:
binom.test.p <- function(x) {
if (is.na(x[1])|is.na(x[2])|(x[1]+x[2])<10) {
return(NA)
}
else {
return(binom.test(x, alternative="two.sided")$p.value)
}
}
My data:
table <- data.frame(geneId=c("a", "b", "c", "d"), ref_SG1_E2_1_R1_Sum = c(10,20,10,15), alt_SG1_E2_1_R1_Sum = c(10,20,10,15))
So I do:
table %>%
mutate(Ratio=binom.test.p(c(ref_SG1_E2_1_R1_Sum, alt_SG1_E2_1_R1_Sum)))
Error: incorrect length of 'x'
If I do:
table %>%
mutate(Ratio=binom.test.p(ref_SG1_E2_1_R1_Sum, alt_SG1_E2_1_R1_Sum))
Error: unused argument (c(10, 20, 10, 15))
The second error is probably because my function needs one vector and gets two parameters instead.
But even forgetting about my function. This works:
table %>%
mutate(sum = ref_SG1_E2_1_R1_Sum + alt_SG1_E2_1_R1_Sum)
This doesn't:
table %>%
mutate(.cols=c(2:3), .funs=funs(sum=sum(.)))
Error: wrong result size (2), expected 4 or 1
So it's probably my misunderstanding of how dplyr works.
Your problem seems to be binom.test instead of dplyr, binom.test is not vectorized, so you can not expect it work on vectors; You can use mapply on the two columns with mutate:
table %>%
mutate(Ratio = mapply(function(x, y) binom.test.p(c(x,y)),
ref_SG1_E2_1_R1_Sum,
alt_SG1_E2_1_R1_Sum))
# geneId ref_SG1_E2_1_R1_Sum alt_SG1_E2_1_R1_Sum Ratio
#1 a 10 10 1
#2 b 20 20 1
#3 c 10 10 1
#4 d 15 15 1
As for the last one, you need mutate_at instead of mutate:
table %>%
mutate_at(.vars=c(2:3), .funs=funs(sum=sum(.)))
In many cases it's sufficient to create a vectorized version of the function:
your_function_V <- Vectorize(your_function)
The vectorized function is then usable in a dplyr's mutate. See also this blog post.
The function posted in the question however takes one two-dimensional input from two different columns. Therefore we need to modify this, so the inputs are individual, before we vectorize.
binom.test.p <- function(x, y) {
# input x and y
x <- c(x, y)
if (is.na(x[1])|is.na(x[2])|(x[1]+x[2])<10) {
return(NA)
}
else {
return(binom.test(x, alternative="two.sided")$p.value)
}
}
# vectorized function
binom.test.p_V <- Vectorize(binom.test.p)
table %>%
mutate(Ratio = binom.test.p_V(ref_SG1_E2_1_R1_Sum, alt_SG1_E2_1_R1_Sum))
# works!

R dataframe column multiplication with sapply

I need to multiply columns in R data.frame. I want to do this based on certain patterns in the column names. This is very elementary task, but I struggle to make it work with sapply() or some related function. This is what I've tried thus far.
df <- data.frame("pA" = sample(1:100), "pB" = sample(1:100), "qA" = sample(1:100), "qB" = sample(1:100))
cols <- c("A","B")
multip <- function(df,col){
dfp <- df[which(names(df) %in% paste0("p",col))]
dfq <- df[which(names(df) %in% paste0("q",col))]
dfv <- dfp*dfq
setNames(dfv, paste0("v",col))
}
sapply(df, function(x) multip(x,cols))
I can make it work if I take it apart and forget the function and sapply parts but that would complicate my work. Is there some solution that would make this work?
You can use multip directly on 'df'
multip(df, cols)
Or without using multip
Map('*', df[grep('p', names(df))], df[grep('q', names(df))])
The problem with sapply/lapply call is that we get access to only a single column for each list element and that is not the arguments based on the function multip

Shorten code to filter data

Take this code:
quite_long_data_frame_name <- data.frame(variable.name = rnorm(50, 3, 2))
quite_long_data_frame_name$variable.name[quite_long_data_frame_name$variable.name > 2 & quite_long_data_frame_name$variable.name < 3] <- NA
In the last line, quite_long_data_frame_name$variable.name needs to be repeated 3 times. Is there any way to achieve same result but using quite_long_data_frame_name$variable.name just once? Can dplyr or magrittr achieve this?
Use subset and the negation of that logical vector:
subset( quite_long_data_frame_name, !(variable.name > 2 & variable.name < 3) )
If you want to destructively modfiy the original, then just assign that value to the original.
If your really do want a result with the NA's:
within( quite_long_data_frame_name, is.na(variable.name) <-
(variable.name > 2 & variable.name < 3) )
You will need to assign back to quite_long_data_frame_name if you want this result to replace the original.
In dplyr, I suppose you would do
quite_long_data_frame_name %>%
mutate(variable.name=ifelse(variable.name>2 & variable.name<3, NA, variable.name))
Now you only type the dataframe name once, but you have to type the variable name 4 times instead of 3. Could help if the variable names are short compared to the dataframe name. Unfortunately no more terse dplyr solution comes to mind.
As an alternative to the attach solution, use with
with(quite_long_data_frame_name, variable.name[variable.name > 2 & variable.name < 3] <- NA)
Still pretty long. I don't know of any way to do this without typing variable.name at least 3 times.
Give your variables shorter names? :)
Note that if you wanted to actually filter (in the dplyr sense) it's easier
quite_long_data_frame_name %>%
filter(variablename > 2 & variablename < 3)
but this is shorter in base R as well.
===== Edit ==========
This this specific conditional, you can use the %between% operator in the data.table package. Shorter, but not very general. Aggregating everything here, we get
with(quite_long_data_frame_name, is.na(variable.name) <- variable.name %between% c(2, 3))

Drop columns per row based on a separate column value

Given a dummy data frame that looks like this:
Data1<-rnorm(20, mean=20)
Data2<-rnorm(20, mean=21)
Data3<-rnorm(20, mean=22)
Data4<-rnorm(20, mean=19)
Data5<-rnorm(20, mean=20)
Data6<-rnorm(20, mean=23)
Data7<-rnorm(20, mean=21)
Data8<-rnorm(20, mean=25)
Index<-rnorm(20,mean=5)
DF<-data.frame(Data1,Data2,Data3,Data4,Data5,Data6,Data7,Data8,Index)
What I'd like to do is remove (make NA) certain columns per row based on the Index column. I took the long way and did this to give you an idea of what I'm trying to do:
DF[DF$Index>5.0,8]<-NA
DF[DF$Index>=4.5 & DF$Index<=5.0,7:8]<-NA
DF[DF$Index>=4.0 & DF$Index<=4.5,6:8]<-NA
DF[DF$Index>=3.5 & DF$Index<=4.0,5:8]<-NA
DF[DF$Index>=3.0 & DF$Index<=3.5,4:8]<-NA
DF[DF$Index>=2.5 & DF$Index<=3.0,3:8]<-NA
DF[DF$Index>=2.0 & DF$Index<=2.5,2:8]<-NA
DF[DF$Index<=2.0,1:8]<-NA
This works fine as is, but is not very adaptable. If the number of columns change, or I need to tweak the conditional statements, it's a pain to rewrite the entire code (the actual data set is much larger).
What I would like to do is be able to define a few variables, and then run some sort of loop or apply to do exactly what the lines of code above do.
As an example, in order to replicate my long code, something along the lines of this kind of logic:
NumCol<-8
Max<-5
Min<-2.0
if index > Max, then drop NumCol
if index >= (Max-0.5) & <=Max, than drop NumCol:(NumCol -1)
repeat until reach Min
I don't know if that's the most logical line of reasoning in R, and I'm pretty bad with Looping and apply, so I'm open to any line of thought that can replicate the above long lines of code with the ability to adjust the above variables.
If you don't mind changing your data.frame to a matrix, here is a solution that uses indexing by a matrix. The building of the two-column matrix of indices to drop is a nice review of the apply family of functions:
Seq <- seq(Min, Max, by = 0.5)
col.idx <- lapply(findInterval(DF$Index, Seq) + 1, seq, to = NumCol)
row.idx <- mapply(rep, seq_along(col.idx), sapply(col.idx, length))
drop.idx <- as.matrix(data.frame(unlist(row.idx), unlist(col.idx)))
M <- as.matrix(DF)
M[drop.idx] <- NA
Here is a memory efficient (but I can't claim elegant) data.table solution
It uses the very useful function findInterval to change you less than / greater than loop
#
library(data.table)
DT <- data.table(DF)
# create an index column which 1:8 represent your greater than less than
DT[,IND := findInterval(Index, c(-Inf, seq(2,5,by =0.5 ), Inf))]
# the columns you want to change
changing <- names(DT)[1:8]
setkey(DT, IND)
# loop through the indexes and alter by reference
for(.ind in DT[,unique(IND)]){
# the columns you want to change
.which <- tail(changing, .ind)
# create a call to `:=`(a = as(NA, class(a), b= as(NA, class(b))
pairlist <- mapply(sprintf, .which, .which, MoreArgs = list(fmt = '%s = as(NA,class(%s))'))
char_exp <- sprintf('`:=`( %s )',paste(pairlist, collapse = ','))
.e <- parse(text = char_exp)
DT[J(.ind), eval(.e)]
}

R loops: Adding a column to a table if does not already exist

I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.
After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA

Resources