R: using apply over two data.frames - r

I want to use apply instead of a for-loop. The problem is, my for-loop uses two data.frames as an input. For example:
x <- data.frame(col1=c(1,NA,3,NA), col2=c(9,NA,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
output <- rep(NA,2)
for(i in 1:2)
{
output[i] <- sum(is.na(x[,i]))+sum(y[,i])
}
The result here is, correctly c(12,27).
But if I try function and apply:
test <- function(vector1,vector2) sum(is.na(vector1))+sum(vector2)
apply(x,y,MARGIN=2,FUN=test)
With apply the result is c(38,37).
How can I fix this?

You can use mapply instead of apply:
x <- data.frame(col1=c(1,NA,3,NA), col2=c(9,NA,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
test <- function(vector1,vector2) sum(is.na(vector1))+sum(vector2)
mapply(test, x, y)
# col1 col2
# 12 27
?mapply

Related

Applying a Function to a Data Frame : lapply vs traditional way

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

Using sapply instead of loop in R

I have a function that requires 4 parameters:
myFun <- function(a,b,c,d){}
I have a matrix where each row contains the parameters:
myMatrix = matrix(c(a1,a2,b1,b2,c1,c2,d1,d2), nrow=2, ncol=4)
Currently I have a loop which feeds the parameters to myFun:
m <- myMatrix
i <- 1
someVector <- c()
while (i<(length(m[,1])+1)){
someVector[i] <-
myFun(m[i,1],m[i,2],m[i,3],m[i,4])
i = i+1
}
print(someVector)
What I would like to know is there a better way to get this same result using sapply instead of a loop.
You can use mapply() here which allows you to give it vectors as arguments, you should turn your matrix into a dataframe.
df <- as.data.frame(myMatrix))
results <- mapply(myFun, df$a, df$b, df$c, df$d)

How can I do a t.test on an entire data.frame and extract the p-values?

My dataset looks something like this:
a <- rnorm(2)
b <- rnorm(2)-3
x <- rnorm(13)
y <- rnorm(2)-1
z <- rnorm(2)-2
eg <- expand.grid(a,b,x,y,z)
treatment <- c(rep(1, 2), rep(0,3))
eg <- data.frame(t(eg))
row.names(eg) <- NULL
eg <- cbind(treatment, eg)
What I need to do is run t-tests on each column, comparing the treatment =1 group to the treatment=0 group. I'd like to then have a vector of p-values. I've tried (several versions of) doing this through a loop, but I continue to receive the same error message: "undefined columns selected." Here's my code currently:
p.values <- c(rep(NA, 208))
for (i in 2:209) {
x <- data.frame(eg[eg$treatment==1][,i][1:2])
y <- data.frame(eg[eg$treatment==0][,i][3:5])
value <- t.test(x=x, y=y)['p.value']
p.values[i] <- value
}
I added the data.frame() after reading someone mention that for loops only loop through dataframes, but it didn't change anything. I am sure there is an easier way to do this, perhaps by using something in the apply family? Does anyone have any suggestions? Thanks so much!
A couple of options, both using sapply:
sapply(
eg[-1], function(x) t.test(x[eg$treatment==1],x[eg$treatment==0])[["p.value"]]
)
Or looping over the names instead:
sapply(
names(eg[-1]),
function(x) t.test(as.formula(paste(x,"~ treatment")),data=eg)[["p.value"]]
)
Or even mapply:
mapply(function(x,y) t.test(x ~ y,data=cbind(x,y))[["p.value"]], eg[-1], eg[1])

Clip outliers in columns in df2,3,4... based on quantiles from columns in df.tr

I am trying to replace the "outliers" in each column of a dataframe with Nth percentile.
n <- 1000
set.seed(1234)
df <- data.frame(a=runif(n), b=rnorm(n), c=rpois(n,1))
df.t1 <- as.data.frame(lapply(df, function(x) { q <- quantile(x,.9,names=F); x[x>q] <- q; x }))
I need the computed quantiles to truncate other dataframes. For example, I compute these quantiles on a training dataset and apply it; I want to use those same thresholds in several test datasets. Here's an alternative approach which allows that.
q.df <- sapply(df, function(x) quantile(x,.9,names=F))
df.tmp <- rbind(q.df, df.t1)
df.t2 <- as.data.frame(lapply(df.tmp, function(x) { x[x>x[1]] <- x[1]; x }))
df.t2 <- df.t2[-1,]
rownames(df.t2) <- NULL
identical(df.t1, df.t2)
The dataframes are very large and hence I would prefer not to use rbind, and then delete the row later. Is is possible to truncate the columns in the dataframes using the q.df but without having to rbind? Thx.
So just write a function that directly computes the quantile, then directly applies clipping to each column. The <- conditional assignment inside your lapply call is bogus; you want ifelse to return a vectorized expression for the entire column, already. ifelse is your friend, for vectorization.
# Make up some dummy df2 output (it's supposed to have 1000 cols really)
df2 <- data.frame(d=runif(1000), e=rnorm(1000), f=runif(1000))
require(plyr)
print(colwise(summary)(df2)) # show the summary before we clamp...
# Compute quantiles on df1...
df1 <- df
df1.quantiles <- apply(df1, 2, function(x, prob=0.9) { quantile(x, prob, names=F) })
# ...now clamp by sweeping col-index across both quantile vector, and df2 cols
clamp <- function(x, xmax) { ifelse(x<=xmax, x, xmax) }
for (j in 1:ncol(df2)) {
df2[,j] <- clamp(df2[,j], df1.quantiles[j]) # don't know how to use apply(...,2,)
}
print(colwise(summary)(df2)) # show the summary after we clamp...
Reference:
[1] "Clip values between a minimum and maximum allowed value in R"

probe global variables to call inside function

I want to pass variables within the .Globalenv when inside a function. Basically concatenate x number of data frames into a matrix.
Here is some dummy code;
Alpha <- data.frame(lon=124.9167,lat=1.53333)
Alpha_2 <- data.frame(lon=3.13333, lat=42.48333)
Alpha_3 <- data.frame(lon=-91.50667, lat=27.78333)
myfunc <- function(x){
vars <- ls(.GlobalEnv, pattern=x)
mat <- as.matrix(rbind(vars[1], vars[2], vars[3]))
return(mat)
}
When calling myfunc('Alpha') I would like the same thing to be returned as when you run;
as.matrix(rbind(Alpha, Alpha_2, Alpha_3)
lon lat
1 124.91670 1.53333
2 3.13333 42.48333
3 -91.50667 27.78333
Any pointers would be appreciated, thanks!
You can use get to retrieve variables by name. We do this here in a loop with lapply, and then use rbind to bind them together.
myfunc <- function(x){
vars <- ls(.GlobalEnv, pattern=x)
df <- do.call(rbind, mget(vars, .GlobalEnv)) # courtesy #Roland
return(df)
}
myfunc("Alpha")
# lon lat
# 1 124.91670 1.53333
# 2 3.13333 42.48333
# 3 -91.50667 27.78333
Note, in practice, you probably want to check that the variables that match the pattern actually are what you think they are, but this gives you the rough tools you want.
Old version (2nd line of func):
df <- do.call(rbind, lapply(vars, get, envir=.GlobalEnv))

Resources