R, apply function on every second column of a data frame? - r

How to apply a function on every second column of a data frame? That is to say, how to modify df2 <- sapply(df1, fun) such that df2 equals df1 but with fun applied to every second column? Here is what I tried:
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
df1 <- data.frame(a,b)
df2 <- sapply(df1[c(TRUE, FALSE)], function(x) x^2)
isTRUE(dim(df1)==dim(df2)) # FALSE
The problem with this code is, that it deletes all columns to which fun was not applied to (dim(df2) # 5 1).

Assigning variables to slices
You can assign new values for subsets of an object. Say for:
x <- c(1,2,3)
x[2] <- 4
Now x will be c(1,4,2). Similarly you can do this for row/columns of a matrix or dataframe. Here we use the apply function with the second argument 2 for cols (1 for cols). I recommend the seq function to generate a sequence of indices from=1, by=2 gives odd and from=2, by=2 gives even indices. Specifying this it way generalises to other subsets and straightforward to check you got it right.
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
df1 <- data.frame(a,b)
df2 <- df1
df2[,seq(1, ncol(df2), 2)] <- apply(df2[,seq(1, ncol(df2), 2)], 2, function(x) x^2)
Loops
Note that you can also do this with a loop:
df2 <- df1
for(col in seq(1, ncol(df2), 2)) df2[,col] <- sapply(df2[,col], function(x) x^2)
Vectorised functions
Since the squared operation is "vectorised" in R, in this case you could also do:
for(col in seq(1, ncol(df2), 2)) df2[,col] <- df2[,col]x^2
Or use vectorisation completely:
df2 <- df1
df2[,seq(1, ncol(df2), 2)] <- df2[,seq(1, ncol(df2), 2)]^2

Related

ANOVA repeated measure on multiple data frames r

I have hundreds of data frames. I need to perform ANOVA RM tests on each of these data frames. The output should be one single data frame with the mean of each p-value.
I tried:
#crate dataframes
df1 <- data.frame(replicate(16,sample(-10:10,10,rep=TRUE)))
df2 <- data.frame(replicate(16,sample(-10:10,10,rep=TRUE)))
df3 <- data.frame(replicate(16,sample(-10:10,10,rep=TRUE)))
Group <- c(rep("A",8),rep("B",8))
Time <- c(rep("before",4),rep("after",4),rep("before",4),rep("after",4))
Name <- rep(rep(1:4, 4))
conds <- data.frame(Name,Time,Group)
#create list
list <- list(df1,df2,df3)
#for loop ANOVA repeated measures
for ( i in list){
data <- cbind(conds,i)
t=NULL
name <- colnames(data)[4:ncol(data)]
for(i in 4:ncol(data)) { z <- aov(data[,i] ~ Group*Time+Error(Name/(Group*Time)), data=data)
sz <- as.list(summary(z))
t <- as.data.frame(c(t,sz[4]$`Error: Name:Group:Time`[[1]]$`Pr(>F)`[1]))
t
}
}
mean(t)
R as a vectorized language is designed to avoid for loops where possible. You could do an sapply approach.
When you list your data frames use names like df1=, which later helps in the result on which of them were done calculations.
(And don't use list as object name since you'll get confused because there is also a list function. Also data, df and friends are "bad" names, you may always check, using e.g. ?list if the name is already occupied.)
list1 <- list(df1=df1, df2=df2, df3=df3)
res <- sapply(list1, function(x) {
dat <- cbind(conds, x)
sapply(dat[-(1:3)], function(y) {
z <- aov(y ~ Group*Time + Error(Name/(Group*Time)), data=dat)
sz <- summary(z)
p <- sz$`Error: Name:Group:Time`[[1]][1, 5]
p
})
})
From the resulting matrix we take the column means.
colMeans(res)
# df1 df2 df3
# 0.4487419 0.4806528 0.4847789
Data:
set.seed(42)
df1 <- data.frame(replicate(16,sample(-10:10,16,rep=TRUE)))
df2 <- data.frame(replicate(16,sample(-10:10,16,rep=TRUE)))
df3 <- data.frame(replicate(16,sample(-10:10,16,rep=TRUE)))
conds <- data.frame(Name=c(rep("A",8),rep("B",8)),
Time=c(rep("before",4),rep("after",4),
rep("before",4),rep("after",4)),
Group=rep(1:4, 4))

Loop through df and create new df in R

I have a df (10 rows, 15 columns)
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
I want to loop over each column, do something to each row and create a new df with the answer.
I actually want to do a linear regression on each column. I get back a list for each column. For example I have a second df with what I want to put into the lm. df2<-data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I then want to do something like:
new_df <- data.frame()
for (i in 1:ncol(df)){
j<-lm(df[,i] ~ df2$X1 + df2$X2)
temp_df<-j$residuals
new_df[,i]<-cbind(new_df,temp_df)
}
I get the error:
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 0, 8
I have checked other similar posts but they always seem to involve a function or something similarly complex for a newbie like me. Please help
This can be done without loops but for your understanding, using loops we can do
new_df <- df
for (i in names(df)) {
j<-lm(df[,i] ~ df$X1 + df$X2)
new_df[i] <- j$residuals
}
You are initialising an empty dataframe with 0 rows and 0 columns initially as new_df and hence when you are trying to assign the value to it, it gives you an error. Instead of that assign original df to new_df as they both are going to share the same structure and then use the above.
Update
Based on the new example
lst1 <- lapply(names(df), function(nm) {dat <- cbind(df[nm], df2[c('X1', 'X2')])
lm(paste0(nm, "~ X1 + X2"), data = dat)$residuals})
out <- setNames(data.frame(lst1), names(df))
Also, this doesn't need any loop
out2 <- lm(as.matrix(df) ~ X1 + X2, data = cbind(df, df2))$residuals
Old
We can do this easily without any loop
new_df <- df + 10
---
If we need a loop, it can be done with `lapply`
new_df <- df
new_df[] <- lapply(df, function(x) x + 10)
---
Or with a `for` loop
lst1 <- vector('list', ncol(df))
for(i in seq_along(df)) lst1[[i]] <- df[, i] + 10
new_df <- as.data.frame(lst1)
data
set.seed(24)
df <- data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
df2 <- data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I would do as suggested by akrun. But if you do need (or want) to loop for some reasons you can use:
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
new_df <- data.frame(replicate(15, rep(NA, 10)))
for (i in 1:ncol(df)){
new_df[ ,i] <- df[ , i] + 10
}

apply a function with two dataframes as input in r

I want to get the total number of NA that missmatch between two dataframes.
I have found the way to get this for two vectors as follows:
compareNA <- function(v1,v2) {
same <- (v1 == v2) | (is.na(v1) & is.na(v2))
same[is.na(same)] <- FALSE
n <- 0
for (i in 1:length(same))
if (same[i] == "FALSE"){
n <- n+1
}
return(n)
}
Lets say I have vector aand bwhen comparing them I got as a result 2
a <- c(1,2,NA, 4,5,6,NA,8)
b <- c(NA,2,NA, 4,NA,6,NA,8)
h <- compareNA(a,b)
h
[1] 2
My question is: how to apply this function for dataframes instead of vectors?
Having as an example this datafames:
a2 <- c(1,2,NA,NA,NA,6,NA,8)
b2 <- c(1,NA,NA,4,NA,6,NA,NA)
df1 <- data.frame(a,b)
df2 <- data.frame(a2,b2)
what i expect as a result is 5, since this are the total number of NAs that appear in df2 that are not in df1. Any suggestion how to make this work?
Here's a second thought.
xy1 <- data.frame(a = c(NA, 2, 3), b = rnorm(3))
xy2 <- data.frame(a = c(NA, 2, 4), b = rnorm(3))
com <- intersect(colnames(xy1), colnames(xy2))
sum(xy1[, com] == xy2[, com], na.rm = TRUE)
If you don't want to worry about column names (but you should), you can make sure the columns align perfectly. In that case, intersect step is redundant.
sum(xy1 == xy2, na.rm = TRUE)
A third way (assuming dimensions of df1 & df2 are same):
sum(sapply(1:ncol(df1), function(x) compareNA(df1[,x], df2[,x])))
# 5
It would be easier to force both dataframes to have the same column names and compare column by column when those have the same name. You can then simply use a loop over columns and increment a running total by applying your function.
compareNA.df <- function(df1, df2) {
total <- 0
common_columns <- intersect(colnames(df1), colnames(df2))
for (col in common_columns) {
total <- total + compareNA(df1[[col]], df2[[col]])
}
return(total)
}
colnames(df2) <- c("a", "b")
compareNA.df(df1, df2)

Clip outliers in columns in df2,3,4... based on quantiles from columns in df.tr

I am trying to replace the "outliers" in each column of a dataframe with Nth percentile.
n <- 1000
set.seed(1234)
df <- data.frame(a=runif(n), b=rnorm(n), c=rpois(n,1))
df.t1 <- as.data.frame(lapply(df, function(x) { q <- quantile(x,.9,names=F); x[x>q] <- q; x }))
I need the computed quantiles to truncate other dataframes. For example, I compute these quantiles on a training dataset and apply it; I want to use those same thresholds in several test datasets. Here's an alternative approach which allows that.
q.df <- sapply(df, function(x) quantile(x,.9,names=F))
df.tmp <- rbind(q.df, df.t1)
df.t2 <- as.data.frame(lapply(df.tmp, function(x) { x[x>x[1]] <- x[1]; x }))
df.t2 <- df.t2[-1,]
rownames(df.t2) <- NULL
identical(df.t1, df.t2)
The dataframes are very large and hence I would prefer not to use rbind, and then delete the row later. Is is possible to truncate the columns in the dataframes using the q.df but without having to rbind? Thx.
So just write a function that directly computes the quantile, then directly applies clipping to each column. The <- conditional assignment inside your lapply call is bogus; you want ifelse to return a vectorized expression for the entire column, already. ifelse is your friend, for vectorization.
# Make up some dummy df2 output (it's supposed to have 1000 cols really)
df2 <- data.frame(d=runif(1000), e=rnorm(1000), f=runif(1000))
require(plyr)
print(colwise(summary)(df2)) # show the summary before we clamp...
# Compute quantiles on df1...
df1 <- df
df1.quantiles <- apply(df1, 2, function(x, prob=0.9) { quantile(x, prob, names=F) })
# ...now clamp by sweeping col-index across both quantile vector, and df2 cols
clamp <- function(x, xmax) { ifelse(x<=xmax, x, xmax) }
for (j in 1:ncol(df2)) {
df2[,j] <- clamp(df2[,j], df1.quantiles[j]) # don't know how to use apply(...,2,)
}
print(colwise(summary)(df2)) # show the summary after we clamp...
Reference:
[1] "Clip values between a minimum and maximum allowed value in R"

How to modify some but not all variables of a data frame?

Suppose there is a data.frame where some variables are coded as integers:
a <- c(1,2,3,4,5)
b <- as.integer(c(2,3,4,5,6))
c <- as.integer(c(5,1,0,9,2))
d <- as.integer(c(5,6,7,3,1))
e <- c(2,6,1,2,3)
df <- data.frame(a,b,c,d,e)
str(df)
Suppose I want to convert columns b to d to numeric:
varlist <- names(df)[2:4]
lapply(varlist, function(x) {
df$x <- as.numeric(x, data=x)
})
str(df)
does not work.
I tried:
df$b <- as.numeric(b, data=df)
df$c <- as.numeric(c, data=df)
df$d <- as.numeric(d, data=df)
str(df)
which works fine.
Questions:
How do I do this (in a loop or better with lapply, [but I'm a Stata person and as such used to writing loops])?
And more generally: how do I apply any function to a list of variables in a data.frame
(e.g. multiply each variable on the list with some other variable[which is always stays the same,
BONUS: or changes with each variable on the list])?
For the first question you can use sapply:
df[2:4] <- sapply(df[2:4],as.numeric)
for the second you should use mapply. For example to multiply the 3 variables(2 to 4) by some 3 different random scalars:
df[2:4] <- mapply(function(x,y)df[[x]]*y,2:4,rnorm(3))
df[,2:4] <- sapply(df[,2:4], as.numeric)
As for your second question, if you want to say multiply column c by 5
df$c <- df$c * 5
Or any vector the same length as c, maybe a new column multiplying c by d
df$cd <- df$c * df$d

Resources