lapply to add column to existing dataframes - r

I have a list of data frames, and want to perform a function on each column in the data frame.
I've been googling for a while, but the issue I have is this:
df.1 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
df.2 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
names(df.1) <- c("a", "b", "c")
names(df.2) <- c("a", "b", "c")
ls.1<- list(df.1,df.2)
res <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
Returns a new list "res" with a group of unnamed dataframes in them (res[[1]], res[[2]] etc).
[[1]]
a b c d
1 2.2378686 3.640607 4.793172 8.433780
2 -0.4411046 3.690850 5.290814 8.981664
3 -1.1490879 3.081092 4.982820 8.063912
4 -0.3024211 1.929033 4.743569 6.672602
5 1.3658726 3.395564 2.800131 6.195695
[[2]]
a b c d
1 0.3452530 3.264709 7.384127 10.648836
2 -1.2031949 3.118633 4.840496 7.959129
3 0.6177369 1.119107 4.938917 6.058024
4 -1.0470713 1.942357 5.747748 7.690106
5 0.8732836 2.704501 5.805754 8.510254
I'm interested in adding columns to the original dataframes (df.1, df.2) How would I do this?

You can name your list elements, or use tibble::lst which will do it for you:
ls.1<- list(df.1 = df.1,df.2 = df.2)
ls.2<- tibble::lst(df.1, df.2)
res1 <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
res2 <- lapply(ls.2, function(x){
x$d <- x$b + x$c
return(x)
})
# $df.1
# a b c d
# 1 0.6782608 4.0774244 2.845351 6.922776
# 2 2.3620601 1.9395314 5.438832 7.378364
# 3 -0.5913838 2.0579972 4.312360 6.370357
# 4 0.5532147 0.8581389 5.867889 6.726027
# 5 -0.3251044 1.9838598 4.321008 6.304867
#
# $df.2
# a b c d
# 1 1.9918131 3.195105 5.715858 8.910963
# 2 0.2525537 2.507358 5.040691 7.548050
# 3 0.5038298 3.112855 5.265974 8.378830
# 4 0.4873384 3.377182 5.685714 9.062896
# 5 -0.6539881 0.157948 5.407508 5.565456
To overwrite the original data.frames you can use list2env on the output.

In order to add columns, you will have to either overwrite your ls.1 with res or perhaps manually assign result to your original data.frames, e.g. df.1 <- res[[1]]. But there are a hundred ways to skin a cat (pun intended) and there may be other better approaches.

Related

R:Subsetting over all data frames inside a list

I'm new in the use of R and stackoverflow. I'm trying to deal with a list of data frame and have the following problem (hope, that this is a good example for reproducing). Assume, that I've a list of 3 data frames with 4 columns (my real code contains 10 data frames with 20 columns):
df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)
For each data frame I've a different condition for subsetting:
For example:
#If I would subset them in a singular way outside of the list
df1_s <- df1[which(df1$k <=12 & df1$k >0), df1$h_1] #Taking only rows of k=12 to k=1
and only the column h_1
df2_s <- df2[which(df2$k <=4 & df2$k >0), df2$h_3]
df3_s <- df3[which(df3$k <=12 & df2$k >0), df2$h_2]
How I can subset the three data frames in the list in a most efficient way ?
I think something with lapply and putting the numbers of subsetting in a vector would be good approach, but I've no idea how to do it or how I can subset in lists.
I hope you can help me. Before posting, I tried to find a solution in other posts, that are dealing with subsetting of data frames in lists, but that doesn't work for my Code.
Here's an mapply approach (same idea as the other answer):
# function: w/ arguments dataframe and a vector = [column name, upper, lower]
rook <- function(df, par) {
out <- df[par[1]][, 1]
out[out <= par[2] & out > par[3]]
}
# list of parameters
par_list <- list(
c('h_1', 12, 0),
c('h_3', 4 , 0),
c('h_2', 12, 0)
)
# call mapply
mapply(rook, df_list, par_list)
Here's a solution using base R. As #www mentioned, the idea is to use an apply-type function (mapply or pmap from purrr) to apply multiple arguments to a function in sequence. This solution also makes use of the eval-parse construct to do flexible subsetting. See e.g. the discussion here http://r.789695.n4.nabble.com/using-a-condition-given-as-string-in-subset-function-how-td1676426.html.
subset_fun <- function(data, criteria, columns) {
subset(data, eval(parse(text = criteria)), columns)
}
criterion <- list("k <= 12 & k > 0", "k <= 4 & k > 0", "k <= 12 & k > 0")
cols <- list("h_1", "h_3", "h_2")
out <- mapply(subset_fun, df_list, criterion, cols)
str(out)
# List of 3
# $ df1.h_1: num [1:12] -0.0589 1.0677 0.2122 1.4109 -0.6367 ...
# $ df2.h_3: num [1:4] -0.826 -1.506 -1.551 0.862
# $ df3.h_2: num [1:12] 0.8948 0.0305 0.9131 -0.0219 0.2252 ...
We can use the pmap function from the purrr package. The key is to define a function to take arguments based on the k and column name, and then organize a list with these arguments, and then use pmap.
library(tidyverse)
# Define a function
subset_fun <- function(dat, k1, k2, col){
dat2 <- dat %>%
filter(k <= k1, k > k2) %>%
pull(col)
return(dat2)
}
# Define lists for the function arguments
par <- list(dat = df_list, # List of data frames
k1 = list(12, 4, 12), # The first number
k2 = list(0, 0, 0), # The second number
col = list("h_1", "h_3", "h_2")) # The column name
# Apply the subset_fun
df_list2 <- pmap(par, subset_fun)
df_list2
# $df1
# [1] -0.6868529 -0.4456620 1.2240818 0.3598138 0.4007715 0.1106827 -0.5558411 1.7869131
# [9] 0.4978505 -1.9666172 0.7013559 -0.4727914
#
# $df2
# [1] -0.9474746 -0.4905574 -0.2560922 1.8438620
#
# $df3
# [1] -0.2803953 0.5629895 -0.3724388 0.9769734 -0.3745809 1.0527115 -1.0491770 -1.2601552
# [9] 3.2410399 -0.4168576 0.2982276 0.6365697
DATA
set.seed(123)
df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)
Consider Map, the wrapper to mapply to return a list of dataframes. And because you subset one column, to avoid return as a vector, cast back with data.frame and use setNames to rename.
Here, mapply or Map, sibling to lapply, is chosen because you want to iterate element-wise across a list of equal length objects. Mapply takes an unlimited number of arguments, here being four, requiring lengths to be equal or multiples of lengths:
low_limits <- c(0, 0, 0)
high_limits <- c(12, 4, 12)
h_cols <- c("h_1", "h_2", "h_3")
subset_fct <- function(df, lo, hi, col)
setNames(data.frame(df[which(df$k > lo & df$k <= hi), col]), col)
new_df_list <- Map(subset_fct, df_list, low_limits, high_limits, h_cols)
# EQUIVALENT CALL
new_df_list <- mapply(subset_fct, df_list, low_limits,
high_limits, h_cols, SIMPLIFY = FALSE)
Output (uses set.seed(456) at top to reproduce random numbers)
new_df_list
# $df1
# h_1
# 1 1.0073523
# 2 0.5732347
# 3 -0.9158105
# 4 1.3110974
# 5 0.9887263
# 6 1.6539287
# 7 -1.4408052
# 8 1.9473564
# 9 1.7369362
# 10 0.3874833
# 11 2.2800340
# 12 1.5378833
# $df2
# h_2
# 1 0.11815133
# 2 0.86990262
# 3 -0.09193621
# 4 0.06889879
# $df3
# h_3
# 1 -1.4122604
# 2 -0.9997605
# 3 -2.3107388
# 4 0.9386188
# 5 -1.3881885
# 6 -0.6116866
# 7 0.3184948
# 8 -0.2354058
# 9 1.0750520
# 10 -0.1007956
# 11 1.0701526
# 12 1.0358389

rename column in dataframe using variable name R

I have a number of data frames. Each with the same format.
Like this:
A B C
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I would like to change the name of the third column--C--so that it includes part if the name of the variable name associated with the data frame.
For the variable df_elephant the data frame should look like this:
A B C.elephant
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I have a function which will change the column name:
rename_columns <- function(x) {
colnames(x)[colnames(x)=='C'] <-
paste( 'C',
strsplit (deparse (substitute(x)), '_')[[1]][2], sep='.' )
return(x)
}
This works with my data frames. However, I would like to provide a list of data frames so that I do not have to call the function multiple times by hand. If I use lapply like so:
lapply( list (df_elephant, df_horse), rename_columns )
The function renames the data frames with an NA rather than portion of the variable name.
[[1]]
A B C.NA
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.02344361
[[2]]
A B C.NA
1 0.45387054 0.02279488 1.6746280
2 -1.47271378 0.68660595 -0.2505752
3 1.26475917 -1.51739927 -1.3050531
Is there some way that I kind provide a list of data frames to my function and produce the desired result?
You are trying to process the data frame column names instead of the actual lists' name. And this is why it's not working.
# Generating random data
n = 3
item1 = data.frame(A = runif(n), B = runif(n), C = runif(n))
item2 = data.frame(A = runif(n), B = runif(n), C = runif(n))
myList = list(df_elephant = item1, df_horse = item2)
# 1- Why your code doesnt work: ---------------
names(myList) # This will return the actual names that you want to use : [1] "df_elephant" "df_horse"
lapply(myList, names) # This will return the dataframes' column names. And thats why you are getting the "NA"
# 2- How to make it work: ---------------
lapply(seq_along(myList), # This will return an array of indicies
function(i){
dfName = names(myList)[i] # Get the list name
dfName.animal = unlist(strsplit(dfName, "_"))[2] # Split on underscore and take the second element
df = myList[[i]] # Copy the actual Data frame
colnames(df)[colnames(df) == "C"] = paste("C", dfName.animal, sep = ".") # Change column names
return(df) # Return the new df
})
# [[1]]
# A B C.elephant
# 1 0.8289368 0.06589051 0.2929881
# 2 0.2362753 0.55689663 0.4854670
# 3 0.7264990 0.68069346 0.2940342
#
# [[2]]
# A B C.horse
# 1 0.08032856 0.4137106 0.6378605
# 2 0.35671556 0.8112511 0.4321704
# 3 0.07306260 0.6850093 0.2510791
You can also try. Somehow similar to Akrun's answer using also Map in the end:
# Your data
d <- read.table("clipboard")
# create a list with names A and B
d_list <- list(A=d, B=d)
# function
foo <- function(x, y){
gr <- which(colnames(x) == "C") # get index of colnames C
tmp <- colnames(x) #new colnames vector
tmp[gr] <- paste(tmp[gr], y, sep=".") # replace the old with the new colnames.
setNames(x, tmp) # set the new names
}
# Result
Map(foo, d_list, names(d_list))
$A
A B C.A
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
$B
A B C.B
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
We can try with Map. Get the datasets in a list (here we used mget to return the values of the strings in a list), using Map, we change the names of the third column with that of the corresponding vector of names.
Map(function(x, y) {names(x)[3] <- paste(names(x)[3], sub(".*_", "", y), sep="."); x},
mget(c("df_elephant", "df_horse")), c("df_elephant", "df_horse"))
#$df_elephant
# A B C.elephant
#1 -0.02299388 0.7140416 0.8492423
#2 -1.43027866 -1.9642077 -1.2886368
#3 -1.01827712 -0.9414119 -2.0234436
#$df_horse
# A B C.horse
#1 0.4538705 0.02279488 1.6746280
#2 -1.4727138 0.68660595 -0.2505752
#3 1.2647592 -1.51739927 -1.3050531

Forcing zero instead of NA in R

I have this function called newBamAD and dataframe x. what this function does is it matches the letters in REF and ALT columns and grabs the respective numbers for REF and ALT values in x. What I need to know is how do I make this function give 0 in ref or alt column instead of NA. How do I replace NA with zero here?
x <- as.matrix(read.csv(text="start,A,T,G,C,REF,ALT,TYPE
chr20:5363934,95,29,14,59,C,T,snp
chr5:8529759,,,,,G,C,snp
chr14:9620689,65,49,41,96,T,G,snp
chr18:547375,94,1,51,67,G,C,snp
chr8:5952145,27,80,25,96,T,T,snp
chr14:8694382,68,94,26,30,A,A,snp
chr16:2530921,49,15,79,72,A,T,snp:2530921
chr16:2530921,49,15,79,72,A,G,snp:2530921
chr16:2530921,49,15,79,72,A,T,snp:2530921flat
chr16:2530331,9,2,,,A,T,snp:2530331
chr16:2530331,9,2,,,A,G,snp:2530331
chr16:2530331,9,2,,,A,T,snp:2530331flat
chr16:2533924,42,13,19,52,G,T,snp:flat
chr16:2543344,4,13,13,42,G,T,snp:2543344flat
chr16:2543344,42,23,13,42,G,A,snp:2543344
chr14:4214117,73,49,18,77,G,A,snp
chr4:7799768,36,28,1,16,C,A,snp
chr3:9141263,27,41,93,90,A,A,snp", stringsAsFactors=FALSE))
newBamAD <- function (x,base.types=c("A","C","G","T")) {
# the version above
rownames(x) <- 1:nrow(x)
ref <- x[cbind(1:nrow(x), x[, 'REF'])]
alt <- x[cbind(1:nrow(x), x[, 'ALT'])]
which.flat <- grep('flat$', x[, 'TYPE'])
alt[which.flat] <- sapply(which.flat, function (i,base.types) {
sum(as.numeric(x[i, c( base.types[!( base.types %in% x[i, 'REF'])] )] ) ,na.rm=TRUE) },base.types)
cbind(x[,c("start","REF","ALT","TYPE")],bam.AD=paste(ref, alt, sep=','))
# cbind(x, bam.AD=paste(ref, alt, sep=','))
}
You could take the advice of thelatemail and switch to data frame, then take the NA out first
df <- as.data.frame(x)
types <- c("A", "T", "G", "C")
df[types][is.na(df[types])] <- 0
head(newBamAD(df))
# start REF ALT TYPE bam.AD
# 1 chr20:5363934 C T snp 59,29
# 2 chr5:8529759 G C snp 0, 0
# 3 chr14:9620689 T G snp 49,41
# 4 chr18:547375 G C snp 51,67
# 5 chr8:5952145 T T snp 80,80
# 6 chr14:8694382 A A snp 68,68
We can use gsub to do that
gsub('NA', 0, newBamAD(x)[,5])

How to combine a data frame and a vector

df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c(1,2)
How do I combine df and y into a new data frame that has all combinations of rows from df with elements from y? In this example, the output should be
data.frame(w=c("r","r","q","q"), x=c("a","a","b","b"),y=c(1,2,1,2))
w x y
1 r a 1
2 r a 2
3 q b 1
4 q b 2
This should do what you're trying to do, and without too much work.
dl <- unclass(df)
dl$y <- y
merge(df, expand.grid(dl))
# w x y
# 1 q b 1
# 2 q b 2
# 3 r a 1
# 4 r a 2
data.frame(lapply(df, rep, each = length(y)), y = y)
this should work
library(combinat)
df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c("one", "two") #for generality
indices <- permn(seq_along(y))
combined <- NULL
for(i in indices){
current <- cbind(df, y=y[unlist(i)])
if(is.null(combined)){
combined <- current
} else {
combined <- rbind(combined, current)
}
}
print(combined)
Here is the output:
w x y
1 r a one
2 q b two
3 r a two
4 q b one
... or to make it shorter (and less obvious):
combined <- do.call(rbind, lapply(indices, function(i){cbind(df, y=y[unlist(i)])}))
First, convert class of columns from factor to character:
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
Then, use expand.grid to get a index matrix for all combinations of rows of df and elements of y:
ind.mat = expand.grid(1:length(y), 1:nrow(df))
Finally, loop through the rows of ind.mat to get the result:
data.frame(t(apply(ind.mat, 1, function(x){c(as.character(df[x[2], ]), y[x[1]])})))

How to find which elements of one set are in another set?

I have two sets: A with columns x,y, and B also with columns x, y.
I need to find the index of the rows of A which are inside of B (both x and y must match).
I have come up with a simple solution (see below), but this comparison is inside of the loop and paste adds much more extra time.
B <- data.frame(x = sample(1:1000, 1000), y = sample(1:1000, 1000))
A <- B[sample(1:1000, 10),]
#change some elements
A$x[c(1,3,7,10)] <- A$x[c(1,3,7,10)] + 0.5
A$xy <- paste(A$x, A$y, sep='ZZZ')
B$xy <- paste(B$x, B$y, sep='ZZZ')
indx <- which(A$xy %in% B$xy)
indx
For example for a single observation an alternative to paste is almost 3 times faster
ind <- sample(1:1000, 1)
xx <- B$x[ind]
yy <- B$y[ind]
ind <- which(with(B, x==xx & y==yy))
# [1] 0.0160000324249268 seconds
xy <- paste(xx,'ZZZ',yy, sep='')
ind <- which(B$xy == xy)
# [1] 0.0469999313354492 seconds
How about using merge() to do the matching for you?
A$id <- seq_len(nrow(A))
sort(merge(A, B)$id)
# [1] 2 4 5 6 8 9
Edit:
Or, to get rid of two unnecessary sorts, use the sort= option to merge()
merge(A, B, sort=FALSE)$id
# [1] 2 4 5 6 8 9

Resources