applying function to multiple dataframes programatically [duplicate]

applying function to multiple dataframes programatically [duplicate] - r

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.

Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.

Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5

In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5

Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

Related

Loop to execute in different dataframes in r [duplicate]

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.

Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.

Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5

In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5

Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

How to make this code more efficient in R?

I know this is a stupid question, but I'm kinda frustrated with my code because it takes so much time. Jere is one part of my code.
basically I have a matrix called "distance"...
a b c
1 2 5 7
2 6 8 4
3 9 2 3
and then lets say I have a column in a data frame, contains of {a,b,c}
c1 c2 c3
c ... ...
a
a just another column
b
c ... ...
so I want to do a match, I wanna make another matrix with ncol=nrow(distance), and nrow=nrow(c1). where replace the factor value with their distance value. Here's an example of the first column of matrix that I'm going to make
a will replaced by 2
b will replaced by 5
c will replaced by 7
and for the second column, i will take row number 2 from distance matrix, and so on... so the result will be like this
m1 m2 m3
7 4 3
2 6 9
2 6 9
5 8 2
7 4 3
That is just an easy example, and I'm running this code, but when it deals with large iterations, it's kinda stressful for me.
for(l in 1:ncol(d.cat)){
get.unique = sort(unique(d.cat[, l]))
for(j in 1:nrow(d.cat)){
value = as.character(d.cat[j, l])
index = which(get.unique == value)
d2[j,l] = (d[[l]][i, index])
}
}
d.cat is categorical data. And d[[...]] is the list of matrix distance for every column in d.cat.

Try to store the indices and do the updating in one go. Lets say your distance matrix is dmat and data frame is df and you want to create a matrix named newmat
a.ind = which(df$c1=="a")
b.ind = which(df$c1=="b")
c.ind = which(df$c1=="c")
newmat = matrix(0,nrow=length(df$c1),ncol=3)
newmat[a.ind,] = dmat[,1]
newmat[b.ind,] = dmat[,2]
newmat[c.ind,] = dmat[,3]

Here's some data
set.seed(123)
d = matrix(1:9, 3, dimnames=list(NULL, letters[1:3]))
df = data.frame(c1 = sample(letters[1:3], 10, TRUE), stringsAsFactors=FALSE)
and a solution
t(d[, match(df$c1, colnames(d))])
For example
> d
a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> df$c1
[1] "a" "c" "b" "c" "c" "a" "b" "c" "b" "b"
> t(d[,match(df$c1, colnames(d))])
[,1] [,2] [,3]
a 1 2 3
c 7 8 9
b 4 5 6
c 7 8 9
c 7 8 9
a 1 2 3
b 4 5 6
c 7 8 9
b 4 5 6
b 4 5 6

Your data
mat <- matrix(c(2,6,9,5,8,2,7,4,3), nrow=3)
rownames(mat) <- 1:3
colnames(mat) <- letters[1:3]
library(dplyr)
set.seed(1)
df <- as.data.frame(matrix(sample(letters[1:3], 12, replace=TRUE), nrow=4)) %>%
setNames(paste0("c", 1:3))
# c1 c2 c3
# 1 a a b
# 2 b c a
# 3 b c a
# 4 c b a
Using purrr::map2_df, iterate through columns of df and columns of tmat
library(purrr)
tmat <- t(mat)
map2_df(df, seq_len(ncol(tmat)), ~tmat[,.y][.x])
# # A tibble: 4 x 3
# c1 c2 c3
# <dbl> <dbl> <dbl>
# 1 2. 6. 2.
# 2 5. 4. 9.
# 3 5. 4. 9.
# 4 7. 8. 9.

Here is my attempt using the tidyverse :
library(tidyverse)
# Lets create some example
distance <- data_frame(a = sample(1:10, 1000, T), b = sample(1:10, 1000, T), c = sample(1:10, 1000, T))
c1 <- data_frame(c1 = sample(letters[1:3], 1000, T), c2 = sample(letters[1:3], 1000, T))
# First rearrange a little bit your data to make it more tidy
distance2 <- distance %>%
mutate(i = seq_len(n())) %>%
gather(col, value, -i)
c2 <- c1 %>%
mutate(i = seq_len(n()) %>%
gather(col, value, -i)
# Now just join the data and spread it again
c12 %>%
left_join(distance2, by = c("i", "value" = "col")) %>%
select(i, col, value.y) %>%
spread(col, value.y)

Combine vectors into data frame, using vector name as a column

library(dplyr)
I have a set of vectors:
Sp_A <- c("A",1,2,3,4,5,6,7,8)
Sp_B <- c("B",9,10,11,12,13,14,15,16)
Sp_C <- c("C",17,18,19,20,21,22,23,24)
which I have made into a list of vectors:
list <- ls(pattern = "Sp_")
I want to use this list to loop over each vector in the list and make it into a data frame . I currently do this for one vector using this:
A_df <- select(data.frame(rep(Sp_A[1], each = 4), c(Sp_A[c(2,4,6,8)]), c(Sp_A[c(3,5,7,9)])), name = 1, var1 = 2, var2 = 3)
I have tried to make this operation into a for loop like this:
for(i in list) {
test[i] <- select(A_df <- data.frame(rep(i[1], each = 4),
c(i[c(2,4,6,8)]),
c(i[c(3,5,7,9)]),
name = 1, var1 = 2, var2 = 3))
}
but to no avail.
I have heard that I might be able to use apply() for this sort of thing but I don't know how.

Maybe this:
lapply(list,function(x) data.frame(name=get(x)[1],matrix(get(x)[-1],ncol = 2)))
[[1]]
name X1 X2
1 A 1 5
2 A 2 6
3 A 3 7
4 A 4 8
[[2]]
name X1 X2
1 B 9 13
2 B 10 14
3 B 11 15
4 B 12 16
[[3]]
name X1 X2
1 C 17 21
2 C 18 22
3 C 19 23
4 C 20 24
Or a simple for loop to assign the dataframes to objects:
for (x in 1:length(list)){
assign(paste0("test",x),data.frame(name=get(list[x])[1],matrix(get(list[x])[-1],ncol = 2)))
}

Same function over multiple data frames in R

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.

Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.

Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5

In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5

Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

R: Calling table() on multiple variables

I have to call the table() function on 10 variables in R. Is there any way of doing it in one shot, without calling them individually like table(v1), table(v2)... table(v10)?

If your variables are arranged as columns in a data.frame, you could use lapply:
df <- data.frame(aa = rpois(10, 4), bb = rpois(10, 3), c = rpois(10, 7))
tabList <- lapply(df, table)
Then you get a list with the various tables:
> tabList
$aa
1 3 4 5 6 7
2 3 2 1 1 1
$bb
1 2 3 4 5
1 2 4 1 2
$c
3 4 5 6 7 9 11 12
1 1 1 3 1 1 1 1
EDIT:
For variables across multiple data.frames, you might try putting them into a list and then using lapply again:
df2 <- df[sample(rownames(df), 15, replace = TRUE), ]
df3 <- df[sample(rownames(df), 20, replace = TRUE), ]
dfList <- list(df = df, df2 = df2, df3 = df3)
lapply(dfList, function(x) lapply(x, FUN = table))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

applying function to multiple dataframes programatically [duplicate] - r

Related

Loop to execute in different dataframes in r [duplicate]

How to make this code more efficient in R?

Combine vectors into data frame, using vector name as a column

Same function over multiple data frames in R

R: Calling table() on multiple variables

Categories

Resources