I am trying to use a lapply within transform.
So this works as expected:
dfX1 = data.frame(a = rpois(100, 10),
b = rpois(100, 10))
dfX2 = transform(dfX1, c = a %in% c(6, 7))
However, when I try something like:
dfX3 = transform(dfX1,
c = apply(
do.call('cbind',
lapply(c(a, b),
function(x) x %in% c(6, 7))), 1, sum))
I get the weird result:
> head(dfX3)
a b c
1 9 9 26
2 9 8 26
3 6 7 26
4 9 11 26
5 11 9 26
6 11 16 26
My expectation is that the lapply would return a list of vectors that would be coerced into a matrix by cbind, and the apply would apply the function sum across rows.
Not sure what I am missing.
In this case, the problem comes from c(a, b), which is the concatenation of cols a and b. The result will be the number of values that are either equal to 6 or 7 in both a and b, repeated along the lines.
Replacing c(a, b) by list(a, b) gives the correct result.
Also, a more synthetic way or doing this is
transform(dfX1, c= apply( data.frame(a,b), 1, function(u) sum(u %in% 6:7)) )
Related
I have a dataframe and i want to calculate the sum of variables present in a vector in every row and make the sum in other variable after i want the name of new variable created to be from the name of the variable in vector
for example
data
Name A_12 B_12 C_12 D_12 E_12
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 16 9 6
r4 7 8 0 7 18
let's say i have two vectors
vector_1 <- c("A_12","B_12","C_12")
vector_2 <- c("B_12","C_12","D_12","E_12")
The result i want is :
New_data >
Name A_12 B_12 C_12 ABC_12 D_12 E_12 BCDE_12
r1 1 5 12 18 21 15 54
r2 2 4 7 13 10 9 32
r3 5 15 16 36 9 6 45
r4 7 8 0 15 7 18 40
I created for loop to get the sum of the rows in a vector but i didn't get the correct result
Please tell me ig you need any more informations or clarifications
Thank you
You can use rowSums and simple column-subsetting:
dat$ABC_12 <- rowSums(dat[,vector_1])
dat$BCDE_12 <- rowSums(dat[,vector_2])
dat
# Name A_12 B_12 C_12 D_12 E_12 ABC_12 BCDE_12
# 1 r1 1 5 12 21 15 18 53
# 2 r2 2 4 7 10 9 13 30
# 3 r3 5 15 16 9 6 36 46
# 4 r4 7 8 0 7 18 15 33
Note that if your frames inherit from data.table, then you'll need to use either subset(dat, select=vector_1) or dat[,..vector_1] instead of simply dat[,vector_1]; if you aren't already using data.table, then you can safely ignore this paragraph.
Like this (using dplyr/tidyverse)
df %>%
rowwise() %>%
mutate(
ABC_12 = sum(c_across(vector_1)),
BCDE_12 = sum(c_across(vector_2))
)
Though I'm not sure the sums are correct in your example
-=-=-=EDIT-=-=-=-
Here's a function to help with the naming.
ex_fun <- function(vec, n_len){
paste0(paste(substr(vec,1,n_len), collapse = ""), substr(vec[1],n_len+1,nchar(vec[1])))
}
Which can then be implemented like so.
df %>%
rowwise() %>%
mutate(
!!ex_fun(vector_1, 1) := sum(c_across(vector_1)),
!!ex_fun(vector_2, 1) := sum(c_across(vector_2)),
)
-=-= Extra note -=--=
If you list your vectors up you could then combine this with r2evans answer and stick into a loop if you prefer.
vectors = list(vector_1, vector_2)
for (v in vectors){
df[ex_fun(v, 1)] <- rowSums(df[,v])
}
I believe this might work, so long as only the starting digits are different:
library("tidyverse")
#Input dataframe.
data <- data.frame(Name =c("r1", "r2", "r3", "r4"), A_12 = c(1, 2, 5, 7), B_12 = c(5, 4, 15, 8),
C_12 = c(12, 7, 16, 0), D_12 = c(21, 10, 9, 7), E_12 = c(15, 9, 6, 18))
#add all vectors to the "vectors" list. I have added vector_1 and vector_2, but
#there can be as many vectors as needed, they just need to be put in the list.
vector_1 <- c("A_12","B_12","C_12")
vector_2 <- c("B_12","C_12","D_12","E_12")
vector_list<-list(vector_1, vector_2)
vector_sum <- function(data, vector_list){
output <- data |>
dplyr::select(1, all_of(vector_list[[1]]))
for (i in vector_list) {
name1 <- substring(as.character(i), 1,1) |> paste(collapse = '')
name2 <- substring(as.character(i[1]), 2)
input_temp <- dplyr::select(data, all_of(i))
input_temp <- mutate(input_temp, temp=rowSums(input_temp))
names(input_temp)[names(input_temp) == "temp"] <- paste(name1, name2)
output = cbind(output, input_temp)
}
output[, !duplicated(colnames(output))]
}
vector_sum(data, vector_list)
I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.
I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.
If I had a data.frame X and wanted to apply a function foo to each of its rows, I would just run apply(X, 1, foo). This is all well-known and simple.
Now imagine I have another data.frame Y and the following function:
mean_of_sum <- function(x,y) {
return(mean(x+y))
}
Is there a way to write an "apply equivalent" to the following loop:
my_loop_fun <- function(X, Y)
results <- numeric(nrow(X))
for(i in 1: length(results)) {
results[i] <- mean_of_sum(X[i,], Y[i,])
}
return(results)
If such an "apply syntax" exists, would it be more efficient than my "good" old loop?
this should work:
sapply(seq_len(nrow(X)), function(i) mean_of_sum(X[i,], Y[i,]))
You apply the function on the sequence 1, 2, ..., n (where n is the number of rows ) and in each "iteration" you evaluate mean_of_sum for the i-th row.
We can split every row of X and Y in list and use mapply to apply the function. Changing the function mean_of_sum a bit to convert one-row dataframe to numeric
mean_of_sum <- function(x,y) {
return(mean(as.numeric(x) + as.numeric(y)))
}
Consider an example,
X <- data.frame(a = 1:5, b = 6:10)
Y <- data.frame(c = 11:15, d = 16:20)
mapply(mean_of_sum, split(X, seq_len(nrow(X))), split(Y, seq_len(nrow(Y))))
# 1 2 3 4 5
#17 19 21 23 25
where X and Y are
X
# a b
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Y
# c d
#1 11 16
#2 12 17
#3 13 18
#4 14 19
#5 15 20
So the first value 17 is counted as
mean(c(1 + 11, 6 + 16))
#[1] 17
and so on for next values.
I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.