Groupby bins and aggregate in R - r

I have data like (a,b,c)
a b c
1 2 1
2 3 1
9 2 2
1 6 2
where 'a' range is divided into n (say 3) equal parts and aggregate function calculates b values (say max) and grouped by at 'c' also.
So the output looks like
a_bin b_m(c=1) b_m(c=2)
1-3 3 6
4-6 NaN NaN
7-9 NaN 2
Which is MxN where M=number of a bins, N=unique c samples or all range
How do I approach this? Can any R package help me through?

A combination of aggregate, cut and reshape seems to work
df <- data.frame(a = c(1,2,9,1),
b = c(2,3,2,6),
c = c(1,1,2,2))
breaks <- c(0, 3, 6, 9)
# Aggregate data
ag <- aggregate(df$b, FUN=max,
by=list(a=cut(df$a, breaks, include.lowest=T), c=df$c))
# Reshape data
res <- reshape(ag, idvar="a", timevar="c", direction="wide")

There would be easier ways.
If your dataset is dat
res <- sapply(split(dat[, -3], dat$c), function(x) {
a_bin <- with(x, cut(a, breaks = c(1, 3, 6, 9), include.lowest = T, labels = c("1-3",
"4-6", "7-9")))
c(by(x$b, a_bin, FUN = max))
})
res1 <- setNames(data.frame(row.names(res), res),
c("a_bin", "b_m(c=1)", "b_m(c=2)"))
row.names(res1) <- 1:nrow(res1)
res1
a_bin b_m(c=1) b_m(c=2)
1 1-3 3 6
2 4-6 NA NA
3 7-9 NA 2

I would use a combination of data.table and reshape2 which are both fully optimized for speed (not using for loops from apply family).
The output won't return the unused bins.
v <- c(1, 4, 7, 10) # creating bins
temp$int <- findInterval(temp$a, v)
library(data.table)
temp <- setDT(temp)[, list(b_m = max(b)), by = c("c", "int")]
library(reshape2)
temp <- dcast.data.table(temp, int ~ c, value.var = "b_m")
## colnames(temp) <- c("a_bin", "b_m(c=1)", "b_m(c=2)") # Optional for prettier table
## temp$a_bin<- c("1-3", "7-9") # Optional for prettier table
## a_bin b_m(c=1) b_m(c=2)
## 1 1-3 3 6
## 2 7-9 NA 2

Related

Writing a summation formula using variables from multiple observations

I am trying to create a new variable for each observation using the following formula:
Index = ∑(BAj / DISTANCEij)
where:
j = focal observation; i= other observation
Basically, I'm taking the focal individual (i) and finding the euclidean distance between it and another point and dividing the other points BA by that distance. Do that for all the other points and then sum them all and repeat all of this for each point.
Here is some sample data:
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
print(df)
ID BA x y
1 1 3 0 1
2 2 5 2 3
3 3 6 3 4
4 4 9 7 9
Currently, I've extracted out vectors and created a formula to calculate part of the formula shown here:
vec1 <- df[1, ]
vec2 <- df[2, ]
dist <- function(vec1, vec2) vec1$BA/sqrt((vec2$x - vec1$x)^2 +
(vec2$y - vec1$y)^2)
My question is how do I repeat this with the x and y values for vec2 changing for each new other point with vec1 remaining the same and then sum them all together?
We may loop over the row sequence, extract the data and apply the dist function
library(dplyr)
library(purrr)
df %>%
mutate(dist_out = map_dbl(row_number(), ~ {
othr <- cur_data()[-.x,]
cur <- cur_data()[.x, ]
sum(dist(cur, othr))
}))
-output
ID BA x y dist_out
1 1 3 0 1 2.049983
2 2 5 2 3 5.943485
3 3 6 3 4 6.593897
4 4 9 7 9 3.404545
Here are two base R ways.
1. for loop
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
n <- nrow(df)
d <- dist(df[c("x", "y")], upper = TRUE)
d <- as.matrix(d)
Index <- numeric(n)
for(j in seq_len(n)) {
d_j <- d[-j, j, drop = TRUE]
Index[j] <- sum(df$BA[j]/d_j)
}
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)
2. sapply loop
Index <- sapply(seq_len(n), \(j) sum(df$BA[j]/d[-j, j, drop = TRUE]))
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)

Loop to execute in different dataframes in r [duplicate]

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

applying function to multiple dataframes programatically [duplicate]

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

Sum of data frame's rows in range defined by columns

I have an integer based dataframe with positional coordinates in one column and a variable in the second. The coordinates range from 1-10 million, the variables from 0-950 - I'm interested in returning the sum of the variables from ranges defined within a separate frame containing the start and end points of the desired range.
To make things a bit easier to compute I've shortened the example:
Data:
a = seq(1,5)
b = c(0,0,1,0,2)
df1 <- data.frame(a, b)
c = c(1,1,2,2,3)
d = c(3,4,3,5,4)
df2 <- data.frame(c,d)
df1:
1, 0
2, 0
3, 1
4, 0
5, 2
df2:
1, 3
1, 4
2, 3
2, 5
3, 4
magic
output:
1,
1,
1,
3,
1,
Where magic is pulling the start and end positions in df2 columns 1 and 2 to pass to rowSums for df1 extraction.
Edit: #Frank's data.table solution: short and fast.
df2[, s := df1[df2, on=.(a >= c, a <= d), sum(b), by=.EACHI]$V1]
# output
c d s
1: 1 3 1
2: 1 4 1
3: 2 3 1
4: 2 5 3
5: 3 4 1
Another way (may be slower but works):
library(data.table)
setDT(df1)
setDT(df2)
## magic function
get_magic <- function(x)
{
spell <- c()
one <- unlist(x[1])
two <- unlist(x[2])
a <- df1[between(a, one, two), sum(b)]
spell <- append(spell, a)
return(spell)
}
# applies to row
d <- apply(df2, 1, get_magic)
print(d)
# output
[1] 1 1 1 3 1
One possible solution is by using mapply. I have used a custom function but one can write an inline function as part of mapply statement.
mapply(row_sum, df2$c, df2$d)
row_sum <- function(x, y){
sum(df1[x:y,2])
}
#Result
#[1] 1 1 1 3 1
Data
a = seq(1,5)
b = c(0,0,1,0,2)
df1 <- data.frame(a, b)
c = c(1,1,2,2,3)
d = c(3,4,3,5,4)
df2 <- data.frame(c,d)

Same function over multiple data frames in R

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but I'm not sure how to set it up properly to call data frames. It also seems most prefer the lapply approach with R. I've played with the get function as well to no avail. I apologize if this is a duplicated question. Any help would be greatly appreciated!
Here's my over simplified example:
2 data frames: df1, df2
df1
start stop ID
0 10 x
10 20 y
20 30 z
df2
start stop ID
0 10 a
10 20 b
20 30 c
what I want is a 4th column with the average of start and stop for both dfs
df1
start stop ID Avg
0 10 x 5
10 20 y 15
20 30 z 25
I can do this one data frame at a time with:
df1$Avg <- rowMeans(subset(df1, select = c(start, stop)), na.rm = TRUE)
but I want to run it on all of the dataframes.
Make a list of data frames then use lapply to apply the function to them all.
df.list <- list(df1,df2,...)
res <- lapply(df.list, function(x) rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE))
# to keep the original data.frame also
res <- lapply(df.list, function(x) cbind(x,"rowmean"=rowMeans(subset(x, select = c(start, stop)), na.rm = TRUE)))
The lapply will then feed in each data frame as x sequentially.
Put them into a list and then run rowMeans over the list.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
lapply(list(df1, df2), function(w) { w$Avg <- rowMeans(w[1:2]); w })
[[1]]
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
[[2]]
x y ID Avg
1 5 2 f 3.5
2 5 3 g 4.0
3 5 4 h 4.5
4 5 5 i 5.0
5 5 6 j 5.5
In case you want all the outputs in the same file this may help.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
z=list(df1,df2)
df=NULL
for (i in z) {
i$Avg=(i$x+i$y)/2
df<-rbind(df,i)
print (df)
}
> df
x y ID Avg
1 3 1 a 2.0
2 3 2 b 2.5
3 3 3 c 3.0
4 3 4 d 3.5
5 3 5 e 4.0
6 5 2 f 3.5
7 5 3 g 4.0
8 5 4 h 4.5
9 5 5 i 5.0
10 5 6 j 5.5
Here's another possible solution using a for loop. I've had the same problem (with more datasets) a few days ago and other solutions did not work.
Say you have n datasets :
df1 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[24:26])
df2 <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[1:3])
...
dfn <- data.frame(start = seq(0,20,10), stop = seq(10,30,10), ID = letters[n:n+2])
The first thing to do is to make a list of the dfs:
df.list<-lapply(1:n, function(x) eval(parse(text=paste0("df", x)))) #In order to store all datasets in one list using their name
names(df.list)<-lapply(1:n, function(x) paste0("df", x)) #Adding the name of each df in case you want to unlist the list afterwards
Afterwards, you can use the for loop (that's the most important part):
for (i in 1:length(df.list)) {
df.list[[i]][["Avg"]]<-rowMeans(df.list[[i]][1:2])
}
And you have (in the case your list only includes the two first datasets):
> df.list
[[1]]
start stop ID Avg
1 0 10 x 5
2 10 20 y 15
3 20 30 z 25
[[2]]
start stop ID Avg
1 0 10 a 5
2 10 20 b 15
3 20 30 c 25
Finally, if you want your modified datasets from the list back in the global environment, you can do:
list2env(df.list,.GlobalEnv)
This technique can be applied to n datasets and other functions.
I find it to be the most flexible solution.

Resources