Calculate means from different data frames

Calculate means from different data frames - r

My goal is calculate a final data frame, which would contain the means from several different data frames. Given data like this:
A <- c(1,2,3,4,5,6,7,8,9)
B <- c(2,2,2,3,4,5,6,7,8)
C <- c(1,1,1,1,1,1,2,2,1)
D <- c(5,5,5,5,6,6,6,7,7)
E <- c(4,4,3,5,6,7,8,9,7)
DF1 <- data.frame(A,B,C)
DF2 <- data.frame(E,D,C)
DF3 <- data.frame(A,C,E)
DF4 <- data.frame(A,D,E)
I'd like to calculate means for all three columns (per row) in each data frame. To do this I put together a for loop:
All <- data.frame(matrix(ncol = 3, nrow = 9))
for(i in seq(1:ncol(DF1))){
All[,i] <- mean(c(DF1[,i], DF2[,i], DF3[,i], DF4[,i]))
}
X1 X2 X3
1 5.222222 4.277778 3.555556
2 5.222222 4.277778 3.555556
3 5.222222 4.277778 3.555556
4 5.222222 4.277778 3.555556
5 5.222222 4.277778 3.555556
6 5.222222 4.277778 3.555556
7 5.222222 4.277778 3.555556
8 5.222222 4.277778 3.555556
9 5.222222 4.277778 3.555556
But the end result was that I calculated entire column means (as opposed to a mean for each individual row).
For example, the first row and first column for each of the 4 data frames is 1,4,1,1. So I would expect the first col and row of the final data frame to be 1.75 (mean(c(1,4,1,1))

We place the datasets in a list, get the sum (+) of corresponding elements using Reduce and divide it by the number of datasets
Reduce(`+`, mget(paste0("DF", 1:4)))/4
# A B C
#1 1.75 3.25 2.5
#2 2.50 3.25 2.5
#3 3.00 3.25 2.0
#4 4.25 3.50 3.0
#5 5.25 4.25 3.5
#6 6.25 4.50 4.0
#7 7.25 5.00 5.0
#8 8.25 5.75 5.5
#9 8.50 5.75 4.0
NOTE: It should be faster than any apply based solutions and the output is a data.frame as that of the original dataset
If we want the tidyverse, then another option is
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
mget(paste0("DF", 1:4)) %>%
map(rownames_to_column, "rn") %>%
map(setNames, c("rn", LETTERS[1:3])) %>%
bind_rows() %>%
group_by(rn) %>%
summarise_each(funs(mean))
# A tibble: 9 × 4
# rn A B C
# <chr> <dbl> <dbl> <dbl>
#1 1 1.75 3.25 2.5
#2 2 2.50 3.25 2.5
#3 3 3.00 3.25 2.0
#4 4 4.25 3.50 3.0
#5 5 5.25 4.25 3.5
#6 6 6.25 4.50 4.0
#7 7 7.25 5.00 5.0
#8 8 8.25 5.75 5.5
#9 9 8.50 5.75 4.0

Since what you're describing is effectively an array, you can actually make it one with abind::abind, which makes the operation pretty simple:
apply(abind::abind(DF1, DF2, DF3, DF4, along = 3), 1:2, mean)
## A D E
## [1,] 1.75 3.25 2.5
## [2,] 2.50 3.25 2.5
## [3,] 3.00 3.25 2.0
## [4,] 4.25 3.50 3.0
## [5,] 5.25 4.25 3.5
## [6,] 6.25 4.50 4.0
## [7,] 7.25 5.00 5.0
## [8,] 8.25 5.75 5.5
## [9,] 8.50 5.75 4.0
The column names are meaningless, and the result is a matrix, not a data.frame, but even if you wrap it in data.frame, it's still very fast.

A combination of tidyverse and base:
#install.packages('tidyverse')
library(tidyverse)
transpose(list(DF1, DF2, DF3, DF4)) %>%
map(function(x)
rowMeans(do.call(rbind.data.frame,
transpose(x)))) %>%
bind_cols()
Should yield:
# A B C
# <dbl> <dbl> <dbl>
# 1 1.75 3.25 2.5
# 2 2.50 3.25 2.5
# 3 3.00 3.25 2.0
# 4 4.25 3.50 3.0
# 5 5.25 4.25 3.5
# 6 6.25 4.50 4.0
# 7 7.25 5.00 5.0
# 8 8.25 5.75 5.5
# 9 8.50 5.75 4.0

Related

Interpolating or spline all columns of a data frame

If a data frame has M rows, how can it be interpolated or splined to create a new data frame with N rows? Here is an example:
# Start with some vectors of constant length (M=7) with data at each time point t
df <- tibble(t = c(1, 2, 3, 4, 5, 6, 7),
y1 = c(0.0, 0.5, 1.0, 3.0, 5.0, 2.0, 0.0),
y2 = c(0.0, 0.75, 1.5, 3.5, 6.0, 4.0, 0.0),
y3 = c(0.0, 1.0, 2.0, 4.0, 3.0, 2.0, 0.0))
# How to interpolate or spline these to other numbers of points (rows)?
# By individual column, to spline results to a new vector with length N=15:
spline(x=df$t, y=df$y1, n=15)
spline(x=df$t, y=df$y2, n=15)
spline(x=df$t, y=df$y3, n=15)
So by vector this is trivial. Question is, how can this spline be applied to all columns across the dataset with M rows to create a new dataset with N rows, preferably with tidyverse approach, e.g.:
df15 <- df %>% mutate(...replace(?)...(spline(x=?, y=?, n=15)... ???))
Again, I would like to have this spline be applied across ALL columns without having to specify syntax that includes column names. The intent is to apply this to data frames with something on the order of 100 columns and where names and numbers of columns may vary. It is of course not necessary to include the t (or x) column in the data frame if that simplifies the approach at all. Thanks for any insight.

spline returns a list. So, we may loop across with summarise and then unpack the columns (summarise is flexible in returning any number of rows whereas mutate is fixed i.e. it should return the same number of rows as the input)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
summarise(across(y1:y3, ~spline(t, .x, n = 15) %>%
as_tibble %>%
rename_with(~ str_c(cur_column(), .)))) %>%
unpack(everything())
-output
# A tibble: 15 × 6
y1x y1y y2x y2y y3x y3y
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0
2 1.43 0.319 1.43 0.404 1.43 0.542
3 1.86 0.468 1.86 0.673 1.86 0.905
4 2.29 0.566 2.29 0.907 2.29 1.18
5 2.71 0.752 2.71 1.21 2.71 1.56
6 3.14 1.18 3.14 1.68 3.14 2.30
7 3.57 1.93 3.57 2.43 3.57 3.33
8 4 3 4 3.5 4 4
9 4.43 4.24 4.43 4.84 4.43 3.83
10 4.86 4.99 4.86 5.85 4.86 3.21
11 5.29 4.56 5.29 5.90 5.29 2.67
12 5.71 3.12 5.71 4.96 5.71 2.29
13 6.14 1.47 6.14 3.46 6.14 1.82
14 6.57 0.269 6.57 1.74 6.57 1.09
15 7 0 7 0 7 0
NOTE: Here, we renamed the columns as the output from spline is a list with names x and y and data.frame/tibble wants unique column names

Here is an option with data.table
library(data.table)
setDT(df)[,
lapply(.SD, function(v) list2DF(spline(t, v, n = 15))),
.SDcols = patterns("^y\\d+")
]
which gives
y1.x y1.y y2.x y2.y y3.x y3.y
1: 1.000000 0.0000000 1.000000 0.0000000 1.000000 0.0000000
2: 1.428571 0.3194303 1.428571 0.4039226 1.428571 0.5423159
3: 1.857143 0.4680242 1.857143 0.6731712 1.857143 0.9052687
4: 2.285714 0.5655593 2.285714 0.9065841 2.285714 1.1770242
5: 2.714286 0.7515972 2.714286 1.2081346 2.714286 1.5555866
6: 3.142857 1.1773997 3.142857 1.6848330 3.142857 2.3039184
7: 3.571429 1.9306220 3.571429 2.4271800 3.571429 3.3318454
8: 4.000000 3.0000000 4.000000 3.5000000 4.000000 4.0000000
9: 4.428571 4.2387392 4.428571 4.8368010 4.428571 3.8340703
10: 4.857143 4.9919616 4.857143 5.8546581 4.857143 3.2089361
11: 5.285714 4.5551878 5.285714 5.8976389 5.285714 2.6706702
12: 5.714286 3.1239451 5.714286 4.9619776 5.714286 2.2875045
13: 6.142857 1.4724741 6.142857 3.4632587 6.142857 1.8204137
14: 6.571429 0.2685633 6.571429 1.7399284 6.571429 1.0868916
15: 7.000000 0.0000000 7.000000 0.0000000 7.000000 0.0000000

How to create a table in R from a data set by taking the average of rows? [duplicate]

This question already has answers here:
Summarizing multiple columns with dplyr? [duplicate]
(5 answers)
Closed 2 years ago.
I have this data
acic2
> acic2
# A tibble: 21 x 9
PCC V1.1 V2.2 V3.3 V4.4 V5.5 V6.6 V7.7 Vtotal
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 8.33 5.33 6 6.5 7.67 7 5.33 6.60
2 A 8.67 4.33 6.25 7 7.5 7 5.67 6.63
3 B 9 4.33 7 7.25 7.83 6.8 6.17 6.91
4 C 5.17 3.33 5.25 2.75 3.17 4 4.5 4.02
5 C 8 6 6.25 3.5 6.17 5.6 6 5.93
6 D 6.5 5.67 7.25 5.75 5.33 6.4 4 5.84
7 D 6 4.67 6 5.25 3.67 4.6 5 5.03
8 E 6.5 7 6 7 4.33 5.4 5.67 5.99
9 E 9 5.67 6.5 8 6.17 3.6 5 6.28
10 F 9.17 8 6.5 6.25 7 6.4 6.67 7.14
# ... with 11 more rows
>
I want to create a separate data set called acic3 by taking the average of columns with the same letters in PCC. So I'll get one row for every letter which contains the average score for each column.

You can group_by and summarize(across(...)) in dplyr
acic3 <- acic2 %>%
group_by(PCC) %>%
summarize(across(starts_with("V"), mean)

How do I reduce the dimensions of my data frame in terms of columns by averaging between columns?

I have a data frame df1 that summarises water temperature every 2 meters until 39 meters depth over time. As an example:
df1<-data.frame(Datetime=c("2016-08-18 00:00:00","2016-08-18 00:01:00","2016-08-18 00:02:00","2016-08-18 00:03:00"),
Site=c("BD","HG","BD","HG"),
m0=c(2,5,6,1),
m2=c(3,5,2,4),
m4=c(4,1,9,3),
m6=c(2,5,6,1),
m8=c(3,5,2,4),
m10=c(2,5,6,1),
m12=c(4,1,9,3),
m14=c(3,5,2,4),
m16=c(2,5,6,1),
m18=c(4,1,9,3),
m20=c(3,5,2,4),
m22=c(2,5,6,1),
m24=c(4,1,9,3),
m26=c(3,5,2,4),
m28=c(2,5,6,1),
m30=c(4,1,9,3),
m32=c(3,5,2,4),
m34=c(2,5,6,1),
m36=c(4,1,9,3),
m38=c(3,5,2,4)
)
> df1
Datetime Site m0 m2 m4 m6 m8 m10 m12 m14 m16 m18 m20 m22 m24 m26 m28 m30 m32 m34 m36 m38
1 2016-08-18 00:00:00 BD 2 3 4 2 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3
2 2016-08-18 00:01:00 HG 5 5 1 5 5 5 1 5 5 1 5 5 1 5 5 1 5 5 1 5
3 2016-08-18 00:02:00 BD 6 2 9 6 2 6 9 2 6 9 2 6 9 2 6 9 2 6 9 2
4 2016-08-18 00:03:00 HG 1 4 3 1 4 1 3 4 1 3 4 1 3 4 1 3 4 1 3 4
I would like to calculate the water temperature for layers of 8 meters instead of 2 meters by averaging water temperatures between the proper columns. For instance, I would like to convert columns m0, m2, m4 and m6 to a unique column called m3.5 that reflects the mean water temperature between 0 and 7 meters depth.
As my desired result:
> df1
Datetime Site m3.5 m11.5 m19.5 m27.5 m35.5
1 2016-08-18 00:00:00 BD 2.75 3.00 2.75 3.25 3.00
2 2016-08-18 00:01:00 HG 4.00 4.00 4.00 3.00 4.00
3 2016-08-18 00:02:00 BD 5.75 4.75 5.75 6.50 4.75
4 2016-08-18 00:03:00 HG 2.25 3.00 2.25 2.75 3.00
Does any one how to do that with dplyr?

here is a solution that would work with any number of columns
num_meters <- 39
grp <- as.factor(cumsum(seq(0,num_meters, 2) %% 8 == 0))
df <- data.frame(df1[,c(1,2)],
t(apply(df1[,-c(1,2)], 1, function(x) tapply(x, grp, mean))))
# Datetime Site X1 X2 X3 X4 X5
#1 2016-08-18 00:00:00 BD 2.75 3.00 2.75 3.25 3.00
#2 2016-08-18 00:01:00 HG 4.00 4.00 4.00 3.00 4.00
#3 2016-08-18 00:02:00 BD 5.75 4.75 5.75 6.50 4.75
#4 2016-08-18 00:03:00 HG 2.25 3.00 2.25 2.75 3.00
# in case you also need the colnames that you have specified
colnames(df)[-c(1,2)] <- paste("m", tapply(seq(0,num_meters, 2), grp, mean) + 0.5, sep = "")

With tidyverse you could as well do something like this:
df1 %>%
gather(var, val, -Datetime, -Site) %>%
mutate(group = rep(seq(3.5, 35.5, 8), each = 16)) %>%
group_by(group, Site, Datetime) %>%
summarise(value = mean(val)) %>%
spread(group, value)
Site Datetime `3.5` `11.5` `19.5` `27.5` `35.5`
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BD 2016-08-18 00:00:00 2.75 3 2.75 3.25 3
2 BD 2016-08-18 00:02:00 5.75 4.75 5.75 6.5 4.75
3 HG 2016-08-18 00:01:00 4 4 4 3 4
4 HG 2016-08-18 00:03:00 2.25 3 2.25 2.75 3

You're probably looking for rowMeans:
df1$m3.5 <- rowMeans(df1[, c("m0", "m2", "m4", "m6")])
No need for dplyr.

The following does it.
library(dplyr)
df1 %>%
mutate(m3.5 = rowMeans(.[3:6]),
m11.5 = rowMeans(.[7:10]),
m19.5 = rowMeans(.[11:14]),
m27.5 = rowMeans(.[15:18]),
m35.5 = rowMeans(.[19:22])) %>%
select(Datetime, Site, m3.5:m35.5)
# Datetime Site m3.5 m11.5 m19.5 m27.5 m35.5
#1 2016-08-18 00:00:00 BD 2.75 3.00 2.75 3.25 3.00
#2 2016-08-18 00:01:00 HG 4.00 4.00 4.00 3.00 4.00
#3 2016-08-18 00:02:00 BD 5.75 4.75 5.75 6.50 4.75
#4 2016-08-18 00:03:00 HG 2.25 3.00 2.25 2.75 3.00

How do I remove NA from a data frame with the intention of using sapply on the data frame [duplicate]

This question already has answers here:
calculate the mean for each column of a matrix in R
(10 answers)
Closed 3 years ago.
I have a data frame:
colA colB
1 15.3 1.76
2 10.8 1.34
3 8.1 1.27
4 19.5 1.47
5 7.2 1.27
6 5.3 1.49
7 9.3 1.31
8 11.1 1.09
9 7.5 1.18
10 12.2 1.22
11 6.7 1.25
12 5.2 1.19
13 19.0 1.95
14 15.1 1.28
15 6.7 1.52
16 8.6 NA
17 4.2 1.12
18 10.3 1.37
19 12.5 1.19
20 16.1 1.05
21 13.3 1.32
22 4.9 1.03
23 8.8 1.12
24 9.5 1.70
How would I be able to remove/change the value of all NAs such that when I use sapply (i.e. sapply(x, mean)), I am taking the mean of 24 rows in the case of colA and 23 columns for colB?
I understand that data frames have to have the same number of rows so using something like na.omit() would not work because it'd remove, in this case, row 16; I'd lose a row of data when I'm calculating the mean for colA.
Thanks!

You should be able to pass na.rm = TRUE and get the mean.
Example:
df <- data.frame(A = 1:3, B = c(NA, 1, 2))
apply(df, 2, mean, na.rm = TRUE)
# A B
# 2.0 1.5

Cut Function in R program

Time Velocity
0 0
1.5 1.21
3 1.26
4.5 1.31
6 1.36
7.5 1.41
9 1.46
10.5 1.51
12 1.56
13 1.61
14 1.66
15 1.71
16 1.76
17 1.81
18 1.86
19 1.91
20 1.96
21 2.01
22.5 2.06
24 2.11
25.5 2.16
27 2.21
28.5 2.26
30 2.31
31.5 2.36
33 2.41
34.5 2.4223
36 2.4323
So I have data about Time and Velocity...I want to use the cut or the which function to separate my data into 6 min intervals...my Maximum Time usually goes up to 3000 mins
So I would want the output to be similar to this...
Time Velocity
0 0
1.5 1.21
3 1.26
4.5 1.31
6 1.36
Time Velocity
6 1.36
7.5 1.41
9 1.46
10.5 1.51
12 1.56
Time Velocity
12 1.56
13 1.61
14 1.66
15 1.71
16 1.76
17 1.81
18 1.86
So what I did so far is read the data using data=read.delim("clipboard")
I decided to use the function 'which'....but I would need to do it for up 3000 mins etc
dat <- data[which(data$Time>=0
& data$Time < 6),],
dat1 <- data[which(data$Time>=6
& data$Time < 12),]
etc
But this wouldn't be so convenient if I had time to went up to 3000 mins
Also I would want all my results to be contained in one output/ variable

I will assume here that you really don't want to duplicate the values across the bins.
cuts = cut(data$Time, seq(0, max(data$Time)+6, by=6), right=FALSE)
x <- by(data, cuts, FUN=I)
x
## cuts: [0,6)
## Time Velocity
## 1 0.0 0.00
## 2 1.5 1.21
## 3 3.0 1.26
## 4 4.5 1.31
## ------------------------------------------------------------------------------------------------------------
## cuts: [6,12)
## Time Velocity
## 5 6.0 1.36
## 6 7.5 1.41
## 7 9.0 1.46
## 8 10.5 1.51
## ------------------------------------------------------------------------------------------------------------
## <snip>
## ------------------------------------------------------------------------------------------------------------
## cuts: [36,42)
## Time Velocity
## 28 36 2.4323

I don't think that you want to get duplicated bounds. here A simple solution without using cut( similar to #Mathew solution).
dat <- transform(dat, index = dat$Time %/% 6)
by(dat,dat$index,FUN=I)

If you really need duplicates timestamps that are integral multipes of 6 then you will have to do some data duplication before splitting.
txt <- "Time Velocity\n0 0\n1.5 1.21\n3 1.26\n4.5 1.31\n6 1.36\n7.5 1.41\n9 1.46\n10.5 1.51\n12 1.56\n13 1.61\n14 1.66\n15 1.71\n16 1.76\n17 1.81\n18 1.86\n19 1.91\n20 1.96\n21 2.01\n22.5 2.06\n24 2.11\n25.5 2.16\n27 2.21\n28.5 2.26\n30 2.31\n31.5 2.36\n33 2.41\n34.5 2.4223\n36 2.4323"
DF <- read.table(text = txt, header = TRUE)
# Create duplicate timestamps where timestamp is multiple of 6 second
posinc <- DF[DF$Time%%6 == 0, ]
neginc <- DF[DF$Time%%6 == 0, ]
posinc <- posinc[-1, ]
neginc <- neginc[-1, ]
# Add tiny +ve and -ve increments to these duplicated timestamps
posinc$Time <- posinc$Time + 0.01
neginc$Time <- neginc$Time - 0.01
# Bind original dataframe without 6 sec multiple timestamp with above duplicated timestamps
DF2 <- do.call(rbind, list(DF[!DF$Time%%6 == 0, ], posinc, neginc))
# Order by timestamp
DF2 <- DF2[order(DF2$Time), ]
# Split the dataframe by quotient of timestamp divided by 6
SL <- split(DF2, DF2$Time%/%6)
# Round back up the timestamps of split data to 1 decimal place
RESULT <- lapply(SL, function(x) {
x$Time <- round(x$Time, 1)
return(x)
})
RESULT
## $`0`
## Time Velocity
## 2 1.5 1.21
## 3 3.0 1.26
## 4 4.5 1.31
## 51 6.0 1.36
##
## $`1`
## Time Velocity
## 5 6.0 1.36
## 6 7.5 1.41
## 7 9.0 1.46
## 8 10.5 1.51
## 91 12.0 1.56
##
## $`2`
## Time Velocity
## 9 12 1.56
## 10 13 1.61
## 11 14 1.66
## 12 15 1.71
## 13 16 1.76
## 14 17 1.81
## 151 18 1.86
##
## $`3`
## Time Velocity
## 15 18.0 1.86
## 16 19.0 1.91
## 17 20.0 1.96
## 18 21.0 2.01
## 19 22.5 2.06
## 201 24.0 2.11
##
## $`4`
## Time Velocity
## 20 24.0 2.11
## 21 25.5 2.16
## 22 27.0 2.21
## 23 28.5 2.26
## 241 30.0 2.31
##
## $`5`
## Time Velocity
## 24 30.0 2.3100
## 25 31.5 2.3600
## 26 33.0 2.4100
## 27 34.5 2.4223
## 281 36.0 2.4323
##
## $`6`
## Time Velocity
## 28 36 2.4323
##

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculate means from different data frames - r

Related

Interpolating or spline all columns of a data frame

How to create a table in R from a data set by taking the average of rows? [duplicate]

How do I reduce the dimensions of my data frame in terms of columns by averaging between columns?

How do I remove NA from a data frame with the intention of using sapply on the data frame [duplicate]

Cut Function in R program

Categories

Resources