This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
aggregate methods treat missing values (NA) differently
(2 answers)
Closed 2 years ago.
There is a data frame x with 5753 observations of 4 variables.
The column names are: date, Depth, var1, and var2. I converted date and Depth to factor before performing aggregate().
I wanted to calculate average and standard deviation to 2 variables with grouping by date and Depth.
When applying aggregate(x[,3:4], by = list(x$date, x$Depth), FUN = function(x) c(avg = mean(x, na.rm = TRUE), SD= sd)), I got average of var1 and average of var 2 grouping by date and Depth, but I did not get SD.
When applying aggregate(. ~ date+Depth, data = x, FUN = function(x) c(avg = mean(x, na.rm = TRUE), SD= sd)), I got an error message: "Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate".
After counting NA in two column, I found out that there are 5622 NA in var1, 5049 NA in var2. I donot want to remove NA before applying aggregate() yet.
My questions are:
why I did not get sd by applying the first syntax?
why is the second syntax not workable? I learned this syntax from stackoverflow, and it worked with the following data frame,
x3 <- read.table(text = " id1 id2 val1 val2
1 a x 1 9
2 a x 2 4
3 a y 3 NA
4 a y 4 NA
5 b x 1 NA
6 b y 4 NA
7 b x 3 9
8 b y 2 8", header = TRUE)
We can use dplyr, where we pass the grouping columns in group_by and the columns to summarise in summarise with across
library(dplyr) #1.0.0
x3 %>%
group_by(id1, id2) %>%
summarise(across(starts_with('val'),
list(mean = ~ mean(., na.rm = TRUE) , sd = ~sd(., na.rm = TRUE))))
# A tibble: 4 x 6
# Groups: id1 [2]
# id1 id2 val1_mean val1_sd val2_mean val2_sd
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a x 1.5 0.707 6.5 3.54
#2 a y 3.5 0.707 NaN NA
#3 b x 2 1.41 9 NA
#4 b y 3 1.41 8 NA
If the version of dplyr is < 1.0.0, we can use summarise_at
x3 %>%
group_by(id1, id2) %>%
summarise_at(vars(-group_cols()), list(mean = ~ mean(., na.rm = TRUE),
sd = ~ sd(., na.rm = TRUE)))
With aggregate, the error we get because of the NA elements and it uses by default na.action = na.drop removing the row if there is any NA in that row. Either specify na.action = na.pass or NULL and this would resolve that issue. But, having multiple functions to be applied with c, it will result in a matrix column. Inorder to have normal data.frame, columns, we can wrap with data.frame in do.call
do.call(data.frame, aggregate(. ~ id1 + id2, data = x3, FUN = function(x)
c(avg = mean(x, na.rm = TRUE), SD= sd(x, na.rm = TRUE)), na.action = NULL))
Related
Let's say I have the data frames with the same column names
DF1 = data.frame(a = c(0,1), b = c(2,3), c = c(4,5))
DF2 = data.frame(a = c(6,7), c = c(8,9))
and want to apply some basic calculation on them, for example add each column.
Since I also want the goal data frame to display missing data, I appended such a column to DF2, so I have
> DF2
a c b
1 6 8 NA
2 7 9 NA
What I tried here now is to create the data frame
for(i in names(DF2)){
DF3 = data.frame(i = DF1[i] + DF2[i])
}
(and then bind this together) but this obviously doesn't work since the order of the columns is mashed up.
SO,
what's the best way to do this pairwise calculation when the order of the columns is not the same, without reordering them?
I also tried doing (since this is what I thought would be a fix)
for(i in names(DF2)){
DF3 = data.frame(i = DF1$i + DF2$i)
}
but this doesn't work because DF1$i is NULL for all i.
Conlusion: I want the data frame
>DF3
a b c
1 6+0 NA 4+8
2 1+7 NA 5+9
Any help would be appreciated.
This may help -
#Get column names from DF1 and DF2
all_cols <- union(names(DF1), names(DF2))
#Fill missing columns with NA in both the dataframe
DF1[setdiff(all_cols, names(DF1))] <- NA
DF2[setdiff(all_cols, names(DF2))] <- NA
#add the two dataframes arranging the columns
DF1[all_cols] + DF2[all_cols]
# a b c
#1 6 NA 12
#2 8 NA 14
We can use bind_rows
library(dplyr)
library(data.table)
bind_rows(DF1, DF2, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 6 NA 12
2 8 NA 14
Another base R option using aggregate + stack + reshae
aggregate(
. ~ rid,
transform(
reshape(
transform(rbind(
stack(DF1),
stack(DF2)
),
rid = ave(seq_along(ind), ind, FUN = seq_along)
),
direction = "wide",
idvar = "rid",
timevar = "ind"
),
rid = 1:nrow(DF1)
),
sum,
na.action = "na.pass"
)[-1]
gives
values.a values.b values.c
1 6 NA 12
2 8 NA 14
I'm using the following dataframe in R:
ID <- c(LETTERS[1:10])
GLUC <- c(88,NA,110,NA,90,88,120,110,NA,90)
TGL <- c(NA,150,NA,200,210,NA,164,170,190,NA)
HDL <- c(32,60,NA,65,NA,32,NA,70,NA,75)
LDL <- c(99,NA,120,165,150,210,NA,188,190,NA)
patient_num <- data.frame(ID,GLUC,TGL,HDL,LDL)
And I want to create a matrix that has GLUC, TGL, HDL and LDL as the row names and mean, median, sd, n and n_miss as the column names. When I put in the following code:
r <- c(mean(patient_num[[varname]],na.rm=TRUE),
median(patient_num[[varname]],na.rm=TRUE),
sd(patient_num[[varname]],na.rm=TRUE),
sum(!is.na(patient_num[[varname]])),
sum(is.na(patient_num[[varname]]))
)
if (length(varname) == 1){
r <- matrix(r,nrow=T)
} else{
for (index in 2:length(varname)){
oneRow = table1(patient_num,varname[[index]])
r <- rbind(r,oneRow)
}
}
rownames(r) <- varname
colnames(r) <- c("mean","median","sd","n","n_miss")
return(r)
}
table1(patient_num,c("GLUC","TGL","HDL","LDL"))
I get an error message:
Error in .subset2(x, i, exact = exact) : recursive indexing failed at level 2
Can't seem to figure out what's wrong
There's a simpler solution using sapply() from base R:
new_df <- sapply(patient_num, function(x) list(
mean = mean(x, na.rm = T),
sd = sd(x, na.rm = T),
n = sum(!is.na(x)),
is_na = sum(is.na(x))))
t(new_df)
#> mean sd n is_na
#>ID NA NA 10 0
#>GLUC 99.42857 13.45185 7 3
#>TGL 180.6667 23.0362 6 4
#>HDL 55.66667 19.00175 6 4
#>LDL 160.2857 40.06126 7 3
If you want only the count of non-NA entries in each row, you can just remove ID from patient_num and run the same code.
Note that you might want to transform new_df back to a data.frame.
You can select only one column at a time using [[.
Here is an alternative way using dplyr functions.
library(dplyr)
table1 <- function(data, varname) {
data %>%
select(all_of(varname)) %>%
tidyr::pivot_longer(cols = everything()) %>%
group_by(name) %>%
summarise(mean = mean(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
n = sum(!is.na(value)),
n_miss = sum(is.na(value)))
}
table1(patient_num,c("GLUC","TGL","HDL","LDL"))
# A tibble: 4 x 6
# name mean median sd n n_miss
# <chr> <dbl> <dbl> <dbl> <int> <int>
#1 GLUC 99.4 90 13.5 7 3
#2 HDL 55.7 62.5 19.0 6 4
#3 LDL 160. 165 40.1 7 3
#4 TGL 181. 180 23.0 6 4
I have a data frame below. I need to find the the row min and max except few column that are characters.
df
x y z
1 1 1 a
2 2 5 b
3 7 4 c
I need
df
x y z Min Max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
Another dplyr possibility could be:
df %>%
mutate(Max = do.call(pmax, select_if(., is.numeric)),
Min = do.call(pmin, select_if(., is.numeric)))
x y z Max Min
1 1 1 a 1 1
2 2 5 b 5 2
3 7 4 c 7 4
Or a variation proposed be #G. Grothendieck:
df %>%
mutate(Min = pmin(!!!select_if(., is.numeric)),
Max = pmax(!!!select_if(., is.numeric)))
Another base R solution. Subset only the columns with numbers and then use apply in each row to get the minimum and maximum value with range.
cbind(df, t(apply(df[sapply(df, is.numeric)], 1, function(x)
setNames(range(x, na.rm = TRUE), c("min", "max")))))
# x y z min max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
1) This one-liner uses no packages:
transform(df, min = pmin(x, y), max = pmax(x, y))
giving:
x y z min max
1 1 1 a 1 1
2 2 5 b 2 5
3 7 4 c 4 7
2) If you have many columns and don't want to list them all or determine yourself which are numeric then this also uses no packages.
ix <- sapply(df, is.numeric)
transform(df, min = apply(df[ix], 1, min), max = apply(df[ix], 1, max))
If your actual data has NAs and if you want to ignore them when taking the min or max then min, max, pmin and pmax all take an optional na.rm = TRUE argument.
Note
Lines <- "x y z
1 1 1 a
2 2 5 b
3 7 4 c"
df <- read.table(text = Lines)
1) We can use select_if. Here, we can use select_if to select the columns that are numeric, then with pmin, pmax get the rowwise min and max and bind it with the original dataset
library(dplyr)
library(purrr)
df %>%
select_if(is.numeric) %>%
transmute(Min = reduce(., pmin, na.rm = TRUE),
Max = reduce(., pmax, na.rm = TRUE)) %>%
bind_cols(df, .)
# x y z Min Max
#1 1 1 a 1 1
#2 2 5 b 2 5
#3 7 4 c 4 7
NOTE: Here, we use only a single expression of select_if
2) The same can be done in base R (no packages used)
i1 <- names(which(sapply(df, is.numeric)))
df['Min'] <- do.call(pmin, c(df[i1], na.rm = TRUE))
df['Max'] <- do.call(pmax, c(df[i1], na.rm = TRUE))
Also, as stated in the comments, this is generalized option. If it is only for two columns, just doing pmin(x, y) or pmax(x,y) is possible and that wouldn't check if the columns are numeric or not and it is not a general solution
NOTE: All of the solutions mentioned here are either answered first or from the comments with the OP
data
df <- structure(list(x = c(1L, 2L, 7L), y = c(1L, 5L, 4L), z = c("a",
"b", "c")), class = "data.frame", row.names = c("1", "2", "3"
))
I want to aggregate Date by group. However, each observation can belong to several groups (e.g. observation 1 belongs to group A and B). I could not find a nice way to achieve this with data.table. Currently I created for each of the possible groups a logical variable which takes the value TRUE if the observation belongs to that group. I am looking for a better way to do this than presented below. I would also like to know how I could achieve this with the tidyverse.
library(data.table)
# Data
set.seed(1)
TF <- c(TRUE, FALSE)
time <- rep(1:4, each = 5)
df <- data.table(time = time, x = rnorm(20), groupA = sample(TF, size = 20, replace = TRUE),
groupB = sample(TF, size = 20, replace = TRUE),
groupC = sample(TF, size = 20, replace = TRUE))
# This should be nicer and less repetitive
df[groupA == TRUE, .(A = sum(x)), by = time][
df[groupB == TRUE, .(B = sum(x)), by = time], on = "time"][
df[groupC == TRUE, .(C = sum(x)), by = time], on = "time"]
# desired output
time A B C
1: 1 NA 0.9432955 0.1331984
2: 2 1.2257538 0.2427420 0.1882493
3: 3 -0.1992284 -0.1992284 1.9016244
4: 4 0.5327774 0.9438362 0.9276459
Here is a solution with data.table:
df[, lapply(.SD[, .(groupA, groupB, groupC)]*x, sum), time]
# > df[, lapply(.SD[, .(groupA, groupB, groupC)]*x, sum), time]
# time groupA groupB groupC
# 1: 1 0.0000000 0.9432955 0.1331984
# 2: 2 1.2257538 0.2427420 0.1882493
# 3: 3 -0.1992284 -0.1992284 1.9016244
# 4: 4 0.5327774 0.9438362 0.9276459
or (thx to #chinsoon12 for the comment) more programmatically:
df[, lapply(.SD*x, sum), by=.(time), .SDcols=paste0("group", c("A","B","C"))]
If you want the result in the long format you can do:
df[, colSums(.SD*x), by=.(time), .SDcols=paste0("group", c("A","B","C"))]
### with indicator for the group:
df[, .(colSums(.SD*x), c("A","B","C")), by=.(time), .SDcols=paste0("group", c("A","B","C"))]
I think it's easier here to work in long format. First I gather the observations to long format, then keep only the values where the observation belongs to the corresponding group. Then I remove the logical column, and rename the groups to single letters. Then I aggregate across groups and time (summarise in dplyr).
Finally I spread back to wide format.
library(dplyr)
library(tidyr)
set.seed(1)
TF <- c(TRUE, FALSE)
time <- rep(1:4, each = 5)
df <- data.frame(time = time, x = rnorm(20), groupA = sample(TF, size = 20, replace = TRUE),
groupB = sample(TF, size = 20, replace = TRUE),
groupC = sample(TF, size = 20, replace = TRUE))
df %>%
gather(group, belongs, groupA:groupC) %>%
filter(belongs) %>%
select(-belongs) %>%
mutate(group = gsub("group", "", group)) %>%
group_by(time, group) %>%
summarise(x = sum(x)) %>%
spread(group, x)
Output
# A tibble: 4 x 4
# Groups: time [4]
time A B C
<int> <dbl> <dbl> <dbl>
1 1 NA 0.943 0.133
2 2 1.23 0.243 0.188
3 3 -0.199 -0.199 1.90
4 4 0.533 0.944 0.928
An option can be using tidyr and dplyr packages in combination with data.table. Try to work on data in long format and then change it to wide format.
library(dplyr)
library(tidyr)
melt(df, id.vars = c("time", "x")) %>%
filter(value) %>%
group_by(time, variable) %>%
summarise(sum = sum(x)) %>%
spread(variable, sum)
# # A tibble: 4 x 4
# # Groups: time [4]
# time groupA groupB groupC
# * <int> <dbl> <dbl> <dbl>
# 1 1 NA 0.943 0.133
# 2 2 1.23 0.243 0.188
# 3 3 - 0.199 -0.199 1.90
# 4 4 0.533 0.944 0.928
I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case:
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
In the past, I've only had to deal with cases of mis-matched repetitions, which has been easy with aggregate(cbind(var1, var2) ~ name, test, FUN = mean, na.action = na.omit) (or the default setting). I'll get averages for each lot over three values for var1 and over four values for var2.
Unfortunately, this will leave me with a dataset completely missing lot A in this case:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.omit)
name var1 var2 var3
1 B 2 6 2
2 C 2 10 6
If I use na.pass, however, I also don't get what I want:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.pass)
name var1 var2 var3
1 A NA 2.5 NA
2 B NA 6.5 2.5
3 C NA 10.5 6.5
Now I lose the good data I had in var1 since it contained instances of NA.
What I'd like is:
NA as the output of mean() if all unique combinations of varN ~ name are NAs
Output of mean() if there are one or more actual values for varN ~ name
I'm guessing this is pretty simple, but I just don't know how. Do I need to use ddply for something like this? If so... the reason I tend to avoid it is that I end up writing really long equivalents to aggregate() like so:
ddply(test, .(name), summarise,
var1 = mean(var1, na.rm = T),
var2 = mean(var2, na.rm = T),
var3 = mean(var3, na.rm = T))
Yeah... so the result of that apparently does what I want. I'll leave the question anyway in case there's 1) a way to do this with aggregate() or 2) shorter syntax for ddply.
Pass both na.action=na.pass and na.rm=TRUE to aggregate. The former tells aggregate not to delete rows where NAs exist; and the latter tells mean to ignore them.
aggregate(cbind(var1, var2, var3) ~ name, test, mean,
na.action=na.pass, na.rm=TRUE)