This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
In the below dataframe, is it possible to find the Date at which maximum occured by groups
df
Date Var Value
27/9/2019 A 56
28/9/2019 A 50
1/10/2019 B 90
2/10/2019 B 100
df1 Max Date Mean
A 56 27/9/2019 53
B 100 2/10/2019 95
We can group_by Var, calculate the mean of Value and select the row with maximum value.
library(dplyr)
df %>%
group_by(Var) %>%
mutate(Mean = mean(Value)) %>%
slice(which.max(Value))
# Date Var Value Mean
# <fct> <fct> <int> <dbl>
#1 27/9/2019 A 56 53
#2 2/10/2019 B 100 95
Base R, split-apply-combine (Edited):
# Create df, ensure date vec has appropriate type:
df <- data.frame(
Date = as.Date(c("27/9/2019", "28/9/2019", "1/10/2019", "2/10/2019"), "%d/%m/%y"),
Var = c("A", "A", "B", "B"),
Value = c(56, 50, 90, 100),
stringsAsFactors = F
)
# Split df by "Var" values:
split_applied_combined <- lapply(split(df, df$Var), function(x){
# Calculate the max date:
max_date <- x$Date[which(x$Value == max(x$Value))]
# Calculate the mean:
mean_val <- mean(x$Value)
# Calculate the std_dev:
sd_val <- sd(x$Value)
# Combine vects into df:
summarised_df <- data.frame(max_date, mean_val, sd_val)
}
)
# Combine list back into dataframe:
split_applied_combined <- do.call(rbind,
# Store df name as vect:
mapply(cbind,
"Var" = names(split_applied_combined),
split_applied_combined,
SIMPLIFY = FALSE))
Dplyr alternative:
require("dplyr")
# Group by var, summarise data, store return object as a dataframe:
summarised_df <-
df %>%
group_by(Var) %>%
summarise(max_date_per_group = max(Date), mean_val_per_group = mean(Value), sd_per_group = sd(Value)) %>%
ungroup()
There might be a better way to do this. Since you want a summary which reduce multiple values down to a single value. You could *_join the output of the filter with the summary table as follows:
library(dplyr)
df1 <- df %>%
group_by(Var) %>%
filter(Value == max(Value)) %>%
select(df1=Var, Max=Value, Date)
df2 <-df %>%
group_by(Var) %>%
summarise_at(.vars = vars(Value),
.funs = c(mean="mean", sd="sd"))
df2 %>%
left_join(df1, by = "Var") %>%
select(Var, Value, Date, mean, sd)
# -------------------------------------------------------------------------
# # A tibble: 2 x 5
# Var Value Date mean sd
# <chr> <dbl> <chr> <dbl> <dbl>
# 1 A 56 27/9/2019 53 4.24
# 2 B 100 2/10/2019 95 7.07
Data
df <- data.frame(
Date = c("27/9/2019", "28/9/2019", "1/10/2019", "2/10/2019"),
Var = c("A", "A", "B", "B"),
Value = c(56, 50, 90, 100), stringsAsFactors = F
)
Hope that is what you want.
Related
I have a data.frame with 150 column names. For each column, I want to extract the maximum and minimum values (the rows repeat) and the row names of each maximum value. I have extracted the min and max values in another data.frame but don't know how to match them.
I have found functions that are very close for this, like for minimum values:
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
sapply(cars,which.min)
speed dist
1 1
Here, it only gives the first index for minimum speed.
And I've tried with loops like:
for (i in (colnames(cars))){
print(min(cars[[i]]))
}
[1] 4
[1] 2
But that just gives me the minimum values, and not if they are repeated and the rowname of each repeated value.
I want something like:
min.value column rowname freq.times
4 speed 1,2 2
2 dist 1 1
Thanks and sorry if I have orthography mistakes. No native speaker
One option is to use tidyverse. I was a little unclear if you want min and max in the same dataframe, so I included both. First, I create an index column with row numbers. Then, I pivot to long format to determine which values are minimum and maximum (using case_when). Then, I drop the rows that are not min or max (i.e., NA in category). Then, I use summarise to turn the row names into a single character string and get the frequency of a given minimum or maximum value.
library(tidyverse)
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "value") %>%
group_by(column) %>%
mutate(category = case_when((value == min(value)) == TRUE ~ "min",
(value == max(value)) == TRUE ~ "max")) %>%
drop_na(category) %>%
group_by(column, value, category) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2:3, 1, 4, 5)
Output
# A tibble: 4 × 5
# Groups: column, value [4]
value category column rowname freq.times
<dbl> <chr> <chr> <chr> <int>
1 2 min dist 1 1
2 120 max dist 49 1
3 4 min speed 1, 2 2
4 25 max speed 50 1
However, if you want to produce the dataframes separately. Then, you could adjust something like this. Here, I don't use category and instead use filter to drop all rows that are not the minimum for a group/column. Then, we can summarise as we did above. You can do the samething for max as well.
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "min.value") %>%
group_by(column) %>%
filter(min.value == min(min.value)) %>%
group_by(column, min.value) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2, 1, 3, 4)
Output
# A tibble: 2 × 4
# Groups: column [2]
min.value column rowname freq.times
<dbl> <chr> <chr> <int>
1 2 dist 1 1
2 4 speed 1, 2 2
Here is another tidyverse approach:
which.min(.) gives the first index, whereas which(. == min(.)) will give all indices that are true for the condition!
Analogues to get the frequence we could use: length(which(.==min(.)))
summarise across all columns min.value, rowname and freq.time
The part after is pivoting to bring the column name in position.
library(tidyverse)
cars %>%
summarise(across(dplyr::everything(), list(min.value = min,
rowname = ~list(which(. == min(.))),
freq.times = ~length(which(.==min(.)))))) %>%
pivot_longer(
cols = contains("_"),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)
) %>%
separate(key, c("column", "name"), sep="_") %>%
pivot_wider(
names_from = name,
values_from = val
) %>%
mutate(rowname = str_replace(rowname, '\\:', '\\,'))
column min.value rowname freq.times
<chr> <chr> <chr> <chr>
1 speed 4 1,2 2
2 dist 2 1 1
min.value <- sapply(cars, min)
columns <- names(min.value)
row.values <- sapply(columns, \(x) which(cars[[x]] == min.value[which(names(min.value) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(min.value) <- names(row.values) <- names(freq.times) <- NULL
data.frame(min.value = min.value,
columns = columns,
row.values = row.values,
freq.times = freq.times)
min.value columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
Here it is wrapped in function, so that you can use it across whatever data frame and function you need:
create_table <- function(df, FUN) {
values <- sapply(df, FUN)
columns <- names(values)
row.values <- sapply(columns, \(x) which(df[[x]] == values[which(names(values) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(values) <- names(row.values) <- names(freq.times) <- NULL
data.frame(values = values,
columns = columns,
row.values = row.values,
freq.times = freq.times)
}
create_table(cars, min)
values columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
create_table(cars, max)
values columns row.values freq.times
1 25 speed 50 1
2 120 dist 49 1
You can use which to obtain the positions. sapply should work. Since you need multiple summary statistics for each column, you just have to wrap up them in a list. Something like this
as.data.frame(sapply(cars, \(x) {
extrema <- range(x)
min.row <- which(x == extrema[[1L]])
max.row <- which(x == extrema[[2L]])
list(
min.value = extrema[[1L]], max.value = extrema[[2L]],
min.row = min.row, max.row = max.row,
freq.min = length(min.row), freq.max = length(max.row)
)
}))
Output
speed dist
min.value 4 2
max.value 25 120
min.row 1, 2 1
max.row 50 49
freq.min 2 1
freq.max 1 1
Using group_by() I want to get the value of column value based on the largest value of column value2:
df = data.frame(id = c(1,1,1,1,2,2,2,2),
value = c(4,5,1,3,1,2,3,1),
value2 = c("a","b","c","d","e","f","g","h"))
df %>% group_by(id) %>%
sumarise(value2_of_largest_value = f(value, value2))
1 b
2 g
We can use which.max to get the index of the value and use that to subset the value2
library(dplyr)
f1 <- function(x, y) y[which.max(x)]
df %>%
group_by(id) %>%
summarise(value2 = f1(value, value2))
#or simply
# summarise(value2 = value2[which.max(value)])
# A tibble: 2 x 2
# id value2
# <dbl> <fct>
#1 1 b
#2 2 g
Another approach in dplyr:
library(dplyr)
df1 %>%
group_by(id) %>%
filter(value == max(value))
or in data.table:
library(data.table)
setDT(df1)[setDT(df1)[, .I[value == max(value)], by=id]$V1]
I have a date frame with the fields PARTIDA (date), Operação (4 levels factor) and TT (numeric) .
I need to group by the PARTIDA column, pivot the Operation column counting to the frequency of each level and sum the TT column.
Like this:
I already tried something with dplyr but I could not get this result, can anyone help me?
Here's a two-step process that may get you what you want:
library(dplyr)
df <-
tibble(
partida = c("date1", "date2", "date3", "date1", "date2"),
operacao = c("D", "J", "C", "D", "M"),
tt = c(1, 2, 3, 4, 5)
)
tt_sums <-
df %>%
group_by(partida) %>%
count(wt = tt)
operacao_counts <-
df %>%
group_by(partida, operacao) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)
final_df <-
operacao_counts %>%
left_join(tt_sums, by = "partida")
> final_df
# A tibble: 3 x 6
partida C D J M n
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 date1 0 2 0 0 5
2 date2 0 0 1 1 7
3 date3 1 0 0 0 3
Similar to #cardinal40's answer but in one go as I try to limit the number of objects added to my environment when possible. Either answer will do the trick.
df %>%
group_by(partida) %>%
mutate(tt = sum(tt)) %>%
group_by(partida, operacao, tt) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)
Let's say I have a few columns in my data frame, that come from a bunch of similar factors:
For eg: A1_Factor1, A1_Factor2, A1_Factor3, B1_Factor1,B1_Factor2,C1_Factor1 etc
What I want is to create additional columns using this data. So:
A1_Mean - This should be the average of columns starting with A1
B1_Mean - This should be the average of columns starting with B1
A1_Min - This should be the minimum value of columns starting with A1
B1_Min - This should be the minimum value of columns starting with B1
A1_SD - This should be the Standard Deviation of columns starting with A1
B1_SD - This should be the Standard Deviation of columns starting with B1
How can it be done in R, so that the code first extract the columns having similar initials, and then perform the required analysis on it. And then create new columns out of it using same initials?
Thanks for your help in advance! :)
You can do this using tidyverse package
Input:
library(tidyverse)
set.seed(123)
df <- tibble(A1_abc = sample(1:10, 5),
A1_cde = sample(10:15, 5),
B1_abc = sample(1:10, 5),
B1_cde = sample(15:20, 5))
df
# A tibble: 5 x 4
A1_abc A1_cde B1_abc B1_cde
<int> <int> <int> <int>
1 3 10 10 20
2 8 12 5 16
3 4 13 6 15
4 7 11 9 18
5 6 15 1 19
Method:
df %>%
gather(key, value) %>%
separate(key, c("gp", "rand"), sep = "_") %>%
select(-rand) %>%
group_by(gp) %>%
mutate(id = 1:n()) %>%
spread(gp, value) %>%
summarise_at(vars(2:3), funs(Min = min(.),
Max = max(.),
Mean = mean(.),
SD = sd(.)))
Output:
# A tibble: 1 x 8
A1_Min B1_Min A1_Max B1_Max A1_Mean B1_Mean A1_SD B1_SD
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3. 1. 15. 20. 8.90 11.9 3.96 6.61
If you want to add more functions, just add it at the funs() function inside the summarise_at()
I created a small example and this is what I have,
df <- data.frame("A1_factor1" = rnorm(5), "A1_factor2" = rnorm(5),
"B1_factor1" = rnorm(5), "B1_factor2" = rnorm(5))
col.names <- names(df)
group <- unique(substr(col.names, 1, 2))
for (i in 1:length(group)){
group.df <- df[, substr(names(df), 1, 2) == group[i]]
df[, ncol(df)+1] <- apply(group.df, 1, mean)
df[, ncol(df)+1] <- apply(group.df, 1, min)
df[, ncol(df)+1] <- apply(group.df, 1, sd)
df[, ncol(df)+1] <- apply(group.df, 1, max)
names(df)[(ncol(df)-3):ncol(df)] <- paste(group[i], c("Mean", "Min", "SD", "Max"), sep = "_")
}
df
I hope this helps!
This question already has answers here:
R group by, counting non-NA values
(3 answers)
Closed 4 years ago.
Here is my example
mydf<-data.frame('col_1' = c('A','A','B','B'), 'col_2' = c(100,NA, 90,30))
I would like to group by col_1 and count non-NA elements in col_2
I would like to do it with dplyr. Here is what I tried:
mydf %>% group_by(col_1) %>% summarise_each(funs(!is.na(col_2)))
mydf %>% group_by(col_1) %>% mutate(non_na_count = length(col_2, na.rm=TRUE))
mydf %>% group_by(col_1) %>% mutate(non_na_count = count(col_2, na.rm=TRUE))
Nothing worked. Any suggestions?
You can use this
mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))
# A tibble: 2 x 2
col_1 non_na_count
<fctr> <int>
1 A 1
2 B 2
We can filter the NA elements in 'col_2' and then do a count of 'col_1'
mydf %>%
filter(!is.na(col_2)) %>%
count(col_1)
# A tibble: 2 x 2
# col_1 n
# <fctr> <int>
#1 A 1
#2 B 2
or using data.table
library(data.table)
setDT(mydf)[, .(non_na_count = sum(!is.na(col_2))), col_1]
Or with aggregate from base R
aggregate(cbind(col_2 = !is.na(col_2))~col_1, mydf, sum)
# col_1 col_2
#1 A 1
#2 B 2
Or using table
table(mydf$col_1[!is.na(mydf$col_2)])
library(knitr)
library(dplyr)
mydf <- data.frame("col_1" = c("A", "A", "B", "B"),
"col_2" = c(100, NA, 90, 30))
mydf %>%
group_by(col_1) %>%
select_if(function(x) any(is.na(x))) %>%
summarise_all(funs(sum(is.na(.)))) -> NA_mydf
kable(NA_mydf)