Count non-NA values by group [duplicate] - r

This question already has answers here:
R group by, counting non-NA values
(3 answers)
Closed 4 years ago.
Here is my example
mydf<-data.frame('col_1' = c('A','A','B','B'), 'col_2' = c(100,NA, 90,30))
I would like to group by col_1 and count non-NA elements in col_2
I would like to do it with dplyr. Here is what I tried:
mydf %>% group_by(col_1) %>% summarise_each(funs(!is.na(col_2)))
mydf %>% group_by(col_1) %>% mutate(non_na_count = length(col_2, na.rm=TRUE))
mydf %>% group_by(col_1) %>% mutate(non_na_count = count(col_2, na.rm=TRUE))
Nothing worked. Any suggestions?

You can use this
mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))
# A tibble: 2 x 2
col_1 non_na_count
<fctr> <int>
1 A 1
2 B 2

We can filter the NA elements in 'col_2' and then do a count of 'col_1'
mydf %>%
filter(!is.na(col_2)) %>%
count(col_1)
# A tibble: 2 x 2
# col_1 n
# <fctr> <int>
#1 A 1
#2 B 2
or using data.table
library(data.table)
setDT(mydf)[, .(non_na_count = sum(!is.na(col_2))), col_1]
Or with aggregate from base R
aggregate(cbind(col_2 = !is.na(col_2))~col_1, mydf, sum)
# col_1 col_2
#1 A 1
#2 B 2
Or using table
table(mydf$col_1[!is.na(mydf$col_2)])

library(knitr)
library(dplyr)
mydf <- data.frame("col_1" = c("A", "A", "B", "B"),
"col_2" = c(100, NA, 90, 30))
mydf %>%
group_by(col_1) %>%
select_if(function(x) any(is.na(x))) %>%
summarise_all(funs(sum(is.na(.)))) -> NA_mydf
kable(NA_mydf)

Related

Summarize with the latest record for each group [duplicate]

This question already has answers here:
Select row with most recent date by group
(5 answers)
Closed 2 years ago.
I have a dataframe:
df <- data.frame(Xdate = c("21-jul-2020", "29-jul-2020", "20-jul-2020", "13-may-2020" ),
names = c("peter", "lisa","peter", "lisa"),
score = c(1,3,5,7))
What is the most elegant way of getting the latest score out:
df_result <- data.frame(names = c("peter", "lisa"),
score = c(1, 3))
The latest score for peter is 1 and were achieved the 21-jul-2020 and the latest score by lisa is 3 and is achieved the 29-jul-2020.
You can use slice_max() in dplyr, which supersedes top_n() after version 1.0.0, to select the most recent date.
library(dplyr)
df %>%
mutate(Xdate = as.Date(Xdate, "%d-%b-%Y")) %>%
group_by(names) %>%
slice_max(Xdate, n = 1) %>%
ungroup()
# # A tibble: 2 x 3
# Xdate names score
# <date> <chr> <dbl>
# 1 2020-07-29 lisa 3
# 2 2020-07-21 peter 1
Here is a dplyr solution.
library(dplyr)
df %>%
mutate(Xdate = as.Date(df$Xdate, "%d-%b-%Y")) %>%
group_by(names) %>%
arrange(Xdate) %>%
summarise_all(last)
## A tibble: 2 x 3
# names Xdate score
# <chr> <date> <dbl>
#1 lisa 2020-07-29 3
#2 peter 2020-07-21 1
A base R one-liner could be
aggregate(score ~ names, data = df[order(df$Xdate),], function(x) x[length(x)])
# names score
#1 lisa 3
#2 peter 1
Here is one alternative from dplyr package
library(dplyr)
df$Xdate <- as.Date(df$Xdate, format = "%d-%b-%Y")
df %>%
group_by(names) %>%
arrange(desc(Xdate)) %>%
mutate(names = first(names),
score = first(score)) %>%
select(!Xdate) %>%
distinct(names, score)%>%
ungroup()
# names score
# <fct> <dbl>
#1 lisa 3
#2 peter 1
or
df %>% group_by(names) %>% arrange(desc(Xdate)) %>% filter(row_number() == 1)
or
df %>% group_by(names) %>% arrange(desc(Xdate)) %>% top_n(n = -1)
Using ave in base R :
subset(transform(df, Xdate = as.Date(Xdate, "%d-%b-%Y")),
Xdate == ave(Xdate, names, FUN = max))
# Xdate names score
#1 2020-07-21 peter 1
#2 2020-07-29 lisa 3
With transform we first convert Xdate to date, using ave we get max date for each names and subset those values.

R: Break a data.frame according to value of column with dplyr

I have this data.frame
MWE <- data.frame(x = c("a", "a", "a", "b", "b", "b"), y = c(1,2,3,4,5,6))
and what I want to obtain is this data.frame
data.frame(a = c(1,2,3), b = c(4,5,6))
Actually, what I originally want is to sum the 2 vectors a and b (well, I have in reality many more vectors, but it is easier to explain with only 2), so that's why I thought about this transformation. I can do a rowSums then, or something equivalent.
I tried to use pivot_wider from tidyr but I had an error.
Any idea of how to do this with dplyr or tidyr?
Continuing from #Mr.Flick's attempt in tidyverse you could create an id column and grouped on that id column calculate the sum like
library(dplyr)
MWE %>%
group_by(x) %>%
mutate(row = row_number()) %>%
group_by(row) %>%
mutate(total_sum = sum(y)) %>%
tidyr::pivot_wider(names_from = x, values_from = y) %>%
ungroup() %>%
select(-row)
# A tibble: 3 x 3
# total_sum a b
# <dbl> <dbl> <dbl>
#1 5 1 4
#2 7 2 5
#3 9 3 6
We can use unstack from base R
unstack(MWE, y ~ x)
# a b
#1 1 4
#2 2 5
#3 3 6
Or using rowid from data.table with pivot_wider from tidyr
library(dplyr)
library(data.table)
library(tidyr)
MWE %>%
mutate(rn = rowid(x)) %>%
pivot_wider(names_from = x, values_from = y) %>%
select(-rn)
# A tibble: 3 x 2
# a b
# <dbl> <dbl>
#1 1 4
#2 2 5
#3 3 6
Using base R:
data.frame(with(MWE, split(y, x)))
a b
1 1 4
2 2 5
3 3 6

Finding the Date at which maximum occured [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
In the below dataframe, is it possible to find the Date at which maximum occured by groups
df
Date Var Value
27/9/2019 A 56
28/9/2019 A 50
1/10/2019 B 90
2/10/2019 B 100
df1 Max Date Mean
A 56 27/9/2019 53
B 100 2/10/2019 95
We can group_by Var, calculate the mean of Value and select the row with maximum value.
library(dplyr)
df %>%
group_by(Var) %>%
mutate(Mean = mean(Value)) %>%
slice(which.max(Value))
# Date Var Value Mean
# <fct> <fct> <int> <dbl>
#1 27/9/2019 A 56 53
#2 2/10/2019 B 100 95
Base R, split-apply-combine (Edited):
# Create df, ensure date vec has appropriate type:
df <- data.frame(
Date = as.Date(c("27/9/2019", "28/9/2019", "1/10/2019", "2/10/2019"), "%d/%m/%y"),
Var = c("A", "A", "B", "B"),
Value = c(56, 50, 90, 100),
stringsAsFactors = F
)
# Split df by "Var" values:
split_applied_combined <- lapply(split(df, df$Var), function(x){
# Calculate the max date:
max_date <- x$Date[which(x$Value == max(x$Value))]
# Calculate the mean:
mean_val <- mean(x$Value)
# Calculate the std_dev:
sd_val <- sd(x$Value)
# Combine vects into df:
summarised_df <- data.frame(max_date, mean_val, sd_val)
}
)
# Combine list back into dataframe:
split_applied_combined <- do.call(rbind,
# Store df name as vect:
mapply(cbind,
"Var" = names(split_applied_combined),
split_applied_combined,
SIMPLIFY = FALSE))
Dplyr alternative:
require("dplyr")
# Group by var, summarise data, store return object as a dataframe:
summarised_df <-
df %>%
group_by(Var) %>%
summarise(max_date_per_group = max(Date), mean_val_per_group = mean(Value), sd_per_group = sd(Value)) %>%
ungroup()
There might be a better way to do this. Since you want a summary which reduce multiple values down to a single value. You could *_join the output of the filter with the summary table as follows:
library(dplyr)
df1 <- df %>%
group_by(Var) %>%
filter(Value == max(Value)) %>%
select(df1=Var, Max=Value, Date)
df2 <-df %>%
group_by(Var) %>%
summarise_at(.vars = vars(Value),
.funs = c(mean="mean", sd="sd"))
df2 %>%
left_join(df1, by = "Var") %>%
select(Var, Value, Date, mean, sd)
# -------------------------------------------------------------------------
# # A tibble: 2 x 5
# Var Value Date mean sd
# <chr> <dbl> <chr> <dbl> <dbl>
# 1 A 56 27/9/2019 53 4.24
# 2 B 100 2/10/2019 95 7.07
Data
df <- data.frame(
Date = c("27/9/2019", "28/9/2019", "1/10/2019", "2/10/2019"),
Var = c("A", "A", "B", "B"),
Value = c(56, 50, 90, 100), stringsAsFactors = F
)
Hope that is what you want.

Value based on largest value by neighbouring column

Using group_by() I want to get the value of column value based on the largest value of column value2:
df = data.frame(id = c(1,1,1,1,2,2,2,2),
value = c(4,5,1,3,1,2,3,1),
value2 = c("a","b","c","d","e","f","g","h"))
df %>% group_by(id) %>%
sumarise(value2_of_largest_value = f(value, value2))
1 b
2 g
We can use which.max to get the index of the value and use that to subset the value2
library(dplyr)
f1 <- function(x, y) y[which.max(x)]
df %>%
group_by(id) %>%
summarise(value2 = f1(value, value2))
#or simply
# summarise(value2 = value2[which.max(value)])
# A tibble: 2 x 2
# id value2
# <dbl> <fct>
#1 1 b
#2 2 g
Another approach in dplyr:
library(dplyr)
df1 %>%
group_by(id) %>%
filter(value == max(value))
or in data.table:
library(data.table)
setDT(df1)[setDT(df1)[, .I[value == max(value)], by=id]$V1]

Group by, pivot, count and sum in DF in R

I have a date frame with the fields PARTIDA (date), Operação (4 levels factor) and TT (numeric) .
I need to group by the PARTIDA column, pivot the Operation column counting to the frequency of each level and sum the TT column.
Like this:
I already tried something with dplyr but I could not get this result, can anyone help me?
Here's a two-step process that may get you what you want:
library(dplyr)
df <-
tibble(
partida = c("date1", "date2", "date3", "date1", "date2"),
operacao = c("D", "J", "C", "D", "M"),
tt = c(1, 2, 3, 4, 5)
)
tt_sums <-
df %>%
group_by(partida) %>%
count(wt = tt)
operacao_counts <-
df %>%
group_by(partida, operacao) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)
final_df <-
operacao_counts %>%
left_join(tt_sums, by = "partida")
> final_df
# A tibble: 3 x 6
partida C D J M n
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 date1 0 2 0 0 5
2 date2 0 0 1 1 7
3 date3 1 0 0 0 3
Similar to #cardinal40's answer but in one go as I try to limit the number of objects added to my environment when possible. Either answer will do the trick.
df %>%
group_by(partida) %>%
mutate(tt = sum(tt)) %>%
group_by(partida, operacao, tt) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)

Resources