summarize_all rows by grouping and define which value should be kept

summarize_all rows by grouping and define which value should be kept - r

I have a data frame in which several data sources are merged. This creates rows with the same id. Now I want to define which values from which row should be kept.
So far I have been using dplyr with group_by and summarize all to keep the first value if it is not NA.
Here's an example:
# function f for summarizing
f <- function(x) {
x <- na.omit(x)
if (length(x) > 0) first(x) else NA
}
# test data
test <- data.frame(id = c(1,2,1,2), value1 = c("a",NA,"b","c"), value2 = c(0:4))
id value1 value2
1 a 0
2 <NA> 1
1 b 2
2 c 3
The following result is obtained when merging
test <- test %>% group_by(id) %>% summarise_all(funs(f))
id value1 value2
1 a 0
2 c 1
Now the question: that NA (na.omit) be replaced already works, but how can I define that not the numerical value 0, but the value not equal to 0 is accepted. So the expected result looks like this:
id value1 value2
1 a 2
2 c 1

You can just modify your f function by subsetting the vector where it is different from zero
f <- function(x) {
x <- na.omit(x)
x <- x[x != 0]
if (length(x) > 0) first(x) else NA
}
Sidenote: as of dplyr 0.8.0, funs is deprecated. You should a lambda, a list of functions or a list of lambdas. In this case I used a single lambda:
test %>%
group_by(id) %>%
summarise_all(~f(.))
# A tibble: 2 x 3
id value1 value2
<dbl> <chr> <int>
1 1 a 2
2 2 c 1

You can write f function as :
library(dplyr)
f <- function(x) x[!is.na(x) & x != 0][1]
test %>% group_by(id) %>% summarise(across(.fns = f))
# id value1 value2
# <dbl> <chr> <int>
#1 1 a 2
#2 2 c 1
Using [1] would return NA automatically if there are no non-zero or non-NA value in your data.

As a sidenote to the sidenote of #RicS, as of dplyr v1+, summarise_all() is deprecated (superseded). You should rather use across():
test %>%
group_by(id) %>%
summarise(across(.f=f))

Related

R - dplyr. Functions with variable similar to dataframe columns

I have this case where I am filtering on a dataframe in a function, but the dataframe has the column with a similar name as the variable I want to filter on.
example:
d = tibble(cond = c(1,2), b = c(1,2))
f_ = function(data, cond) {
data = data %>% filter(b == cond)
return(data)
}
f_(d, cond = 2)
# A tibble: 2 x 2
cond b
<dbl> <dbl>
1 1 1
2 2 2
No filtering happens (because here cond is equal to b).
this becomes an issue when I do not control the number of columns in the data but at the minimum I know it has the b column.

We can change the function to evaluate the 'cond' not from the environment
f_ = function(data, cond) {
data %>%
filter(b == !!cond)
}
f_(d, cond = 2)
# A tibble: 1 x 2
# cond b
# <dbl> <dbl>
#1 2 2

Filter group only when both levels are present

This feels like it should be more straightforward and I'm just missing something. The goal is to filter the data into a new df where both var values 1 & 2 are represented in the group
here's some toy data:
grp <- c(rep("A", 3), rep("B", 2), rep("C", 2), rep("D", 1), rep("E",2))
var <- c(1,1,2,1,1,2,1,2,2,2)
id <- c(1:10)
df <- as.data.frame(cbind(id, grp, var))
only grp A and C should be present in the new data because they are the only ones where var 1 & 2 are present.
I tried dplyr, but obviously '&' won't work since it's not row based and '|' just returns the same df:
df.new <- df %>% group_by(grp) %>% filter(var==1 & var==2) #returns no rows

Here is another dplyr method. This can work for more than two factor levels in var.
library(dplyr)
df2 <- df %>%
group_by(grp) %>%
filter(all(levels(var) %in% var)) %>%
ungroup()
df2
# # A tibble: 5 x 3
# id grp var
# <fct> <fct> <fct>
# 1 1 A 1
# 2 2 A 1
# 3 3 A 2
# 4 6 C 2
# 5 7 C 1

We can condition on there being at least one instance of var == 1 and at least one instance of var == 2 by doing the following:
library(tidyverse)
df1 <- data_frame(grp, var, id) # avoids coercion to character/factor
df1 %>%
group_by(grp) %>%
filter(sum(var == 1) > 0 & sum(var == 2) > 0)
grp var id
<chr> <dbl> <int>
1 A 1 1
2 A 1 2
3 A 2 3
4 C 2 6
5 C 1 7

Filtering data based on multiple variables

I am trying to create a new column based other column criteria where my data looks like the following:
ID Column 1 Column 2 Column 3
1 2 Y "2013-10-22T10:09"
1 2 Y "2013-10-23T10:09"
2 3 N "2013-10-23T10:09"
3 0 Y "2013-10-23T10:09"
For each ID, I would like to keep only the earliest date/time as long as column 1 is greater than 0 and column 2 is not N. The results would look like this:
ID Column 1 Column 2 Column 3 Column 4
1 2 Y "2013-10-22T10:09" 2013-10-22
I currently tried this but I was wondering how to do it and if there is an elegant way of doing it:
library(dplyr)
ifelse(Column 1 >0 and Column 2 !="N",
(new %>%
group_by(ID) %>%
arrange(Column 3) %>%
slice(1L)))
Column 4 <- as.Date(Column 3, format='%Y-%m-%dT%H:%M')

library(dplyr)
df %>%
filter(Column1 > 0 & Column2 != 'N') %>% # filter out non-matching rows
group_by(ID) %>%
top_n(-1, Column3) %>% # select only the row with the earliest date-time
mutate(Date = as.Date(Column3)) # create date column
#
# # A tibble: 1 x 5
# # Groups: ID [1]
# ID Column1 Column2 Column3 Date
# <int> <int> <chr> <chr> <date>
# 1 1 2 Y 2013-10-22T10:09 2013-10-22

rm(list = ls())
df <- data.frame(id = c(1,1,2,3),column_1 = c(2,2,3,0),
column_2 = c("Y","Y","N","Y"),
column_3 = as.Date(c("2013-10-22","2013-10-23","2013-10-23","2013-10-23"),format = "%Y-%m-%d"))
n <- unique(df$id)
datalist <- list()
for(i in 1:n)
{
z <- df[df$id == i & df$column_1 > 0 & df$column_2 != "N" & df$column_3 == min(df$column_3),]
datalist[[i]] <- z
}
do.call(rbind,datalist)
This function will help you.
But the constraints for each column were made constant.
You can change it as per your convenience.
Thanks

Filter Dataframe with Tidy Evaluation

I am handling a large dataset. First, for certain columns (X1, X2, ...), I am trying to identify a range of value (a, b) consists of repeated value (a > n, b > n). Next, I wish to filter row based on the condition which matches respective columns to result given in the previous step.
Here is a reproducible example simulating the scenario I am facing,
library(tidyverse)
set.seed(1122)
vecs <- lapply(X = 1:2, function(x) rep(c(1, 2, 3), times = 10) %>% sample() %>% head(10))
names(vecs) <- paste0("col_", 1:2)
dat <- vecs %>% as.data.frame()
dat
col_1 col_2
1 3 2
2 1 1
3 1 1
4 1 2
5 1 2
6 3 3
7 3 3
8 2 1
9 1 3
10 2 2
I am able to identify the range by the following method,
# Which col has repeated value more than 3 appearances?
more_than_3 <- function(df, var){
var <- rlang::sym(var)
df %>%
group_by(!!var) %>%
summarise(n = n()) %>%
filter(n > 3) %>%
pull(!!var) %>%
range()
}
cols_name <- c("col_1", "col_2")
some_range <- purrr::map(cols_name, more_than_3, df = dat)
names(some_range) <- cols_name
some_range
$col_1
[1] 1 1
$col_2
[1] 2 2
However, to filter out values that fall outside the upper limit, this is what I do.
dat %>%
filter(col_1 <= some_range[["col_1"]][2],
col_2 <= some_range[["col_2"]][2])
col_1 col_2
1 1 1
2 1 1
3 1 2
4 1 2
I believe there must be a more efficient and elegant way of filtering the result based on tidy evaluation. Can someone point me to the right direction?
Many thanks in advance.

First let's try to create a small function that creates a single filter expression for one column. This function will take a symbol and then transform to string but it could be the other way around:
new_my_filter_call_upper <- function(sym, range) {
col_name <- as.character(sym)
col_range <- range[[col_name]]
if (is.null(col_range)) {
stop(sprintf("Can't find column `%s` to compute range", col_name), call. = FALSE)
}
expr(!!sym < !!col_range[[2]])
}
Let's try it:
new_my_filter_call_upper(quote(foobar), some_range)
#> Error: Can't find column `foobar` to compute range
# It works!
new_my_filter_call_upper(quote(col_1), some_range)
#> col_1 < 3
Now we're ready to create a pipeline verbs that will take a data frame and bare column names.
# Probably cleaner to pass range as argument. Prefix with dot to allow
# columns named `range`.
my_filter <- function(.data, ..., .range) {
# ensyms() guarantees there won't be complex expressions
syms <- rlang::ensyms(...)
# Now let's map the function to create many calls:
calls <- purrr::map(syms, new_my_filter_call_upper, range = .range)
# And we're ready to filter with those expressions:
dplyr::filter(.data, !!!calls)
}
Let's try it:
dat %>% my_filter(col_1, col_2, .range = some_range)
#> col_1 col_2 NA.
#> 1 2 1 1
#> 2 2 2 1

We could use map2
library(purrr)
map2(dat, some_range, ~ .x < .y[2]) %>%
reduce(`&`) %>%
dat[.,]
# col_1 col_2
#1 2 2
#2 1 1
#3 1 2
#6 1 1
Or with pmap
pmap(list(dat, some_range %>%
map(2)), `<`) %>%
reduce(`&`) %>%
dat[.,]

how to use tidyeval functions with loops?

Consider this simple example
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
> dataframe
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
and this tidyeval function that uses dplyr to aggregate my dataframe according to some input column.
func_tidy <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
now, this works
> func_tidy(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
but doing the same thing from within a loop FAILS
for(col in c(group)){
func_tidy(dataframe, col)
}
Error in grouped_df_impl(data, unname(vars), drop) : Column `col` is unknown
What is the problem here? How can I use my tidyeval function in a loop?
Thanks!

For looping through column names you will need to use character strings.
for(col in "group")
When you pass this variable to your function, you will need to convert it from a character string to a symbol using rlang::sym. You use !! to unquote so the expression is evaluated.
So your loop would look like (I add a print to see the output):
for(col in "group"){
print( func_tidy(dataframe, !! rlang::sym(col) ) )
}
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

summarize_all rows by grouping and define which value should be kept - r

You can write f function as : library(dplyr) f <- function(x) x[!is.na(x) & x != 0][1] test %>% group_by(id) %>% summarise(across(.fns = f)) # id value1 value2 # <dbl> <chr> <int> #1 1 a 2 #2 2 c 1 Using [1] would return NA automatically if there are no non-zero or non-NA value in your data.

As a sidenote to the sidenote of #RicS, as of dplyr v1+, summarise_all() is deprecated (superseded). You should rather use across(): test %>% group_by(id) %>% summarise(across(.f=f))

Related

R - dplyr. Functions with variable similar to dataframe columns

Filter group only when both levels are present

Filtering data based on multiple variables

Filter Dataframe with Tidy Evaluation

how to use tidyeval functions with loops?

Categories

Resources