How to sum n highest values by row using dplyr without reshaping? - r

I would like to create a new column based on the n highest values per row of a data frame.
Take the following example:
library(tibble)
df <- tribble(~name, ~q_1, ~q_2, ~q_3, ~sum_top_2,
"a", 4, 1, 5, 9,
"b", 2, 8, 9, 17)
Here, the sum_top_2 column sums the 2 highest values of columns prefixed with "q_". I would like to generalize to the n highest values by row. How can I do this using dplyr without reshaping?

One option is pmap from purrr to loop over the rows of the columns that starts_with 'q_', by sorting the row in decreasing order, get the first 'n' sorted elements with head and sum
library(dplyr)
library(purrr)
library(stringr)
n <- 2
df %>%
mutate(!! str_c("sum_top_", n) := pmap_dbl(select(cur_data(),
starts_with('q_')),
~ sum(head(sort(c(...), decreasing = TRUE), n))))
-output
# A tibble: 2 x 5
name q_1 q_2 q_3 sum_top_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 4 1 5 9
2 b 2 8 9 17
Or use rowwise from dplyr.
df %>%
rowwise %>%
mutate(!! str_c("sum_top_", n) := sum(head(sort(c_across(starts_with("q_")),
decreasing = TRUE), n))) %>%
ungroup
# A tibble: 2 x 5
name q_1 q_2 q_3 sum_top_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 4 1 5 9
2 b 2 8 9 17

Related

Count number of observations per distinct group inside summarise with dplyr (n_distinct equivalent?)

Is there a function that counts the number of observations within unique groups and not the number of distinct groups as n_distinct() does?
I'm summarising data with dplyr and group_by(), and I'm trying to calculate means of numbers of observations per a different grouping variable.
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))
df%>%
group_by(id.2)%>%
summarise(n.mean=mean(n_distinct(id)),
v.mean=mean(v))
# A tibble: 2 × 3
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 3 5
2 2 2 4.5
What I instead need:
id.2 n.mean v.mean
1 1 5
2 2 4.5
because for
id.2==1 n.mean is the mean of 1 observation for A, 2 for B, 1 observation for C,
> mean(1,2,1)
[1] 1
id.2==2 n.mean is the mean of 2 observations for A, 0 for B, 2 for C,
mean(2,0,2)
[1] 2
I tried grouping by group_by(id, id.2) first to count the observations and then pass those counts on when grouping by only id.2 in a subsequent step, but that didn't work (though I probably just don't know how to implement this with dplyr as I'm not very experienced with tidyverse solutions)
You are not using mean correctly. mean(1, 2, 1) ignores all but the first argument and therefore will return 1 no matter what other numbers are in the second and third positions. For id.2 == 1, you'd want mean(c(1, 2, 1)), which returns 1.333.
We can use table to quickly calculate the frequencies of id within each grouping of id.2, and then take the mean of those. We can compute v.mean in the same step.
library(tidyverse)
df %>%
group_by(id.2) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 2 6
Your example notes that id.2 == 2 does not have any values for id == B. It is not clear whether your desired solution counts this as a zero-length category, or simply ignores it. The solution above ignores it. The following includes it as a zero-length category by first complete-ing the input data (note new row #7, which has NA data):
df_complete <- complete(df, id.2, id)
id.2 id v
<chr> <chr> <int>
1 1 A 9
2 1 B 1
3 1 B 2
4 1 C 5
5 2 A 4
6 2 A 7
7 2 B NA
8 2 C 3
9 2 C 10
We can convert id to factor data, which will force table to preserve its unique levels even in groupings of zero length:
df_complete %>%
group_by(id.2) %>%
mutate(id = factor(id)) %>%
filter(!is.na(v)) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Or an alternate recipe that does not rely on table:
df_complete %>%
group_by(id.2, id) %>%
summarize(
n_rows = sum(!is.na(v)),
id_mean = mean(v)
) %>%
group_by(id.2) %>%
summarize(
n.mean = mean(n_rows),
v.mean = weighted.mean(id_mean, n_rows, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Note that when providing randomized example data, you should use set.seed to control the randomization and ensure reproducibility. Here is what I used:
set.seed(0)
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))

how to keep only rows that have highest value in certain column in R

I have a dataframe that looks like this:
library(tidyverse)
df <- tribble (
~Species, ~North, ~South, ~East, ~West,
"a", 4, 3, 2, 3,
"b", 2, 3, 4, 5,
"C", 2, 3, 3, 3,
"D", 3, 2, 2, 2
)
I want to filter for species that where the highest value is e.g. North.
In this case, species A and D would be selected. Expected output would be a df with only species A and D in it.
I used a workaround like this:
df %>%
group_by(species) %>%
mutate(rowmean = mean(North:West) %>%
filter(North > rowmean) %>%
ungroup() %>%
select(!rowmean)
which seems like a lot of code for a simple task!
I cant however find a way to do this more codefriendly. Is there a (preferably tidyverse) way to perform this task in a more clean way?
Kind regards
An easier approach is with max.col in base R. Select the columns that are numeric. Get the column index of each row where the value is max. Check if that is equal to 1 i.e. the first column (as we selected only from 2nd column onwards) and subset the rows
subset(df, max.col(df[-1], 'first') == 1)
# A tibble: 2 x 5
# Species North South East West
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 a 4 3 2 3
#2 D 3 2 2 2
If it is based on the rowwise mean
subset(df, North > rowMeans(df[-1]))
Or if we prefer to use dplyr
library(dplyr)
df %>%
filter(max.col(cur_data()[-1], 'first') == 1)
Similarly if it based on the rowwise mean
df %>%
filter(North > rowMeans(cur_data()[-1]))
# base
df[df$North > rowMeans(df[-1]), ]
# A tibble: 2 x 5
Species North South East West
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 4 3 2 3
2 D 3 2 2 2

row wise test if multiple (not all) columns are equal

I want to do a row wise check if multiple columns are all equal or not. I came up with a convoluted approach to count the occurences of each value per group. But this seems somewhat... cumbersome.
sample data
sample_df <- data.frame(id = letters[1:6], group = rep(c('r','l'),3), stringsAsFactors = FALSE)
set.seed(4)
for(i in 3:5) {
sample_df[i] <- sample(1:4, 6, replace = TRUE)
sample_df
}
desired output
library(tidyverse)
sample_df %>%
gather(var, value, V3:V5) %>%
mutate(n_var = n_distinct(var)) %>% # get the number of columns
group_by(id, group, value) %>%
mutate(test = n_distinct(var) == n_var ) %>% # check how frequent values occur per "var"
spread(var, value) %>%
select(-n_var)
#> # A tibble: 6 x 6
#> # Groups: id, group [6]
#> id group test V3 V4 V5
#> <chr> <chr> <lgl> <int> <int> <int>
#> 1 a r FALSE 3 3 1
#> 2 b l FALSE 1 4 4
#> 3 c r FALSE 2 4 2
#> 4 d l FALSE 2 1 2
#> 5 e r TRUE 4 4 4
#> 6 f l FALSE 2 2 3
Created on 2019-02-27 by the reprex package (v0.2.1)
Does not need to be dplyr. I just used it for showing what I want to achieve.
There are a bunch of ways to check for equality row-wise. Two good ways:
# test that all values equal the first column
rowSums(df == df[, 1]) == ncol(df)
# count the unique values, see if there is just 1
apply(df, 1, function(x) length(unique(x)) == 1)
If you only want to test some columns, then use a subset of columns rather than the whole data frame:
cols_to_test = c(3, 4, 5)
rowSums(df[cols_to_test] == df[, cols_to_test[1]]) == length(cols_to_test)
# count the unique values, see if there is just 1
apply(df[cols_to_test], 1, function(x) length(unique(x)) == 1)
Note I use df[cols_to_test] instead of df[, cols_to_test] when I want to be sure the result is a data.frame even if cols_to_test has length 1.

R - (Tidyverse) Turn one row into multiple rows based on column integer

Let's say that I have a dataset which consists of multiple observations. Sometimes a single observation is actually multple ones that have been condensed into one. To keep track of how many observations were merged together an integer-valued variable exists.
What I want to do is to reverse this process.
Example code:
library(tidyverse)
# Example tibble
df_ex <- tibble(
var1 = seq(1, 3),
var2 = c('Some', 'Random', 'Text'),
var3 = c(1, 3, 2)
)
The code above produces the following tibble:
# A tibble: 3 x 3
var1 var2 var3
<int> <chr> <dbl>
1 1 Some 1
2 2 Random 3
3 3 Text 2
The desired tibble after some tidyverse magic would be:
# A tibble: 6 x 3
var1 var2 var3
<dbl> <chr> <dbl>
1 1 Some 1
2 2 Random 1
3 2 Random 1
4 2 Random 1
5 3 Text 1
6 3 Text 1
There are multiple ways to do this in tidyverse
1) Do a group by 'var1' (assuming it is unique), create a list column for 'var3' by replicating 1 with the value of 'var3' and then unnest
df_ex %>%
group_by(var1) %>%
mutate(var3 = list(rep(1, var3))) %>%
unnest
2) Use map to get the list column for 'var3' and unnest
df_ex %>%
mutate(var3 = map(var3, ~ rep(1, .x))) %>%
unnest
3) With base R, replicate the sequence of rows to expand the data and then transform the 'var3' to 1
transform(df_ex[rep(seq_len(nrow(df_ex)), df_ex$var3),], var3 = 1)

R - Find a sequence of row elements based on time constraints in a dataframe

Consider the following dataframe (ordered by id and time):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b".
Some rules for creating sequences:
Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.
The expected result:
df1
id count
1 1 2
2 2 2
There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.
Is it a way to solve the problem using "dplyr"?
I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
Here's the output:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:
Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
This outputs the following:
Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
If you want just the counts, you can group_by(id) then count() as in the original code snippet.
Perhaps it's easier to represent event sequences as strings and use regex:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2

Resources