Mark top entries of subsets in R with tidyverse - r

I'd like to mark my first top-ranked value with a marker using the tidyverse - if possible.
Assume the following data
test = tibble(group=c(1,1,1,1,2,2,2,2), values = c(1,2,3,4,7,6,5,2))
I'd now like to mark the first top values, which would be the values 3 and 4 for group 1 and 7 and 6 for group 2, yielding:
# A tibble: 8 x 3
group values marker
<dbl> <dbl> <lgl>
1 1 1 FALSE
2 1 2 FALSE
3 1 3 TRUE
4 1 4 TRUE
5 2 7 TRUE
6 2 6 TRUE
7 2 5 FALSE
8 2 2 FALSE
I thought about ranking them and than doing a comparison to get the boolean values or utilizing purrr but I could not figure out how.

After grouping by 'group', either rank the 'values' check the sorted 'n' tail elements are %in% the ranked ones to create a logical vector
library(tidyverse)
test %>%
group_by(group) %>%
mutate(marker = dense_rank(values),
marker = marker %in% tail(sort(marker), 2))
Or directly use order, %in% on the tail
test %>%
group_by(group) %>%
mutate(marker = values %in% tail(values[order(values)], 2))
Or
test %>%
group_by(group) %>%
mutate(marker = dense_rank(values) > n()-2)
Or it can be done in a single line with data.table
library(data.table)
setDT(test)[order(values), marker := values %in% tail(values, 2), group]
Or another option is after grouping by 'group', get the top_n rows (n - specified as 2, wt as 'values'), right_join with the original dataset after creating a 'marker' column of 'TRUE's, and then replace the NA elements with FALSE
test %>%
group_by(group) %>%
top_n(2, values) %>%
mutate(marker = TRUE) %>%
right_join(test) %>%
mutate(marker = replace_na(marker, FALSE))

Related

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.
In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.
Some reproducible code:
library(dplyr)
df1 = data.frame(ids = c(1:5),
var = c("a","a","c","b","b"))
df2 = data.frame(var = c('c','a','b','b','d'),
links = 0)
Ideally, I would like a resulting dataframe that looks like:
var links
1 c 3
2 a 1 or 2
3 b 4 or 5
4 b 4 or 5
5 d 0
where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".
Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".
Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.
So far, I've gone this route, but I'm unsure about what to put after the ~
linked_df = df2 %>%
mutate(links=case_when(links==0 & var %in% df1$var ~
sample(c(df1$ids),n(),replace=T) # unsure about this line
TRUE ~ links)
I think this is what you want. I've left the ids column in the result, but
it can be removed when the sampling is complete.
library(dplyr)
library(tidyr)
df1_nest = df1 %>%
group_by(var) %>%
summarize(ids = list(ids))
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
set.seed(47)
df2 %>%
left_join(df1_nest) %>%
mutate(
links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
)
# Joining, by = "var"
# var links ids
# 1 c 3 3
# 2 a 1 1, 2
# 3 b 4 4, 5
# 4 b 5 4, 5
# 5 d 0 NULL
Something like this could do the trick, just a map of a filter of the first dataframe:
df2 %>%
as_tibble() %>%
mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
index = row_number()) %>%
unnest(links, keep_empty = TRUE) %>%
group_by(index) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(-index)
# # A tibble: 5 × 2
# var links
# <chr> <int>
# 1 c 1
# 2 a 1
# 3 b 4
# 4 b 5
# 5 d NA

heading rows with different n by group in R

I'm trying to get the first n parts of an object, but with different n per group, according values I have in other object.
I have the next replicable example:
a<- tibble(id = c(1,2,3,4,5,6,7,8,9,10),
group = c(1,1,1,1,1,2,2,2,2,2))
b<- tibble(group=c(1,2),
n = c(3,4))
where what I want is to get the first 3 rows of a when the group is 1, and the first 4 rows of a when the group is 2.
I've trying doing this:
cob<- a %>% group_by(group) %>% arrange(id, .by_group = TRUE) %>%
group_map(~head(.x, b$n))
But I just get the first 3 rows in both groups, and not different size for each group.
We can do a join and then filter
library(dplyr)
a %>%
left_join(b) %>%
group_by(group) %>%
filter(row_number() <= first(n)) %>%
ungroup %>%
select(-n)
or another option is
a %>%
group_by(group) %>%
slice(seq_len(b$n[match(cur_group(), b$group)]))
Here is a data.table solution.
library(data.table)
setDT(a) # only needed because you started with a tibble
setDT(b) # same
a[b, on=.(group)][, .(id=id[1:n]), by=.(group, n)]
group n V1
1: 1 3 1
2: 1 3 2
3: 1 3 3
4: 2 4 6
5: 2 4 7
6: 2 4 8
7: 2 4 9
The first clause: a[b, on=.(group)] joins b to a creating a data.table with columns group, id, and n. The second clause: [, .(id=id[1:n]), by=.(group, n)] groups by group, taking the first n elements of id in each group.

How to find elements common in at least half of elements in an R tibble

I have a tibble of values:
raw = tibble(
labels = rep(rep(1:4,each=3),2),
group = rep(c("A","B"), each=12),
value = c(1,2,3,3,4,5,6,7,2,2,12,1,7,3,3,3,4,5,6,3,2,2,7,1))
I want to select for each group A and B seperatlty the common value in at least half of their for labels. The result may be
Res = tibble(group = c("A","B"),
value = c("1,2,3","2,3,7"))
It will be helpful if I can find a flexible function to do the same selection for at least 1/3 of labels.
Here is one option where we do a grouping by 'group', 'value', get the number of distinct 'labels', then do a group by 'group' and filter the rowss where the 'n' is greater than or equal to the number of distinct 'labels' by 2 i.e. 50%, get the distinct rows of 'group', 'value'
library(dplyr)
raw %>%
group_by(group, value) %>%
mutate(n = n_distinct(labels)) %>%
group_by(group) %>%
filter(n >= n_distinct(labels)/2) %>%
select(-n) %>%
ungroup %>%
distinct(group, value)
# A tibble: 6 x 2
# group value
# <chr> <dbl>
#1 A 1
#2 A 2
#3 A 3
#4 B 7
#5 B 3
#6 B 2

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

how to count repetitions of first occuring value with dplyr

I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)

Resources