heading rows with different n by group in R - r

I'm trying to get the first n parts of an object, but with different n per group, according values I have in other object.
I have the next replicable example:
a<- tibble(id = c(1,2,3,4,5,6,7,8,9,10),
group = c(1,1,1,1,1,2,2,2,2,2))
b<- tibble(group=c(1,2),
n = c(3,4))
where what I want is to get the first 3 rows of a when the group is 1, and the first 4 rows of a when the group is 2.
I've trying doing this:
cob<- a %>% group_by(group) %>% arrange(id, .by_group = TRUE) %>%
group_map(~head(.x, b$n))
But I just get the first 3 rows in both groups, and not different size for each group.

We can do a join and then filter
library(dplyr)
a %>%
left_join(b) %>%
group_by(group) %>%
filter(row_number() <= first(n)) %>%
ungroup %>%
select(-n)
or another option is
a %>%
group_by(group) %>%
slice(seq_len(b$n[match(cur_group(), b$group)]))

Here is a data.table solution.
library(data.table)
setDT(a) # only needed because you started with a tibble
setDT(b) # same
a[b, on=.(group)][, .(id=id[1:n]), by=.(group, n)]
group n V1
1: 1 3 1
2: 1 3 2
3: 1 3 3
4: 2 4 6
5: 2 4 7
6: 2 4 8
7: 2 4 9
The first clause: a[b, on=.(group)] joins b to a creating a data.table with columns group, id, and n. The second clause: [, .(id=id[1:n]), by=.(group, n)] groups by group, taking the first n elements of id in each group.

Related

Create column with a certain week value by group

I would like to create a column, by group, with a certain week's value from another column.
In this example New_column is created with the Number from the 2nd week for each group.
Group Week Number New_column
A 1 19 8
A 2 8 8
A 3 21 8
A 4 5 8
B 1 4 12
B 2 12 12
B 3 18 12
B 4 15 12
C 1 9 4
C 2 4 4
C 3 10 4
C 4 2 4
I've used this method, which works, but I feel is a really messy way to do it:
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(New_column = ifelse(Week == 2, Number, NA))
df <- df %>%
group_by(Group) %>%
mutate(New_column = sum(New_column, na.rm = T))
There are several solution possible, depending on what you need specifically. With your specific sample data, however, all of them give the same result
1) It identifies the week number from column Week, even if the dataframe is not sorted
df %>%
group_by(Group) %>%
mutate(New_column = Number[Week == 2])
However, if the weeks do not start from 1, this solution will still try to find the case only where Week == 2
2) If df is already sorted by Week inside each group, you could use
df %>%
group_by(Group) %>%
mutate(New_column = Number[2])
This solution does not take the week Number in which Week == 2, but rather the second week within each group, regardless of its actual Week value.
3) If df is not sorted by week, you could do it with
df %>%
group_by(Group) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(New_column = Number[2])
and uses the same rationale as solution 2)

Counting highest number occurrences of character and return in separate data table/frame in R

I am looking for a string of code that will count the number of occurrences of a certain variable, sort it in order, and then limit it to the first X results. Example of what I am looking for:
Dataframe:
ID Group
1000 A
1001 A
100a A
100g D
1004 C
100f B
100z B
1293 B
2412 B
3040 B
3452 C
Result: Table or Dataframe showing Top 3 results (of 4), in order of highest to low
Group Count
B 5
A 3
C 2
Thanks in advance!
In dplyr, we can count Group values, select top 3 values and arrange them in decreasing order.
library(dplyr)
df %>% count(Group) %>% top_n(3, n) %>% arrange(desc(n))
# Group n
# <fct> <int>
#1 B 5
#2 A 3
#3 C 2
We can also use
df %>% count(Group) %>% arrange(desc(n)) %>% head(3)
Or in base R
stack(head(sort(table(df$Group), decreasing = TRUE), 3))

Mark top entries of subsets in R with tidyverse

I'd like to mark my first top-ranked value with a marker using the tidyverse - if possible.
Assume the following data
test = tibble(group=c(1,1,1,1,2,2,2,2), values = c(1,2,3,4,7,6,5,2))
I'd now like to mark the first top values, which would be the values 3 and 4 for group 1 and 7 and 6 for group 2, yielding:
# A tibble: 8 x 3
group values marker
<dbl> <dbl> <lgl>
1 1 1 FALSE
2 1 2 FALSE
3 1 3 TRUE
4 1 4 TRUE
5 2 7 TRUE
6 2 6 TRUE
7 2 5 FALSE
8 2 2 FALSE
I thought about ranking them and than doing a comparison to get the boolean values or utilizing purrr but I could not figure out how.
After grouping by 'group', either rank the 'values' check the sorted 'n' tail elements are %in% the ranked ones to create a logical vector
library(tidyverse)
test %>%
group_by(group) %>%
mutate(marker = dense_rank(values),
marker = marker %in% tail(sort(marker), 2))
Or directly use order, %in% on the tail
test %>%
group_by(group) %>%
mutate(marker = values %in% tail(values[order(values)], 2))
Or
test %>%
group_by(group) %>%
mutate(marker = dense_rank(values) > n()-2)
Or it can be done in a single line with data.table
library(data.table)
setDT(test)[order(values), marker := values %in% tail(values, 2), group]
Or another option is after grouping by 'group', get the top_n rows (n - specified as 2, wt as 'values'), right_join with the original dataset after creating a 'marker' column of 'TRUE's, and then replace the NA elements with FALSE
test %>%
group_by(group) %>%
top_n(2, values) %>%
mutate(marker = TRUE) %>%
right_join(test) %>%
mutate(marker = replace_na(marker, FALSE))

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Filter and return all rows of a group where specific row fulfills one condition

I am looking to filter and retrieve all rows from all groups where a specific row meets a condition, in my example when the value is more than 3 at the highest day per group. This is obviously simplified but breaks it down to the essential.
# Dummy data
id = rep(letters[1:3], each = 3)
day = rep(1:3, 3)
value = c(2,3,4,2,3,3,1,2,4)
my_data = data.frame(id, day, value, stringsAsFactors = FALSE)
My approach works, but it seems somewhat unsmart:
require(dplyr)
foo <- my_data %>%
group_by(id) %>%
slice(which.max(day)) %>% # gets the highest day
filter(value>3) # filters the rows with value >3
## semi_join with the original data frame gives the required result:
semi_join(my_data, foo, by = 'id')
id day value
1 a 1 2
2 a 2 3
3 a 3 4
4 c 1 1
5 c 2 2
6 c 3 4
Is there a more succint way to do this?
my_data %>% group_by(id) %>% filter(value[which.max(day)] > 3)

Resources