Count subgroups in group_by with dplyr [duplicate] - r

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 4 years ago.
I'm stuck trying to do some counting on a data frame. The gist is to group by one variable and then break further by groups based on a second variable. From here I want to count the size if the subgroups for each group. The sample code is this:
set.seed(123456)
df <- data.frame(User = c(rep("A", 5), rep("B", 4), rep("C", 6)),
Rank = c(rpois(5,1), rpois(4,2), rpois(6,3)))
#This results in an error
df %>% group_by(User) %>% group_by(Rank) %>% summarize(Res = n_groups())
So what I want is 'User A' to have 3, 'User B' to have 4, and 'User C' to have 5. In other words the data frame df would end up looking like:
User Rank Result
1 A 2 3
2 A 2 3
3 A 1 3
4 A 0 3
5 A 0 3
6 B 1 4
7 B 2 4
8 B 0 4
9 B 6 4
10 C 1 5
11 C 4 5
12 C 3 5
13 C 5 5
14 C 5 5
15 C 8 5
I'm still learning dplyr, so I'm unsure how I should do it. How can this be achieved? Non-dplyr answers are also very welcome. Thanks in advance!

Try this:
df %>% group_by(User) %>% mutate(Result=length(unique(Rank)))
Or (see comment below):
df %>% group_by(User) %>% mutate(Result=n_distinct(Rank))

A base R option would be using ave
df$Result <- with(df, ave(Rank, User, FUN = function(x) length(unique(x))))
df$Result
#[1] 3 3 3 3 3 4 4 4 4 5 5 5 5 5 5
and a data.table option is
library(data.table)
setDT(df)[, Result := uniqueN(Rank), by = User]

Related

How can I remove rows with the same value in 2 ore more rows in R

I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")

Filter for first 5 observations per group in tidyverse

I have precipitation data of several different measurement locations and would like to filter for only the first n observations per location and per group of precipitation intensity using tidyverse functions.
So far, I've grouped the data by location and by precipitation intensity.
This is a minimal example (there are several observations of each rainfall intensity per location)
df <- data.frame(location = c(rep(1, 7), rep(2, 7)),
rain = c(1:7, 1:7))
location rain
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 2 1
9 2 2
10 2 3
11 2 4
12 2 5
13 2 6
14 2 7
I thought that it should be quite easy using group_by() and filter(), but so far, I haven't found an expression that would return only the first n observations per rain group per location.
df %>% group_by(rain, location) %>% filter(???)
You can do:
df %>%
group_by(location) %>%
slice(1:5)
location rain
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
library(dplyr)
df %>%
group_by(location) %>%
filter(row_number() %in% 1:5)
Non-dplyr solutions (that also rearrange the rows)
# Base R
df[unlist(lapply(split(row.names(df), df$location), "[", 1:5)), ]
# data.table
library(data.table)
setDT(df)[, .SD[1:5], by = location]
An option in data.table
library(data.table)
setDT(df)[, .SD[seq_len(.N) <=5], location]

Sum previous instances that match the same ID [duplicate]

This question already has answers here:
cumsum by group [duplicate]
(2 answers)
Closed 4 years ago.
I have this example dataset:
df <- data.frame(ID = c(1, 1, 1, 2, 2, 2), A = c("2018-10-12",
"2018-10-12", "2018-10-13", "2018-10-14", "2018-10-15", "2018-10-16"),
B = c(1, 5, 7, 2, 54, 202))
ID A B
1 1 2018-10-12 1
2 1 2018-10-12 5
3 1 2018-10-13 7
4 2 2018-10-14 2
5 2 2018-10-15 54
6 2 2018-10-16 202
What I'm trying to do is create a column C that is the sum of B but only for dates before each respective row. For instance, the output I'm seeking is:
ID A B C
1 1 2018-10-12 1 1
2 1 2018-10-12 5 6
3 1 2018-10-13 7 13
4 2 2018-10-14 2 2
5 2 2018-10-15 54 56
6 2 2018-10-16 202 258
I generally will use subsets to do individual sumifs when I have those questions, but I'm not sure how to do this in a new column.
My end goal is to determine the dates that each ID (if applicable) crosses 50.
Thanks!
We can do a group by cumulative sum to create the 'C' column
library(dplyr)
df %>%
group_by(ID) %>%
mutate(C = cumsum(B))
Or use data.table
library(data.table)
setDT(df)[, C := cumsum(B), by = ID]
or with base R
df$C <- with(df, ave(B, ID, FUN = cumsum))

How to subset data based on combination of criteria in R

I have a several million rows of data and I need to create a subset. No success despite of trying hard and searching all over the web. The question is:
How to create a subset including only the smallest values of value for all ID & item combinations?
The data structure looks like this:
> df = data.frame(ID = c(1,1,1,1,2,2,2,2),
item = c('A','A','B','B','A','A','B','B'),
value = c(10,5,3,2,7,8,9,10))
> df
ID item value
1 1 A 10
2 1 A 5
3 1 B 3
4 1 B 2
5 2 A 7
6 2 A 8
7 2 B 9
8 2 B 10
The the result should look like this:
ID item value
1 A 5
1 B 2
2 A 7
2 B 9
Any hints greatly appreciated. Thank you!
We can use aggregate from baseR with grouping variables 'ID' and 'item' to get the min of 'value'
aggregate(value~., df, min)
# ID item value
#1 1 A 5
#2 2 A 7
#3 1 B 2
#4 2 B 9
Or using dplyr
library(dplyr)
df %>%
group_by(ID, item) %>%
summarise(value = min(value))
Or with data.table
library(data.table)
setDT(df)[, .(value = min(value)) , .(ID, item)]
Or another option would be to order and get the first row after grouping
setDT(df)[order(value), head(.SD, 1), .(ID, item)]

Counting complete cases by ID for several variables

I'm just beginning to learn R, so my apologies if this is simpler than I think it is, but I'm really struggling to find an answer.
What I'm attempting to do is to create a vector with a count of complete cases, by ID, for multiple variables.
For example, in this data frame:
ID<-c(1:5)
score.1<-c(1, 7, 3, 5, NA, 4, 6, 9, 11, NA)
score.2<-c(2, NA, 7, 6, NA, 5, NA, 7, 10, 1)
sample<-data.frame(ID, score.1, score.2)
ID score.1 score.2
1 1 2
2 7 NA
3 3 7
4 5 6
5 NA NA
1 4 5
2 6 NA
3 9 7
4 11 10
5 NA 1
The output I'm looking for is something like:
ID Complete
1 4
2 2
3 4
4 4
5 1
Is there a way to do this that I'm missing? I've tried count(complete.cases(sample)) with plyr and sum(complete.cases()), but it's not giving me what I actually want.
Any help with this is appreciated.
You can use dplyr:
library(dplyr)
sample %>%
mutate(new_var = rowSums(!is.na(sample[,2:3]))) %>%
group_by(ID) %>%
summarize(Complete = sum(new_var))
The output is exactly what you are looking for:
ID Complete
(int) (dbl)
1 4
2 2
3 4
4 4
5 1
with package dplyr and base function complete.cases, try
require(dplyr)
sample %>%
mutate(complete = complete.cases(sample)) %>%
group_by(ID) %>%
summarise(complete = sum(complete))
This should do it:
score.1_complete <- sample[complete.cases(sample$score.1), ]
score.2_complete <- sample[complete.cases(sample$score.2), ]
total <- rbind(score.1_complete, score.2_complete)
output <- count(total, "ID")
my reasoning:
score.1_complete selects the rows where score.1 (though not necessarily score.2) is complete. score.2_complete selects the rows where score.2 (though not necessarily score.1) is complete. therefore, counting how many times an ID shows up in total gives you how many times score.1 is complete for that ID + how many times score.2 is complete for that ID, which is what you want.
Here is another option with gather/summarise. We convert the 'wide' to 'long' format with gather (from tidyr), get the sum of non-NA 'value' grouped by 'ID'.
library(tidyr)
library(dplyr)
gather(sample, score, value,-ID) %>%
group_by(ID) %>%\
summarise(value= sum(!is.na(value)) )
# ID value
# (int) (int)
#1 1 4
#2 2 2
#3 3 4
#4 4 4
#5 5 1
Or a base R approach would be
tapply(rowSums(!is.na(sample[-1])), sample$ID, FUN=sum)
# 1 2 3 4 5
# 4 2 4 4 1

Resources