Obtaining a summary of grouped counts in R - r

This should be simple but I have been stumped by it: I am trying to figure out an efficient method for obtaining summary stats of a grouped count. Here's a toy example:
df = tibble(pid = c(1,2,2,3,3,3,4,4,4,4), y = rnorm(10))
df %>% group_by(pid) %>% count(pid)
which outputs the expected
# A tibble: 4 × 2
# Groups: pid [4]
pid n
<dbl> <int>
1 1 1
2 2 2
3 3 3
4 4 4
However, what if I want a summary of those grouped counts? Attempting to mutate a new variable or add_count hasn't worked I assume because the variables are different sizes. For instance:
df %>% group_by(pid) %>% count(pid) %>% mutate(count = summary(n))
generates an error. What would be a simple way to generate summary statistics of the grouped counts (e.g., min, max, mean, etc.)?

mutate is for adding columns to a data frame - you don't want that here, you need to pull the column out of the data frame.
df %>%
count(pid) %>%
pull(n) %>%
summary()

Related

Balance observations in data.frame by factor level [duplicate]

This question already has answers here:
Take random sample by group
(9 answers)
Closed 3 days ago.
I would like to subsample a dataframe that has an imbalanced number of observations by factor level.
The output I want is another dataframe built from data from the original one where the number of observations by factor level is similar across factor levels (doesn't need to be exactly the same number for each level, but roughly similar).
I am not sure if this called "thinning" the data, or "undersampling" the data.
Consider for instance this dataframe:
data <- data.frame(id = 1:1000,
class = c(rep("A", 700), rep("B", 200), rep("C", 50), rep("D", 50)))
How can I slice rows so that I extract ~200 rows, 50 for each class A, B, C and D?
I can do this manually, but I would like to find a method that I can use with larger datasets and based on a factor with more levels.
I would also be thankful for advice on the name of what I need (thinning? undersampling? stratified sampling?). Thanks!
You can use slice_sample in dplyr:
library(dplyr)
data %>%
group_by(class) %>%
slice_sample(n = 50)
In dplyr 1.1.0 and above:
slice_sample(data, n = 50, by = class)
Base R option using lapply with split based on group and sample 50 rows. After that combine them back using rbind like this:
df = lapply(split(data, data$class), function(x) x[sample(nrow(x), 50),])
df_sampled = do.call(rbind, df)
# Check number of observations
library(dplyr)
df_sampled %>%
group_by(class) %>%
summarise(n = n())
#> # A tibble: 4 × 2
#> class n
#> <chr> <int>
#> 1 A 50
#> 2 B 50
#> 3 C 50
#> 4 D 50
Created on 2023-02-17 with reprex v2.0.2

how to determine the number of unique values based on multiple criteria dplyr

I've got a df that looks like:
df(site=c(A,B,C,D,E), species=c(1,2,3,4), Year=c(1980:2010).
I would like to calculate the number of different years that each species appear in each site, creating a new column called nYear, I've tried filtering by group and using mutate combined with ndistinct values but it is not quite working.
Here is part of the code I have been using:
Df1 <- Df %>%
filter(Year>1985)%>%
mutate(nYear = n_distinct(Year[Year %in% site]))%>%
group_by(Species,Site, Year) %>%
arrange(Species, .by_group=TRUE)
ungroup()
The approach is good, a few things to correct.
First, let's make some reproducible data (your code gave errors).
df <- data.frame("site"=LETTERS[1:5], "species"=1:5, "Year"=1981:2010)
You should have used summarise instead of mutate when you're looking to summarise values across groups. It will give you a shortened tibble as an output, with only the groups and the summary figures present (fewer columns and rows).
mutate on the other hand aims to modify an existing tibble, keeping all rows and columns by default.
The order of your functions in the chains also needs to change.
df %>%
filter(Year>1985) %>%
group_by(species,site) %>%
summarise(nYear = length(unique(Year))) %>% # instead of mutate
arrange(species, .by_group=TRUE) %>%
ungroup()
First, group_by(species,site), not year, then summarise and arrange.
# A tibble: 5 × 3
species site nYear
<int> <chr> <int>
1 1 A 5
2 2 B 5
3 3 C 5
4 4 D 5
5 5 E 5
You can use distinct() on the filtered frame, and then count by your groups of interest:
distinct(Df %>% filter(Year>1985)) %>%
count(Site, Species,name = "nYear")

R and dplyr: case_when throws 'incorrect length error' despite not being asked to evaluate group

I have a panel dataset where some groups have observations starting at an earlier year than others and would like to calculate the change in value from the earliest possible time period. I expected that by using case_when within mutate, R would not try to evaluate the code for groups where the earlier dates do not exist, but this does not seem to be the case. I have included a reprex below.
library("dplyr")
dataset <- data.frame(names=c("a","a","a","b","b"),
values=c(2,3,4,2,3),
dates=c("2010","2011","2012","2011","2012"))
dataset_calc <- dataset %>%
group_by(names) %>%
mutate(new_val = case_when(names=="a" ~ values-values[dates=="2010"],
TRUE ~ values-values[dates=="2011"]))
Is there a better solution for what I would like to do?
The resulting dataframe should be something like:
names values dates new_val
1 a 2 2010 0
2 a 3 2011 1
3 a 4 2012 2
4 b 2 2011 0
5 b 3 2012 1
If you arrage the data by group, then you can just subtract off the first value for each group
dataset %>%
group_by(names) %>%
arrange(dates) %>%
mutate(new_val = values - first(values))
If you wanted to hard code different reference years, you would want to use the case_when part over the year rather than the values. For example
dataset %>%
group_by(names) %>%
mutate(
ref_year = case_when(names=="a" ~ "2010", TRUE~"2011"),
new_val = values - values[dates==ref_year],
ref_year = NULL
)
(you don't need to use the temporary ref_year variable, I just added it here for clarity of how the function was working)

R: how can I calculate the percentages a variable takes on a certain value by group?

So I'm trying to get r to report the share of a certain variable taking on a specific value in a group.
For example: Let`s consider a dataset which consists of groups 1,2 and 3. Now I would like to know the percentage a Variable1 takes on the value 500 in group 1,2 and 3 and incorporate this as a new vaiable.
Is there a convenient way to get to a solution?
So it should look something like this:
df
Group Var1 Var1_perc
1 0 50
1 400 50
1 500 50
1 500 50
and so on for the other groups
I would use tidyverse to do this
Calculate how often a variable takes on a certain value in a group
library(tidyverse)
df %>%
group_by(Group,Var1) %>%
summarise(count = n())
To calculate the percentage in a group:
df %>%
left_join(df %>%
group_by(grp) %>%
summarise(n = n()), by = "grp" ) %>%
group_by(grp,value) %>%
summarise(percentage = n()/n)
The whole left_join stuff is to calculate how often a group appears in the table. I couldn't think of a better one rn.

Adding a column with consecutive numbers in R

I apologize if this question is abhorrently simple, but I'm looking for a way to just add a column of consecutive integers to a data frame (if my data frame has 200 observations, for example, starting with 1 for the first observation, and ending with 200 on the last one).
How can I do this?
For a dataframe (df) you could use
df$observation <- 1:nrow(df)
but if you have a matrix you would rather want to use
ma <- cbind(ma, "observation"=1:nrow(ma))
as using the first option will transform your data into a list.
Source: http://r.789695.n4.nabble.com/adding-column-of-ordered-numbers-to-matrix-td2250454.html
Or use dplyr.
library(dplyr)
df %>% mutate(observation = 1:n())
You might want it to be the first column of df.
df %>% mutate(observation = 1:n()) %>% select(observation, everything())
Probably, function tibble::rowid_to_column is what you need if you are using tidyverse ecosystem.
library(tidyverse)
dat <- tibble(x=c(10, 20, 30),
y=c('alpha', 'beta', 'gamma'))
dat %>% rowid_to_column(var='observation')
# A tibble: 3 x 3
observation x y
<int> <dbl> <chr>
1 1 10 alpha
2 2 20 beta
3 3 30 gamma

Resources