Balance observations in data.frame by factor level [duplicate] - r

This question already has answers here:
Take random sample by group
(9 answers)
Closed 3 days ago.
I would like to subsample a dataframe that has an imbalanced number of observations by factor level.
The output I want is another dataframe built from data from the original one where the number of observations by factor level is similar across factor levels (doesn't need to be exactly the same number for each level, but roughly similar).
I am not sure if this called "thinning" the data, or "undersampling" the data.
Consider for instance this dataframe:
data <- data.frame(id = 1:1000,
class = c(rep("A", 700), rep("B", 200), rep("C", 50), rep("D", 50)))
How can I slice rows so that I extract ~200 rows, 50 for each class A, B, C and D?
I can do this manually, but I would like to find a method that I can use with larger datasets and based on a factor with more levels.
I would also be thankful for advice on the name of what I need (thinning? undersampling? stratified sampling?). Thanks!

You can use slice_sample in dplyr:
library(dplyr)
data %>%
group_by(class) %>%
slice_sample(n = 50)
In dplyr 1.1.0 and above:
slice_sample(data, n = 50, by = class)

Base R option using lapply with split based on group and sample 50 rows. After that combine them back using rbind like this:
df = lapply(split(data, data$class), function(x) x[sample(nrow(x), 50),])
df_sampled = do.call(rbind, df)
# Check number of observations
library(dplyr)
df_sampled %>%
group_by(class) %>%
summarise(n = n())
#> # A tibble: 4 × 2
#> class n
#> <chr> <int>
#> 1 A 50
#> 2 B 50
#> 3 C 50
#> 4 D 50
Created on 2023-02-17 with reprex v2.0.2

Related

Obtaining a summary of grouped counts in R

This should be simple but I have been stumped by it: I am trying to figure out an efficient method for obtaining summary stats of a grouped count. Here's a toy example:
df = tibble(pid = c(1,2,2,3,3,3,4,4,4,4), y = rnorm(10))
df %>% group_by(pid) %>% count(pid)
which outputs the expected
# A tibble: 4 × 2
# Groups: pid [4]
pid n
<dbl> <int>
1 1 1
2 2 2
3 3 3
4 4 4
However, what if I want a summary of those grouped counts? Attempting to mutate a new variable or add_count hasn't worked I assume because the variables are different sizes. For instance:
df %>% group_by(pid) %>% count(pid) %>% mutate(count = summary(n))
generates an error. What would be a simple way to generate summary statistics of the grouped counts (e.g., min, max, mean, etc.)?
mutate is for adding columns to a data frame - you don't want that here, you need to pull the column out of the data frame.
df %>%
count(pid) %>%
pull(n) %>%
summary()

how to determine the number of unique values based on multiple criteria dplyr

I've got a df that looks like:
df(site=c(A,B,C,D,E), species=c(1,2,3,4), Year=c(1980:2010).
I would like to calculate the number of different years that each species appear in each site, creating a new column called nYear, I've tried filtering by group and using mutate combined with ndistinct values but it is not quite working.
Here is part of the code I have been using:
Df1 <- Df %>%
filter(Year>1985)%>%
mutate(nYear = n_distinct(Year[Year %in% site]))%>%
group_by(Species,Site, Year) %>%
arrange(Species, .by_group=TRUE)
ungroup()
The approach is good, a few things to correct.
First, let's make some reproducible data (your code gave errors).
df <- data.frame("site"=LETTERS[1:5], "species"=1:5, "Year"=1981:2010)
You should have used summarise instead of mutate when you're looking to summarise values across groups. It will give you a shortened tibble as an output, with only the groups and the summary figures present (fewer columns and rows).
mutate on the other hand aims to modify an existing tibble, keeping all rows and columns by default.
The order of your functions in the chains also needs to change.
df %>%
filter(Year>1985) %>%
group_by(species,site) %>%
summarise(nYear = length(unique(Year))) %>% # instead of mutate
arrange(species, .by_group=TRUE) %>%
ungroup()
First, group_by(species,site), not year, then summarise and arrange.
# A tibble: 5 × 3
species site nYear
<int> <chr> <int>
1 1 A 5
2 2 B 5
3 3 C 5
4 4 D 5
5 5 E 5
You can use distinct() on the filtered frame, and then count by your groups of interest:
distinct(Df %>% filter(Year>1985)) %>%
count(Site, Species,name = "nYear")

How do I merge rows with the same name in R? [duplicate]

This question already has answers here:
How to get summary statistics by group
(14 answers)
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 5 years ago.
I'm trying to merge rows of a data set by using the mean operator.
Basically, I want to convert data set 1 into data set 2 (see below)
1. ID MEASUREMENT 2. ID MEASURE
A 20 A 22.5
B 30 B 30
A 25 .
. .
. .
How can I do this on R?
Note that in contrast to the example I have given here, my data set is really large and I can't look through the data set, group rows according to their id's then find colMeans.
My thoughts are to order the dataset, separate the measures for each id, then find each mean and regroup the data. However, this will be very time consuming.
I would really appreciate if someone can assist me with a direct code or even a for loop.
This code should be able to do that for you.
library(data.table)
setDT(dat)
dat = dat[ , .(MEASURE = mean(MEASUREMENT)), by = .(ID)]
Just to be a little more complete i'll throw in an example and a way to do this in base R.
Data:
dat = data.frame(ID = c("A","A","A","B","B","C"), MEASUREMENT = c(1:3,61,13,7))
With only base R functions:
aggregate(MEASUREMENT ~ ID, FUN = mean, dat)
ID MEASUREMENT
1 A 2
2 B 37
3 C 7
With data.table:
library(data.table)
setDT(dat)
dat = dat[ , .(MEASURE = mean(MEASUREMENT)), by = .(ID)]
> dat
ID MEASURE
1: A 2
2: B 37
3: C 7
You can also do this easily in dplyr, assuming your data is in df
library(dplyr)
df <- df %>%
group_by(ID) %>%
summarize(MEASURE = mean(MEASUREMENT))

How to summarize a data frame into a new one that tells means of separate levels? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 7 years ago.
I have a data.frame that looks somewhat like this.
k <- data.frame(id = c(1,2,2,1,2,1,2,2,1,2), act = c('a','b','d','c','d','c','a','b','a','b'), var1 = 25:34, var2= 74:83)
I have to group the data into separate levels by first 2 columns and write the mean of the the next 2 columns(var1 and var2). It should look like this
id act varmean1 varmean2
1 1 a
2 1 c
3 2 a
4 2 b
5 2 b
6 2 d
The values of respective means are filled in varmean1 and varmean2.
My actual dataframe has 88 columns where I have to group the data into separate levels by the first 2 columns and find the respective means of the remaining. Please help me figure this out as soon as possible. Please try to use 'dplyr' package for the solution if possible. Thanks.
You have several options:
base R:
aggregate(. ~ id + act, k, mean)
or
aggregate(cbind(var1, var2) ~ id + act, k, mean)
The first option aggregates all the column by id and act, the second option only the column you specify. In this case both give the same result, but it is good to know for when you have more columns and only want to aggregate some of them.
dplyr:
library(dplyr)
k %>%
group_by(id, act) %>%
summarise_each(funs(mean))
If you want to specify the columns for which to calculate the mean, you can use summarise instead of summarise_each:
k %>%
group_by(id, act) %>%
summarise(var1mean = mean(var1), var2mean = mean(var2))
data.table:
library(data.table)
setDT(k)[, lapply(.SD, mean), by = .(id, act)]
If you want to specify the columns for which to calculate the mean, you can add .SDcols like:
setDT(k)[, lapply(.SD, mean), by = .(id, act), .SDcols=c("var1", "var2")]

Adding a column with consecutive numbers in R

I apologize if this question is abhorrently simple, but I'm looking for a way to just add a column of consecutive integers to a data frame (if my data frame has 200 observations, for example, starting with 1 for the first observation, and ending with 200 on the last one).
How can I do this?
For a dataframe (df) you could use
df$observation <- 1:nrow(df)
but if you have a matrix you would rather want to use
ma <- cbind(ma, "observation"=1:nrow(ma))
as using the first option will transform your data into a list.
Source: http://r.789695.n4.nabble.com/adding-column-of-ordered-numbers-to-matrix-td2250454.html
Or use dplyr.
library(dplyr)
df %>% mutate(observation = 1:n())
You might want it to be the first column of df.
df %>% mutate(observation = 1:n()) %>% select(observation, everything())
Probably, function tibble::rowid_to_column is what you need if you are using tidyverse ecosystem.
library(tidyverse)
dat <- tibble(x=c(10, 20, 30),
y=c('alpha', 'beta', 'gamma'))
dat %>% rowid_to_column(var='observation')
# A tibble: 3 x 3
observation x y
<int> <dbl> <chr>
1 1 10 alpha
2 2 20 beta
3 3 30 gamma

Resources