Sampling where number of samples per cluster varies in R - r

I have a dataframe,
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10,11),score=c(1,3,5,7,3,4,7,1,2,6,3),cluster=c(1,1,2,2,2,2,3,3,3,3,3))
I also have a set of cluster IDs and the number of samples I'd like from each cluster,
sample_sizes<-data.frame(cluster=c(1,2,3),samples=c(1,3,2))
I would like to have a samples dataframe consisting of samples selected according to the number of samples specified in the sample_sizes dataframe.
For instance, the following table would be a potential result:
id score cluster
2 3 1
3 4 2
5 3 2
6 4 2
9 2 3
11 3 3
I have looked at using the following using dplyr:
df2<-merge(df,sample_sizes)
df3<-df2 %>%
group_by(cluster) %>%
sample_n(samples)
but receive an error.
Is there a best method for doing this? A solution that could scale with larger numbers of clusters and samples would be ideal.
Thank you in advance!

We may use map2_df along with split:
map2_df(split(df, df$cluster), sample_sizes$samples, sample_n)
# id score cluster
# 1 1 1 1
# 2 4 7 2
# 3 5 3 2
# 4 3 5 2
# 5 7 7 3
# 6 9 2 3
split(df, df$cluster) gives a list of data frames, one for each cluster, then map2_df applies sample_n to each cluster, just like you intended, and binds the resulting data frames into one.

Here is a way using tidyr::nest() and purrr::map2
library(tidyverse)
df %>% group_by(cluster) %>% nest() %>%
left_join(sample_sizes) %>% mutate(samp=map2(data,samples,sample_n)) %>%
select(cluster,samples,samp) %>% unnest()
Joining, by = "cluster"
# A tibble: 6 x 4
cluster samples id score
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 3 5 3
3 2 3 6 4
4 2 3 4 7
5 3 2 8 1
6 3 2 10 6

Related

creating randomisation order that is equal across both columns and rows

I'm trying to randomize the order for the receipt of 6 drinks (each in a different day) for 40 participants. I want to ensure that every participant get each drink once, and that every drink has roughly the same number of occurrences across participants in each day.
I create the data, with participants in columns and days in rows.
library(ggplot2)
set.seed(123)
random_order <- as.data.frame(replicate(40, sample(1:6, 6,
replace = F)))
random_order$trial <- c(1:6)
random_order
Then I check the number of occurrences of each drink within each row / trial, which shows that the frequency of different drinks within trials is not uniform:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n())
# # A tibble: 36 × 3
# # Groups: trial [6]
# trial drink_order count
# <int> <int> <int>
# 1 1 1 9
# 2 1 2 8
# 3 1 3 8
# 4 1 4 4
# 5 1 5 5
# 6 1 6 6
# 7 2 1 7
# 8 2 2 4
# 9 2 3 10
# 10 2 4 7
# # … with 26 more rows
and look at it with a density plot:
tidyr::pivot_longer(random_order, cols = c(1:40),
names_to = "participant", values_to = "drink_order") |>
dplyr::group_by(trial, drink_order) |>
dplyr::summarise(count = dplyr::n()) |>
ggplot(aes(count)) +
geom_density()
Basically, I want to have a very thin normal curve. How can I make it so that the count column above has a small range during creating the data?
Thanks!
You’re looking for a variation on a Latin square, which is a set of ordered elements such that each element occurs exactly once per column and once per row. You can generate random Latin squares using agricolae::design.lsd(). In your case, instead of once per row, you want each element to occur the same number of times per row, which you can do by binding together multiple Latin squares.
library(agricolae)
set.seed(123)
# to get 40 columns, first get 7 Latin squares
# (7 squares x 6 columns per square = 42 columns)
orders <- replicate(
7,
design.lsd(1:6)$sketch,
simplify = FALSE
)
# then column-bind and subset to 40 columns
random_order <- data.frame(do.call(cbind, orders))[, 1:40]
random_order$trial <- c(1:6)
Using the code from your question, we can see that all trials include 6 or 7 of each drink:
# A tibble: 36 × 3
# Groups: trial [6]
trial drink_order count
<int> <chr> <int>
1 1 1 7
2 1 2 7
3 1 3 7
4 1 4 6
5 1 5 6
6 1 6 7
7 2 1 7
8 2 2 6
9 2 3 6
10 2 4 7
# … with 26 more rows

Stepwise column sum in data frame based on another column in R

I have a data frame like this:
Team
GF
A
3
B
5
A
2
A
3
B
1
B
6
Looking for output like this (just an additional column):
Team
x
avg(X)
A
3
0
B
5
0
A
2
3
A
3
2.5
B
1
5
B
6
3
avg(x) is the average of all previous instances of x where Team is the same. I have the following R code which gets the overall average, however I'm looking for the "step-wise" average.
new_df <- df %>% group_by(Team) %>% summarise(avg_x = mean(x))
Is there a way to vectorize this while only evaluating the previous rows on each "iteration"?
You want the cummean() function from dplyr, combined with lag():
df %>% group_by(Team) %>% mutate(avg_x = replace_na(lag(cummean(x)), 0))
Producing the following:
# A tibble: 6 × 3
# Groups: Team [2]
Team x avg_x
<chr> <dbl> <dbl>
1 A 3 0
2 B 5 0
3 A 2 3
4 A 3 2.5
5 B 1 5
6 B 6 3
As required.
Edit 1:
As #Ritchie Sacramento pointed out, the following is cleaner and clearer:
df %>% group_by(Team) %>% mutate(avg_x = lag(cummean(x), default = 0))

How to use dplyr mutate function in R to calculate a running balance?

In the MWE code at the bottom, I'm trying to generate a running balance for each unique id when running from one row to the next. For example, when running the below code the output should be:
data2 <-
id plusA plusB minusC running_balance [desired calculation for running balance]
1 3 5 10 -2 3 + 5 - 10 = -2
2 4 5 9 0 4 + 5 - 9 = 0
3 8 5 8 5 8 + 5 - 8 = 5
3 1 4 7 3 id doesn't change so 5 from above + (1 + 4 - 7) = 3
3 2 5 6 4 id doesn't change so 3 from above + (2 + 5 - 6) = 4
5 3 6 5 4 3 + 6 - 5 = 4
The below MWE refers to, when id is consistent from one row to the next, the prior row plusA amount rather than the prior row running_balance amount. I've tried changing the below to some form of lag(running_balance...) without luck yet.
I'm trying to minimize the use of too many packages. For example I understand the purrr package offers an accumulate() function, but I'd rather stick to only dplyr for now. Is there a simple way to do this, using dplyr mutate() in my case? I also tried fiddling around with the dplyr cumsum() function which should work here but I'm unsure of how to string several of them together.
MWE code:
data <- data.frame(id=c(1,2,3,3,3,5),
plusA=c(3,4,8,1,2,3),
plusB=c(5,5,5,4,5,6),
minusC = c(10,9,8,7,6,5))
library(dplyr)
data2<- subset(
data %>% mutate(extra=case_when(id==lag(id) ~ lag(plusA), TRUE ~ 0)) %>%
mutate(running_balance=plusA+plusB-minusC+extra),
select = -c(extra)
)
Using dplyr:
data %>%
mutate(running_balance = plusA + plusB - minusC) %>%
group_by(id) %>%
mutate(running_balance = cumsum(running_balance)) %>%
ungroup()
Output:
# A tibble: 6 x 5
# Groups: id [4]
id plusA plusB minusC running_balance
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 5 10 -2
2 2 4 5 9 0
3 3 8 5 8 5
4 3 1 4 7 3
5 3 2 5 6 4
6 5 3 6 5 4

R count() using dynamically generated list of variables/columns

If I have a tibble called observations with the following variables/columns:
category_1_red_length
category_1_red_width
category_1_red_depth
category_1_blue_length
category_1_blue_width
category_1_blue_depth
category_1_green_length
category_1_green_width
category_1_green_depth
category_2_red_length
category_2_red_width
category_2_red_depth
category_2_blue_length
category_2_blue_width
category_2_blue_depth
category_2_green_length
category_2_green_width
category_2_green_depth
Plus a load more. Is there a way to dynamically generate the following count()?
count(observations,
category_1_red_length,
category_1_red_width,
category_1_red_depth,
category_1_blue_length,
category_1_blue_width,
category_1_blue_depth,
category_1_green_length,
category_1_green_width,
category_1_green_depth,
category_2_red_length,
category_2_red_width,
category_2_red_depth,
category_2_blue_length,
category_2_blue_width,
category_2_blue_depth,
category_2_green_length,
category_2_green_width,
category_2_green_depth,
sort=TRUE)
I can create the list of columns I want to count with:
columns_to_count = list()
column_prefix = 'category'
aspects = c('red', 'blue', 'green')
dimensions = c('length', 'width', 'depth')
for (x in 1:2) {
for (aspect in aspects) {
for (dimension in dimensions) {
columns_to_count = append(columns_to_count, paste(column_prefix, x, aspect, dimension, sep='_'))
}
}
}
But then how do I pass my list of columns in columns_to_count to the count() function?
In my actual data set there are about 170 columns like this that I want to count so creating the list of columns without loops doesn't seem sensible.
Struggling to think of the name for what I'm trying to do so unable to find useful search results.
Thanks.
You can use non-standard evaluation using syms and !!!. For example, using mtcars dataset
library(dplyr)
library(rlang)
cols <- c('am', 'cyl')
mtcars %>% count(!!!syms(cols), sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
This is same as doing
mtcars %>% count(am, cyl, sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
You don't need to include names in cols one by one by hand. You can use regex if the column contains a specific pattern or use position to get appropriate column name.
You can use .dots to receive strings as variables:
count(observations, .dots=columns_to_count, sort=TRUE)
r$> d
V1 V2
1 1 4
2 2 5
3 3 6
r$> count(d, .dots=list('V1', 'V2'))
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1
r$> count(d, V1, V2)
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1

Filter for first 5 observations per group in tidyverse

I have precipitation data of several different measurement locations and would like to filter for only the first n observations per location and per group of precipitation intensity using tidyverse functions.
So far, I've grouped the data by location and by precipitation intensity.
This is a minimal example (there are several observations of each rainfall intensity per location)
df <- data.frame(location = c(rep(1, 7), rep(2, 7)),
rain = c(1:7, 1:7))
location rain
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 2 1
9 2 2
10 2 3
11 2 4
12 2 5
13 2 6
14 2 7
I thought that it should be quite easy using group_by() and filter(), but so far, I haven't found an expression that would return only the first n observations per rain group per location.
df %>% group_by(rain, location) %>% filter(???)
You can do:
df %>%
group_by(location) %>%
slice(1:5)
location rain
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
library(dplyr)
df %>%
group_by(location) %>%
filter(row_number() %in% 1:5)
Non-dplyr solutions (that also rearrange the rows)
# Base R
df[unlist(lapply(split(row.names(df), df$location), "[", 1:5)), ]
# data.table
library(data.table)
setDT(df)[, .SD[1:5], by = location]
An option in data.table
library(data.table)
setDT(df)[, .SD[seq_len(.N) <=5], location]

Resources