Grouped sampling without duplication - r

I'm struggeling to find a solution for the following problem. From a dataframe with 384 rows and 11 columns need to be drawn 24 samples ramdomly, each one containing 16 items.
Those 16 items also represent the total amount of combinations between factor levels which must be considered within each sample.
We have 4 grouping factors in the process:
Type, Valence, LT, Gender. All of them comprise 2 factor levels respectively. The dataframe looks essentially like this:
df2 <- data.frame(VNr=c(rep(1:8, 48)),
PId=c(rep(1:48, each = 8)),
Gender=rep(c("M", "F"), each=192),
Type=rep(c("E", "S"), each=4, times=48),
Valence=rep(c("P", "N"), each = 2, times=96),
LT=rep(c("L", "T"), each=1, times=192))
My former approach used dplyr to do the job:
N=24
df3 <- map_dfr(seq_len(N), ~df2 %>%
group_by(Type, Valence, LT, Gender) %>%
slice_sample(n = 1) %>%
mutate(sample_no = .x) %>%
ungroup() %>%
mutate(resample = duplicated(PId)) %>%
rowwise())
Regarding the grouping, this works flawlessly. However, it produces duplicates, meaning the same PId appearing more than once in single sample, which is not acceptable.
How can this be avoided?
LMc proposed a workaround here
Sampling by Group in R with no replacement but the final result cannot contain any repeats as well
Unfortunately, I could not get this to work yet.
Any help on this issue is very much appreciated!
Thanks in advance!
-Marshal

Does this work?
library(tidyverse)
df2 <- tibble(
VNr=c(rep(1:8, 48)),
PId=c(rep(1:48, each = 8)),
Gender=rep(c("M", "F"), each=192),
Type=rep(c("E", "S"), each=4, times=48),
Valence=rep(c("P", "N"), each = 2, times=96),
LT=rep(c("L", "T"), each=1, times=192)
)
df2
df2 %>%
group_by(Type, Valence, LT, Gender) %>%
mutate(n_rows_initial = n()) %>%
slice_sample(n = 16, replace = FALSE) %>%
mutate(n_rows_sampled = n()) %>%
ungroup()

Related

Randomly sleeting rows based on all groups in two columns

I have a large dataset with about 167k rows. I would like to take a sample of 2000 rows of it while making sure I am taking rows from all groups in two columns (id & quality) in the data.
This is a snapshot of the data
df <- data.frame(id=c(1,2,3,4,5,1,2),
quality=c("a","b","c","d","z","g","t"))
df %>% glimpse()
Rows: 7
Columns: 2
$ id <dbl> 1, 2, 3, 4, 5, 1, 2
$ quality <chr> "a", "b", "c", "d", "z", "g", "t"
So, I need to ensure that the sampled data has rows from all combinations of these two group columns.
I hope someone can help out.
Thanks!
I think that's what you're looking for.
my_df <- data.frame(id = c(1, 2, 3, 4, 5, 1, 2, 2, 2),
quality = c("a", "b", "c", "d", "z", "g", "t", "t", "t"))
my_df <- my_df %>% group_by(id, quality) %>% mutate(Unique = cur_group_id())
my_df$Test <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_a <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_b <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_c <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_d <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_e <- my_df %>% group_by(Unique) %>% sample_n(., 1)
You don't need that much dataframe, those are just examples to show that for each unique group 1 row will be extract randomly. The difference is seen in the column named "Test" especially for the id = 2 and quality = t, based on the data sample.
If you want to make sure that each id and quality is represented in your new sample, you will need to group you data by these variables.
What you are looking for is the following,
df %>%
group_by(id,quality) %>%
sample_n(1, replace = TRUE)
You can change sample size pr group and id, and set replacement as desired.
It gives the following output,
# Groups: id, quality [7]
id quality
<dbl> <chr>
1 1 a
2 1 g
3 2 b
4 2 t
5 3 c
6 4 d
7 5 z
The data that you provided, have unique groups, and therefore sampling the way you want it, gives the same number of rows as you data.
Edit: sample_n is superseeded by slice_sample, I wasnt aware of this. But you can easily change the script by,
df %>%
group_by(id,quality) %>%
slice_sample(
n = 1
)
You can also sample a proportion of your data.frame by setting prop instead of n,
df %>%
group_by(id,quality) %>%
slice_sample(
prop = 0.25
)

How do you use forcat's fct_lump_min() function on a factor while keeping another identifiying factor?

Lets consider this dummy dataset:
v1<- c("A","B", "C", "D", "E", "F")
v2<- c("Z","Y", "X", "X", "V", "U")
Count<- c(2, 5, 10, 5, 1)
df<- cbind.data.frame(v1, v2, Count)
I want to use fct_lump_min() to lump all v1 factors that have a count of 2 or less into another factor named "unique". If I were to completely disregard the V2 factor column, I have functional code like this:
df<-df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
count(CombinedDGSequence, wt = Count, name = "Count")
However, doing so removes the corresponding v2 factor column completely. Is there any way I can maintain each v1 factor level's corresponding v2 value in the resulting dataframe after using fct_lump_min?
Thanks guys!
We may need add_count which creates a new column instead of summarizing
library(dplyr)
library(forcats)
df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count,
other_level = "Unique")) %>%
add_count(CombinedDGSequence, wt = Count, name = "Count")
You may try this to combine all the v2 values in one string.
library(dplyr)
library(forcats)
df %>%
mutate(v1 = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
group_by(v1) %>%
summarise(v2 = toString(v2),
Count = sum(Count))

Creating duplicate in R

I have the following input data frame with 4 columns and 3 rows.
The time column can take value from 1 to the corresponding value of the maturity column for that customer, I want to create more observations for each customer till the value of time is = value of maturity, with the other columns retaining their original value. Please see the below links for input and expected output
Input
Output
Here is a dplyr solution inspired but not exactly equal to this post.
library(dplyr)
df <- data.frame(custno = 1:3, time = 1, dept = c("A", "B", "A"))
df %>%
slice(rep(1:n(), each = 5)) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
Edit
After the comments by the OP, the following seems to be better.
First, the data:
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))
And the solution.
df %>%
tidyr::uncount(maturity) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
We can also use slice with row_number
library(dplyr)
library(data.table)
df %>%
slice(rep(row_number(), maturity)) %>%
mutate(time = rowid(custno))
data
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))

summarise mean of a specific column in dplyr

I would like to summarise a grouped data.frame without knowing the name of the column. But what I know is, that the feature is always at position 3 (column) in this data.frame, is that possible?
df <- data_frame(date = rep(c("2017-01-01", "2017-01-02", "2017-01-03"), 2),
group = rep(c("A", "B"), 3),
temperature = runif(6, -10, 30),
percipitation = runif(6, 0,5)
)
parameter <- "perc"
df1 <- df %>%
select(date, group, starts_with(parameter)) %>%
group_by(group) %>%
summarise(
avg = mean(percipitation)
)
In this example the code works, but of course only for the parameter 'perc' and not for 'temp' or so.
avg = mean(df[[3]])
or something like this doesn't work. Any suggestions?
You could keep just the grouping variable and the third column using select(group, 3). The function summarise_all() can then be used to calculate the mean.
df %>%
select(group, 3) %>%
group_by(group) %>%
summarise_all(
funs(mean)
)

In nested data frame, pass information from one list column to function applied in another

I am working on a report for which I have to export a large number of similar data frames into nice looking tables in Word. My goal is to achieve this in one go, using flextable to generate the tables and purrr / tidyverse to apply all the formatting procedures to all rows in a nested data frame. This is what my data frame looks like:
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
I want to generate separate tables for all groups in column 'school' and started by using the nest() function within tidyr.
list <- df %>%
group_by(school) %>%
nest()
This gives me a nested data frame to which I can apply the functions in flextable using purrr:
list <- list %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade")))
The first mutate generates a new column with flextable objects for each school, and the second mutate applies header labels to the table, based on the column names that are saved in the object.
My goal is now to add another header that is based on the name of the school. This value resides in the list column entitled school, which corresponds row-wise to the tables generated in the list column ftables. How can I pass the name of the school to the add_header function within ftables, using purrr or any other procedure?
Expected output
I have been able to achieve what I want for individual schools with this procedure (identical header cells will later be merged):
school.name <- "A"
ftable.a <- df %>%
filter(school == "A") %>%
select(-school) %>%
flextable() %>%
set_header_labels(students = "No of students",
grade = "Grade") %>%
add_header(students = school.name,
grade = school.name)
ftable.a
package purrr provides function map2 that you should use:
library(flextable)
library(magrittr)
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
byschool <- df %>%
group_by(school) %>%
nest()
byschool <- byschool %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(
ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade"))) %>%
mutate(ftables = purrr::map2(ftables, school, function(ft, h){
add_header(ft, students = h, grade = h)
} ))

Resources