Randomly sleeting rows based on all groups in two columns - r

I have a large dataset with about 167k rows. I would like to take a sample of 2000 rows of it while making sure I am taking rows from all groups in two columns (id & quality) in the data.
This is a snapshot of the data
df <- data.frame(id=c(1,2,3,4,5,1,2),
quality=c("a","b","c","d","z","g","t"))
df %>% glimpse()
Rows: 7
Columns: 2
$ id <dbl> 1, 2, 3, 4, 5, 1, 2
$ quality <chr> "a", "b", "c", "d", "z", "g", "t"
So, I need to ensure that the sampled data has rows from all combinations of these two group columns.
I hope someone can help out.
Thanks!

I think that's what you're looking for.
my_df <- data.frame(id = c(1, 2, 3, 4, 5, 1, 2, 2, 2),
quality = c("a", "b", "c", "d", "z", "g", "t", "t", "t"))
my_df <- my_df %>% group_by(id, quality) %>% mutate(Unique = cur_group_id())
my_df$Test <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_a <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_b <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_c <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_d <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_e <- my_df %>% group_by(Unique) %>% sample_n(., 1)
You don't need that much dataframe, those are just examples to show that for each unique group 1 row will be extract randomly. The difference is seen in the column named "Test" especially for the id = 2 and quality = t, based on the data sample.

If you want to make sure that each id and quality is represented in your new sample, you will need to group you data by these variables.
What you are looking for is the following,
df %>%
group_by(id,quality) %>%
sample_n(1, replace = TRUE)
You can change sample size pr group and id, and set replacement as desired.
It gives the following output,
# Groups: id, quality [7]
id quality
<dbl> <chr>
1 1 a
2 1 g
3 2 b
4 2 t
5 3 c
6 4 d
7 5 z
The data that you provided, have unique groups, and therefore sampling the way you want it, gives the same number of rows as you data.
Edit: sample_n is superseeded by slice_sample, I wasnt aware of this. But you can easily change the script by,
df %>%
group_by(id,quality) %>%
slice_sample(
n = 1
)
You can also sample a proportion of your data.frame by setting prop instead of n,
df %>%
group_by(id,quality) %>%
slice_sample(
prop = 0.25
)

Related

Grouped sampling without duplication

I'm struggeling to find a solution for the following problem. From a dataframe with 384 rows and 11 columns need to be drawn 24 samples ramdomly, each one containing 16 items.
Those 16 items also represent the total amount of combinations between factor levels which must be considered within each sample.
We have 4 grouping factors in the process:
Type, Valence, LT, Gender. All of them comprise 2 factor levels respectively. The dataframe looks essentially like this:
df2 <- data.frame(VNr=c(rep(1:8, 48)),
PId=c(rep(1:48, each = 8)),
Gender=rep(c("M", "F"), each=192),
Type=rep(c("E", "S"), each=4, times=48),
Valence=rep(c("P", "N"), each = 2, times=96),
LT=rep(c("L", "T"), each=1, times=192))
My former approach used dplyr to do the job:
N=24
df3 <- map_dfr(seq_len(N), ~df2 %>%
group_by(Type, Valence, LT, Gender) %>%
slice_sample(n = 1) %>%
mutate(sample_no = .x) %>%
ungroup() %>%
mutate(resample = duplicated(PId)) %>%
rowwise())
Regarding the grouping, this works flawlessly. However, it produces duplicates, meaning the same PId appearing more than once in single sample, which is not acceptable.
How can this be avoided?
LMc proposed a workaround here
Sampling by Group in R with no replacement but the final result cannot contain any repeats as well
Unfortunately, I could not get this to work yet.
Any help on this issue is very much appreciated!
Thanks in advance!
-Marshal
Does this work?
library(tidyverse)
df2 <- tibble(
VNr=c(rep(1:8, 48)),
PId=c(rep(1:48, each = 8)),
Gender=rep(c("M", "F"), each=192),
Type=rep(c("E", "S"), each=4, times=48),
Valence=rep(c("P", "N"), each = 2, times=96),
LT=rep(c("L", "T"), each=1, times=192)
)
df2
df2 %>%
group_by(Type, Valence, LT, Gender) %>%
mutate(n_rows_initial = n()) %>%
slice_sample(n = 16, replace = FALSE) %>%
mutate(n_rows_sampled = n()) %>%
ungroup()

Creating duplicate in R

I have the following input data frame with 4 columns and 3 rows.
The time column can take value from 1 to the corresponding value of the maturity column for that customer, I want to create more observations for each customer till the value of time is = value of maturity, with the other columns retaining their original value. Please see the below links for input and expected output
Input
Output
Here is a dplyr solution inspired but not exactly equal to this post.
library(dplyr)
df <- data.frame(custno = 1:3, time = 1, dept = c("A", "B", "A"))
df %>%
slice(rep(1:n(), each = 5)) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
Edit
After the comments by the OP, the following seems to be better.
First, the data:
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))
And the solution.
df %>%
tidyr::uncount(maturity) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
We can also use slice with row_number
library(dplyr)
library(data.table)
df %>%
slice(rep(row_number(), maturity)) %>%
mutate(time = rowid(custno))
data
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))

dplyr filter columns with value 0 for all rows with unique combinations of other columns

I have a dataframe that looks like this:
df <- tibble(date = c(2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01),
site = c("X", "X", "X", "X", "Z", "Z", "Z", "Z"),
treatment = c("a", "a", "b", "b", "a", "a", "b", "b"),
species = c("vetch", "clover", "vetch", "clover", "vetch", "clover", "vetch", "clover"),
frequency = c(0, 1, 1, 1 1, 0, 1, 0))
But with lots of dates and sites and treatments. What I want is to filter out observations where all frequencies of that species (across all treatments and dates) is 0 for that site. So in the above I want to remove clover at site "Z" because it did not occur at any treatment or date at that site, but I want to leave clover in site "X" because it did occur in one of the treatments. So I want:
tibble(date = c(2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01),
site = c("X", "X", "X" "X", "Z", "Z"),
treatment = c("a", "a", "b", "b", "a", "b"),
species = c("vetch", "clover", "vetch", "clover", "vetch", "vetch")
frequency = c(0, 1, 1, 1, 1, 1))
My first thought was to pivot_wider, select columns then pivot_longer again, but this didn't work because the clover column was still selected by having a 1 in site "X":
df %>%
pivot_wider(names_from = species, names_prefix = "spp.", values_from = frequency, values_fill = 0) %>%
group_by(site) %>%
select_if(~ !is.numeric(.) || sum(.) != 0) %>%
pivot_longer(starts_with("spp."), names_to = "species", names_prefix = "spp.", values_to = "frequency") -> df
So I guess I need to filter instead, but I can't figure out how to do that.
Maybe not for this dataset but generally using sum might not be the right approach since if you have negative numbers it might cancel it out and you'll get wrong groups removed. You can use all or any :
With dplyr :
library(dplyr)
df %>% group_by(date, site, species) %>% filter(any(frequency != 0))
#Also
#df %>% group_by(date, site, species) %>% filter(!all(frequency == 0))
# date site treatment species frequency
# <dbl> <chr> <chr> <chr> <dbl>
#1 2018 X a vetch 0
#2 2018 X a clover 1
#3 2018 X b vetch 1
#4 2018 X b clover 1
#5 2018 Z a vetch 1
#6 2018 Z b vetch 1
The same can be done in data.table as well :
library(data.table)
setDT(df)[, .SD[any(frequency != 0)], .(date, site, species)]
Or in base R :
subset(df, ave(frequency != 0, date, site, species, FUN = any))
An easy solution can be achieved by creating another column that contains the frequency of each species grouped by date, site and species (ignoring treatment). Then you can easily filter using this new column and afterwards eliminate it.
library(tidyverse)
df %>%
# Group by date site and species
group_by(date, site, species) %>%
# Create new column that sums frequency values by grouping variables
mutate(appears = sum(frequency)) %>%
# ignore rows where appears = 0
filter(appears != 0) %>%
# Eliminate appears column
select(-appears)

Frequency of unique values of one variable grouped in another variable - R?

Extreme newbie question: I have 2 variables, region ID and household ID, there are duplicate households within the regions. I'm just trying to find out how many unique households are in each region.
This is what I am trying:
library(dplyr)
table <- data %>% group_by(region) %>% summarise(hid = unique(hid))
Error message:
Error: Column hid must be length 1 (a summary value), not 142
Something like this might get you what you want:
library(tidyverse)
df <- tibble(region_id = c(1, 2, 3, 1, 2, 3),
household_id = c("a", "b", "b", "a", "a", "b"))
df %>%
group_by(region_id) %>%
count(household_id) %>%
summarize(unique_households = n())

Calculation between groups in one column in tidy data

I have data like that:
df <- (
tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
)
I want to calculate the ratio between "Height" and "Waist" and between "Waist" and "Hip".
I have the following solution. But my solution requires using spread() and delivers only the calculation for "Waist-to-hip".
df <- rbind(df,
spread(df, Parameter, Value)
%>% transmute(ID = ID,
Group = Group,
Parameter = "Ratio.Height-to-Hip",
Value = Height / Hip,
Parameter = "Ratio.Waist-to-Hip",
Value = Waist / Hip))
Is it possible to stay in tidy data format and avoid switching to the long-format? Why is the calculation for "Height-to-hip" missing?
Here is one the possible solution:
# Calculate ratios "Height" vs "Waist" and "Waist" vs "Hip"
# 1. Load packages
library(tidyr)
library(dplyr)
# 2. Data set
df <- tibble(
id = rep(1:2, 4),
group = c("A", "B", "A", "B","A", "B", "A", "B"),
parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# 3. Filter and transform data set
df <- df %>%
filter(parameter %in% c("Height", "Waist", "Hip")) %>%
spread(parameter, value)
# 4. Convert column names to lower case
colnames(df) <- tolower(colnames(df))
# 5. Calcutate ratios
df <- df %>%
mutate(
ratio_height_vs_waist = round(height / waist, 2),
ratio_waist_vs_hip = round(waist / hip, 2))
The main problem is that the data are not in a tidy format.
Two key features of the tidy format are (Wickham, 2013):
Each variable forms a column;
Each observation forms a row.
In its original format, your data violates these two rules. For example, the Parameter column contains four variables (Blood, Height, Waist, and Hip). The knock-on effect of grouping several variables within Parameter is that each observation has to be repeated across several rows. In general, repeated rows of an identifier (ID in this case) in the absence of repeated measures is a sign that two or more variables have been grouped under a single column.
Anyway, here's my attempt to clean the data (I have used mutate and and not transmute for illustrative purposes).
# Load packages
library(dplyr)
library(tidyr)
library(magrittr) # For the %<>% function, which I love
# Make data frame, df
df <- tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# Wrangle df
df %<>%
# ID and Group appear to be repeated, so use them to group_by
group_by(ID, Group) %>%
# Spread the Value column by the Parameter column
spread(key = Parameter,
value = Value) %>%
# Ungroup, just because its a good habit
ungroup() %>%
# Generate new columns.
mutate(Ratio_height_to_hip = Height / Hip,
Ratio_waist_to_hip = Waist / Hip)
# Print df
df
#> # A tibble: 2 x 8
#> ID Group Blood Height Hip Waist Ratio_height_to_hip
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 A 6.3 180 60 90 3.000000
#> 2 2 B 6.0 170 65 102 2.615385
#> # ... with 1 more variables: Ratio_waist_to_hip <dbl>
df <- df %>%
spread(Parameter, Value) %>%
mutate("Ratio.Height-to-Hip" = Height / Hip) %>%
mutate("Ratio.Waist-to-Hip" = Hip / Waist) %>%
gather("Parameter", "Value", -c("ID", "Group"))
Your data is not in tidy format ;) If you want your data in tidy format remove the last step.

Resources