R - Stratified Sampling with Large Dataset

R - Stratified Sampling with Large Dataset - r

I have a large dataset, and I have multiple groups I want to sample. Each group has a certain number of positive cases, with a value of 1, and a lot more negative cases, with a value of zero.
For each group, I want to select all the positive cases, and then a random amount of negative cases equal to 4x the amount of positive cases in that group.
I also need something that run quickly on a lot of data.
Semi-Update:
stratified_sample = data %>%
group_by(group) %>%
mutate(n_pos = sum(response == 1),
n_neg = 4 * n_pos) %>%
group_by(group,response) %>%
mutate(rec_num = n(),
random_val = runif(n()),
random_order = rank(random_val)) %>%
filter(response == 1 | random_order <= n_neg)

This should work if you sub in the correct names. If you have issues, provide a reproducible example.
library(dplyr)
stratified_sample = your_large_dataset %>%
group_by(whatever_your_grouping_variable_is) %>%
mutate(n_pos = sum(column_name_of_your_label == 1),
n_neg = sum(column_name_of_your_label == 0),
cutoff = 4 * n_pos / n_neg) %>%
filter(column_name_of_your_label == 1 | runif(n()) < cutoff)
This gives each negative case a probability of 4 * number of positive cases / number of negative cases to be selected, so the sample fraction won't be exact, but it has the expected value that you want.

Related

How to use weighting in dplyr package

I don't understand how weighting works in the dplyr::sample_n function. I have a list of very small numbers (ranging from 0.1020457 to 0.1789028) and I need to weight my sampling so that I get some on the lower end, upper end and in the middle. But since the numbers are so similar, I'm not sure how to do it. I also don't want to restrict my sampling to a certain range either (e.g. numbers > 0.16), i just want those far more likely to be sampled.
I can make the range larger (-1.552115 to 2.008253) but that means scaling by data, and I can't weight with negative numbers. I have to do things like abs(numbers - maximum). Heres an example of what I'm doing:
sample_n(data.frame(scaledMeasurement$V1), 4,
replace = FALSE,
weight = abs((scaledMeasurement $V1) - max(scaledMeasurement $V1)))
Heres a section of my data:
Measurement ID
0.8022473 1
1.6991193 2
0.7262765 3
0.3903775 4
-1.5521155 5
-0.7821887 6

If your goal is to get a sample that contains some on the low end, some near the median, and some on the end, it's far easier to avoid weights and just work with group_by + sample_n.
library(tidyverse)
df = tibble(my_nums = runif(10,0.1020457,0.1789028))
df %>%
mutate(quantile = case_when(
my_nums <= quantile(my_nums, probs = c(0.33)) ~ "a",
my_nums <= quantile(my_nums, probs = c(0.67)) ~ "b",
TRUE ~ "c"
)) %>%
group_by(quantile) %>%
sample_n(2)
Produces:
my_nums quantile
<dbl> <chr>
1 0.105 a
2 0.105 a
3 0.151 b
4 0.124 b
5 0.173 c
6 0.172 c
However, if you wanted to use weights, sample_n requires that the weights be the same length as the vector that's being sampled and also that the sum of the weights is equal to 1. You could add a weight column based on a subdivision of your groups (as I show above quantiles), grouping by that, generating a random number between one and length, ungrouping, and then dividing the values in that column by its sum. Like so:
df %>%
mutate(quantile = case_when(
my_nums <= quantile(my_nums, probs = c(0.33)) ~ "a",
my_nums <= quantile(my_nums, probs = c(0.67)) ~ "b",
TRUE ~ "c"
)) %>%
group_by(quantile) %>%
mutate(weight = sample(seq(1,length(my_nums)),length(my_nums))) %>%
ungroup %>% arrange(quantile) %>%
mutate(weight = weight / sum(weight)) %>%
sample_n(6, weight = weight)

How to count the occurrence of permutations in a data set in R?

I have a question on how to count the occurrence of specified permuations in a data set in R.
I am currently working on continuous-glucose-monitoring data sets. Shortly, each data set has between 1500 to 2000 observations (each observation is a plasma glucose value measured every 5 minutes over 6 days).
I need to count the occurrence of glucose values below 3.9 occurring for 15 minutes or more and less than 120 minutes in a row (>3 observations and <24 observations for values <3.9 in a row) on a numeric scale.
I have made a new variable with a factor 1 or 0 for whether the plasma glucose value is below 3.9 or not.
I would then like to count the number of occurrences of permutations > three 1’s in a row and < twenty-four 1’s in a row.
Is there a function in R for this or what would be the easiest approach?

Im not sure if i got your data-structure right, but maybe the following code still can help
I'm assuming a data-structure that includes Measurement, person-id and measurement-id.
library(dplyr)
# create dumy-data
set.seed(123)
data_test = data.frame(measure = rnorm(100, 3.5,2), person_id = rep(1:10, each = 10), measure_id = rep(1:10, 10))
data_test$below_criterion = 0 # indicator for measures below crit-value
data_test$below_criterion[which(data_test$measure < 3.9)] = 1 # indicator for measures below crit-value
# indicator, that shows if the current measurement is the first one below crit_val in a possible series
# shift columns, to compare current value with previous one
data_test = data_test %>% group_by(person_id) %>% mutate(prev_below_crit = c(below_criterion[1], below_criterion[1:(n()-1)]))
data_test$start_of_run = 0 # create the indicator variable
data_test$start_of_run[which(data_test$below_criterion == 1 & data_test$prev_below_crit == 0)] = 1 # if current value is below crit and previous value is above, this is the start of a series
data_test = data_test %>% group_by(person_id) %>% mutate(grouper = cumsum(start_of_run)) # helper-variable to group all the possible series within a person
data_test = data_test %>% select(measure, person_id, measure_id, below_criterion, grouper) # get rid of the previous created helper-variables
data_results = data_test %>% group_by(person_id, grouper) %>% summarise(count_below_crit = sum(below_criterion)) # count the length of each series by summing up all below_crit indicators within a person and series
data_results = data_results %>% group_by(person_id) %>% filter(count_below_crit >= 3 & count_below_crit <=24) %>% summarise(n()) # count all series within a desired length for each person
data_results
data.frame(data_test)

summary matrix by ID combination by r

I have a df (test) like this
Now if you look at the data, 6 to 10 combination is available in the second period but not in the first period. Hence when I use this code
a_summary <- test %>%
group_by(from, to) %>%
summarize(avg = mean(share, na.rm = T)) %>%
ungroup() %>%
spread(from, avg, fill = 0)
The output comes like this
Now, look at the 10 to 6 cell. it gives a value of 1 because 10 to 6 combination only exist one time. But when I make the average, I would like to consider all combination in each period. hence the expected outcome of that 10 to 6 cell is .5 and overall matrix column and row summation should be 1.

a_summary <- test %>%
group_by(from, to) %>%
summarize(count = sum(n, na.rm = T)) %>%
ungroup() %>%
spread(from, count, fill = 0)
This will give you all count of all combinations. Now you can normalize this matrix with dividing by sum(test$n) or use prop.table()

How to plot relative proportions over time in R

For my research I am trying to create a similar graph based around this graph I found in a piece of literature:
My experiment involved the genre-tagging of 10 different songs. I saved the tags (the words people used to describe seperately).
The x-asis should represent all the participants that took part in chronological order. The y-axis should represent how often a word is used in a tag. Consider this sample data:
df <- data.frame(tagid= numeric(0), participantid = numeric(0), tag = character(0))
newRow <-data.frame(tagid=1, participantid=1, tag = "triphop")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=2, participantid=1, tag = "electronic")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=3, participantid=2, tag = "mellow")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=4, participantid=2, tag = "electronic")
df <-rbind(df,newRow)
newRow <-data.frame(tagid=5, participantid=3, tag = "electronic")
df <-rbind(df,newRow)
Tagid 1 and 2 belong to the same participant and should have the same x coordinate. 3 and 4 belong to participant 2 and tagid 5 belongs to participant 3.
For this dataset I'd like to plot a graph like this (excuse the drawing):
The y-axis represents the percentage of participants that have used a specific word to describe this music piece. As 'electronic' is used by all three participants it stays at 100%. 'Triphop' was used by participant 1, but not by participant 2 and 3, decreasing from 100%, to 50%, to 33% at participant 3.

Code is a bit messy, but probably you want something like this ? You need to complete the dataframe so each participantid has rows for all three tag levels. Then, with the cumulative sum of the tag levels and the cumulative sum of participants, you can get the proportion.
df %>%
group_by(participantid, tag) %>%
summarise(n = n()) %>%
complete(tag, nesting(participantid), fill = list(n = 0)) %>%
group_by(tag) %>%
mutate(absolute = cumsum(n)) %>%
ungroup() %>%
mutate(id = rep(1:3, each = length(levels(tag)))) %>%
mutate(proportion = ifelse(absolute / id != 0, absolute / id, NA)) %>%
ggplot(aes(x = participantid, y = proportion, color = tag)) + geom_line(lwd = 1)

In R, is it possible to include the same row in multiple groups, or is there other workaround?

I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.
Here's an example to give a more concrete idea:
library(MESS)
library(lubridate)
library(dplyr)
Generate Reproducible Example
datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))
useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
dat <- left_join(dat, useDate)
Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.
out <- dat %>%
mutate(lagDateGood = lag(DateGood),
leadDateGood = lead(DateGood)) %>%
filter(lagDateGood != "No" | leadDateGood != "No")
Now I need to calculate the area under the curve - this is not correct
out2 <- out %>%
group_by(day) %>%
mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))
The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.
I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?

For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:
#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)
calcTotalFlux <- function(df) {
if (nrow(df) < 10) { # make sure the day has at least 10 measures
NA
} else {
day_midnight <- floor_date(df$datetime[2], "day")
df %>%
mutate(time = datetime - day_midnight) %>%
summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}
do.call("rbind",lapply(res, calcTotalFlux))
TotalFlux
7 NA
8 585230.2
9 579017.3
10 NA
11 563689.7
12 NA

Here's another way. More in line with the suggestions of #Alex Brown.
# Another way
last <- out %>%
group_by(day) %>%
filter(datetime == max(datetime)) %>%
ungroup() %>%
mutate(day = day + 1)
first <- out %>%
group_by(day) %>%
filter(datetime == min(datetime)) %>%
ungroup() %>%
mutate(day = day - 1)
d <- rbind(out, last, first) %>%
group_by(day) %>%
arrange(datetime)
n_measures_per_day <- d %>%
summarize(n = n())
d <- left_join(d, n_measures_per_day) %>%
filter(n > 4)
TotalFluxDF <- d %>%
mutate(timeAtMidnight = floor_date(datetime[3], "day"),
time = datetime - timeAtMidnight) %>%
summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))
TotalFluxDF
Source: local data frame [3 x 2]
day auc
(dbl) (dbl)
1 8 585230.2
2 9 579017.3
3 11 563689.7

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Stratified Sampling with Large Dataset - r

Related

How to use weighting in dplyr package

How to count the occurrence of permutations in a data set in R?

summary matrix by ID combination by r

How to plot relative proportions over time in R

In R, is it possible to include the same row in multiple groups, or is there other workaround?

Categories

Resources