I have a data frame of 30 years of a response variable. I want to write a code that will subset that df into "x" number of years "n" times, and run a regression of the response in all of those subsets.
So if we started with 30 years, x=5 & n=2, we would end with 2 regressions, each using 5 random years out of the available 30. I wrote a function for that here:
# build df
df = data.frame(year=c(1:30),
response = runif(30,1,100))
# create function
subsample <- function(df, x, n ){
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for number of simulations
slice(rep(1:n(), each = n)) %>%
# add group number, which will be the "nth" trial
mutate(group = c(1:n)) %>%
# expand data
unnest(cols = c(data)) %>%
# group by group number, then subsample n times from each group
group_by(group) %>%
group_map(~ sample_n(.x, x, replace = F)) %>%
# stitch back together and add group number col back
bind_rows(.id="trial") %>%
# arrange by group and year
mutate(trial=as.numeric(trial)) %>%
arrange(trial,year) %>%
# group by subsample and run regression
group_by(trial) %>%
do({
mod = lm(response ~ year, data = .)
data.frame(Intercept = coef(mod)[1],
Slope = coef(mod)[2])
})
}
# test function
subsample(df, x=5, n=2)
# A tibble: 2 x 3
# Groups: simulation [2]
trial Intercept Slope
<dbl> <dbl> <dbl>
1 1 48.5 -0.895
2 2 35.4 -0.275
Great, so that works, and we get two regressions (all I want is slope and intercept) each using a subset of 5 out of 30 years.
However, now I want to do this with every possible combination of years (so x = c(2:30)), ending with a df that should look like this
# A tibble: 2 x 3
number_of_years trial Intercept Slope
<dbl> <dbl> <dbl> <dbl>
1 2 1 48.5 -0.895
2 2 2 35.4 -0.275
3 3 1 55.2 0.333
4 3 2 34.1 0.224
5 4 1 63.2 -0.359
6 4 2 45.5 -0.241
7 5 1 43.1 0.257
8 5 2 37.9 -0.657
9 6 1 51.0 -0.456
10 6 2 65.6 0.126
This would be showing regression values of 2 trials ("n") each using 2 random years (number_of_years, "x"), then 2 trials using 3 random years, 4 random years, etc... all the way until 30.
So I tried to follow the same logic as above, but now trying map_group() with the custom function that I built:
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for the number of simulations we want to test (29, in this case)
slice(rep(1:n(), each = (nrow(df)-1))) %>%
# add column for number out of total and unnest
mutate(number_of_years = c(2:(nrow(.)+1))) %>%
select(number_of_years,data) %>% #reorder
unnest(cols =c(data)) %>%
# group by out of total
group_by(number_of_years) %>%
group_map( ~ subsample(.x, x=5, n=2,))
### this is the problematic line!
### this is giving us 2 trials (n=2) of a regression, each using
### x=5 years of sampling. but instead of x=5 years, I want x=number_of_years
### so x should be the same as the grouping variable.
So the problem here is, since my subsample() function needs 3 inputs (df, x, n), I need to figure out how to make "x" the same as the grouping variable for the dataset. x should be (number_of_years). I've tried to do group_map( ~ subsample(.x,.x$number_of_years,2) and variations like that, but I can't figure out how to make it return 30 tibbles of 2 trials each, meaning 2 regressions of subsamples of the original df, but each one using a different number of years to calculate the regression.
I would like to stay in the tidyverse/ dplyr/ purr workflow if possible.
Thanks!
Related
I am playing around with the brca dataset in r and am trying to create 2 x standardised mean values for each variable; one for the B group and one for the M group. This is so that I can calculate the difference between the standardised mean to see which variables have the highest difference.
I think what I want to do is:
scale each variable so they are standardised
group by the outcome (either B or M)
calculate the mean of each variable for each group
pivot from wide to long
I expect that B is one column and M is a second column at this point (and each variable mean is a row, with variable name being row name)
calculate the absolute difference between means for B & M for each variable and store as new column
arrange by desc
Does my logic sound correct?
If so, 'think' I have managed to do steps 1-3 but I have never done these calculations before let alone done them in r so I have no idea if I am on the right track. Would anyone mind reviewing and seeing if it looks right?
Secondly - can someone help me with how to complete the pivot to a long table (my step 4)?
library(tidyverse)
library(purrrlyr)
library(ggplot2)
temp <- dslabs::brca
df <- cbind(as.data.frame(temp$x), outcome = temp$y)
scaled_df <- df %>%
mutate_if(is.numeric, scale) %>%
group_by(outcome) %>%
dmap(mean)
Something like this?
suppressPackageStartupMessages({
library(tidyverse)
library(purrrlyr)
})
temp <- dslabs::brca
df <- cbind(as.data.frame(temp$x), outcome = temp$y)
scaled_df <- df %>%
mutate_if(is.numeric, scale) %>%
group_by(outcome) %>%
purrrlyr::dmap(mean)
scaled_df %>%
pivot_longer(-outcome) %>%
group_by(name) %>%
summarise(diff_means = diff(value))
#> # A tibble: 30 × 2
#> name diff_means
#> <chr> <dbl>
#> 1 area_mean 1.47
#> 2 area_se 1.13
#> 3 area_worst 1.52
#> 4 compactness_mean 1.23
#> 5 compactness_se 0.605
#> 6 compactness_worst 1.22
#> 7 concave_pts_mean 1.60
#> 8 concave_pts_se 0.843
#> 9 concave_pts_worst 1.64
#> 10 concavity_mean 1.44
#> # … with 20 more rows
Created on 2022-08-04 by the reprex package (v2.0.1)
I'm testing the accuracy of an imputation model using training and test datasets. The model I'm running uses a categorical variable. Unfortunately, when I randomly split the dataset and run a model on the training set, I am unable to estimate a coefficient for some categorical variables which are present in the test dataset. I would like to split the data while ensuring that all categorical variables are present in both the training and test datasets. Is there an easy way to do this in R?
In the simulated data below, this would require the same sets of letters to be present in both datasets, so that I can test the accuracy of the model in the test dataset.
chars<-c("A","B","C","D")
complete_data<-data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))
In my dataset, the problem is a little trickier as some of the categorical variables are extremely scarce.
EDIT:
Thanks for the responses. I ended up looking up stratified sampling as Antimon suggested and came across the caret package which apparently works as well.
library(caret)
train.index <- createDataPartition(complete_data$v3, p = .7, list = FALSE)
train <- complete_data[ train.index,]
test <- complete_data[-train.index,]
This can be achieved quite simply.
library(tidyverse)
chars<-c("A","B","C","D")
complete_data <- tibble(v1=rnorm(100,2,100),
v2=rnorm(100,1,100),
v3=sample(chars, 100, replace=TRUE))
propCategory = function(data, category, prop){
category = enquo(category)
cat1 = data %>% pull(!!category)
unlist(sapply(as.list(unique(cat1)), function(x) {sample(which(cat1==x), sum(cat1==x)*prop)}))
}
complete_data %>% propCategory(v3, .2)
output
[1] 98 35 20 78 40 70 87 3 86 38 22 100 80 93 47 5 24 29 26
As you can see, my propCategory function returns the axial indexes. But let's check if they contain what you need.
First, let's check the training indexes.
train = complete_data %>% propCategory(v3, .75)
complete_data[train,] %>% distinct(v3)
complete_data[train,] %>% nrow()
output
> complete_data[train,] %>% distinct(v3)
# A tibble: 4 x 1
v3
<chr>
1 B
2 A
3 D
4 C
> complete_data[train,] %>% nrow()
[1] 74
Now it's time for the test indexes.
complete_data[-train,] %>% distinct(v3)
complete_data[-train,] %>% nrow()
output
> complete_data[-train,] %>% distinct(v3)
# A tibble: 4 x 1
v3
<chr>
1 B
2 A
3 D
4 C
> complete_data[-train,] %>% nrow()
[1] 26
As you can see, both the training and test data include each of your categories.
A little note about the prop parameter.
My propCategory function was written in such a way that for each value from the variable category it returns the number of randomly selected indices with prop * (the number of saved values of the categorical variable).
Take a good look at the results below.
complete_data %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
complete_data[train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
complete_data[-train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
output
> complete_data %>% group_by(v3) %>%
+ summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 26 0.26
2 B 35 0.35
3 C 24 0.24
4 D 15 0.15
> complete_data[train,] %>% group_by(v3) %>%
+ summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 19 0.257
2 B 26 0.351
3 C 18 0.243
4 D 11 0.149
> complete_data[-train,] %>% group_by(v3) %>%
+ summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 7 0.269
2 B 9 0.346
3 C 6 0.231
4 D 4 0.154
There are several ways of doing this. You need to divide your data by v3 and then split each group randomly:
chars <- c("A","B","C","D")
complete_data <- data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))
Now we'll use the by() function to split the data into groups by v3 and draw a random sample of half of the rownames in each group:
test <- as.numeric(unlist(by(complete_data, complete_data$v3, function(x) sample(rownames(x), length(rownames(x))/2))))
train_test <- rep("train", nrow(complete_data))
train_test[test] <- "test"
table(complete_data$v3, train_test)
# train_test
# test train
# A 11 12
# B 12 13
# C 13 14
# D 12 13
Now complete_data[train_test=="train", ] is your training set and complete_data[train_test=="test", ] is your test set.
Hi I have a df that has variables as the columns and rows are time. The element of each intersection is a count.
Var_1 Var_2 Var_3
Time_1 5 4 5
Time_2 4 19 4
Time_3 2 2 87
This df has a lot of rows (> 30,000)
How can I calculate Z scores for every 20 rows? Thanks in advance! <3
Here is an answer that uses dplyr::summarise() to calculate means and standard deviations, then we merge them with the original data and use mutate() to calculate the z-scores. We'll illustrate the single variable case, but it can be extended to handle multiple variables.
Given the ambiguity of the original question, we assume the Time- column is structured in groups of 20, which allows us to use it as the main grouping variable for the solution. That is, there are 20 observations at Time-1, another 20 at Time-2, etc.
If the requirement is to create groups of 20 rows based on consecutive row identifiers, the solution can easily be modified to add a grouping variable to represent sets of 20 rows.
# simulate some data
y <- rpois(20000,3) # simulate counts
TimeVal <- paste0(rep("Time-",20000),
rep(1:1000,20))
data <-data.frame(TimeVal,y,stringsAsFactors = FALSE)
library(dplyr)
result <- data %>% group_by(TimeVal) %>% summarise(ybar = mean(y),
stDev = sd(y)) %>%
full_join(data,.,) %>% mutate(.,zScore = (y - ybar) / stDev)
head(result)
...and the output:
> head(result)
TimeVal y ybar stDev zScore
1 Time-1 6 3.45 1.276302 1.99795938
2 Time-2 2 2.95 1.700619 -0.55862010
3 Time-3 2 3.20 1.908430 -0.62878909
4 Time-4 3 3.10 1.916686 -0.05217339
5 Time-5 2 3.10 1.447321 -0.76002513
6 Time-6 2 3.30 1.809333 -0.71849700
>
Extending the solution: z-scores for multiple columns
To solve for multiple columns in the original input data frame, first we create a long form tidy data frame with tidyr::pivot_longer), calculate means and standard deviations, merge them with the narrow data and calculate z-scores.
Converting the input data to a long form tidy data frame allows us to use the original column names in a dplyr::by_group(), eliminating a lot of code that would be otherwise required to calculate the z-scores for each column in the original data.
library(tidyr)
set.seed(95014) # set seed to make results reproducible
y2 <- rpois(20000,8)
y3 <- rpois(20000,15)
data <- data.frame(TimeVal,y,y2,y3,stringsAsFactors = FALSE)
# convert to narrow format tidy, calculate means, sds, and zScores
longData <- data %>%
group_by(TimeVal) %>%
pivot_longer(-TimeVal,
names_to = "variable",
values_to = "value")
result <- longData %>%
group_by(TimeVal,variable) %>%
summarise(avg = mean(value), stDev = sd(value)) %>%
full_join(longData,.) %>%
mutate(.,zScore = (value - avg) / stDev)
head(result)
...and the output:
> head(result)
# A tibble: 6 x 6
# Groups: TimeVal [2]
TimeVal variable value avg stDev zScore
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 Time-1 y 6 3.45 1.28 2.00
2 Time-1 y2 13 8.7 2.23 1.93
3 Time-1 y3 20 16.4 5.25 0.686
4 Time-2 y 2 2.95 1.70 -0.559
5 Time-2 y2 6 8.2 2.89 -0.760
6 Time-2 y3 12 14.8 3.34 -0.852
>
I am trying to draw a stratified sample from a data set for which a variable exists that indicates how large the sample size per group should be.
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
In this example, grp refers to the group I want to sample by and frq is the sample size specificied for that group.
Using split, I came up with this possible solution, which gives the desired result but seems rather inefficient :
s <- split(df, df$grp)
lapply(s,function(x) sample_n(x, size = unique(x$frq))) %>%
do.call(what = rbind)
Is there a way using just dplyr's group_by and sample_n to do this?
My first thought was:
df %>% group_by(grp) %>% sample_n(size = frq)
but this gives the error:
Error in is_scalar_integerish(size) : object 'frq' not found
This works:
df %>% group_by(grp) %>% sample_n(frq[1])
# A tibble: 9 x 3
# Groups: grp [3]
id grp frq
<int> <int> <dbl>
1 3 1 3
2 4 1 3
3 2 1 3
4 6 2 2
5 8 2 2
6 13 3 4
7 14 3 4
8 12 3 4
9 11 3 4
Not sure why it didn't work when you tried it.
library(tidyverse)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
set.seed(22)
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
# # A tibble: 9 x 2
# grp id
# <int> <int>
# 1 1 2
# 2 1 5
# 3 1 3
# 4 2 8
# 5 2 9
# 6 3 14
# 7 3 13
# 8 3 15
# 9 3 11
Function sample_n works if you pass as inputs a data frame of ids (not a vector of ids) and one frequency value (for each group).
An alternative version using map2 and generating the inputs for sample_n in advance:
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
The following answer is not recommended, just shows a different approach without nests/maps that some people might find more comprehensible. Possibly of use to someone working with a smallish data set who wants to do something slightly different to the original question, is a bit scared or doesn't have time to play around with functions they don't really understand, and isn't too worried about efficiency. You just need to recall the behaviour of the original sample function in base R: when provided with a (positive) integer argument x, it outputs a vector randomly permuting the integers from 1:x.
> sample(5)
[1] 5 1 4 2 3
If we had five elements, we could then obtain a random sample of size three by only selecting the positions where 1, 2 and 3 were permuted - in this case we'd pick the second, fourth and fifth elements. All clear? Then similarly we can just do that within each group, assigning random integers from 1 to the group size, and choosing as our sample the places where the random id is less than or equal to the desired sample size for that group.
library(tidyverse)
# The iris data set has three different species
# I want to sample 2, 5 and 3 flowers respectively from each
sample_sizes <- data.frame(
Species = unique(iris$Species),
n_to_sample = c(2, 5, 3)
)
iris %>%
left_join(sample_sizes, by = "Species") %>% # adds column for how many to sample from this species
group_by(Species) %>% # each species is a group, the size of the group can be found by n()
mutate(random_id = sample(n())) %>% # give each flower in the group a random id between 1 and n()
ungroup() %>%
filter(random_id <= n_to_sample)
Which gave me the output:
# A tibble: 10 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_to_sample random_id
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <int>
1 4.9 3.1 1.5 0.1 setosa 2 1
2 5.7 4.4 1.5 0.4 setosa 2 2
3 6.2 2.2 4.5 1.5 versicolor 5 3
4 6.3 2.5 4.9 1.5 versicolor 5 2
5 6.4 2.9 4.3 1.3 versicolor 5 5
6 6 2.9 4.5 1.5 versicolor 5 4
7 5.5 2.4 3.8 1.1 versicolor 5 1
8 7.3 2.9 6.3 1.8 virginica 3 1
9 7.2 3 5.8 1.6 virginica 3 3
10 6.2 3.4 5.4 2.3 virginica 3 2
You can of course pipe through to select(-random_id, -n_to_sample) if you no longer have any use for the final two columns, but I left them in so it's clearer from the output how the code worked.
For the example data given in the question:
library(dplyr)
# example data
df <- data.frame(id = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
# A tibble: 9 x 3
id grp frq
<int> <int> <dbl>
1 1 1 3
2 2 1 3
3 3 1 3
4 8 2 2
5 9 2 2
6 11 3 4
7 12 3 4
8 13 3 4
9 15 3 4
NB if you're a safety fanatic and x might be zero, and you want to guarantee the length of the output is definitely the same as x, you're better to do sample(seq_len(x)) than sample(x). That way you get the zero-length vector integer(0) rather than the length-one vector 0 in the case where x is zero. In my code, the mutate will never be working on a row for which n() is zero (if n() were zero then that group is empty so there couldn't be a row there) and this isn't a problem. Just something to be aware of if you're taking this approach somewhere else.
Benchmarks for comparison:
f1 <- function(df) { # #AntoniosK with nest and map
df %>%
group_by(grp) %>% # for each group
nest() %>% # nest data
mutate(v = map(data, ~sample_n(data.frame(id=.$id), unique(.$frq)))) %>% # sample using id values and (unique) frq value
unnest(v) # unnest the sampled values
}
f2 <- function(df) { # #AntoniosK with nest and map2
df %>%
group_by(grp) %>% # for every group
summarise(d = list(data.frame(id=id)), # create a data frame of ids
frq = unique(frq)) %>% # get the unique frq value
mutate(v = map2(d, frq, ~sample_n(.x, .y))) %>% # sample using data frame of ids and frq value
unnest(v) %>% # unnest sampled values
select(-frq) # remove frq column (if needed)
}
f3 <- function(df) { # #thc
df %>% group_by(grp) %>% sample_n(frq[1])
}
f4 <- function(df) { # #Silverfish
df %>%
group_by(grp) %>%
mutate(random_id = sample(n())) %>%
ungroup() %>%
filter(random_id <= frq) %>%
select(-random_id)
}
# example data of variable size
df_n <- function(n) {
data.frame(id = seq_len(3*n),
grp = rep(1:3,each = n),
frq = rep(c(3,2,4), each = n))
}
require(microbenchmark)
microbenchmark(f1(df_n(1e3)), f2(df_n(1e3)), f3(df_n(1e3)), f4(df_n(1e3)),
f1(df_n(1e6)), f2(df_n(1e6)), f3(df_n(1e6)), f4(df_n(1e6)),
times=20)
Results strongly favour #thc's df %>% group_by(grp) %>% sample_n(frq[1]) both for data frame with a couple of thousand or couple of million rows. My naive approach takes two or three times as long, and #AntoniosK's faster solution is the one with nest and map2 (worse than mine for smaller data frames but better for the larger ones).
Unit: milliseconds
expr min lq mean median uq max neval
f1(df_n(1000)) 12.0007 12.27295 12.479760 12.34190 12.46475 13.6403 20
f2(df_n(1000)) 9.5841 9.82185 9.905120 9.87820 9.98865 10.2993 20
f3(df_n(1000)) 1.3729 1.53470 1.593015 1.56755 1.68910 1.8456 20
f4(df_n(1000)) 3.1732 3.21600 3.558855 3.27500 3.57350 5.4715 20
f1(df_n(1e+06)) 1582.3807 1695.15655 1699.288195 1714.13435 1727.53300 1744.2654 20
f2(df_n(1e+06)) 323.3649 336.94280 407.581130 346.95390 463.69935 911.6647 20
f3(df_n(1e+06)) 216.3265 235.85830 268.756465 247.63620 259.02640 395.9372 20
f4(df_n(1e+06)) 641.5119 663.03510 737.089355 682.69730 803.98205 1132.6586 20
I have a dataset that looks like the following:
id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5
I would like to perform several t.tests to compare the means for each factor in the S (samediff) condition to the means for that same factor in the D (samediff) condition.
I know I could do this in the following way:
dfgive<-filter(df, factor == "give")
t.test(value~samediff, dfgive)
dfimpact<-filter(df, factor == "impact")
t.test(value~samediff, dfimpact)
Is there a way to perform several t.tests in fewer lines? In the actual dataset, there are several more factors than are included here. I would like to be able to conduct all the t.tests necessary without creating separate dataframes in the same way I've shown above.
To augment existing answers, you can use broom::tidy to tidy the output from the t.test, e.g.
library(tidyverse)
library(broom)
df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff))) %>%
mutate(ttest = map(ttest, tidy)) %>%
unnest() %>%
select(factor, estimate, estimate1, estimate2, p.value)
# # A tibble: 2 x 5
# factor estimate estimate1 estimate2 p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 give -0.5 2 2.5 0.712
# 2 impact 0 4.5 4.5 1
Here's a base-R approach:
results <- lapply(split(df, df$factor), function(X) {
out <- t.test(value ~ samediff, X)
data.frame(diff = out$statistic,
mean1 = out$estimate[1],
mean2 = out$estimate[2],
pval = out$p.value)
})
do.call(rbind, results)
# diff mean1 mean2 pval
# give -0.4472136 2.0 2.5 0.7117228
# impact 0.0000000 4.5 4.5 1.0000000
We can split the data by factor and apply t.test one by one. The final output is a list. We can access the result by lst$give or lst$impact.
library(tidyverse)
lst <- df %>%
split(.$factor) %>%
map(~t.test(value ~ samediff, .x))
DATA
df <- read.table(text = "id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5 ",
header = TRUE, stringsAsFactors = FALSE)
We can group by 'factor' and summarise the output of t.test in a list
library(dplyr)
out <- df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff)))
out
# A tibble: 2 x 2
# factor ttest
# <chr> <list>
#1 give <S3: htest>
#2 impact <S3: htest>
The output is stored in a list column which can be extracted with $ or [[
identical(out$ttest[[1]], t.test(value ~ samediff, dfgive))
#[1] TRUE