Why aren't stratum sample sizes maintained using rsample bootstraps? - r

[After seeing joran's comment below about the make_strata() function, I filed an issue with rsample on Github.]
I'm trying to take stratified bootstrap samples from a data frame. I want separate bootstrap samples to be taken within each stratum, so that the resulting bootstrap sample has the same number of observations in each stratum as the original data frame. However, that does not always happen when using the bootstraps() function of the rsample package. When I run this code:
library(rsample)
mydf <- data.frame(A=1:58, B=rep(1:4, c(6, 6, 23, 23)))
lboots <- bootstraps(mydf, times=3, strata="B")$splits
lbootsdf <- lapply(lboots, as.data.frame)
with(mydf, table(B))
lapply(lbootsdf, function(df) table(df$B))
These are the results I get:
B
1 2 3 4
6 6 23 23
$`1`
1 2 3 4
10 5 20 23
$`2`
1 2 3 4
3 8 24 23
$`3`
1 2 3 4
4 5 24 25
I was expecting to see 6 1's, 6 2's, 23 3's, and 23 4's in each of the three bootstrap samples.
How can I take the type of stratified bootstrap sample that I want?

This doesn't use rsample::bootstraps but instead constructs the bootstrap samples explicitly.
library("dplyr")
splits <- mydf %>%
crossing(id = seq(2)) %>%
group_by(id, B) %>%
sample_n(n(), replace = TRUE) %>%
ungroup()
lboots$splits[[id]]$data are copies of the original data.

It doesn't look as if you're doing a bootstrap sample because you're not estimating the sampling distribution of a statistic. What it seems to me that you're trying to do is a stratified sample (i.e. instead of a simple random sample) of the data stored in mydf$A using mydf$B as the strata.
The package dplyr has a function that is purpose-built for this scenario, sample_frac:
library(dplyr)
mydf <- data.frame(A=1:58, B=rep(1:4, c(6, 6, 23, 23)))
data_grouped_by_stratum <- mydf %>% group_by(mydf$B)
data_sampled_by_stratum <- data_grouped_by_stratum %>% sample_frac(size=1, replace=T)
# Now, a bit of cleanup on the resulting tibble object
df_of_data_sampled_by_stratum <- data_sampled_by_stratum %>% dplyr::ungroup %>% dplyr::select(-`mydf$B`) %>% as.data.frame
In the call to sample_frac, size=1 means that the fraction of the rows to sample within each group is 1; i.e. 100% of the group's rows.

When I looked at the "B" component of the lboots object (made without subsetting splits, I see consistency on the sampling faction.
BUT: that apparently not the designed out as joran points out. Appears this is a package in early development, since the documentation is not in sync with the code.:
maintainer("rsample")
[1] "Max Kuhn <max#rstudio.com>"
lboots <- bootstraps(mydf, times=3, strata="B")
str(lboots)
table(lboots$splits[['1']]$data$B)
1 2 3 4
6 6 23 23
> table(lboots$splits[['2']]$data$B)
1 2 3 4
6 6 23 23
> table(lboots$splits[['3']]$data$B)
1 2 3 4
6 6 23 23

Related

Group column values by a set numeric difference in R (big dataset)

I want a script to put column values in groups with a maximum range of 10. So that all values in the group are <10 different from each other. I'm trying to get a script that defines the groups, counts how many values are in each group, and find the mean.
So if I had this df:
cat1 = c(85, 60, 60, 55, 55, 15, 0, 35, 35 )
cat2 = c("a","a","a","a","a","a","a","a","a")
df <- data.frame(cat1, cat2)
cat1 cat2
1 85 a
2 60 a
3 60 a
4 55 a
5 55 a
6 15 a
7 0 a
8 35 a
9 35 a
the output would be:
numValues cat1avg
1 85
4 57.5
1 15
1 0
2 35
I followed the top-rated answer here, but I'm getting weird outputs on some of my groups. Specifically, it doesn't seem like the script is properly adding the number of values in each group.
The only way I can think to do it is through for loops with a million if statements, and I have 1,000+ of these little dfs that I need to summarise.
I was also thinking of doing a fuzzy count. But I haven't been able to find anything about that anywhere either.
I also can't just cut up the cat1 range into groups of 10 and just allocate them all into a bin because it's less about the level and more about how close they are to each other.
Thanks!
I'd suggest working at this in the opposite order. Instead of assigning groups based on distance and seeing how many groups there are, we could specify a number of groups (k) and ask R to pick the most distinct clusterings, and compare how well those clusterings fit our purpose. There is a built-in algorithm in R, kmeans, to do this for us.
Let's say we expect between 1 and 6 groups. There are only 6 unique values when I run unique(cat1) so it can't be more than that. We can then use map from purrr in tidyverse to use each of 1:6 in a kmeans algorithm, and we can use augment from broom to extract the output from that in a tidy way.
library(tidyverse); library(broom)
kclusts <- tibble(k = 1:6) %>%
mutate(kclust = map(k, ~kmeans(cat1, .x)),
augmented = map(kclust, augment, df)
)
This will create a nested table with the results we want inside the augmented column. Let's pull those out:
assignments <- kclusts %>%
unnest(cols = c(augmented))
We could visualize these like so. Note that with k = 1, everything is in cluster 1. With k = 5, the 55 + 60s are paired. I think the trivial k = 6 case is just left out.
ggplot(assignments, aes(x = cat1, y = cat2)) +
geom_point(aes(color = .cluster), alpha = 0.8) +
facet_wrap(~ k)
We could see how much range is in each cluster produced in each case, and find the widest range cluster for each k. We see that dividing into four groups would have at least 15 range (see the 4 facet in the chart above), but 5 groups would be adequate to keep the within-cluster range under 5.
assignments %>%
group_by(k, .cluster) %>%
summarize(range = max(cat1) - min(cat1)) %>%
summarize(max_range = max(range))
# A tibble: 6 × 2
k max_range
<int> <dbl>
1 1 85
2 2 35
3 3 30
4 4 15
5 5 5
6 6 0
And finally:
assignments %>%
filter(k == 5) %>%
group_by(.cluster) %>%
summarize(numValues = n(),
cat1avg = mean(cat1))
.cluster numValues cat1avg
<fct> <int> <dbl>
1 1 1 85
2 2 1 15
3 3 2 35
4 4 4 57.5
5 5 1 0

sample function in R

I Have just started learning R using RStudio and I have, perhaps, some basic questions.
One of them regards the "sample" function.
More specifically, my dataset consists of 402224 observations of 147 variables. My task is to take a sample of 50 observations and then produce a dataframe and so on.
But when the function sample is executed
y = sample(mydata, 50, replace = TRUE, prob = NULL)
the result is a dataset with 40224 observations of 50 variables. That is, the sampling is done at variables and not obesrvations.
Do you have any idea why does it happen?
Thank you in advance.
If you want to create a data frame of 50 observations with replacement from your data frame, you can try:
mydata[sample(nrow(mydata), 50, replace=TRUE), ]
Alternatively, you can use the sample_n function from the dplyr package:
sample_n(mydata, 50)
The other answers people have been giving are to select rows, but it looks like you are after columns. You can still accomplish this in a similar way.
Here's a sample df.
df = data.frame(a = 1:5, b = 6:10, c = 11:15)
> df
a b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
Then, to randomly select 2 columns and all observations we could do this
> df[ , sample(1:ncol(df), 2)]
c a
1 11 1
2 12 2
3 13 3
4 14 4
5 15 5
So, what you'll want to do is something like this
y = mydata[ , sample(1:ncol(mydata), 50)]
That is because sample accepts only vectors.
try the following:
library(data.table)
set.seed(10)
df_sample<- data.table(df)
df[sample(.N, 402224 )]

How to re-arrange a data.frame

I am interested in re-arranging a data.frame in R. Bear with me a I stumble through a reproducible example.
I have a nominal variable which can have 1 of two values. Currently this nominal variable is a column. Instead I would like to have two columns, representing the two values this nominal variable can have. Here is an exmample data frame. S is the nominal variable with values T and C.
n <- c(1,1,2,2,3,3,4,4)
s <- c("t","c","t","c","t","c","t","c")
b <- c(11,23,6,5,12,16,41,3)
mydata <- data.frame(n, s, b)
I would rather have a data frame that looked like this
n.n <- c(1,2,3,4)
trt <- c(11,6,23,41)
cnt <- c(23,5,16,3)
new.data <- data.frame(n.n, trt, cnt)
I am sure there is a way to use mutate or possibly tidyr but I am not sure what the best route is and my data frame that I would like to re-arrange is quite large.
you want spread:
library(dplyr)
library(tidyr)
new.data <- mydata %>% spread(s,b)
n c t
1 1 23 11
2 2 5 6
3 3 16 12
4 4 3 41
How about unstack(mydata, b~s):
c t
1 23 11
2 5 6
3 16 12
4 3 41

How to omit rows of two highest and the lowest value by group in R

this seems a very basic question, but I just can't seem to find the solution.
How do you remove the (three) rows of the two highest and the lowest values of a variable by several factors in R? I have modifed the airquality a little to get an example (sorry, I am still a beginner):
set.seed(1)
airquality$var1 <- c(sample(1:3, 153, replace=T))
airquality$var2 <- c(sample(1:2, 153, replace=T))
airquality2 <- airquality
airquality2$Solar.R <- as.numeric(airquality2$Solar.R)
airquality2$Solar.R <- airquality2$Solar.R*2
airquality3 <- airquality
airquality3$Solar.R <- as.numeric(airquality3$Solar.R)
airquality3$Solar.R <- airquality3$Solar.R*2.5
test <- round(na.omit(rbind(airquality, airquality2, airquality3)))
test$var1 <- factor(test$var1)
test$var2 <- factor(test$var2)
head(test)
Which comes to:
head(test)
# Ozone Solar.R Wind Temp Month Day var1 var2
# 1 41 190 7 67 5 1 1 1
# 2 36 118 8 72 5 2 2 2
# 3 12 149 13 74 5 3 2 1
# 4 18 313 12 62 5 4 3 2
# 7 23 299 9 65 5 7 3 1
# 8 19 99 14 59 5 8 2 1
Now I would like to remove the rows with the two highest and the lowest values of Solar.R with something like group_by(Month, var1, var2). Since there are 30 factor combinations (5*3*2), 90 rows should be omitted. The rest of the data should stay the same. I looked at Min & Max, but could not get it to work. Any help would be gladly appreciated.
I think you're looking for slice:
library("dplyr")
sliced =
test %>%
group_by(Month, var1, var2) %>% # group
arrange(Solar.R) %>% # within-group, order by Solar.R
slice(3:(n() - 2)) # keep the 3rd through the 3rd-to-last row
nrow(sliced)
# [1] 233
Edit: I had 3:(n() - 3) at first, corrected to 3:(n() - 2). A nice sanity check is to think of (1:10)[3:(10 - 3)] vs (1:10)[3:(10 - 2)]. I didn't bother to read your simulation code, but when I checked things out with n_group() I saw 27 groups, not 30 as stated in your question. (Perhaps a seed issue, with rawr's set.seed(1) there are 28 groups.)
More edits: Based on your edit, looks like perhaps you want to omit the lowest value and the two highest values rather than the two lowest and two highest. Simply change 3:(n() - 2)) to 2:(n() - 2) to make that adjustment.
here is a data.table way of doing this but I guess dplyr would be more verbose .
require(data.table)
set.seed(1)
airquality$var1 <- c(sample(1:3, 153, replace=T))
airquality$var2 <- c(sample(1:2, 153, replace=T))
airquality2 <- airquality
airquality2$Solar.R <- as.numeric(airquality2$Solar.R)
airquality2$Solar.R <- airquality2$Solar.R*2
airquality3 <- airquality
airquality3$Solar.R <- as.numeric(airquality3$Solar.R)
airquality3$Solar.R <- airquality3$Solar.R*2.5
test <- round(na.omit(rbind(airquality, airquality2, airquality3)))
test$var1 <- factor(test$var1)
test$var2 <- factor(test$var2)
dt_test <- as.data.table(test)
dt_test[,.SD[order(-Solar.R)][c(3:(.N-1))],.(Month,var1,var2)]
We can also use .I to get the row index in data.table and then subset it based on that.
library(data.table)
i1 <- setDT(test)[order(Solar.R), .I[3:(.N-1)],.(Month, var1, var2)]$V1
test[i1]

Reverse Scoring Items

I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R

Resources