Random sampling with different strata - r

I have a dataset from which I want to select a random sample of rows, but following some pre-defined rules. This may be a very basic question but I am very new to this and still trying to grasp the basic concepts. My dataset includes some 330 rows of data (I have included a simplified version here) with several columns. I want to sample 50 rows out of the 330 (I kept these numbers in the mock dataset for simplicity as this is part of the problem I am having) with the option to add the predefined rules to the sampling process.
Here is a mock version of the data:
bank<-data.frame(matrix(0,nrow=330,ncol=5))
colnames(bank)<-c("id","var1","var2","year","lo")
bank$id<-c(1:330)
bank$var1<-sample(letters[1:5],330,replace=T)
bank$var2<-sample(c("s","r"),330,replace=T)
bank$var3<-sample(2010:2018,330,replace=T)
bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
The code I used to try to sample the correct number of rows is
library(splitstackshape)
x<-splitstackshape::stratified(indt=bank,group=c("var1","var2","year","lo"),0.151)
However this is not selecting 50 rows. I had initially tried to define size=50 but I got the following error:
Groups b s 2012 lo4,... [there is a very long list here],...contain fewer rows than requested. Returning all rows.
Then I tried to define size as a percent: 0.151 (15.1%?) which should be right 50 out of 330 but that samples 5 rows (I tried 0.5 and samples 44 rows and if I try 0.500000001 it samples 287 rows???).
What am I missing? For the moment I am stuck here.
Once I manage to sample the correct number of rows (50) I would like to define some rules, like: only upto 50% of the sample can have 2018 (bank$year) AND only up to half of the bank$year==2018 rows can have bank$var2=="r". Obviously I don't expect someone to do this for me, but could you please provide some advice on
1- Why am I getting the wrong number of rows (probably just syntax?)
2- what package I should look into if splitstackshape::stratified() is not the best or a good choice to achieve this?
Many thanks!
M

I think the issues comes from the fact that your dataset (as you've shared here) is fairly small, you have a large number of strata (5 letters X 2 s or r X 9 years X 6 lo categories), and it's just not possible to take samples of the desired size from within each stratum. When I bump the sample size up to 33,000 and take a sample of 15.1%, I get a sample of size 4,994. Putting size = 50 is requesting a sample of size 50 from each stratum, which is not remotely possible with the data you've shared.
> bank<-data.frame(matrix(0,nrow=33000,ncol=5))
> colnames(bank)<-c("id","var1","var2","year","lo")
> bank$id<-c(1:33000)
> bank$var1<-sample(letters[1:5],33000,replace=T)
> bank$var2<-sample(c("s","r"),33000,replace=T)
> bank$var3<-sample(2010:2018,33000,replace=T)
> bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
>
> k <- stratified(bank, group = c('var1', 'var2', 'var3', 'lo'), size = .151)
> dim(k)
[1] 4994 6

Another process, by selecting the n = sample desired for each group, provided by Jenny Bryan here; sampling from groups where you specify n based on the specific sample size per group, samp is the randomized sample per n group; so n will need to be adjusted according to the proportionate amount per group:
bank %>%
group_by(var1) %>%
nest() %>%
mutate(n = c(7,0,9,1,13),
samp = map2(data, n, sample_n)) %>%
select(var1, samp) %>%
unnest()

Related

Identical values generated from random samples from a uniform distribution in dplyr

This is a follow up to previous question. My question was not fully formulated and therefore not fully answered in my last post. Forgive me, I'm new to using stack overflow.
My professor has assigned a problem set, and we are required to use dplyr and other tidyverse packages. I'm very aware that most (if not all) the tasks that I'm trying to execute are possible in base r, but that's not in agreement with my instructions.
First we are asked to generate a tibble of 1000 random samples from a uniform distribution:
2a. Create a new tibble called uniformDf containing a variable called unifSamples that contains 10000 random samples from a uniform distribution. You should use the runif() function to create the uniform samples. {r 2a}
uniformDf <- tibble(unifSamples = runif(1000))
This goes well.
Then we are asked to loop thru this tibble 1000 times, each time choosing 20 random samples and computing the mean and saving it to a tibble:
2c. Now let's loop through 1000 times, sampling 20 values from a uniform distribution and computing the mean of the sample, saving this mean to a variable called sampMean within a tibble called uniformSampleMeans. {r 2c}
unif_sample_size = 20 # sample size
n_samples = 1000 # number of samples
# set up q data frame to contain the results
uniformSampleMeans <- tibble(sampMean=rep(NA,n_samples))
# loop through all samples. for each one, take a new random sample,
# compute the mean, and store it in the data frame
for (i in 1:n_samples){
uniformSampleMeans$sampMean[i] <- uniformDf %>%
sample_n(unif_sample_size) %>%
summarize(sampMean = mean(sampMean))
}
This all runs, well, I believe until I look at my uniformSampleMeans tibble. Which looks like this:
1 0.471271611726843
2 0.471271611726843
3 0.471271611726843
4 0.471271611726843
5 0.471271611726843
6 0.471271611726843
7 0.471271611726843
...
1000 0.471271611726843
All the values are identical! Does anyone have any insight as to why my output is like this? I'd be less concerned if they varied by +/- 0.000x values seeing as how this is from a distribution that ranges from 0 to 1 but the values are all identical even out to the 15th decimal place! Any help is much appreciated!
The following selects random unif_sample_size rows and gives it's mean
library(dplyr)
uniformDf %>% sample_n(unif_sample_size) %>% pull(unifSamples) %>% mean
#[1] 0.5563638
If you want to do this n times use replicate and repeat it n times
n <- 10
replicate(n, uniformDf %>%
sample_n(unif_sample_size) %>%
pull(unifSamples) %>% mean)
#[1] 0.5070833 0.5259541 0.5617969 0.4695862 0.5030998 0.5745950 0.4688153 0.4914363 0.4449804 0.5202964

Confirm succesful 'extend' using simr package in R

I'm using the 'extend' function in simr, but I want to be able to confirm that it has appropriately extended the data set as I wanted it to. Is there a function I can use to show me the data set it has created?
I have a dataset including 17 participants in each of 2 groups. Each participant provided two ratings at each of 8 time points, so that I now have variables of participant (id), the difference between the two ratings (my dependent variable, rating_diff), time (8 levels) and group (2 levels, neutral and threat). As I understand it, id is nested within group.
I constructed the following model and calculated the power to detect an interaction between time and group:
model_es <- lmer(rating_diff ~ time + group + time*group + (1|id),
data = data)
fixef(model_es)['time:groupthreat'] <- -0.16
interaction_power0 <- powerSim(model_es, nsim=100, test =
fcompare(rating_diff ~ time + group)) # Power given varies between 86% and 93%, which is too high.
I now want to 'extend' the model to determine the power with only 15 participants in each group. First, I checked the number of rows in my existing dataset:
nrow(getData(model_es)) # gives 252 rows
I worked out that altering the dataset to 15 participants per group should yield 220 rows.
First, I though I ought to be extending within id+group, but that gives too many rows:
model_es_extend0 <- extend(model_es, within = 'id+group', n=30)
nrow(getData(model_es_extend0)) # 954 rows
I tried extending along id instead:
model_es_extend1 <- extend(model_es, along = 'id', n=30)
nrow(getData(model_es_extend1)) #220 rows
This clearly gives the correct number of rows, but how can I verify that there are 15 participants per group, rather than 17 still in one group and 13 in the other?
You should be able to check with:
xtabs(~ group + time, data=getData(model_es_extend1))
I suspect the extend command you want is:
model_es_extend2 <- extend(model_es, within = 'time+group', n=15)

Resampling in R

Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(200,200,3000), x2=c(200,200,1000), y=c(100,100,3))
So I have 3 observations.
Now I want to select 2 observations randomly out of d three times (without repetition - there is three combinations in total). For each of these three times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get three numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
There is three combinations in total, so I can only get the same result in this case. If I do the calculation manually, I will get the three following result:
0.977 0.977 1
(The result could of course be in a another order).
And the mean of these two numbers is:
0.984
This is a simple example. In my case I have a lot of combinations, where I don't select all of the combinations (e.g. there could be say 1,000,000 combinations, where I only select 1,000 of them).
I think it's better if you use sample.int and replicate instead of doing all the combinations, see my example:
nsample <- 2 # Number of selected observations
nboot <- 10 # Number of times you repeat the process
replicate(nboot, with(d[sample.int(nrow(d), nsample), ],
weighted.mean(eff(dea(data.frame(x1, x2), y)), y)))
I have check also the link you bring regarding this issue, so if I got it right, I mean, you want to extract two rows (observations) each time without replacement, you can use sample:
SelObs <- sample(1:nrow(d),2)
# for getting the selected observations just
dSel <- d[SelObs,]
And then do your calculations
If you want those already selected observation to not be selected in a nex random selection, it is similar, but you need an index
Obs <- 1:nrow(d)
SelObs <- sample(Obs, 2)
dSel <- d[SelObs, ]
# and now, for removing those already selected
Obs <- Obs[-SelObs]
# and keep going with next random selections and the above code

Compute new column based on values in current and following rows with dplyr in R

I have a big dataset (10+ Mil x 30 vars) and i am trying to compute some new variables based on complicated interactions of current ones. For clarity i am including only the important variables in the question. I have the following code in R but i am interested in other views and opinions. I am using the dplyr package to compute new columns based on current/following row values of 3 other columns. (more explanation below code)
I am wondering if there is a way to make this faster and more efficient, or maybe completely rewrite it...
# the main function-data is a dataframe, windowSize and ratio are ints
computeNewColumn <- function(data,windowSize,ratio){
#helper function used in the second mutate down...
# all args are ints, i return a boolean out
windowAhead <- function(timeTo,window,reduction){
# subset the original dataframe-only observations with values of
# TimeToGo between timeTo-1 and window (basically the following X rows
# from the current one)
subframe <- data[(timeTo-1 >= data$TimeToGo & data$TimeToGo >= window), ]
isthere <- any(subframe$Price < reduction)
return(isthere)
}
# I group by value of ID first and order by TimeToGo...
data %<>% group_by(ID) %>%
arrange(desc(TimeToGo)) %>%
# ...create two new columns from simple interactions of existing ones...
mutate(Window = ifelse(TimeToGo > windowSize, TimeToGo - windowSize, 0),
Reduction = floor(Price - (ratio * Price))) %>%
rowwise() %>%
#...now comes the more complex stuff- I want to compute a third column
# depending on the next (TimeToGo - Window) number of values of Price
mutate(Advice = ifelse(windowAhead(TimeToGo,Window,Reduction),1,0) )
return(data)
}
We have a dataset with the following columns: ID,Price, TimeToGo.
We first group by values of ID and compute two new columns based on current row values (Window from TimeToGo and Reduction from Price). Next thing we would like to do is compute a new third column based on
1.current value of Reduction
2.the next (Window - TimeToGo) amount of values of Price in the dataframe.
I am wondering if there is a simple way to reference upcoming values of a column from within mutate()? I am ideally looking for a sliding window function on one column, where the limits of the sliding window are set from two other current column values. My solution for now just uses a custom function which subsets on the original dataframe manually, does a comparison and returns back a value to the mutate() call. Any help and ideas would be much appreciated!
p.s. heres a sample of data... please let me know if you would need any more info. Thanks!
> a
ID TimeToGo Price
1 AQSAFOTO30A 96 19
2 AQSAFOTO20A 95 19
3 AQSAFOTO30A 94 17
4 AQSAFOTO20A 93 18
5 AQSAFOTO25A 92 19
6 AQSAFOTO30A 91 17

R: Shapiro test by group won't produce p-values and corrupt data frame warning

This question has been asked before, but the solutions posed only partially solve my problem, and I've been working on this for days now. I felt it was time to seek help, even if the topic has been addressed previously. I apologize for any inconvenience.
I have a very large data.frame in R with 6288 observations of 11 variables. I want to run a Shapiro test by group on each variable, but grouped by two different factors (Number and Treatment). A much reduced sample data set with one variable is provided for example:
data <- data.frame(Number=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
Treatment=c("High","High","High","High","High","High","Low",
"Low","Low","Low","Low","Low","High","High","High",
"High","High","High","Low","Low","Low","Low","Low",
"Low"),
FW=c(746,500,498,728,626,580,1462,738,1046,568,320,578,654,664,
660,596,1110,834,486,548,688,776,510,788))
I want to run a Shapiro test on FW by Number and by Treatment, so I'd have a test for 1High, 1Low, 2High, 2Low, etc. I'd like to have data for both the W statistic and the P-value. The original dataset contains 16 observations per group (1High,1Low,etc.; total groups=400), and an occasional NA; this sample dataset contains 6 observations per group (1High, 1Low, 2High, 2Low; groups=4).
The following code was previously posted as a solution to this problem of shapiro tests by groups:
res<-aggregate(cbind(P.value=data$FW)~data$Number+data$Treatment,data,FUN=shapiro.test)
I've also experimented with a number of other ways of grouping, but nothing seems to work. The above code comes closest.
The code above using aggregate groups my data appropriately, and gives me the W statistic, but it won't give me the P value (the column header says "P.value", but this is not the P value, it's the W statistic, I've confirmed this several ways). It also gives me the following warning message:
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
When I did a Google search for this warning, the results suggest it is a bug in the data.frame, but I can't figure out how to solve it. I'm not even sure it really is a bug in this case.
Can anyone help by providing some insight into the warning message, or another way to do the Shapiro test by group?
You're getting that error because shapiro.test returns a list and aggregate expects the result of the aggregation to be a vector or a single number.
aggregate sees the list, takes the first element of the list by default, and tells you why it's unhappy (in admittedly vague terms). But it still gives you the Shapiro-Wilk statistic since that's the first element of the list returned from shapiro.test.
You can make a slight modification to your existing code that will get you what you want without issue:
aggregate(formula = FW ~ Number + Treatment,
data = data,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
# Number Treatment FW.W FW.V2
# 1 1 High 0.88995051 0.31792857
# 2 2 High 0.78604502 0.04385663
# 3 1 Low 0.93305840 0.60391888
# 4 2 Low 0.86456934 0.20540230
Note that the rightmost columns correspond to the statistic and p-value.
This is directly extracting the statistic and p-value from the list, thereby making the result of aggregation a single vector, which makes aggregate happy.
Another option would be to use the data.table package, available from CRAN.
library(data.table)
DT <- data.table(data)
DT[,
.(W = shapiro.test(FW)$statistic, P.value = shapiro.test(FW)$p.value),
by = .(Number, Treatment)]
# Number Treatment W P.value
# 1: 1 High 0.8899505 0.31792857
# 2: 1 Low 0.9330584 0.60391888
# 3: 2 High 0.7860450 0.04385663
# 4: 2 Low 0.8645693 0.20540230
The dplyr package is handy for groupwise operations:
library(dplyr)
data %>%
group_by(Number, Treatment) %>%
summarise(statistic = shapiro.test(FW)$statistic,
p.value = shapiro.test(FW)$p.value)
Number Treatment statistic p.value
1 1 High 0.8899505 0.31792857
2 1 Low 0.9330584 0.60391888
3 2 High 0.7860450 0.04385663
4 2 Low 0.8645693 0.20540230
The simple dplyr answer didn't do it for me as it did not do the shapiro test on each grouped variable, but only did it once, so here's my own solution using nesting :
shapiro <- data %>%
group_by(!!sym(groupvar)) %>%
group_nest() %>%
mutate(shapiro = map(.data$data, ~ shapiro_test(.x, !!sym(quantvar)))) %>%
select(-data) %>%
unnest(cols = shapiro) %>%
print()

Resources