Subsampling from a set with the assumption that each member would be picked at least one time in r - r

I need a code or idea for the case that we have a dataset of 1000 rows. I want to subsample from rows with the size of 800 for multiple times (I dont know how many times should I repeat).
How should I control that all members would be picked at least in one run? I need the code in r.
To make the question more clear, lets define the row names as:
rownames(dataset) = A,B,C,D,E,F,G,H,J,I
if I subsample 3 times:
A,B,C,D,E,F,G,H
D,E,A,B,H,J,F,C
F,H,E,A,B,C,D,J
The I is not in any of the subsample sets. I would like to do subsampling for 90 or 80 percent of the data for many times but I expect all the rows would be chosen at least in one of the subsample sets. In the above sample the element I should be picked in at least one of the subsamples.

One way to do this is random sampling without replacement to designate a set of "forced" random picks, in other words have a single guaranteed appearance of each row, and decide ahead of time which subsample that guaranteed appearance will be in. Then, randomly sample the rest of the subsample.
num_rows = 1000
num_subsamples = 1000
subsample_size = 900
full_index = 1:num_rows
dat = data.frame(i = full_index)
# Randomly assign guaranteed subsamples
# Make sure that we don't accidentally assign more than the subsample size
# If we're subsampling 90% of the data, it'll take at most a few tries
biggest_guaranteed_subsample = num_rows
while (biggest_guaranteed_subsample > subsample_size) {
# Assign the subsample that the row is guaranteed to appear in
dat$guarantee = sample(1:num_subsamples, replace = TRUE)
# Find the subsample with the most guaranteed slots taken
biggest_guaranteed_subsample = max(table(dat$guarantee))
}
# Assign subsamples
for (ss in 1:num_subsamples) {
# Pick out any rows guaranteed a slot in that subsample
my_sub = dat[dat$guarantee == ss, 'i']
# And randomly select the rest
my_sub = c(my_sub, sample(full_index[!(full_index %in% my_sub)],
subsample_size - length(my_sub),
replace = FALSE))
# Do your subsample calculation here
}

Related

Generating testing and training datasets with replacement in R

I have mirrored some code to perform an analysis, and everything is working correctly (I believe). However, I am trying to understand a few lines of code related to splitting the data up into 40% testing and 60% training sets.
To my current understanding, the code randomly assigns each row into group 1 or 2. Subsequently, all the the rows assigned to 1 are pulled into the training set, and the 2's into the testing.
Later, I realized that sampling with replacement is not want I wanted for my data analysis. Although in this case I am unsure of what is actually being replaced. Currently, I do not believe it is the actual data itself being replaced, rather the "1" and "2" place holders. I am looking to understand exactly how these lines of code work. Based on my results, it seems as it is working accomplishing what I want. I need to confirm whether or not the data itself is being replaced.
To test the lines in question, I created a dataframe with 10 unique values (1 through 10).
If the data values themselves were being sampled with replacement, I would expect to see some duplicates in "training1" or "testing2". I ran these lines of code 10 times with 10 different set.seed numbers and the data values were never duplicated. To me, this suggest the data itself is not being replaced.
If I set replace= FALSE I get this error:
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
set.seed(8)
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
training1 <- df[test==1,]
testing2 <- df[test==2,]
Id like to split up my data into 60-40 training and testing. Although I am not sure that this is actually happening. I think the prob function is not doing what I think it should be doing. I've noticed the prob function does not actually split the data exactly into 60percent and 40percent. In the case of the n=10 example, it can result in 7 training 2 testing, or even 6 training 4 testing. With my actual larger dataset with ~n=2000+, it averages out to be pretty close to 60/40 (i.e., 60.3/39.7).
The way you are sampling is bound to result in a undesired/ random split size unless number of observations are huge, formally known as law of large numbers. To make a more deterministic split, decide on the size/ number of observation for the train data and use it to sample from nrow(df):
set.seed(8)
# for a 60/40 train/test split
train_indx = sample(x = 1:nrow(df),
size = 0.6*nrow(df),
replace = FALSE)
train_df <- df[train_indx,]
test_df <- df[-train_indx,]
I recommend splitting the code based on Mankind_008's answer. Since I ran quite a bit of analysis based on the original code, I spent a few hours looking into what it does exactly.
The original code:
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
Answer From ( https://www.datacamp.com/community/tutorials/machine-learning-in-r ):
"Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights. Note also that, even though you don’t see it in the DataCamp Light chunk, the seed has still been set to 1234."
One of my main concerns that the data values themselves were being replaced. Rather it seems it allows the 1 and 2 placeholders to be assigned over again based on the probabilities.

Loop over data frame comparing pairs

I have created the following dataframe:
set.seed(42)
df1 = data.frame(pair = rep(c(1:26),2), size = rnorm(52,5.4,1.89))
It represents random pairs of individuals of a certain size, as assigned by the 'pair' column.
The random distribution (5.4, 1.89) is based on observed data from the group that I sampled in my study (N=26 pairs).
I now want to ask a very basic question that I am unable to code my way to:
Imagine a horizontal line at the mean (5.4), severing the population in two:
What proportion of individuals are paired with another individual from the same side of the line? i.e. is there a tendency for small to be with small and big to be with big?
I want to compare the proportion I observed with the proportion generated from 'asking' the above question a lot of times (e.g. 1000 repetitions).
In my study 18/26 individuals were together with a similar sized partner, and so I want to ask 'out of a 1000 repetitions, how many times was the proportion of similar individuals equal to or greater than 18/26?' this will be my 'p-value'.
I have no clue how to code this, but in my head it goes like this:
For each value in column 'size': when pair value are equal, do this:
is the larger individual equal to or bigger than 5.4? is the smaller
individual equal to or bigger than 5.4?
if so, return a "yes"
OR
is the larger individual equal to or smaller than 5.4? is the smaller
individual equal to or smaller than 5.4?
if so, return a "yes"
if none of the above are true, return a 0
provide an output of the proportion of yes and no. store this in a data.frame repeat
this process 1000 times, adding all the outputs to the mentioned data
frame:
run1 24/26
run2 4/26
...
run999 13/26
I really hope someone can show me the start to this, or the relevant code/arguments/structure.
Is this what you want
#Create empty output, for 10 iterations
same_group_list = replicate(10,0)
diff_group_list = replicate(10,0)
for (j in 1:10){ #For 10 iterations
df1 = data.frame(pair = rep(c(1:26),2), size = rnorm(52,5.4,1.89))
#Sort by 'pair'
df1 = df1[with(df1, order(pair)), ]
#Assign a group based on if 'size' is > or < than mean(size)
for (i in 1:nrow(df1)){
if (df1$size[i] <= mean(df1$size)){ #Use 5.4 explicitly instead of mean(df1$size) if you want
df1$Group[i] = -1
} else {
df1$Group[i] = 1
}
}
df1$Group = as.numeric(df1$Group) #Convert to numeric
output2 = tapply(df1$Group, df1$pair, mean) #Carry out groupwise mean
diff_group_list[j] = sum(output2 == 0) #A mean of 0 means pair grouped with another group
same_group_list[j] = length(output2) - diff_group_list[j] #Everything else is the same group
}
output = data.frame("Same groupout of 26" = same_group_list, "Different Group out of 26" = diff_group_list)
I created a data frame with pairs side by side and then compared which of them where higher than 5.4. Then compared pairs. The pairs with both sizes higher than 5.4 were summed, and then everything was divided by 26.
The data frame proportions shows the proportion for each run.
proportions <- data.frame(run = (1:1000), prop = rep(NA,1000))
for (i in 1:1000) {
df = data.frame(pair = c(1:26),
size1 = rnorm(26,5.4,1.89),
size2 = rnorm(26,5.4,1.89)
)
greaterPairs <- sum(df[,2] > 5.4 & df[,3]>5.4)
proportions[i,2] = greaterPairs/26
}
head(proportions)
I did not keep the proportion in the string format "18/26" because later, if you want to sum the total of them which follows some condition, you will have to do it visually, one by one. So, for example, if you want to know how many of them are greater than or equal 18/26:
sum(proportions$prop >= (18/26))

Forming a Wright-Fisher loop with "sample()"

I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100

Find a block of steady column values

5Can anyone give me a hint to speed up the following program?
Situation: I have a huge amount of measurement data. I need to extract data for "10 minutes stable operation conditions" of 5 parameters i.e. column values.
Here is my (working, but really slow) solution:
- Take the first 10 rows from the dataframe
- Compare the min and max of each column to the first value of the column
- If at least one column min or max is not within tolerance, delete the first row, repeat
- If they are within tolerance, calculate the mean of the results, store them, delete 10 rows, repeat.
- break when the dataframe has less than 10 rows
Since I am using a repeat loop, this takes 30min to extract 610 operation points from 86.220 minutes of data.
Any help is appreciated. Thanks!
edit: I created some code to explain. Please note that I deleted the checking routines for na values and standby operation (values around 0):
n_cons<-5 # Number of consistent minutes?
### Function to check wheter a value is within tolerance
f_cons<-function(min,max,value,tol){
z<-max > (value + tol) | min < (value - tol);
return(z)
}
# Define the +/- tolerances
Vu_1_tol<-5 # F_HT
Vu_2_tol<-5 # F_LT
# Create empty result map
map<-c(rep(NA,3))
dim(map)<- c(1,3)
colnames(map)<-list("F_HT","F_LT","Result")
system.time(
repeat{
# Criteria to break
if(nrow(t6)<n_cons){break}
# Subset of the data to check
t_check<-NULL
t_check<-cbind(t6$F_HT[1:n_cons],
t6$F_LT[1:n_cons]
)
# Check for consistency
if(f_cons(min(t_check[,1]),max(t_check[,1]),t_check[1,1],Vu_1_tol)){t6<-t6[-1,]
next}
if(f_cons(min(t_check[,2]),max(t_check[,2]),t_check[1,2],Vu_2_tol)){t6<-t6[-1,]
next}
# If the repeat loop passes the consistency check, store the means
attach(t6[1:n_cons,])
# create a new row wih means of steady block
new_row<-c(mean(F_HT),mean(F_LT),mean(Result))
new_row[-1]<-round(as.numeric(new_row[-1]),2)
map<-rbind(map,new_row) # attach new steady point to the map
detach(t6[1:n_cons,])
t6<-t6[-(1:n_cons),] # delete the evaluated lines from the data
}
)
The data I am using looks like this
t6<-structure(list(F_HT = c(1499.71, 1500.68, 1500.44, 1500.19, 1500.31,
1501.76, 1501, 1551.22, 1500.01, 1500.52, 1499.53, 1500.78, 1500.65,
1500.96, 1500.25, 1500.76, 1499.49, 1500.24, 1500.47, 1500.25,
1735.32, 2170.53, 2236.08, 2247.48, 2250.71, 2249.59, 2246.68,
2246.69, 2248.27, 2247.79), F_LT = c(2498.96, 2499.93, 2499.73,
2494.57, 2496.94, 2507.71, 2495.67, 2497.88, 2499.63, 2506.18,
2495.57, 2504.28, 2497.38, 2498.66, 2502.17, 2497.78, 2498.38,
2501.06, 2497.75, 2501.32, 2500.79, 2498.17, 2494.82, 2499.96,
2498.5, 2503.47, 2500.57, 2501.27, 2501.17, 2502.33), Result = c(9125.5,
8891.5, 8624, 8987, 9057.5, 8840.5, 9182, 8755.5, 9222.5, 9079,
9175.5, 9458.5, 9058, 9043, 9045, 9309, 9085.5, 9230, 9346, 9234,
9636.5, 9217.5, 9732.5, 9452, 9358, 9071.5, 9063.5, 9016.5, 8591,
8447.5)), .Names = c("F_HT", "F_LT", "Result"), row.names = 85777:85806, class = "data.frame")
With this code and data, I get 3 steady operation points, which is what I want, but which is very slow.
Hopefully, this helps to better explain my problem.
Heureka!
Thanks to the comment of Carl Witthoft, I was able to speed up the proces by factor 15!
I used rollapply a lot, because rollmean and rollmax had some problems with NA which did not occur when using rollaply.
Thanks for your help!
Here is what I did I used the same data like before:
# Use only the values needed to check for stability
t7<-as.data.frame(cbind(t6$F_HT,t6$F_LT))
n_cons<-5 # Number of consistent minutes?
# Calculate the mean values for each column over 5 rows
t7_rm<-rollapply(t7,n_cons,mean,align = "left")
colnames(t7_rm)<-c("mean_F_HT","mean_F_LT")
# idem with maximum
t7_max<-rollapply(t7,width=n_cons,FUN=max, na.rm = F,align = "left")
colnames(t7_max)<-c("max_F_HT","max_F_LT")
# idem with minimum
t7_min<-rollapply(t7,width=n_cons,FUN=min, na.rm = F,align = "left")
colnames(t7_min)<-c("min_F_HT","min_F_LT")
# create table with maximum absolute daviation from the mean values
t7_dif<-pmax((t7_max-t7_rm[1:nrow(t7_max),]),(t7_rm[1:nrow(t7_min),]-t7_min))
colnames(t7_dif)<-c("diff_F_HT","diff_F_LT")
# Enter tolerance limits
V1_tol<-50 # F_HT
V2_tol<-50 # F_LT
# Create a tolerance table
t7_tol<-cbind(rep(V1_tol,nrow(t7_dif)),rep(V2_tol,nrow(t7_dif)))
# Create a logical table with TRUE or FALSE depending on if the max deviation is within tolerance
t7_check<-(t7_dif<t7_tol)
# Replace all "FALSE" with "NA" (in order to use is.na)
t7_check_NA<-apply(t7_check,c(1,2),function(x) {ifelse(x==FALSE,NA,x)})
# Create rolling mean over complete data
t6_rm<-rollapply(t6,n_cons,mean,na.rm=TRUE,align = "left")
# Create a map of stable operation points with means of parameters and result
t6_map<-t6_rm[complete.cases(t7_check_NA),]
The result differs from my original one, because no lines are omitted. But this works for me.

How should I combine two loops in r?

I want to ask your opinion since I am not so sure how to do it. This is regarding one part of my paper project and my situation is:
Stage I
I have 2 groups and for each group I need to compute the following steps:
Generate 3 random numbers from normal distribution and square them.
Repeat step 1 for 15 times and at the end I will get 15 random numbers.
I already done stage I using for loop.
n1<-3
n2<-3
miu<-0
sd1<-1
sd2<-1
asim<-15
w<-rep(NA,asim)
x<-rep(NA,asim)
for (i in 1:asim) {
print(i)
set.seed(i)
data1<-rnorm(n1,miu,sd1)
data2<-rnorm(n2,miu,sd2)
w[i]<-sum(data1^2)
x[i]<-sum(data2^2)
}
w
x
Second stage is;
Stage II
For each group, I need to:
Sort the group;
Find trimmed mean for each group.
For the whole process (stage I and stage II) I need to simulate them for 5000 times. How am I going to proceed with step 2? Do you think I need to put another loop to proceed with stage II?
Those are tasks you can do without explicit loops. Therefore, note a few things: It is the same if you generate 3 times 15 times 2000 random numbers or if you generate them all at once. They still share the same distribution.
Next: Setting the seed within each loop makes your simulation deterministic. Call set.seed once at the start of your script.
So, what we will do is to generate all random numbers at once, then compute their squared norms for groups of three, then build groups of 15.
First some variable definitions:
set.seed(20131301)
repetitions <- 2000
numperval <- 3
numpergroup <- 15
miu <- 0
sd1 <- 1
sd2 <- 1
As we need two groups, we wrap the group generation stuff into a custom function. This is not necessary, but does help a bit in keeping the code clean an readable.
generateGroup <- function(repetitions, numperval, numpergroup, m, s) {
# Generate all data
data <- rnorm(repetitions*numperval*numpergroup, m, s)
# Build groups of 3:
data <- matrix(data, ncol=numperval)
# And generate the squared norm of those
data <- rowSums(data*data)
# Finally build a matrix with 15 columns, each column one dataset of numbers, each row one repetition
matrix(data, ncol=numpergroup)
}
Great, now we can generate random numbers for our group:
group1 <- generateGroup(repetitions, numperval, numpergroup, miu, sd1)
group2 <- generateGroup(repetitions, numperval, numpergroup, miu, sd2)
To compute the trimmed mean, we again utilize apply:
trimmedmeans_group1 <- apply(group1, 1, mean, trim=0.25)
trimmedmeans_group2 <- apply(group2, 1, mean, trim=0.25)
I used mean with the trim argument instead of sorting, throwing away and computing the mean. If you need the sorted numbers explicitly, you could do it by hand (just for one group, this time):
sorted <- t(apply(group1, 1, sort))
# We have to transpose as apply by default returns a matrix with each observation in one column. I chose the other way around above, so we stick with this convention and transpose.
Now, it would be easy to throw away the first and last two columns and generate the mean, if you want to do it manually.

Resources