Generating testing and training datasets with replacement in R - r

I have mirrored some code to perform an analysis, and everything is working correctly (I believe). However, I am trying to understand a few lines of code related to splitting the data up into 40% testing and 60% training sets.
To my current understanding, the code randomly assigns each row into group 1 or 2. Subsequently, all the the rows assigned to 1 are pulled into the training set, and the 2's into the testing.
Later, I realized that sampling with replacement is not want I wanted for my data analysis. Although in this case I am unsure of what is actually being replaced. Currently, I do not believe it is the actual data itself being replaced, rather the "1" and "2" place holders. I am looking to understand exactly how these lines of code work. Based on my results, it seems as it is working accomplishing what I want. I need to confirm whether or not the data itself is being replaced.
To test the lines in question, I created a dataframe with 10 unique values (1 through 10).
If the data values themselves were being sampled with replacement, I would expect to see some duplicates in "training1" or "testing2". I ran these lines of code 10 times with 10 different set.seed numbers and the data values were never duplicated. To me, this suggest the data itself is not being replaced.
If I set replace= FALSE I get this error:
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
set.seed(8)
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
training1 <- df[test==1,]
testing2 <- df[test==2,]
Id like to split up my data into 60-40 training and testing. Although I am not sure that this is actually happening. I think the prob function is not doing what I think it should be doing. I've noticed the prob function does not actually split the data exactly into 60percent and 40percent. In the case of the n=10 example, it can result in 7 training 2 testing, or even 6 training 4 testing. With my actual larger dataset with ~n=2000+, it averages out to be pretty close to 60/40 (i.e., 60.3/39.7).

The way you are sampling is bound to result in a undesired/ random split size unless number of observations are huge, formally known as law of large numbers. To make a more deterministic split, decide on the size/ number of observation for the train data and use it to sample from nrow(df):
set.seed(8)
# for a 60/40 train/test split
train_indx = sample(x = 1:nrow(df),
size = 0.6*nrow(df),
replace = FALSE)
train_df <- df[train_indx,]
test_df <- df[-train_indx,]

I recommend splitting the code based on Mankind_008's answer. Since I ran quite a bit of analysis based on the original code, I spent a few hours looking into what it does exactly.
The original code:
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
Answer From ( https://www.datacamp.com/community/tutorials/machine-learning-in-r ):
"Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights. Note also that, even though you don’t see it in the DataCamp Light chunk, the seed has still been set to 1234."
One of my main concerns that the data values themselves were being replaced. Rather it seems it allows the 1 and 2 placeholders to be assigned over again based on the probabilities.

Related

Subsampling from a set with the assumption that each member would be picked at least one time in r

I need a code or idea for the case that we have a dataset of 1000 rows. I want to subsample from rows with the size of 800 for multiple times (I dont know how many times should I repeat).
How should I control that all members would be picked at least in one run? I need the code in r.
To make the question more clear, lets define the row names as:
rownames(dataset) = A,B,C,D,E,F,G,H,J,I
if I subsample 3 times:
A,B,C,D,E,F,G,H
D,E,A,B,H,J,F,C
F,H,E,A,B,C,D,J
The I is not in any of the subsample sets. I would like to do subsampling for 90 or 80 percent of the data for many times but I expect all the rows would be chosen at least in one of the subsample sets. In the above sample the element I should be picked in at least one of the subsamples.
One way to do this is random sampling without replacement to designate a set of "forced" random picks, in other words have a single guaranteed appearance of each row, and decide ahead of time which subsample that guaranteed appearance will be in. Then, randomly sample the rest of the subsample.
num_rows = 1000
num_subsamples = 1000
subsample_size = 900
full_index = 1:num_rows
dat = data.frame(i = full_index)
# Randomly assign guaranteed subsamples
# Make sure that we don't accidentally assign more than the subsample size
# If we're subsampling 90% of the data, it'll take at most a few tries
biggest_guaranteed_subsample = num_rows
while (biggest_guaranteed_subsample > subsample_size) {
# Assign the subsample that the row is guaranteed to appear in
dat$guarantee = sample(1:num_subsamples, replace = TRUE)
# Find the subsample with the most guaranteed slots taken
biggest_guaranteed_subsample = max(table(dat$guarantee))
}
# Assign subsamples
for (ss in 1:num_subsamples) {
# Pick out any rows guaranteed a slot in that subsample
my_sub = dat[dat$guarantee == ss, 'i']
# And randomly select the rest
my_sub = c(my_sub, sample(full_index[!(full_index %in% my_sub)],
subsample_size - length(my_sub),
replace = FALSE))
# Do your subsample calculation here
}

Translating a for-loop to perhaps an apply through a list

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

TraMineR, Extract all present combination of events as dummy variables

Lets say I have this data. My objective is to extraxt combinations of sequences.
I have one constraint, the time between two events may not be more than 5, lets call this maxGap.
User <- c(rep(1,3)) # One users
Event <- c("C","B","C") # Say this is random events could be anything from LETTERS[1:4]
Time <- c(c(1,12,13)) # This is a timeline
df <- data.frame(User=User,
Event=Event,
Time=Time)
If want to use these sequences as binary explanatory variables for analysis.
Given this dataframe the result should be like this.
res.df <- data.frame(User=1,
C=1,
B=1,
CB=0,
BC=1,
CBC=0)
(CB) and (CBC) will be 0 since the maxGap > 5.
I was trying to write a function for this using many for-loops, but it becomes very complex if the sequence becomes larger and the different number of evets also becomes larger. And also if the number of different User grows to 100 000.
Is it possible of doing this in TraMineR with the help of seqeconstraint?
Here is how you would do that with TraMineR
df.seqe <- seqecreate(id=df$User, timestamp=df$Time, event=df$Event)
constr <- seqeconstraint(maxGap=5)
subseq <- seqefsub(df.seqe, minSupport=0, constraint=constr)
(presence <- seqeapplysub(subseq, method="presence"))
which gives
(B) (B)-(C) (C)
1-(C)-11-(B)-1-(C) 1 1 1
presence is a table with a column for each subsequence that occurs at least once in the data set. So, if you have several individuals (event sequences), the table will have one row per individual and the columns will be the binary variable you are looking for. (See also TraMineR: Can I get the complete sequence if I give an event sub sequence? )
However, be aware that TraMineR works fine only with subsequences of length up to about 4 or 5. We suggest to set maxK=3 or 4 in seqefsub. The number of individuals should not be a problem, nor should the number of different possible events (the alphabet) as long as you restrict the maximal subsequence length you are looking for.
Hope this helps

Restricted permutation (permute) fails using shuffleSet and runs using shuffle

I'm doing PRC using the vegan-package but run into trouble when I attempt to perform an Anova on the results. I get the following error-message:
Error in doShuffleSet(spln[[i]], nset = nset, control) :
number of items to replace is not a multiple of replacement length
The problem originates in the shuffleSet-function of the permute-package. I created a reproducible example below. The weird thing is that the shuffle-function does not cause trouble, but the shuffleSet-function does.
In my experiment 3 treatments were given to 4 animals. The animals received the treatments in different orders. On every day, 5 samples were collected over time.
I would like to permute my observations within animals and not between them. Therefore I use AnimalID as a block.
I would like to permute days (in my actual experiments animals received the same treatment multiple times) but keep the measurements within a day intact. Hence I chose to permute Days freely and have no permutations within Days.
require(permute)
TreatmentLevels=3
Animals=4
TimeSteps=5
AnimalID=rep(letters[1:Animals],each=TreatmentLevels*TimeSteps)
Time=rep(1:TimeSteps,Animals=TreatmentLevels)
#treatments were given in different order per animal.
Day=rep(c(1,2,3,2,3,1,3,2,1,2,3,1),each=TimeSteps)
Treatment=rep(rep(LETTERS[1:TreatmentLevels],each=TimeSteps),Animals)
dataset=as.data.frame(cbind(AnimalID,Treatment,Day,Time))
ctrl=how(blocks = dataset$AnimalID,plots = Plots(strata=dataset$Day,type = "free"),
within=Within(type="none"), nperm = 999)
#this works
shuffle(60,control=ctrl)
#this giveas an error
shuffleSet(60,nset=1,control=ctrl)
shuffleSet(60,nset=10,control=ctrl)
The problem seems to be in the block. Because this works
dataset$AnimalDay=factor(paste0(dataset$AnimalID,dataset$Day))
ctrl=how(plots = Plots(strata=dataset$AnimalDay,type = "free"),
within=Within(type="none"), nperm = 999)
#this works
shuffle(60,control=ctrl)
shuffleSet(60,nset=1,control=ctrl)
shuffleSet(60,nset=10,control=ctrl)
The key problem seems to be nset = 1: the permutation is generated and shuffleSet works, but printing the result fails because one set is dropped to a vector and print expects a matrix. You can get the permutation, you can use the permutation, but you cannot print it.
We got to fix this.

Forming a Wright-Fisher loop with "sample()"

I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100

Resources