Translating a for-loop to perhaps an apply through a list - r

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

Related

"grouping factor must have exactly 2 levels"

Hi y'all I'm fairly new to R and I'm supposed to calculate F statistic for this table
The code I have inputted is as follows:
# F-test
res.ftest <- var.test(TotalLength ~ SwimSpeed , data = my_data)
res.ftest
I know I have more than two levels from the other posts I have read online, but I am not sure what to change to get the outcome I want.
FIRST AND FOREMOST...If you invoke
?var.test()
you will note that the S3 version you called assumes lhs is numeric and rhs is a 2-level factor.
As for the rest, while I don't know the words to your specific work/school assignment here, the words shouldn't be "calculate an F-test", exactly. They should be "analyze these data appropriately". While there are a number of routes you could take, this is normally seen as a regression problem, NOT a problem of trying to compare two variances/complete a 1-way ANOVA which is what var.test() is designed to do. (Reading the documentation at, for example, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/var.test should make this clear and is something you should always do when invoking R procedures.)
Using a subset of your data (please do this yourself for stack helpers next time rather than make someone here do it for you)...
df <- data.frame(
ID = 1:4,
TL = c(27.1,29.0,33.0,29.3),
SS = c(86.6,62.4,63.8,62.3)
)
cor.test(df$TL,df$SS) # reports t statistic
# or
summary(lm(df$TL ~ df$SS)) # reports F statistic
Note that F is simply t^2 here in the 2 variable case.
Lastly, I should add it is remotely, vaguely possible the assignment is to check if the variances of the 2 distributions are equal even though I can see no reason why anyone would want to know considering they are 2 different measures on two different underlying scales measuring 2 different things. However,
var.test(df$TL, df$SS)
will return a "result" should you take the assignment to mean compare the observed variances.

Generating testing and training datasets with replacement in R

I have mirrored some code to perform an analysis, and everything is working correctly (I believe). However, I am trying to understand a few lines of code related to splitting the data up into 40% testing and 60% training sets.
To my current understanding, the code randomly assigns each row into group 1 or 2. Subsequently, all the the rows assigned to 1 are pulled into the training set, and the 2's into the testing.
Later, I realized that sampling with replacement is not want I wanted for my data analysis. Although in this case I am unsure of what is actually being replaced. Currently, I do not believe it is the actual data itself being replaced, rather the "1" and "2" place holders. I am looking to understand exactly how these lines of code work. Based on my results, it seems as it is working accomplishing what I want. I need to confirm whether or not the data itself is being replaced.
To test the lines in question, I created a dataframe with 10 unique values (1 through 10).
If the data values themselves were being sampled with replacement, I would expect to see some duplicates in "training1" or "testing2". I ran these lines of code 10 times with 10 different set.seed numbers and the data values were never duplicated. To me, this suggest the data itself is not being replaced.
If I set replace= FALSE I get this error:
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
set.seed(8)
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
training1 <- df[test==1,]
testing2 <- df[test==2,]
Id like to split up my data into 60-40 training and testing. Although I am not sure that this is actually happening. I think the prob function is not doing what I think it should be doing. I've noticed the prob function does not actually split the data exactly into 60percent and 40percent. In the case of the n=10 example, it can result in 7 training 2 testing, or even 6 training 4 testing. With my actual larger dataset with ~n=2000+, it averages out to be pretty close to 60/40 (i.e., 60.3/39.7).
The way you are sampling is bound to result in a undesired/ random split size unless number of observations are huge, formally known as law of large numbers. To make a more deterministic split, decide on the size/ number of observation for the train data and use it to sample from nrow(df):
set.seed(8)
# for a 60/40 train/test split
train_indx = sample(x = 1:nrow(df),
size = 0.6*nrow(df),
replace = FALSE)
train_df <- df[train_indx,]
test_df <- df[-train_indx,]
I recommend splitting the code based on Mankind_008's answer. Since I ran quite a bit of analysis based on the original code, I spent a few hours looking into what it does exactly.
The original code:
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
Answer From ( https://www.datacamp.com/community/tutorials/machine-learning-in-r ):
"Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights. Note also that, even though you don’t see it in the DataCamp Light chunk, the seed has still been set to 1234."
One of my main concerns that the data values themselves were being replaced. Rather it seems it allows the 1 and 2 placeholders to be assigned over again based on the probabilities.

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

Forming a Wright-Fisher loop with "sample()"

I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100

Show KMeans cluster results with clusters as columns

My data has 40+ variables and I am creating a 3 cluster model on it.
I have built a kmeans model:
teen_clusters <- kmeans(interests_z, 3).
It works fine. It is getting an output that I can read is the issue.
When I screen print the model, it places the variables on the top (40 across) and the clusters as rows (3 deep). Very hard to read.
I want it the other way around. 3 cluster columns and 40 rows.
I have tried the below, but get the same thing. This does way too much screen wrap.
aggregate(interests_z,by=list(teen_clusters$cluster),FUN=mean)
Since we don't have your data lets use mtcars ...
ret <- kmeans(mtcars,3)
ret$centers # the default format
t(ret$centers) # transposed as you want
To see the components of ret use str(ret)

Resources