Storing data from nested loop in r - r

I need to repeat the sampling procedure of the below loop 1000 times using a second loop.
This is the simplified code i produced for reproducability, the inner loop.
##Number of iterations
N = 8
##Store data from inner loop in vectors
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
for (I in 1:N){
PolynomialDegree [I] <- I
PMSE [I] <- I*rnorm(1)
}
Now, using a second , outer loop. I want repeat this "sampling procedure" 1000 times and store the data of all those vectors into a single dataframe. Im struggling to write the outer loop and was hoping for some assistance.
This is my attempt with non-reproducable code, I hope it is clear what i am attempting to do.
##Set number of iterations
N <- 8
M <- 1000
##Store data
OUTPUT <- rep(1,M)
##Outer loop starts
for (J in 1:M){
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
##Inner loop starts
for (I in 1:N){
##Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = I), data = training)
##fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
##define and store PMSE
PMSE[I] <- (1/(nrow(tempraindata)- nrow(training)))*(sum(testing$tem-predictions))^2
PolynomialDegree [I] <- I
} ## End of inner loop
OUTPUT[J] <- ##THIS IS WHERE I WANT TO SAVE THE DATA
} ##End outer loop
I want to store all the data inside OUTPUT and make it a dataframe, if done correctly it should contain 8000 values of PMSE and 8000 values of PolynomialDegree.

Avoid the bookkeeping of initializing vectors and then assigning elements by index. Consider a single sapply (or vapply) passing both iterations to build a matrix of 8,000 elements of the PSME calculations within a 1000 X 8 structure. Every column would then be a model run (or PolynomialDegree) and every row the training/testing data pair.
## Set number of iterations
N <- 8
M <- 1000
## Defined method to generalize process
calc_PSME <- function(M, N) {
## Randomly build training/testing sets
set.seed(M+N) # TO REPRODUCE RANDOM SAMPLES
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
## Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = N), data = training)
## Fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
## Return single PSME value
(
(1/(nrow(tempraindata)- nrow(training))) *
(sum(testing$tem-predictions)) ^ 2
)
}
# RETURN (1000 X 8) MATRIX WITH NAMED COLUMNS
PSME_matrix <- sapply(1:N, calc_PSME, 1:M)
PSME_matrix <- vapply(1:N, calc_PSME, numeric(M), 1:M)
Should you need a 8,000-row data frame of two columns, consider reshape to long format:
long_df <- reshape(
data.frame(output_matrix),
varying = 1:8,
timevar = "PolynomialDegree",
v.names = "PSME",
ids = NULL,
new.row.names = 1:1E4,
direction = "long"
)

Related

nested loop in r to correlate columns of df1 to columns of df2

I have two datasets with abundance data from groups of different species. Columns are species and rows are sites. The sites (rows) are identical between the two datasets and what i am trying to do is to correlate the columns of the first dataset to the columns of the second dataset in order to see if there is a positive or a negative correlation.
library(Hmisc)
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$P
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$r
the first will give me the p value of the relation between sp1 and spA and the second the r value
I initially created a loop that allowed me to check all species of the first dataframe with a single column of the second dataframe. Needless to say if I was to make this work I would have to repeat the process a few hundred times.
My simple loop for one column of df1(new6) against all columns of df2(otu.table.filter)
pvalues = list()
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$P
}
rvalues = list()
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$r
}
p<-NULL
for(i in 1:length(pvalues)){
tmp <-print(pvalues[[i]][2])
p <- rbind(p, tmp)
}
r<-NULL
for(i in 1:length(rvalues)){
tmp <-print(rvalues[[i]][2])
r <- rbind(r, tmp)
}
fdr<-as.matrix(p.adjust(p, method = "fdr", n = length(p)))
sprman<-cbind(r,p,fdr)
and using the above as a starting point I tried to create a nested loop that each time would examine a column of df1 vs all columns of df2 and then it would proceed to the second column of df1 against all columns of df2 etc etc
but here i am a bit lost and i could not find an answer for a solution in r
I would assume that the pvalues output should be a list of
pvalues[[i]][[j]]
and similarly the rvalues output
rvalues[[i]][[j]]
but I am a bit lost and I dont know how to do that as I tried
pvalues = list()
rvalues = list()
for (j in 1:7){
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$P
}
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$r
}
}
but I cannot make it work cause I am not sure how to direct the output in the lists and then i would also appreciate if someone could help me with the next part which would be to extract for each comparison the p and r value and apply the fdr function (similar to what i did with my simple loop)
here is a subset of my two dataframes
Here a small demo. Let's assume two matrices x and y with a sample size n. Then correlation and approximate p-values can be estimated as:
n <- 100
x <- matrix(rnorm(10 * n), nrow = n)
y <- matrix(rnorm(5 * n), nrow = n)
## correlation matrix
r <- cor(x, y, method = "spearman")
## p-values
pval <- function(r, n) 2 * (1 - pt(abs(r)/sqrt((1 - r^2)/(n - 2)), n - 2))
pval(r, n)
## for comparison
cor.test(x[,1], y[,1], method = "spearman", exact = FALSE)
More details can be found here: https://stats.stackexchange.com/questions/312216/spearman-correlation-significancy-test
Edit
And finally a loop with cor.test:
## for comparison
p <- matrix(NA, nrow = ncol(x), ncol=ncol(y))
for (i in 1:ncol(x)) {
for (j in 1:ncol(y)) {
p[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$p.value
}
}
p
The values differ a somewhat, because the first uses the t-approximation then the second the "exact AS 89 algorithm" of cor.test.

How to fasten going through all independent variable combinations?

I want to write function combination_rsquare(y, data, factor_number) where
y - A vector - dependent variable
data - A data frame containing independent variables
factor_number - vector or numeric which tells how many elements in combination should be included.
Let's consider my function :
combination_rsquare <- function(y, data, factor_number = c(2, 3)) {
name_vec <- c()
r_sq <- c()
for (j in seq_along(factor_number)) {
# Defining combinations
comb_names <- combn(colnames(data), factor_number[j])
for (i in 1:ncol(comb_names)) {
#Append model r-squared for each combination
r_sq<- append(
r_sq,
summary(lm(y ~ ., data = data[comb_names[1:factor_number[j], i]]))$r.squared
)
# Create vector containing model names seperated by "+"
name_vec <- append(
name_vec,
paste(comb_names[1:factor_number[j], i], collapse = "+")
)
}
}
data.frame(name_vec, r_sq)
}
Let's have a look how my function works on data :
Norm <- rnorm(100)
Unif <- runif(100)
Exp <- rexp(100)
Pois <- rpois(100,1)
Weib <- rweibull(100,1)
df <- data.frame(Unif, Exp, Pois, Weib)
combination_rsquare(Norm, df)
name_vec r_sq
1 Unif+Exp 0.02727265
2 Unif+Pois 0.02912956
3 Unif+Weib 0.01613404
4 Exp+Pois 0.04853872
5 Exp+Weib 0.03252025
6 Pois+Weib 0.03573252
7 Unif+Exp+Pois 0.05138219
8 Unif+Exp+Weib 0.03571401
9 Unif+Pois+Weib 0.04112936
10 Exp+Pois+Weib 0.06209911
Okay - so we have it! Everything is working! However - If I'm putting very large data frame to my function and adding new features to be calculated (adjusted R.squared, AIC, BIC and so on) it's taking ages! My question is - is there any possibility how can I make this function works faster ? i.e. maybe the double loop can be omitted, or maybe there is R build function for creating such combinations ?
To summarize - How can I make combination_rsquare() to calculate faster ?

speed up replication of rows using model

I would like to create replicate predictions for one integer independent variable (iv1) given some model and a data frame called training. This is my current approach. I appreciate this is not self containing but hopefully it is self explanatory:
number_of_samples <- 10
results <- NULL
for (row in 1:nrow(training)) {
fake_iv1_values <- sample(1:100, number_of_samples)
case <- training[row,]
for (iv1 in fake_iv1_values) {
case$iv1 <- iv1
case$prediction <- predict(some_model, newdata = case)
results <- rbind(results, case)
}
}
Using loops is very slow. I wonder, if this could be sped up? Thanks!
Try with this.
Reproducible fake data and model:
# create fake data
n_row <- 100
n_xs <- 100
training <- data.frame(y = rnorm(n_row), iv1 = rnorm(n_row))
training[, paste0("x",1:n_xs)] <- replicate(n_xs, list(rnorm(n_row)))
# example model
some_model <- lm(y~., training)
Rewritten code:
number_of_samples <- 10
results <- NULL
# vector of several fake_iv1_values vectors
fake_iv1_values <- as.numeric(replicate(nrow(training), sample(1:100, number_of_samples)))
# replicate each row of the original dataframe
results <- training[rep(seq_len(nrow(training)), each = number_of_samples), ]
# add fake values to the replicated dataframe
results$iv1 <- fake_iv1_values
# get predictions
results$prediction <- predict(some_model, newdata = results)

R list containing training set and test set objects

I am trying to create 10 folds of my data. What I want to have is a data structure of length 10 (number of folds) and each element of the data structure contains an object/data structure that has two attributes/elements; the training set and the test set at that fold. This is my R code.
I wanted to access for example, the training set at fold 8 by View(data_pairs[[8]]$training_set). But it did not work. Any help would be appreciated :)
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(data), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- data[test_ind,]
train <- data[-test_ind,]
data_pair <- list(training_set = list(train), test_set = list(test))
data_pairs <- append(x = data_pairs, values = data_pair)
}
You were very close, you just needed to wrap values in a list call.
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(mtcars), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- mtcars[test_ind,]
train <- mtcars[-test_ind,]
data_pair <- list(training_set = train, test_set = test)
data_pairs <- append(x = data_pairs, values = list(data_pair))
#data_pairs <- c(data_pairs, list(data_pair))
}
If your data is big I would suggest you read these two posts on more efficient ways to grow a list.
Append an object to a list in R in amortized constant time, O(1)?
Here we go again: append an element to a list in R
I would also like to point out that you are not creating "folds" of your data. In your case you are attempting a 10-fold cross validation, which means your data should be separated into 10 "equal" sized chunks. Then you create 10 train/test data sets using each fold as the test data and the rest for training.
It seems like the package modelr could help you here.
In particular I would point you to:
https://modelr.tidyverse.org/reference/resample_partition.html
library(modelr)
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))
mod <- lm(mpg ~ wt, data = ex$train)
rmse(mod, ex$test)
#> [1] 3.229756
rmse(mod, ex$train)
#> [1] 2.88216
Alternatively, producing a dataset of these partitions can be done with:
crossv_mc(data, n, test = 0.2, id = ".id")

How to store values from loop to a dataframe in R?

I am new to R and programming, I want to store values from loop to a data frame in R. I want ker, cValues, accuracyValues values to be stored a data frame from bellow code. I am not able to achieve this, Data Frame is only saving last value not all the values.
Can you please help me with this please.
# Define a vector which has different kernel methods
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Define a for loop to calculate accuracy for different values of C and kernel
for (ker in kerna){
cValues <- c()
accuracyValues <- c()
for (c in 1:100) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
#pred
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
cValues[c] <- c;
accuracyValues[c] <- accuracy;
}
for(i in 1:100) {
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
}
}
Starting from your base code, set up the structure of the output data frame. Then, loop through and fill in the accuracy values on each iteration. This method also "flattens" the nested loop and gets rid of your c variable which conflicts with the built-in c() function.
kerna <- c("rbfdot","polydot","vanilladot","tanhdot","laplacedot",
"besseldot","anovadot","splinedot")
# Create dataframe to store output data
df <- data.frame(kerna = rep(kerna, each = 100),
cValues = rep(1:100, times = length(kerna)),
accuracyValues = NA,
stringsAsFactors = F)
# Define a for loop to calculate accuracy for different values of C and kernel
for (i in 1:nrow(df)){
ker <- df$kerna[i]
j <- df$cValues[i]
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=j,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
# Insert accuracy into df$accuracyValues
df$accuracyValues[i] <- accuracy;
}
Consider Map to build a list of data frames from each pairing of ker and cValues (1:100) generated from expand.grid and row bind all elements together.
k_c_pairs_df <- expand.grid(kerna=kerna, c_value=1:100, stringsAsFactors = FALSE)
model_fct <- function(ker, c) {
model <- ksvm(V11~V1+V2+V3+V4+V5+V6+V7+V8+V9+V10,
data = credit_card_data,
type ="C-svc",
kernel = ker,
C=c,
scaled =TRUE)
pred <- predict(model,credit_card_data[,1:10])
accuracy <- sum(pred== credit_card_data$V11)/nrow(credit_card_data)
print(paste("kernal:",ker, "c=",cValues[i],"accuracy=",accuracyValues[i]))
return(data.frame(kernel = ker, cValues = c, accuracyValues = accuracy))
}
df_list <- Map(model_fct, k_c_pairs_df$ker, k_c_pairs_df$c_value)
final_df <- do.call(rbind, df_list)

Resources