How do I add new columns to a data set for each regression loop iteration? - r

I'm trying to test the predictive power of a model by breaking the observations into 1/4th and 3/4th groups (test and train respectively), running a first-order regression with the independent variable train sample, using these coefficients to produce predicted values from the independent variable test sample, and then I would like to add new columns of these predicted values to the dependent variable test data for each iteration of the loop.
For context: TSIP500 is the full sample; iv is independent variable; dv is dependent variable, a max of 50 iterations is simply a test that isn't too large in quantity of iterations.
I was having trouble with the predict function so I did the equation manually. My code is below:
for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
predicted <- (int + B1*test_500iv)
predicted <- data.frame(predicted)
test_500dv <- data.frame(test_500dv)
test_500dv[,i] <- apply(predicted)
}
I've tried different approaches for the last line, but I always just get a singular column added. Any help would be tremendously appreciated.

for(i in 1:50){
test_index <- sample(nrow(TSIP500iv), (1/4)*nrow(TSIP500iv), replace=FALSE)
train_500iv <- TSIP500[-test_index,"distance"]
test_500iv <- TSIP500[test_index,"distance"]
train_500dv <- TSIP500[-test_index,"percent_of_max"]
test_500dv <- TSIP500[test_index,"percent_of_max"]
reg_model <- lm(train_500dv~train_500iv)
int <- reg_model$coeff[1]
B1 <- reg_model$coeff[2]
temp_results <- paste('pred',i,sep='_')
assign(temp_results, as.data.frame(int + B1*test_500iv))
test_500dv <- cbind(data.frame(test_500dv),temp_results)
}

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Is there a way to get the index of a list in R without match or which

I am trying to detect anomalies in the iris dataset by normalising the data into iris_norm, then splitting that into a training and testing set, then using the knn function to find anomalies. now I can extract those anomalies from the normalised iris_test set but not from the actual iris set, is there a way for me to use the indexes of the values in 'actual' as the indexes in iris? Here is my code
library(gmodels)
library(class)
library(tidyverse)
# STEP 1: Import your dataset, look at a summary
summary(iris)
# STEP 2: Generate a random number to split the dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
# The normalization function is created
nor <-function(x) {(x -min(x))/(max(x)-min(x))}
# Run nomalisation on predictor columns
iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor))
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
# Extract 5th column of train dataset because it will be used as
#'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=15)
##create confusion matrix
tab <- table(pr,iris_test_category)
##this function divides the correct predictions by total number of predictions
#that tell us how accurate teh model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
#create a cross table to see where the wrong predictions are
mytab <- CrossTable(iris_test_category, pr, FALSE)
#anomaly indexes
anomalies_index <- which(iris_test_category != pr)
# get the anomaly values
anomaly_value1 <- iris_test[iris_test_category != pr, "Sepal.Length"]
anomaly_value2 <- iris_test[iris_test_category != pr, "Sepal.Width"]
anomaly_value3 <- iris_test[iris_test_category != pr, "Petal.Length"]
anomaly_value4 <- iris_test[iris_test_category != pr, "Petal.Width"]
anomalies <- data.frame(anomaly_value1, anomaly_value2,
anomaly_value3, anomaly_value4)
actual <- iris_test[anomalies_index,]
print(anomalies)
print(actual)
I found the solution a few minutes later, all I had to do was
actual_index <- as.numeric(rownames(actual))
iris[actual_index,]
and I was able to extract the correct values

How to create correlation matrix after mice multiple imputation

I'm using the mice package to create multiple imputations. I want to create a correlations matrix (and a matrix of p-values for the correlation coefficients. I use miceadds::micombine.cor to do this. But this gives a dataframe with variables in the first to columns, and then a number of columns to contain r, p, t-values, and the like.
I'm looking for a way to turn this dataframe into a "good old" matrix with the correlation coefficient between x and y in position [x,y], and a matrix with p-values Does anyone have an easy way to do this?
Here's some code to reproduce:
data <- mtcars
mt.mis <- prodNA(mtcars, noNA = 0.1)
imputed <-mice(iris.mis, m = 5, maxit = 5, method = "pmm")
correlations<- miceadds::micombine.cor(mi.res=iris.mis, variables = c(1:3))
What I'm looking for is something like the output from cor(mtcars). Who can help?
I ended up writing my own function. Can probably be done much more efficiently, but this is what I made.
cormatrix <- function(r, N){
x <- 1
cormatrix <- matrix(nrow = N, ncol = N) # create empty matrix
for (i in 1:N) {
for (j in i:N) {
if(j>i){
cormatrix[i,j] <- r[x]
cormatrix[j,i] <- r[x]
x <- x + 1
}
}
}
diag(cormatrix) <- 1
cormatrix
}
You can call it with the output of micombine.cor and the number of variables in your model as arguments. So for example cormatrix(correlations$r,ncol(df)).

How to randomly select training and test subsets have same proportion of values for a specific variable in R?

I have a dataset data with 16 variables. One of the variables, DiseasePositive, indicates whether someone has been positive for a disease. Its values are therefore either 0 or 1.
What I want to do is as follows:
Randomly select a subset of 70% of my data to train the model.
Make sure that the train and test sets have approximately equal proportions of people with DiseasePositive==0 and people with DiseasePositive==1.
I read that I can use sample.split to do the 70% thing, but I don't know how to do the second thing. How can I do this using the sample.split function (from the caTools package)?
What I've done is this but I'm not sure if this is how the function works:
data$spl <- sample.split(data$DiseasePositive,SplitRatio = 0.7)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)
Here is a custom-made R solution:
stratified.sample <- function(var, p) {
obs <- seq_along(var)
grps <- unique(var)
inds <- numeric()
for(g in grps) {
inds <- c(inds, sample(obs[var==g], floor(sum(var==g)*p)))
}
inds
}
You can use the above function to stratify into test and train for any variable, even if it has more than 2 levels. Here is a demonstration using iris:
tinds <- stratified.sample(iris$Species, 0.7)
train <- iris[tinds,]
test <- iris[-tinds,]
Make sure that the class balances were preserved:
table(train$Species)
table(test$Species)
Using sample.split and your data:
inds <- sample.split(data$DiseasePositive, SplitRatio = 0.7)
train <- data[inds,]
test <- data[!inds,]

R: looped variable assignment, augmenting variable calculation each time

I am trying to calculate a regression variable based on a range of variables in my data set. I would like the regression variable (ei: Threshold 1) to be calculated using a different variable set in each iteration of running the regression.
Aim to collected SSR values for each threshold range, and thus identify the ideal threshold based on the data.
Data (df) variables: Yield, Prec, Price, 0C, 1C, 2C, 3C, 4C, 5C, 6C, 7C, 8C, 9C, 10C
Each loop calculates "thresholds" by selecting a different "b" each time.
a <- df$0C
b <- df$1C
Threshold1 <- (a-b)
Threshold2 <- (b)
Where "b" would be changing in each loop, ranging from 1C to 9C.
Each individual threshold set (1 and 2) should be used to run a regression, and save the SSR for comparison with the subsequent regression utilizing thresholds based on a new "b" value (ranging from 1C TO 9C)
Regression:
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
for each loop of the Regression, I vary the components of calculating thresholds in the following manner:
Current approach is centered around the following code:
df <- read.csv("Data.csv",header=TRUE)
names(df)
0C-9Cvarlist <- names(df)[9:19]
ssr.vec <- matrix(,21,1)
for(i in 1:length(varlist)){
a <- df$0C
b <- df$[i]
Threshold1 <- (a-b)
Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist,r2)
}
colnames(ssr.vec) <- c("varlist","r2")
I am failing to achieve the desired result with the above approach.
Thank you.
I can spot quite a few mistakes...
You need to add variables of interest (Threshold1 anf Threshold2) to the data in the regression. Also, I think that you need to select varlist[i] and not varlist to create your ssr.vec. You need 2 columns to your ssr.vec which is a matrix, so you should call it matrix. You also cannot use something like df$[i] to extract a column! Why is the matrix of length 21 ?! Change the column name to C0,..,C9 and not 0C,..,9C.
For future reference, solve the simple errors before asking question... and include error messages in your post!
This should do the job:
df <- read.csv("Data.csv",header=TRUE)
names(df)[8:19] = paste0("C",0:10)
varlist <- names(df)[9:19]
ssr.vec <- matrix(,21,2)
for(i in 1:length(varlist)){
a <- df$C0
b <- df[,i+9]
df$Threshold1 <- (a-b)
df$Threshold2 <- (b)
reg <- lm(log(Yield)~Threshold1+Threshold2+log(Price)+prec+I(prec^2),data=df)
r2 <- summary(reg)$r.squared
ssr.vec[i,] <- c(varlist[i],r2)
}
colnames(ssr.vec) <- c("varlist","r2")

Resources