Scaling only some columns of a training set and a test set - r

I often have to deal with the following issue:
I have a test set and a training set
I want to scale all columns of a training set, except for a few ones which are identified by a character vector
then, based on the sample means and sample standard deviations of the selected columns of the training set, I want to rescale the test set too
Currently, my workflow is kludgy: I use an index vector and then partial assignment to scale only some columns of the train set. I store the means and standard deviations from the scaling operation on the training set, and I use them to scale the test set. I was wondering if there could be a simpler way, without having to install caret (for a series of reasons, I'm not a big fan of caret and I definitely won't start using it just for this problem).
Here is my current workflow:
# define dummy train and test sets
train <- data.frame(letters = LETTERS[1:10], months = month.abb[1:10], numbers = 1:10,
x = rnorm(10, 1), y = runif(10))
test <- train
test$x <- rnorm(10, 1)
test$y <- runif(10)
# names of variables I don't want to scale
varnames <- c("letters", "months", "numbers")
# index vector of columns which must not be scaled
index <- names(train) %in% varnames
# scale only the columns not in index
temp <- scale(train[, !index])
train[, !index] <- temp
# get the means and standard deviations from temp, to scale test too
means <- attr(temp, "scaled:center")
standard_deviations <- attr(temp, "scaled:center")
# scale test
test[, !index] <- scale(test[, !index], center = means, scale = standard_deviations)
Is there a simpler/more idiomatic way to do this?

It is a nice question and I have tried a lot to come up with an answer.
I think this is a bit more elegant code:
train0=train%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale
means <- attr(train0, "scaled:center")
standard_deviations <- attr(train0, "scaled:center")
train0=cbind(select(train,c(letters, months, numbers)),train0)
test0=test%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale(center = means, scale = standard_deviations)
test0=cbind(select(test,c(letters, months, numbers)),test0)
I have tried hard to work with mutate_at in order to avoid cbind extra code but with no lack

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Logistic regression training and test data

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?
You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train
The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000

R: iterate a function over two lists simultaneously using lapply?

I have multiple factors dividing my data.
By one factor (uniqueGroup), I would like to subset my data, by another factor (distance), I want to first classify my data by "moving threshold", and then test statistical difference between groups.
I have created a function movThreshold to classify my data, and test it by wilcox.test. To vary the different threshold values, I just run
lapply(th.list, # list of thresholds
movThreshold, # my function
tab = tab, # original data
dependent = "infGrad") # dependent variable
Now I've realized, that in fact I need to firstly subset my data by uniqueGroup, and then vary the threshold value. But I am not sure, how to write it in my lapply code?
My dummy data:
set.seed(10)
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshold function &
# test for original data
# ============================================
movThreshold <- function(th, tab, dependent, ...) {
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b")
# Calculate wincoxon test - as I have only two groups
test<-wilcox.test(tab[[dependent]] ~ as.factor(group), # specify column name
data = tab)
# Put results in a vector
c(th, unique(tab$uniqueGroup), dependent, uniqueGroup, round(test$p.value, 3))
}
# Define two vectors to run through
# unique group
gr.list<-unique(tab$uniqueGroup)
# unique threshold
th.list<-c(2,3,4)
How to run lapply over two lists??
lapply(c(th.list,gr.list), # iterate over two vectors, DOES not work!!
movThreshold,
tab = tab,
dependent = "infGrad")
In my previous question (Kruskal-Wallis test: create lapply function to subset data.frame?), I've learnt how to iterate through individual subsets within a table:
lapply(split(tab, df$uniqueGroup), movThreshold})
But how to iterate through subsets, and through thresholds at once?
If I understood correctly what you're trying to do, here is a data.table solution:
library(data.table)
setDT(tab)[, lapply(th.list, movThreshold, tab = tab, dependent = "infGrad"), by = uniqueGroup]
Also, you can just do a nested lapply.
lapply(gr.list, function(z) lapply(th.list, movThreshold, tab = tab[uniqueGroup == z, ], dependent = "infGrad"))
I apologize, If I misunderstood what you're trying to do.

Get different test and training sets from the same sample

I have some data for which I want to compare a few different linear models. I can use caTools::sample.split() to get one training/test set.
I would like to see how the model would change if I had used a different training/test set from the same sample. If I do not use set.seed() I should get a different set every time I call sample.split.
I am using lapply to call the function a certain number of times right now:
library(data.table)
library(caTools)
dat <- as.data.table(iris)
dat_list <- lapply(1:20, function(z) {
sample_indices <- sample.split(dat$Sepal.Length, SplitRatio = 3/4)
inter <- dat
inter$typ <- "test"
inter$typ[sample_indices] <- "train"
inter$set_no <- z
return(as.data.table(inter))})
And for comparing the coefficients:
coefs <- sapply(1:20, function(z){
m <- lm(Sepal.Length ~ Sepal.Width, data = dat_list[[z]][typ == "train"])
return(unname(m$coefficients))
})
The last few lines could be edited to return the RMS error when predicting values in the test set (typ=="test").
I'm wondering if there's a better way of doing this?
I'm interested in splitting the data efficiently (my actual data set is quite large)
I'm a big advocate of lists of data frames, but it doesn't make sense to duplicate your data in a list - especially if it's biggish data, you don't need 20 copies of your data to have 20 train-test splits.
Instead, just store the indices of the train and test sets, and give the appropriate subset to the model.
n = 5
train_ind = replicate(n = n, sample(nrow(iris), size = 0.75 * nrow(iris)), simplify = FALSE)
test_ind = lapply(train_ind, function(x) setdiff(1:nrow(iris), x))
# then modify your loop to subset the right rows
coefs <- sapply(seq_len(n), function(z) {
m <- lm(Sepal.Length ~ Sepal.Width, data = iris[train_ind[[z]], ])
return(m$coefficients)
})
It's also good to parameterize anything that is used more than once. If you want to change to 20 replicates, set up your code so you change n = 20 at the top and don't have to go through the whole thing looking for every time you used 5 to change it to 20. It might be nice to pull out the split_ratio = 0.75 and put it on it's own line at the top too, even though it's only used once.

caret:rfe get best performing variables for a certain size

I ran a rfe Model with around 400 variables and got the result that the optimal model uses 40 variables. However, plotting the standard deviations of the error based on cross validation shows that the 40 variable model performs only slightly better than a model with only 10 variables. That's why I'd like to go for the model with 10 variables. I would like to use the 10 variables which perform best for a ten- variable-model and train the model again.
How can I get the 10 variables which lead to the model performance shown in the rfe object?
Since I use rerank=TRUE, I cannot just pick the 10 highest ranked variables from varImp(rfeModel$fit) right? (Would this work if I was not using "rerank" ?)
I'm also struggling with the differences between the output from varImp(rfeModel$fit), varImp(rfeModel), pickVars(rfeModel$variables,40).
What is the right way to get the best performing variables with regard to the size of interest?
The following example can be used:
data(BloodBrain)
x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)
set.seed(1)
rfProfile <- rfe(x, logBBB,
sizes = c(2, 5, 10, 20),
method="nnet",
maxit=10,
rfeControl(functions = caretFuncs,
returnResamp="all",
method="cv",
rerank=TRUE))
print(rfProfile), varImp(rfProfile$fit), varImp(rfProfile), pickVars(rfProfile$variables, rfProfile$optsize)
The simplest thing to do is to use the update function:
new_profile <- update(rfProfile, x = x, y = logBBB, size = 10)
Internally, it uses this code:
selectedVars <- rfProfile$variables
bestVar <- rfProfile$control$functions$selectVar(selectedVars, 10)
Max

Resources