Logistic regression training and test data - r

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?

You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train

The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000

Related

How to apply MICE imputations on test set?

I have two separate data sets: one for train (1000000 observation) and the other one for test (1000000 observation). I divided the train set into 3 sets (mytrain: 700000 observations, myvalid: 150000 observations, mytest:150000 observations). Thetest set with 1000000 observations doesn't include the target variable, so it should be used for the final test. Since there are some missing values for categorical variables, I need to use mice to impute them. I should reuse the imputation done on mytrain set to fill the missing values in the myvalid, mytest and test sets. Based on the answer to this question, I should do this:
data2 <- rbind(mytrain,myval,mytest,test)
data2$ST_EMPL <- as.factor(data2$ST_EMPL)
data2$TYP_RES <- as.factor(data2$TYP_RES)
imp <- mice(data2, method = "cart", m = 1, maxit = 1, seed = 123,
ignore = c(rep(FALSE, 700000),rep(TRUE, 1300000)))
data2.imp <- complete(imp,1)
summary(imp)
mytrainN <- data2.imp[1:700000,]
myvalN <- data2.imp[700001:850000,]
mytestN <- data2.imp[850001:1000000,]
testN <- data2.imp[1000001:2000000,]
However, since the test set does not have the target column, it is not possible to merge it with mytrain, mytest, and myvalid. Is it possible to add a hypothetical target column (with the value of say 10 for all 1000000 observations) to the test set?

Weighted sampling in R problems

enter image description here
I want to do weighted sampling in R, with original data imbalanced between 0 and 1, I used sample to do but result in still biased data.
nsample=4000
model_weights <- ifelse(train$Bankrupt == 1,0.9677419,0.03225806)
samp_idx <- sample(4107, nsample, replace=T, prob=model_weights)
data.weighted <- data[samp_idx, ]
table(data.weighted$Bankrupt)
0 1
3761 239
Look at the documentation for the stratified function. You want to do something like this, but it is impossible to tell from the data you provide.
stratified(DF, "Status", c(Bankrupt = 30, NotBankrupt = 1))
The column in the data holding the groups should be character and the groups should match those in your list of weights you pass to the stratified function.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

How to downsample within recursive feature elimination using caret?

Consider the data frame data created here:
set.seed(123)
num = sample(5:20, replace = T, 20)
id = letters[1:20]
loc <- rep(id, num)
data <- data.frame(Location = loc)
data[paste0('var', seq_along(1:10))] <- rnorm(length(id) * sum(num))
Assuming data is my training data; Each row represents measurements that were taken on a randomly sampled individuals from populations identified by the grouping variable Location. I want to use recursive feature elimination to identify the best subset of predictors for predicting Location. Analogously, I want to understand how much variation each of the predictors explain in Location (i.e., which ones are most important, and how much more important are they). I have read how this can be done using the caret package using something like this:
library(caret)
subsets <- 1:9
ctrl <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 10, verbose = F)
lmProfile <- rfe(data[,2:10], data[,1], sizes = subsets, rfeControl = ctrl)
In my data example, considering the unbalanced number of samples in each Location, I want to use down sampling to ensure that the same number of samples is being considered across the levels of Location upon each iteration. Could someone demonstrate how I might do this?

Scaling only some columns of a training set and a test set

I often have to deal with the following issue:
I have a test set and a training set
I want to scale all columns of a training set, except for a few ones which are identified by a character vector
then, based on the sample means and sample standard deviations of the selected columns of the training set, I want to rescale the test set too
Currently, my workflow is kludgy: I use an index vector and then partial assignment to scale only some columns of the train set. I store the means and standard deviations from the scaling operation on the training set, and I use them to scale the test set. I was wondering if there could be a simpler way, without having to install caret (for a series of reasons, I'm not a big fan of caret and I definitely won't start using it just for this problem).
Here is my current workflow:
# define dummy train and test sets
train <- data.frame(letters = LETTERS[1:10], months = month.abb[1:10], numbers = 1:10,
x = rnorm(10, 1), y = runif(10))
test <- train
test$x <- rnorm(10, 1)
test$y <- runif(10)
# names of variables I don't want to scale
varnames <- c("letters", "months", "numbers")
# index vector of columns which must not be scaled
index <- names(train) %in% varnames
# scale only the columns not in index
temp <- scale(train[, !index])
train[, !index] <- temp
# get the means and standard deviations from temp, to scale test too
means <- attr(temp, "scaled:center")
standard_deviations <- attr(temp, "scaled:center")
# scale test
test[, !index] <- scale(test[, !index], center = means, scale = standard_deviations)
Is there a simpler/more idiomatic way to do this?
It is a nice question and I have tried a lot to come up with an answer.
I think this is a bit more elegant code:
train0=train%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale
means <- attr(train0, "scaled:center")
standard_deviations <- attr(train0, "scaled:center")
train0=cbind(select(train,c(letters, months, numbers)),train0)
test0=test%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale(center = means, scale = standard_deviations)
test0=cbind(select(test,c(letters, months, numbers)),test0)
I have tried hard to work with mutate_at in order to avoid cbind extra code but with no lack

Resources