Exclude missing values from model performance calculation - r

I have a dataset and I want to build a model, preferably with the caret package. My data is actually a time series but the question is not specific to time series, it's just that I work with CreateTimeSlices for the data partition.
My data has a certain amount of missing values NA, and I imputed them separately of the caret code. I also kept a record of their locations:
# a logical vector same size as the data, which obs were imputed NA
imputed=c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)
imputed[imputed] <- NA; print(imputed)
#### [1] FALSE FALSE FALSE NA FALSE FALSE
I know there is an option in Caret train function to either exclude the NA or impute them with different techniques. That's not what I want. I need to build the model on the already imputed dataset but I want to exclude the imputed points from the calculation of the error indicators (RMSE, MAE, ...).
I don't know how to do this in caret. In my first script I tried to do the whole cross validation manually, and then I had a customized error measure:
actual = c(5, 4, 3, 6, 7, 5)
predicted = c(4, 4, 3.5, 7, 6.8, 4)
Metrics::rmse(actual, predicted) # with all the points
#### [1] 0.7404953
sqrt(mean( (!imputed)*(actual-predicted)^2 , na.rm=T)) # excluding the imputed
#### [1] 0.676757
How can I handle this way of doing in caret? Or is there another way to avoid coding everything by hand?

I don't know if you are looking for this but here is a simple solution by creating a function:
i=which(imputed==F) ## As you have index for NA values
metric_na=function(fun, actual, predicted, index){
fun(actual[index], predicted[index])
}
metric_na(Metrics::rmse, actual, predicted, index = i)
0.676757
metric_na(Metrics::mae, actual, predicted, index = i)
0.54
Also you can just use the index directly while calculating the desired metrics:
Metrics::rmse(actual[i], predicted[i])

Related

Handling skip in rpart and random forest

I have a dataset containing 10 categorical variables. Each of these has missing values coded as (-9, -6, -3, -2, -1). I want to create 1 column that takes the mean of these 10 variables excluding the negative values. I can collapse the negative values into NA and then median impute them but I need to retain -6 since -6 implies that the person skipped the question because it does not apply to them. For instance, parental relationship quality does not apply to single parents. I ultimately want to use this variable as a predictor in my random forest model so I am not sure how to handle -6 in this case. One way that I could think of is to impute each of the 10 variables as follows (Let's say that the 10 variables are a1 to a10):
missing_categs <- c(-9, -3, -2, -1)
df[df$a1%in%missing_categs,]$a1 <- assign median value of a1
After the above step, I calculate the average of a1 to a10. The ones that yield "-6" are the ones that pertain to single parents (which means it does not apply to them). then, I convert -6 to NA. So, now I have average values and one NA. Can rpart and random forest models handle NA? Other better alternative solutions are most welcome. Thanks in advance!
Can rpart and random forest models handle NA?
I do not know what you mean with handle. If you mean that you can use NA in the predictors than the answer is yes for rpart
> library(rpart)
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> rpart(df, na.action=na.pass)
n= 3
node), split, n, deviance, yval
* denotes terminal node
but no for randomForest
> library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> randomForest(df, na.action=na.pass)
Error in randomForest.default(df, na.action = na.pass) :
NA not permitted in predictors
If you mean handle that they are able to deal with them in some manner, for example by giving them a function, than the answer is yes for both.
rpart and randomForest have the parameter na.action which you can use. See here for rpart and here for randomForest.
The default for rpart na.action is na.rpart which deletes "all observations for which y is missing" and "those in which one or more predictors are missing" are kept.
The default for randomForest na.action is na.fail which returns the given data structure unaltered if no NA's are found, and if at least one NA is found it "signals an error".

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Correcting for multiple comparisons in permutation procedure using R and multtest

I have carried out a permutation test comprising a Null-distribution of distances and then 5 observed distances as statistics. Now I would like to correct for multiple comparisons using the Max-T method; using the multtest package, and the ss.maxT, the ss.minT and/or the sd.maxT functions.
But I have problems implementing the functions and making sense of the results; the first function only gives 1s as result, the third only give back the unadjusted p-values and the third throws an error.
Please see example data below:
## Example data
# Observed distances
obs <- matrix(c(0.001, 0.2, 0.50, 0.9, .9999))
null_values <- runif(20)
# Null distribution of distances
null <- matrix(null_values, nrow = length(obs), ncol = length(c(1:20)), byrow=TRUE)
null
# Hypotheses
alternative <- "more"
# The unadjusted raw p-value
praw <- c(0, 0.1, 0.45, 0.85, 1)
# Only getting 1s as results
adjusted_p_values_max <- multtest::ss.maxT(null, obs, alternative, get.cutoff=FALSE,
get.cr = FALSE, get.adjp = TRUE, alpha = 0.05)
adjusted_p_values_max
# Should probably use this one: but getting praw back, which is supposedly correct (but perhaps odd)
# this is because of the null distribution being identical for all 5 variables.
# Hence, should each word be tested against its own unique null distribution?
adjusted_p_values_min <- multtest::ss.minP(null, obs, praw, alternative, get.cutoff=FALSE,
get.cr = FALSE, get.adjp = TRUE, alpha=0.05)
adjusted_p_values_min
# Throwing and error
adjusted_p_values_sdmax <- sd.maxT(null, obs, alternative, get.cutoff=TRUE,
get.cr = TRUE, get.adjp = TRUE, alpha = 0.05)
adjusted_p_values_sdmax
Considering the very different conclusions from the first two methods, I’m wondering if my plan to implement these methods are incorrect in the first place. Basically, I want to examine several hundred distances against a null distribution of several thousands.
obs = The observed distances between different observed points in space to the same “original” point A. (Hence, distances are not independent since they all relate to the same point)
null = The null distribution comprises distances between points that have been randomly selected (replacement = TRUE) from the different observed points and the same original point A.
It seems way too conservative to use ss.maxP for me. Whereas it seems unnecessary to use ss.minP if it “just” returns the raw p-values; or what am I missing?
Can I perhaps solve this situation by constructing individual null distributions for every observed distance?
Thank you in advance!

Logistic regression training and test data

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?
You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train
The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000

R - warning for dissimilarity calculation, clustering with numeric matrix

Reproducible data:
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
Convert dataframe to matrix
Matrix_from_Data <- data.matrix(Data)
Check the structure
str(Matrix_from_Data)
num [1:10, 1:3] 1 0 0 1 0 1 0 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "X" "Y" "Z"
The question:
I have dataframe of binary, symmetric variables (larger than the example), and I'd like to do some hierarchical clustering, which I've never tried before. There are no missing or NA values.
I convert the dataframe into a matrix before attempting to run the daisy function from the 'cluster' package, to get the dissimilarity matrix. I'd like to explore the options for calculating different dissimilarity metrics, but am running into a warning (not an error):
library(cluster)
Dissim_Euc_Matrix_from_Data <- daisy(Matrix_from_Data, metric = "euclidean", type = list(symm =c(1:ncol(Matrix_from_Data))))
Warning message:
In daisy(Matrix_from_Data, metric = "euclidean", type = list(symm = c(1:ncol(Matrix_from_Data)))) :
with mixed variables, metric "gower" is used automatically
...which seems weird to me, since "Matrix_from_Data" is all numeric variables, not mixed variables. Gower might be a fine metric, but I'd like to see how the others impact the clustering.
What am I missing?
Great question.
First, that message is a Warning and not an Error. I'm not personally familiar with daisy, but my ignorant guess is that that particular warning message pops up when you run the function and doesn't do any work to see if the warning is relevant.
Regardless of why that warning appears, one simple way to compare the clustering done by several different distances measures in hierarchical clustering is to plot the dendograms. For simplicity, let's compare the "euclidean" and "binary" distance metrics programmed into dist. You can use ?dist to read up on what the "binary" distance means here.
# When generating random data, always set a seed if you want your data to be reproducible
set.seed(1)
Data <- data.frame(
X = sample(c(0,1), 10, replace = TRUE),
Y = sample(c(0,1), 10, replace = TRUE),
Z = sample(c(0,1), 10, replace = TRUE)
)
# Create distance matrices
mat_euc <- dist(Data, method="euclidean")
mat_bin <- dist(Data, method="binary")
# Plot the dendograms side-by-side
par(mfrow=c(1,2))
plot(hclust(mat_euc))
plot(hclust(mat_bin))
I generally read dendograms from the bottom-up since points lower on the vertical axis are more similar (i.e. less distant) to one another than points higher on the vertical axis.
We can pick up a few things from these plots:
4/6, 5/10, and 7/8 are grouped together using both metrics. We should hope this is true if the rows are identical :)
3 is most strongly associated with 7/8 for both distance metrics, although the degree of association is a bit stronger in the binary distance as opposed to the Euclidean distance.
1, 2, and 9 have some notably different relationships between the two distance metrics (e.g. 1 is most strongly associated with 2 in Euclidean distance but with 9 in binary distance). It is in situations like this where the choice of distance metric can have a significant impact on the resulting clusters. At this point it pays to go back to your data and understand why there are differences between the distance metrics for these three points.
Also remember that there are different methods of hierarchical clustering (e.g. complete linkage and single linkage), but you can use this same approach to compare the differences between methods as well. See ?hclust for a complete list of methods provided by hclust.
Hope that helps!

Resources