I was trying to perform Autoencoder for anomaly detection. I used H2O R package to generate reconstruction MSE for a sample data using h2o.anomaly function. However, I have also tried to manually calculate it by myself according the the MSE formula from the documentation link below:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#mse-mean-squared-error
The training data consisting of three features and 5 rows that I used to build the model is as below:
head(train_dat)
Feature1 Feature2 Feature3
1 68.18 0.1806535 3.871201
2 71.51 0.3987761 2.484907
3 67.77 0.4285304 3.332205
4 69.58 0.1823216 2.890372
5 70.98 0.4134333 1.791759
The test data consisting of three features and 5 rows that I used for prediction is as below:
head(test_dat)
Feature1 Feature2 Feature3
1 68.33000 0.4350239 2.708050
2 73.98000 0.5550339 3.044522
3 67.11000 0.7323679 2.639057
4 69.90395 0.9999787 4.499810
5 71.28867 0.4882539 3.091042
After running training and prediction, the reconstructed features are as below:
head(mod.out)
reconstr_Feature1 reconstr_Feature2 reconstr_Feature3
1 69.66297 0.4239244 2.346250
2 69.88329 0.3963843 2.381598
3 69.46544 0.4610502 2.233164
4 68.96117 0.4229165 2.676295
5 69.63208 0.3895452 2.530025
When I used the h2o.anomaly function for MSE calculation, I received MSE output as below:
head(mse.list)
Reconstruction.MSE
1 0.05310159
2 0.57037600
3 0.54427385
4 2.08407248
5 0.14251951
However, when I tried to calculate the MSE by applying the function below, I obtained different MSE output:
mod.anon.validate <- apply((test_dat - mod.out)^2, 1, mean)
mse.list.validate <- as.data.frame(mod.anon.validate)
head(mse.list.validate)
mod.anon.validate
1 0.6359438
2 5.7492281
3 1.9288268
4 1.5156829
5 1.0229217
I was wondering what I have done wrong in my manual MSE calculation? When it is called "Reconstruction MSE", is it different from the general MSE? The full R script is as below:
### H2O Autoencoder test run ###
#Load test and training data.
test_dat <- read.table("sample.test.dat", header=TRUE)
train_dat <- read.table("sample.train.dat", header=TRUE)
#Start H2O
library(h2o)
localH2O <- h2o.init(port =54321)
#Training and deep learning
feature_names <- names(train_dat[1:3])
unmod.hex <- as.h2o(train_dat, destination_frame="train.hex") ; mod.hex=as.h2o(test_dat, destination_frame="test.hex")
unmod.dl <- h2o.deeplearning(x=feature_names,
training_frame=unmod.hex,
autoencoder = TRUE,
reproducible = T,
hidden = c(3,2,3), epochs = 50,
activation = "Tanh")
#Output result
mod.out <- as.data.frame(h2o.predict(unmod.dl,mod.hex,type=response))
mod.anon <- h2o.anomaly(unmod.dl, mod.hex, per_feature=FALSE)
mse.list <- as.data.frame(mod.anon)
mod.anon.validate <- apply((test_dat - mod.out)^2, 1, mean)
mse.list.validate <- as.data.frame(mod.anon.validate)
Thanks for your help.
The calculations don't match because MSE is calculated in the normalised space. If you set standardize=FALSE param in h2o.deeplearning() it will match:
unmod.dl <- h2o.deeplearning(x=feature_names, standardize = FALSE,
training_frame=unmod.hex,
autoencoder = TRUE,
reproducible = T,
hidden = c(3,2,3), epochs = 50,
activation = "Tanh")
mod.out <- as.data.frame(h2o.predict(unmod.dl, mod.hex, type=response))
mod.anon <- h2o.anomaly(unmod.dl, mod.hex, per_feature=FALSE)
mse.list <- as.data.frame(mod.anon)
mse.list
> mse.list
Reconstruction.MSE
1 1512.740
2 1777.491
3 1458.438
4 1587.593
5 1648.999
> mod.anon.validate <- apply((test_dat - mod.out)^2, 1, mean)
> mse.list.validate <- as.data.frame(mod.anon.validate)
> mse.list.validate
mod.anon.validate
1 1512.740
2 1777.491
3 1458.438
4 1587.593
5 1648.999
Here's an example of how to normalize:
#Load test and training data.
test_dat <- sample.test
train_dat <- sample.train
#Start H2O
library(h2o)
localH2O <- h2o.init(port =54321, strict_version_check = FALSE)
#Training and deep learning
feature_names <- names(train_dat[1:3])
unmod.hex <- as.h2o(train_dat, destination_frame="train.hex")
mod.hex <- as.h2o(test_dat, destination_frame="test.hex")
unmod.dl <- h2o.deeplearning(x=feature_names,
training_frame=unmod.hex,
autoencoder = TRUE,
reproducible = T,
hidden = c(3,2,3), epochs = 50,
activation = "Tanh")
# Anomaly Detection
mod.anon <- h2o.anomaly(unmod.dl, mod.hex, per_feature=FALSE)
mse.list <- as.data.frame(mod.anon)
# Manual MSE
mod.out <- as.data.frame(h2o.predict(unmod.dl, mod.hex, type=response))
# Scale Output
s <- apply(train_dat, 2, max) - apply(train_dat, 2, min)
m <- apply(train_dat, 2, mean)
original_scaled <- t(apply(test_dat, 1, function(x) (x-m)/s))
recreate_scaled <- t(apply(mod.out, 1, function(x) (x-m)/s))
mod.anon.validate <- apply((original_scaled - recreate_scaled)^2, 1, mean)
mse.list.validate <- as.data.frame(mod.anon.validate)
# Compare Outputs
print(mse.list)
print(mse.list.validate)
Related
sample_size <- 200
sample_meanvector <- c(3, 4)
sample_covariance_matrix <- matrix(c(2, 1, 1, 2),
ncol = 2)
# create bivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
#Convert the datatype
df_sample_distribution <- as.data.frame(sample_distribution)
df_sample_distribution$Y <- (1 + df_sample_distribution$V1*2 + df_sample_distribution$V2 + rnorm(200,0,1))
colnames(df_sample_distribution)[1] <- "X1"
colnames(df_sample_distribution)[2] <- "X2"
Code above is the one I use to generate a bivariate normal distribution vectors and code below is the code to run regression over the generated data.
Test2 <- lm( Y ~ X1, data = df_sample_distribution)
#to extract only specific coefficients
summary(Test)$coefficients[2,1]
My question is whether there is a way such that I can regenerate data and run regression over it for 200 times and save all the outputs in a list. Here is the pseudo code in my head.
for (){
#generate data
for ()
{
#extract coeffiients and insert them in a list
}
}
In simple terms,
step 1: create data
step 2: run regression over it
step 3: get the coefficient (hopefully save them in a list)
I am looking for code that can loop through step 1 to 3 for 200 times and save everything results. Any ideas or inspirations are welcomed. Thank you guys in advance.
Just wrap your code into a for-loop like your pseudo code:
library(MASS)
iterations <- 10 # In your example this should be 200
sample_size <- 200
sample_meanvector <- c(3, 4)
sample_covariance_matrix <- matrix(c(2, 1, 1, 2),
ncol = 2)
# create output data.frame
df_output <- data.frame(iteration = integer(0), coeff = double(0))
# loop over data generation and regression
for (i in seq_len(iterations)) {
sample_distribution <- mvrnorm(n = sample_size,
mu = sample_meanvector,
Sigma = sample_covariance_matrix)
#Convert the datatype
df_sample_distribution <- as.data.frame(sample_distribution)
df_sample_distribution$Y <- (1 + df_sample_distribution$V1*2 + df_sample_distribution$V2 + rnorm(200,0,1))
colnames(df_sample_distribution)[1] <- "X1"
colnames(df_sample_distribution)[2] <- "X2"
df_output[i, 1] <- i
df_output[i, 2] <- summary(lm( Y ~ X1, data = df_sample_distribution))$coefficients[2,1]
}
This returns df_output containing coefficients for each iteration:
iteration coeff
1 1 2.647886
2 2 2.274654
3 3 2.447453
4 4 2.451471
5 5 2.568877
6 6 2.428295
7 7 2.440396
8 8 2.478357
9 9 2.477211
10 10 2.367012
I am using the pROC package in r to calculate and compare the AUCs of multiple tests, to see which test has the best ability to discriminate between patients and controls. However, I have a large number of tests and essentially want to run a series of pairwise comparisons of each tests AUC with every other test and then correct for multiple comparisons. This is as far as I've gotten with my code (example with simulated and replicable dataset below):
#load pROC
library(pROC)
#generate df with random numbers
set.seed(123)
df <- data.frame(disease_status = rbinom(n=100, size=1, prob=0.20),
test1 = rnorm(100, mean=15, sd=4),
test2 = rnorm(100, mean=30, sd=2),
test3 = rnorm(100, mean=50, sd=3))
#create roc object for test1, test2, test3
roc.out_test1<-roc(df$disease_status, df$test1, plot=TRUE, smooth = FALSE)
roc.out_test2<-roc(df$disease_status, df$test2, plot=TRUE, smooth = FALSE)
roc.out_test3<-roc(df$disease_status, df$test3, plot=TRUE, smooth = FALSE)
#compare the AUC of test1 and test 2
roc.test(roc.out_test1, roc.out_test2, reuse.auc=TRUE, method="delong", na.rm=TRUE)
#DeLong's test for two correlated ROC curves
#data: roc.out_test1 and roc.out_test2
#Z = 0.60071, p-value = 0.548
#alternative hypothesis: true difference in AUC is not equal to 0
#sample estimates:
#AUC of roc1 AUC of roc2
#0.5840108 0.5216802
#create a function to do above for all comparisons
vec_ROCs1 <- c("roc.out_test1,", "roc.out_test2,", "roc.out_test3,")
vec_ROCs2 <- c("roc.out_test1", "roc.out_test2", "roc.out_test3")
ROCs2_specifications <- paste0(vec_ROCs2, ",", "reuse.auc=TRUE")
test <- unlist(lapply(ROCs2_specifications, function(x) paste0(vec_ROCs1, x)))
test2 <- lapply(test, function(x) roc.test(x))
#Error in roc.test.default(x) :
# argument "predictor1" is missing, with no default
Please let me know your thoughts and suggestions on how to fix this!
Thank you.
The following should work, please check it. I didn't write all the details, but you can ask me other questions if you don't understand the code.
#load pROC
library(pROC)
#> Type 'citation("pROC")' for a citation.
#>
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#>
#> cov, smooth, var
#generate df with random numbers
set.seed(123)
df <- data.frame(disease_status = rbinom(n=100, size=1, prob=0.20),
test1 = rnorm(100, mean=15, sd=4),
test2 = rnorm(100, mean=30, sd=2),
test3 = rnorm(100, mean=50, sd=3))
#create roc object for test1, test2, test3
roc.out_test1<-roc(df$disease_status, df$test1, plot=TRUE, smooth = FALSE)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
roc.out_test2<-roc(df$disease_status, df$test2, plot=TRUE, smooth = FALSE)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
roc.out_test3<-roc(df$disease_status, df$test3, plot=TRUE, smooth = FALSE)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
# compare the AUC of test1 and test 2
roc.test(roc.out_test1, roc.out_test2, reuse.auc = TRUE, method = "delong", na.rm = TRUE)
#>
#> DeLong's test for two correlated ROC curves
#>
#> data: roc.out_test1 and roc.out_test2
#> Z = 0.60071, p-value = 0.548
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2
#> 0.5840108 0.5216802
Now we generate a list of all possible combinations of the three tests and run the roc.test function using the same parameters that you set.
all_tests <- combn(
list(
"test1" = roc.out_test1,
"test2" = roc.out_test2,
"test3" = roc.out_test3
),
FUN = function(x, ...) roc.test(x[[1]], x[[2]]),
m = 2,
simplify = FALSE,
reuse.auc = TRUE,
method = "delong",
na.rm = TRUE
)
The output is a list of choose(3, 2) = 3 elements (i.e. the number of combinations of n elements taken 2 at a time) and each element of the list is a test. For example this is the same as your previous test:
all_tests[[1]]
#>
#> DeLong's test for two correlated ROC curves
#>
#> data: x[[1]] and x[[2]]
#> Z = 0.60071, p-value = 0.548
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2
#> 0.5840108 0.5216802
The only problem here is that it's difficult to recognise which tests are used in the comparisons, so we can also add a list of names:
tests_names <- combn(
list("test1", "test2", "test3"),
m = 2,
FUN = paste,
simplify = TRUE,
collapse = "_"
)
all_tests <- setNames(all_tests, tests_names)
This is the result.
names(all_tests)
#> [1] "test1_test2" "test1_test3" "test2_test3"
The names of the objects flag the tests that are used in the comparison.
all_tests$test1_test2
#>
#> DeLong's test for two correlated ROC curves
#>
#> data: x[[1]] and x[[2]]
#> Z = 0.60071, p-value = 0.548
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2
#> 0.5840108 0.5216802
Created on 2020-03-14 by the reprex package (v0.3.0)
The roc.test() function expects a roc object as input. The list test is just character strings of all the arguments, which the function does not know what to do with. The list also includes comparisons of the tests with themselves i.e. "roc.out_test1,roc.out_test1,reuse.auc=TRUE" I assume you don't actually need to do this and that there are only 3 comparisons that you need 1v2, 1v3, 2v3. The purrr package provides map functions similar to lapply and map2 allows you to iterate of 2 lists at the same time. You need to create 2 lists of the actually roc objects and iterate over these.
#load pROC
library(pROC)
library(dplyr)
library(purrr) #For map2 function
#generate df with random numbers
set.seed(123)
df <- data.frame(disease_status = rbinom(n=100, size=1, prob=0.20),
test1 = rnorm(100, mean=15, sd=4),
test2 = rnorm(100, mean=30, sd=2),
test3 = rnorm(100, mean=50, sd=3))
#create roc object for test1, test2, test3
roc.out_test1<-roc(df$disease_status, df$test1, plot=TRUE, smooth = FALSE)
roc.out_test2<-roc(df$disease_status, df$test2, plot=TRUE, smooth = FALSE)
roc.out_test3<-roc(df$disease_status, df$test3, plot=TRUE, smooth = FALSE)
#compare the AUC of test1 and test 2
roc.test(roc.out_test1, roc.out_test2, reuse.auc=TRUE, method="delong", na.rm=TRUE)
roc_new <- function(test1, test2){
roc.test(test1, test2, reuse.auc=TRUE, method="delong", na.rm=TRUE)
}
#List of all tests
all_tests <- list(roc.out_test1,
roc.out_test2,
roc.out_test3)
#Create unique combos of tests
unique_combos <- expand.grid(1:3, 1:3) %>%
filter(Var1 < Var2) %>% #exludes duplicate comparisons,
#each col provides the index for the 2 lists to iterate over
mutate(names = paste(Var1, " V ", Var2)) #Create col to name final output list
#Create 2 lists to iterate over
#Create list 1
(test1 <- all_tests[as.numeric(unique_combos$Var1)])
#Create list 2
(test2 <- all_tests[as.numeric(unique_combos$Var2)])
#Iterate over both lists
output <- map2(test1, test2, roc_new)
names(output) <- unique_combos$names
How to get OA and Kappa value for each variable like this table in the figure below?
The study used RFE using Caret
You get this table if you do rfe on a dataset for classification. It looks like the article cleaned and renamed some column names but that is it.
library(caret)
data(mdrr)
mdrrDescr <- mdrrDescr[,-nearZeroVar(mdrrDescr)]
mdrrDescr <- mdrrDescr[, -findCorrelation(cor(mdrrDescr), .8)]
set.seed(1)
inTrain <- createDataPartition(mdrrClass, p = .75, list = FALSE)[,1]
train <- mdrrDescr[ inTrain, ]
test <- mdrrDescr[-inTrain, ]
trainClass <- mdrrClass[ inTrain]
testClass <- mdrrClass[-inTrain]
set.seed(2)
ctrl <- rfeControl(functions = rfFuncs,method = "cv",number = 5, verbose = FALSE)
rf_profile <- rfe(train, trainClass,
ntree = 50,
rfeControl = ctrl)
rf_profile$results contains the results that you can see in the table.
rf_profile$results
Variables Accuracy Kappa AccuracySD KappaSD
1 4 0.7355696 0.4599432 0.06290770 0.1274150
2 8 0.7934494 0.5736408 0.08328405 0.1725036
3 16 0.8060759 0.6011138 0.05961418 0.1222687
4 61 0.8260759 0.6411303 0.07101790 0.1483737
and if you want the names with those variables you can get them like this
rf_profile$optVariables[rf_profile$results$Variables]
[1] "VRA1" "TI2" "Xt" "G.O..Cl."
When I use the mice package to impute data I have the following issue:
I can't seem to find a way to replace NA values of new observations, given that I already have imputed the missing data in the training set.
Example 1
I have trained an algorithm with data from data frame with 10 features and 1000 observations.
How can I predict a new observation using this algorithm (with missing data)?
Example 2
Supose we have a data frame with NA values:
V1 V2 V3 R1
1 2 NA 1
1.4 -1 0 0
1.2 NA 0 1
1.6 NA 1 1
1.2 3 1 0
I impute the missing values using the mice package:
imp <- mice(df, m = 2, maxit = 100, meth = 'pmmm', seed = 12345)
The object df now has 2 dataframes with imputed values.
(dfImp1)
V1 V2 V3 R1
1 2 0.5 1
1.4 -1 0 0
1.2 1.5 0 1
1.6 1.5 1 1
1.2 3 1 0
Now with this data frame, I can train an algorithm:
modl <- glm(R1~., (dfImp1), family = binomial)
I want to predict the response of a new observation, e.g:
obs1 <- data.frame(V1 = 1, V2 = 1.4, V3 = NA)
How do I impute the missing data a of new individual observation?
It seems that mice package has not a built-in solution but we can write one.
The idea is to:
(1) use the same mice algorithm to fill NA in dataset used to train GLM and the new observation(s);
(2) predict only the new observation without NA.
I'm going to use iris as a data example.
library(R6)
library(mice)
# Binary output to use Binomial
df <- iris %>% filter(Species != "virginica")
# The new observation
new_data <- tail(df, 1)
# the dataset used to train the model
df <- head(df,-1)
# Now, let insert some NAs
insert_nas <- function(x) {
set.seed(123)
len <- length(x)
n <- sample(1:floor(0.2*len), 1)
i <- sample(1:len, n)
x[i] <- NA
x
}
df$Sepal.Length <- insert_nas(df$Sepal.Length)
df$Petal.Width <- insert_nas(df$Petal.Width)
new_data$Sepal.Width = NA
summary(df)
In fit method we apply mice to fill NAs, fit a GLM model and store it to be used in predict method.
In predict method we (1) add the new_observation to the dataset (with NAs), (2) replace NA again using mice, (3) get back the row(s) of the new observation(s) without NA and then (4) apply GLM to predict this new observation.
# R6 Class Generator
GLMWithMice <- R6Class("GLMWithMice", list(
model = NULL,
df = NULL,
fitted = FALSE,
initialize = function(df) {
self$df <- df
},
fit = function(formula = "Species~.", family = binomial) {
imp <- mice(self$df, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
self$model <- glm(formula, df_cleaned, family = family, maxit = 100)
self$fitted <- TRUE
return(cat("\n model fitted!"))
},
predict = function(new_data, type = "response"){
n_rows <- nrow(self$df)
df_new <- bind_rows(self$df, new_data)
imp <- mice(df_new, m = 2, maxit = 100, meth = 'pmm', seed = 12345, print=FALSE)
df_cleaned <- complete(imp,1)
new_data_cleaned <- tail(df_cleaned, nrow(df_new) - n_rows)
return(predict(self$model,new_data_cleaned, type = type))
}
)
)
#Let's create a new instance of "GLMWithMice" class
model <- GLMWithMice$new(df = df)
class(model)
model$fit(formula = Species~., family = binomial)
model$predict(new_data = new_data)
unfortunately I have problems using predict() in the following simple example:
library(e1071)
x <- c(1:10)
y <- c(0,0,0,0,1,0,1,1,1,1)
test <- c(11:15)
mod <- svm(y ~ x, kernel = "linear", gamma = 1, cost = 2, type="C-classification")
predict(mod, newdata = test)
The result is as follows:
> predict(mod, newdata = test)
1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA>
0 0 0 0 0 1 1 1 1 1
Can anybody explain why predict() only gives the fitted values of the training sample (x,y) and does not care about the test-data?
Thank you very much for your help!
Richard
It looks like this is because you misuse the formula interface to svm(). Normally, one supplies a data frame or similar object within which the variables in the formula are searched for. It usually doesn't matter if you don't do this, even if it is not best practice, but when you want to predict, not putting variables in a data frame gets you in a right mess. The reason it returns the training data is because you don't provide newdata an object with a component named x in it. Hence it can't find the new data x so returns the fitted values. This is common for most R predict methods I know.
The solution then is to i) put your training data in a data frame and pass svm this as the data argument, and ii) supply a new data frame containing x (from test) to predict(). E.g.:
> DF <- data.frame(x = x, y = y)
> mod <- svm(y ~ x, data = DF, kernel = "linear", gamma = 1, cost = 2,
+ type="C-classification")
> predict(mod, newdata = data.frame(x = test))
1 2 3 4 5
1 1 1 1 1
Levels: 0 1
You need newdata to be of the same form, ie using a data.frame helps:
R> library(e1071)
Loading required package: class
R> df <- data.frame(x=1:10, y=sample(c(0,1), 10, rep=TRUE))
R> mod <- svm(y ~ x, kernel = "linear", gamma = 1,
+ cost = 2, type="C-classification", data=df)
R> newdf <- data.frame(x=11:15)
R> predict(mod, newdata=newdf)
1 2 3 4 5
0 0 0 0 0
Levels: 0 1
R>
By the way, this is also shown the help page for svm():
## density-estimation
# create 2-dim. normal with rho=0:
X <- data.frame(a = rnorm(1000), b = rnorm(1000))
attach(X)
# traditional way:
m <- svm(X, gamma = 0.1)
# formula interface:
m <- svm(~., data = X, gamma = 0.1)
# or:
m <- svm(~ a + b, gamma = 0.1)
# test:
newdata <- data.frame(a = c(0, 4), b = c(0, 4))
predict (m, newdata)
So in sum, use the formula interface and supply a data.frame --- that is how essentially all modeling functions in R work.