Calculate the accuracy of an imputation function in R - r

I'm trying to test various imputation methods in R and I've written a function which takes a data frame, inserts some random NA values, imputes the missing values and then compares the imputation method back to the original data using MAE.
My function looks as follows:
pacman::p_load(tidyverse)
impute_diamonds_accuracy <- function(df, col, prop) {
require(tidyverse)
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[idx_na, col] - df[idx_na, col]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}
impute_diamonds_accuracy(df = diamonds, col = 'cut', prop = 0.02)
The function prints to the screen that it's doing the imputation but it fails when it performs that MAE calculation with the following error:
Error in imputed_df[idx_na, col] - df[idx_na, col] :
non-numeric argument to binary operator
How can I compare the original data against the imputed version to get a sense of the accuracy?

diamonds is a tibble.
> library(ggplot2)
> data(diamonds)
> is_tibble(diamonds)
[1] TRUE
so we may need to use [[ to extract the column as a vector. Also, the idx_na returns the index of NA elements in data. If we want to use the subset comparison, make a copy of the original data before we assign NAs, and then do the comparison between the imputed and original data
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
-full code
impute_diamonds_accuracy <- function(df, col, prop) {
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
df_cpy <- data.table::copy(df)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- mice::complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}

Related

Expand for-loop to accommodate list in R?

I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.

Speeding up count of pairwise observations in R

I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE

Error in R: subscript out of bounds

So I'm trying to do something very simple. Loop over a data frame and calculate the max corelation coefficient between a pair of columns.
I am trying to do this in R.
My data frame has been read using fread()
Here's my code: I declared max=-1, a=0andb=0in the starting.
for(i in 2:1933)
{
for(j in i+1:1934)
{
if(is.numeric(data[[i]]) && is.numeric(data[[j]]))
{
if(isTRUE(sd(data[[i]], na.rm=TRUE) !=0) && isTRUE(sd(data[[j]], na.rm=TRUE) !=0))
{
c = cor(data[[i]], data[[j]], use="pairwise.complete.obs")
if(isTRUE(c>=max))
{
max = c
a = i
b = j
}
}
}
}
}
The error I get is
Error in .subset2(x, i, exact = exact) : subscript out of bounds
I do have 1934 columns, I can't figure out the problem. Am I missing something fairly obvious?
There's a much easier way to do this: cor(...) takes a matrix (nr X nc) and returns a new matrix (nc X nc) with the correlation coefficient of every column against every other column. The rest is pretty straightforward:
library(data.table) # to simulate fread(...)
set.seed(1) # for reproducibble example
dt <- as.data.table(matrix(1:50+rnorm(50,sd=5), ncol=5)) # create reproducible example
result <- cor(dt, use="pairwise.complete.obs") # matrix of correlation coefficients
diag(result) <- NA # set diagonals to NA
max(result, na.rm=TRUE) # maximum correlation coefficient
# [1] 0.7165304
which(result==max(result, na.rm=TRUE), arr.ind=TRUE) # location of max
# row col
# V3 3 2
# V2 2 3
There are two locations because of course the correlation between col 2 and 3 is the same as the correlation between cols 3 and 2.
Try this:::
drop_list <- NULL
#Guess the first column iS ID Column
feature.names <- names(data)[2:length(names(data)]
for(f in feature.names){
if(sd(data[[f]], na.rm=TRUE) == 0.0 | is.numeric(data[[f]])==FALSE)
{
drop_list <- c(drop_list, f)
}
}
data <- data[,!(names(data) %in% drop_list)]
corr_data <- cor(data, use="pairwise.complete.obs")
##remove Correlation between same variables
for(i in 1:dim(corr_data)[1]){corr_data[i,i] <- -99 }
#Please try to sort the correlation data.frame accordingly with which function as Howard suggested
Cheers

Spearman correlation between two matrices of same dimensions

I have two matrices of equal dimensions (p and e) and I would like to make a spearman correlation between columns of the same name. I want to have the output of pair correlations in a matrix (M)
I used the corr.test() function from library Psych and here is what I did:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(p[,rs],e[,rs],method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
But I get an error message:
Error in 1:ncol(y) : argument of length 0
Could you please show me what is wrong? or suggest another method?
No need for all this looping and indexing etc:
# test data
p <- matrix(data = rnorm(100),nrow = 10)
e <- matrix(data = rnorm(100),nrow = 10)
cor <- corr.test(p, e, method="spearman", adjust="none")
data.frame(name=colnames(p), r=diag(cor$r), p=diag(cor$p))
# name r p
#a a 0.36969697 0.2930501
#b b 0.16363636 0.6514773
#c c -0.15151515 0.6760652
# etc etc
If the names of the matrices don't already match, then match them:
cor <- corr.test(p, e[,match(colnames(p),colnames(e))], method="spearman", adjust="none")
Since the two matrices are huge, it would take very long system.time to execute the function corr.test() on all possible pairs but the loop that finally worked is as follow:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(as.data.frame(p[,rs]),as.data.frame(e[,rs]),
method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}

How to compute a running cor.test() in a data.frame with NA values in R?

I'm trying to do a running() correlation between my daily climate data, and the problem is that I have many missing values (NA) in my data.frame. I'm using the cor.test() because I need to get the p.values. For example in some days I don't have precipitation or humidity values, and I would like to know how to compute this running correlation with my temperature data, but omitting the NA values.
Here an example with NA values:
library(gtools)
df <- data.frame(temp=rnorm(100, 10:30), prec=rnorm(100, 1:300), humi=rnorm(100, 1:100))
df$prec[c(1:10, 25:30, 95:100)] <-NA
df$humi[c(15:19, 20:25, 80:90)] <-NA
corPREC <- t(running(df$temp, df$prec, fun = cor.test, width=10, by=10))
corHUMI <- t(running(df$temp, df$humi, fun = cor.test, width=10, by=10))
You can use complete.cases to get a logical vector of complete rows (TRUE = complete); then subsetting inside ad-hoc function used for testing too
library(gtools)
df <- data.frame(temp=rnorm(100, 10:30), prec=rnorm(100, 1:300),
humi=rnorm(100, 1:100))
df$prec[c(1:10, 25:30, 95:100)] <-NA
df$humi[c(15:19, 20:25, 80:90)] <-NA
my.fun <- function(x,y) {
my.df <- data.frame(x,y)
my.df.cmpl <- my.df[complete.cases(my.df), ]
# 3 complete obs is the minimum for cor.test
if (nrow(my.df.cmpl)<=2) {
return(rep(NA, 4))
} else {
my.test <- cor.test(my.df.cmpl$x,my.df.cmpl$y)
return(c(my.test$statistic, my.test$p.value,
my.test$conf.int))
}
}
corPREC <- t(running(df$temp, df$prec, fun = my.fun, width=10, by=10))
corHUMI <- t(running(df$temp, df$humi, fun = my.fun, width=10, by=10))
you could also consider
my.test <- cor.test(~ x + y, na.action = "na.exclude", data = my.df)
but you can't handle too-few-rows situations (in a straightforward manner).

Resources