function to call the different columns for calculating Correlation and Confidence interval using Bootstrap in R - r

Here is the problem I am currently facing: I have a data frame (let's call A) of 200 observations (rows) and 12 variables (columns). where I am, trying to find out the confidence interval using Bootstrap based on Correlation between two variables in the data frame.
My Data:
library(boot)
library(tidyverse)
library(psychometric)
hsb2 <- read.table("https://stats.idre.ucla.edu/stat/data/hsb2.csv", sep=",", header=T)
here I am trying to find out the confidence interval by using bootstrap based correlation formula
I wrote code for that its work.
k<-CIr(r=orig.cor, n = 21, level = .95)
k
n<-length(hsb2$math)
#n
B<-5000
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.v2<-hsb2$math[index]
boot.v1<-hsb2$write[index]
boot.cor<-cor(boot.v1, boot.v2,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
ci_boot
Result:
[1] 0.6439442
[1] 0.2939780 0.8416635
2.5% 97.5%
0.5556964 0.7211145
Here is the actual problem I am facing where I have to write a function to get
result for another variable but
this function not working
bo<-function(v1,v2,df){
orig.cor <- cor(df$v1,df$v2,method="spearman")
orig.ci<-CIr(r=orig.cor, n = 21, level = .95)
B<-5000
n<-length(df$v1)
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.hvltt2<-df$v1[index]
boot.hvltt<-df$v2[index]
boot.cor<-cor(boot.hvltt2, boot.hvltt,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
return(orig.cor,orig.ci,ci_boot)
}
after calling this function I am getting error
bo(math,write,hsb2)
bo(math,read,hsb2)
bo(female,write,hsb2)
bo(female,read,hsb2)
I am getting this error
Error in cor(df$v1, df$v2, method = "spearman") : supply both 'x' and 'y' or a matrix-like 'x'
how to write a function correctly.
I want the result as each time a call function it needs to be stored in data frame like below
Variable1 variable2 Orig Cor Orig CI bootstrap CI
math wirte 0.643 0.2939780 0.8416635 0.5556964 0.7211145
math read 0.66 0.3242639 0.8511580 0.5736904 0.7400174
female read -0.059 -0.4787978 0.3820967 -0.20432743 0.08176896
female write
science write
science read

The logic was right, I just had to make some changes on how you access the elements on df. R doesn't recognized the objects math and write because they are columns inside the data.frame. One way to pass them as arguments to the function is to define them as strings v1 = "math" and then access them with df[,v1]
bo<-function(v1,v2,df){
orig.cor <- cor(df[,v1],df[,v2],method="spearman")
orig.ci<-CIr(r=orig.cor, n = 21, level = .95)
B<-5000
n<-nrow(df) #Changed length to nrow
boot.cor.all<-NULL
for (i in 1:B){
index<-sample(1:n, replace=T)
boot.hvltt2<-df[index,v1]
boot.hvltt<-df[index,v2]
boot.cor<-cor(boot.hvltt2, boot.hvltt,method="spearman")
boot.cor.all<-c(boot.cor.all, boot.cor)
}
ci_boot<-quantile(boot.cor.all, prob=c(0.025, 0.975))
return(list(orig.cor,orig.ci,ci_boot)) #wrap your returns in a list
}
bo("math","write",hsb2)

Related

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

extracting residuals from pixel by pixel regression

I am trying to extract the residuals from a regression run pixel by pixel on a raster stack of NDVI/precipitation. My script works when i run it with a small part of my data. But when i try to run the whole of my study area i get: "Error in setValues(out, x) : values must be numeric, integer, logical or factor"
The lm works, since I can extract both slope and intercept. I just cant extract the residuals.
Any idea of how this could be fixed?
Here is my script:
setwd("F:/working folder/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
residualfun = function(x) { if (is.na(x[1])){ NA } else { m <- lm(x[1:6] ~ x[7:12], na.action=na.exclude)
r <- residuals.lm(m)
return (r)}}
res <- calc(s,residualfun)
And here is my data: https://1drv.ms/u/s!AhwCgWqhyyDclJRjhh6GtentxFOKwQ
Your function only test if the first layer shows NA values to avoid fitting the model. But there may be NA in other layers. You know that because you added na.action = na.exclude in your lm fit.
The problem is that if the model removes some values because of NA, the residuals will only have the length of the non-NA values. This means that your resulting r vector will have different lengths depending on the amount of NA values in layers. Then, calc is not be able to combine results of different lengths in a stack a a defined number of layers.
To avoid that, you need to specify the length of r in your function and attribute residuals only to non-NA values.
I propose the following function that now works on the dataset your provided. I added (1) the possibility compare more layers of each if you want to extend your exploration (with nlayers), (2) avoid fitting the model if there are only two values to compare in each layer (perfect model), (3) added a try if for any reason the model can fit, this will output values of -1e32 easily findable for further testing.
library(raster)
setwd("/mnt/Data/Stackoverflow/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
# Number of layers of each
nlayers <- 6
residualfun <- function(x) {
r <- rep(NA, nlayers)
obs <- x[1:nlayers]
cov <- x[nlayers + 1:nlayers]
# Remove NA values before model
x.nona <- which(!is.na(obs) & !is.na(cov))
# If more than 2 points proceed to lm
if (length(x.nona) > 2) {
m <- NA
try(m <- lm(obs[x.nona] ~ cov[x.nona]))
# If model worked, calculate residuals
if (is(m)[1] == "lm") {
r[x.nona] <- residuals.lm(m)
} else {
# alternate value to find where model did not work
r[x.nona] <- -1e32
}
}
return(r)
}
res <- calc(s, residualfun)

Bootstrapping sample means in R using boot Package, Creating the Statistic Function for boot() Function

I have a data set with 15 density calculations, each from a different transect. I would like to resampled these with replacement, taking 15 randomly selected samples of the 15 transects and then getting the mean of these resamples. Each transect should have its own personal probability of being sampled during this process. This should be done 5000 times. I have a code which does this without using the boot function but if I want to calculate the BCa 95% CI using the boot package it requires the bootstrapping to be done through the boot function first.
I have been trying to create a function but I cant get any that seem to work. I want the bootstrap to select from a certain column (data$xs) and the probabilites to be used are in the column data$prob.
The function I thought might work was;
library(boot)
meanfun <- function (data, i){
d<-data [i,]
return (mean (d)) }
bo<-boot (data$xs, statistic=meanfun, R=5000)
#boot.ci (bo, conf=0.95, type="bca") #obviously `bo` was not made
But this told me 'incorrect number of dimensions'
I understand how to make a function in the normal sense but it seems strange how the function works in boot. Since the function is only given to boot by name, and no specification of the arguments to pass into the function I seem limited to what boot itself will pass in as arguments (for example I am unable to pass data$xs in as the argument for data, and I am unable to pass in data$prob as an argument for probability, and so on). It seems to really limit what can be done. Perhaps I am missing something though?
Thanks for any and all help
The reason for this error is, that data$xs returns a vector, which you then try to subset by data [i, ].
One way to solve this, is by changing it to data[i] or by using data[, "xs", drop = FALSE] instead. The drop = FALSE avoids type coercion, ie. keeps it as a data.frame.
We try
data <- data.frame(xs = rnorm(15, 2))
library(boot)
meanfun <- function(data, i){
d <- data[i, ]
return(mean(d))
}
bo <- boot(data[, "xs", drop = FALSE], statistic=meanfun, R=5000)
boot.ci(bo, conf=0.95, type="bca")
and obtain:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = bo, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% ( 1.555, 2.534 )
Calculations and Intervals on Original Scale
One can use boot.array to extract all or a subset of the resampled sets. In this case:
bo.ci<-boot.ci(boot.out = bo, conf = 0.95, type = "bca")
resampled.data<-boot.array(bo,1)
To extract the first and second sets of resampled data:
resample.1<-resampled.data[1,]
resample.2<-resampled.data[2,]
Then proceed to extract the individual statistic you'd want from any subset. For isntance, If you assume normality you could run a student's t.test on teh first subset:
t.test(resample.1)
Which for this example and particular seed value(s) gives:
data: resample.1
t = 6.5216, df = 14, p-value = 1.353e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.234781 10.365219
sample estimates:
mean of x
7.8
r resampling boot.array

Error in creating a function to perform ttests on multiple continuous variables

So I'm trying to create a function that will take in a string of continuous variables, a categorical variable and a dataframe and output a table that includes, for each continuous variable: mean group1, mean group2, teststat, confidence interval, p-value.
What is currently here gives me the error: Error in model.frame.default(formula = var ~ class, data = data) : variable lengths differ (found for 'class')
I would love any feedback on how to fix this error and make this function do what I like. I want to make this function way more substantial and flexible, but I can't even get the basic version (handling multiple variables) to work.
THANKS!
#Continuous must be an object of the form:
#vars<-c("cont1", "cont2", "cont3", etc)
#CREATE DATA
cat1<-sample(c(1,2), 100, replace=T)
cont1<-rnorm(100, 25, 8)
cont2<-rnorm(100, 0, 1)
cont3<-rnorm(100, 6, 14.23)
cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
one<-data.frame(cat1, cont1, cont2, cont3, cont4)
#FUNCTION
two.group.comp<-function(continvars,class,data){
attach(data)
descriptives<-function(var){
test<-t.test(var~class, data)
means<-data.frame(test[5])
mean1<-means[1,1]
mean2<-means[2,1]
teststatbig<-data.frame(test[1])
teststat<-teststatbig[1,1]
conf<-data.frame(test[4])
lconf<-conf[1,1]
uconf<-conf[2,1]
pvalues<-data.frame(test[3])
pvalue<-pvalues[1,1]
variablename<-deparse(substitute(var))
entry<-data.frame(variablename,mean1,mean2,lconf,uconf,teststat,pvalue)
}
var<-data.frame(continvars)
table<<-sapply(var,descriptives)
detach(data)
}
#VARIABLES
continvars<-c("cont1", "cont2", "cont3")
#CALL TO FUNCTION
two.group.comp(continvars=continvars, class=cat1, data=one)
Does this do what you want?
two.group.comp <- function(continvars,class,data){
get.stats <- function(x,cat){
f <- unique(cat)
x1 <- x[cat==f[1]]
x2 <- x[cat==f[2]]
tt <- t.test(x1,x2)
smry <- c(tt$estimate,tt$statistic,p=tt$p.value)
names(smry) <- c("mean.1","mean.2","t","p")
return(smry)
}
result <- do.call(rbind,lapply(data[,continvars],get.stats,cat=class))
return(result)
}
# create sample dataset
set.seed(1)
cat1 <-sample(c(1,2), 100, replace=T)
cont1<-rnorm(100, 25, 8)
cont2<-rnorm(100, 0, 1)
cont3<-rnorm(100, 6, 14.23)
cont4<-rnorm(100, 25, 8)*runif(5, 0.1, 1)
one <-data.frame(cat1, cont1, cont2, cont3, cont4)
continvars<-c("cont1", "cont2", "cont3")
# call the function...
two.group.comp(continvars,cat1,one)
# mean.1 mean.2 t p
# cont1 24.4223859 25.33275704 -0.6024497 0.54827955
# cont2 0.0330148 0.01168979 0.1013519 0.91947827
# cont3 10.5784201 4.00651493 2.4183031 0.01747468
Working from the inside out:
get.stats(...) takes a single column of data, splits it into x1 and x2 according to cat, runs the t-test, and returns the summary statistics as a named vector.
lapply(...) passes the continvars columns of data to get.stats(...) one at a time.
do.call(rbind,...) binds together the set of vectors returned from lapply(...), row-wise, to generate the final result table.
This will work also if you pass column numbers instead of column names.
A piece of advice: the way you have it set up, you pass the column names of the continuous variables, but you pass the grouping factor as a vector. It would be cleaner if you pass the column name of the grouping factor.

Resources