Permutation test in R - r

I want to check whether two variables are correlated or not after breaking the association between those two variables. And I am supposed to do it using permutation and using the kendall correlation coefficient. I am not sure if I am doing it the right way. Below is my code.
### This is original observed data
observed <- cor(myData$gene_dens,myData$qp.site,method = "kendall")
plot(myData$gene_dens,myData$qp.site,main=paste("Corelation = ",observed))
### I am doing permuation here to break the association between the two variables I am looking at
perm = function(dataframe)
{
result1 = sample(dataframe$gene_dens,size = length(myData),replace = FALSE)
return(result1)
}
###I am using 10000 replicates because I want to make a null distribution so that I don't have to rely on the assumptions of the normal distribution
result = replicate(10000,perm(myData))
### myData is the vector containing the entire data of the csv file.
hist(result)
pvalue <- (sum(result < observed) + sum(result > observed))/length(result)

Related

How do I loop different percentages of missing values using MCAR?

Using the cleveland data from MCI data respository, I want to generate missing values on the data to apply some imputation techniques.
heart.ds <- read.csv(file.choose())
head(heart.ds)
attach(heart.ds)
sum(is.na(heart.ds))
str(heart.ds)
#Changing Appropriate Variables to Factors
heart.ds$sex<-as.factor(heart.ds$sex)
heart.ds$cp<-as.factor(heart.ds$cp)
heart.ds$fbs<-as.factor(heart.ds$fbs)
heart.ds$exang<-as.factor(heart.ds$exang)
heart.ds$restecg<-as.factor(heart.ds$restecg)
heart.ds$slope<-as.factor(heart.ds$slope)
heart.ds$thal<-as.factor(heart.ds$thal)
heart.ds$target<-as.factor(heart.ds$target)
str(heart.ds)
Now i want to generate missing values using the MCAR mechanism. Below is the loop code;
p = c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
for(i in 1:length(p)){
hd_mcar[i] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
The problem here is that, after the above code, i dont get the data been generated in it original values like in a data frame where i will have n variables and n rows.
Below is a picture of the output i had through the above code;
enter image description here
But when i use only one missingness percentage i get an accurate results; below is the coe for only one missing percentage
#Missing Completely at Random(MCAR)
hd_mcar <- delete_MCAR(heart.ds, 0.05)
sum(is.na(hd_mcar))
Below is the output of the results;
enter image description here
Please I need help to to solve the looping problem. Thank you.
Now I want to apply the MICE and other imputations methods like HMISC, Amelia, mi, and missForest inside the loop but it is giving me an error saying "Error: Data should be a matrix or data frame"
The code below is for only MICE;
#1. Method(MICE)
mice_mcar[[i]] <- mice(hd_mcar, m=ip, method = c("pmm","logreg","polyreg","pmm","pmm","logreg",
"polyreg","pmm","logreg","pmm","polyreg","pmm",
"polyreg","logreg"), maxit = 20)
#Diagnostic check
summary(heart.ds$age)
mice_mcar$imp$age
#Finding the means of the impuatations
app1 <- apply(mice_mcar$imp$age, MARGIN = 2, FUN = mean)
min1 <- abs(app1-mean(heart.ds$age))
#Selecting the minimum index
sm1 <- which(min1==min(min1))
#Selecting final imputation
final_clean_hd_mcar =mice::complete(mice_mcar,sm1)
mice.mcar = final_clean_hd_mcar
How do i go about to make it fit into the loop and works perfectly?
Your problem was this line:
hd_mcar = rep(0, length(heart.ds)) #to generate empty bins of 10 different percentages of missingness using the MCAR package
You are creating a vector here rather than a list. You can't assign a data frame to an element of a vector without coercing it into something that is not a data frame. You want to do this:
p <- c(0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1)
hd_mcar <- vector(mode = "list", length = length(p))
for(i in 1:length(p)){
hd_mcar[[i]] <- delete_MCAR(heart.ds, p[i]) #to generate 10 different percentages of missingness using the MCAR package
}
Note that because it's a list now, hd_mcar[[i]] uses the [[ rather than [ subscript.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Speed up functions involving (s)apply

I've profiled my code using the lineprof package and identified the bottlenecks to be in the three functions perm.stat.list, G.hat, and emp.FDR. The common theme seems to be the use of (s)apply, based on the output of the profiler.
Below is a simplified version of my functions, along with code to generate a reproducible example involving the three functions. I've added comments to better explain what each function is doing and the inputs required.
I'd like to speed up my code considerably because even with B=10, the process takes almost half an hour of computation. The input takes a large matrix (10000 x 10000), so speed is important. Ideally, I'd like to run B=5000 permutations, which also increases computation time.
Any tips to improve my code are greatly appreciated.
### Functions ###
perm.stat.list <- function(samp.dat,N1,N2,B){
perm.list = NULL
for (b in 1:B){
#Permute the row "labels", preserving information across columns
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
#Need to save each (1 x M) permutation vector into a list
perm.list[[b]] = apply(perm.dat.tmp,2,function(y) t.test(y[1:N1],y[(N1+1):(N1+N2)])$statistic)
}
return(perm.list)
}
G.hat = function(perm.mat,t){
#Number of permutations
B = nrow(perm.mat)
#Compute an empirical distribution along each COLUMN of permutation matrix
out = apply(perm.mat,2,function(x) sum(x>t,na.rm = TRUE))/B
return(out)
}
emp.FDR <- function(t.vec,mat){
#For each value in t.vec, apply G.hat function
out = sapply(t.vec,function(i) sum(G.hat(mat,i),na.rm = TRUE)/max(sum(t.vec > i,na.rm = TRUE),1))
return(out)
}
.
### Generate reproducible example ###
### Global variables ###
#Sample sizes (rows)
N1=3000
N2=7000
#Number of columns
M = 10000
#Number of permutations
B = 10
### Data ###
set.seed(1)
X1 = matrix(rnorm(N1*M),ncol=M)
X2 = matrix(rnorm(N2*M),ncol=M)
### Combine data in one large matrix of size (N1+N2) rows x M columns ###
samp.dat = rbind(X1,X2)
### Compute statistic for each column of samp.dat ###
t.stats = apply(samp.dat,2,
function(x) t.test(x[1:N1],x[(N1+1):(N1+N2)])$statistic)
### Sort t.stats in decreasing order (not necessarily needed for computation) ###
t.vec = sort(t.stats,decreasing=TRUE)
### Permutation matrix based on the data ###
perm.mat = perm.stat.list(samp.dat=samp.dat,N1=N1,N2=N2,B=B)
eFDR = emp.FDR(t.vec=t.vec,mat=perm.mat)

Permutations and combinations of all the columns in R

I want to check all the permutations and combinations of columns while selecting models in R. I have 8 columns in my data set and the below piece of code lets me check some of the models, but not all. Models like column 1+6, 1+2+5 will not be covered by this loop. Is there any better way to accomplish this?
best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
for(j in 1:8){
for(x in k){
diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
if( best_model[1] < accuracy[x] ){
best_model[1] = accuracy[x]
best_model[2] = i
best_model[3] = j
}
}
}
}
Well, this answer isn't complete, but maybe it'll get you started. You want to be able to subset by all possible subsets of columns. So instead of having i:j for some i and j, you want to be able to subset by c(1,6) or c(1,2,5), etc.
Using the sets package, you can for the power set (set of all subsets) of a set. That's the easy part. I'm new to R, so the hard part for me is understanding the difference between sets, lists, vectors, etc. I'm used to Mathematica, in which they're all the same.
library(sets)
my.set <- 1:8 # you want column indices from 1 to 8
my.power.set <- set_power(my.set) # this creates the set of all subsets of those indices
my.names <- c("a") #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
names(my.power.set) <- my.names
my.indices <- vector("list",length(my.power.set)-1)
for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work
I wanted to create a list of lists called my.indices, so that my.indices[i] was a subset of {1,2,3,4,5,6,7,8} that could be used in place of where you have i:j. Then, your for loop would have to run from 1:length(my.indices).
But alas, I have been spoiled by Mathematica, and thus cannot decipher the incredibly complicated world of R data types.
Solved it, below is the code with explanatory comments:
# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
# ignoring the first case i.e. i=1, as it doesn't represent any model
# convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
model <- c()
for(i in 1:length(combination)){
# choose which columns to consider depending on the combination
if(combination[i])
model <- c(model, i)
}
for(x in k){
# for the columns decides by model, find out the accuracies of model for k=1:27
diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
if( best_model_accuracy < accuracy[x] ){
best_model_accuracy = accuracy[x]
best_model = model
print(model)
}
}
}
I trained with Pima.tr and tested with Pima.te. KNN Accuracy for pre-processed predictors was 78% and 80% without pre-processing (and this because of the large influence of some variables).
The 80% performance is at par with a Logistic Regression model. You don't need to preprocess variables in Logistic Regression.
RandomForest, and Logistic Regression provide a hint on which variables to drop, so you don't need to go and perform all possible combinations.
Another way is to look at a matrix Scatter plot
You get a sense that there is difference between type 0 and type 1 when it comes to npreg, glu, bmi, age
You also notice the highly skewed ped and age, and you notice that there may be an anomaly data point between skin and and and other variables (you may need to remove that observation before going further)
Skin Vs Type box plot shows that for type Yes, an extreme outlier exist (try removing it)
You also notice that most of the boxes for Yes type are higher than No type=> the variables may add prediction to the model (you can confirm this through a Wilcoxon Rank Sum Test)
The high correlation between Skin and bmi means that you can use one or the other or an interact of both.
Another approach to reducing the number of predictors is to use PCA

How to find an optimal adstock decay factor for an independent variable in panel analysis in R?

I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)

Resources