How to include p-values<0.05 in q-graphs? - r

I am following up an old question without answer (https://stackoverflow.com/questions/31653029/r-thresholding-networks-with-inputted-p-values-in-q-graph). I'm trying to assess relations between my variables.For this, I've used a correlation network map. Once I did so, I would like to implement a significance threshold component. For instance, I want to only show results with p-values <0.05. Any idea about how could I implement my code?
Data set: https://www.dropbox.com/s/xntc3i4eqmlcnsj/d100_partition_all3.csv?dl=0
My code:
library(qgraph)
cor_d100_partition_all3<-cor(d100_partition_all3)
qgraph(cor_d100_partition_all3, layout="spring",
label.cex=0.9, labels=names(d100_partition_all3),
label.scale=FALSE, details = TRUE)
Output:
Additionally, I have this small piece of code that transform R2 values into p.values:
Code:
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
p.mat <- cor.mtest(d100_partition_all3)
Cheers

There are a few ways to only plot the significant correlations. First, you could pass additional arguments to the qgraph()function. You can look at the documentation for more details. The function call given below should have values that are close to what is needed.
qgraph(cor_d100_partition_all3
, layout="spring"
, label.cex=0.9
, labels=names(d100_partition_all3)
, label.scale=FALSE
, details = TRUE
, minimum='sig' # minimum based on statistical significance
,alpha=0.05 # significance criteria
,bonf=F # should Bonferroni correction be used
,sampleSize=6 # number of observations
)
A second option is to create a modified correlation matrix. When the correlations are not statistically significant based on your cor.mtest() function, the value is set to NA in the modified correlation matrix. This modified matrix is plotted. A main visual difference between the first and second solutions seems to be the relative line weights.
# initializing modified correlation matrix
cor_d100_partition_all3_mod <- cor_d100_partition_all3
# looping through all elements and setting values to NA when p-values is greater than 0.05
for(i in 1:nrow(cor_d100_partition_all3)){
for(j in 1:nrow(cor_d100_partition_all3)){
if(p.mat[i,j] > 0.05){
cor_d100_partition_all3_mod[i,j] <- NA
}
}
}
# plotting result
qgraph(cor_d100_partition_all3_mod
,layout="spring"
,label.cex=0.7
,labels=names(d100_partition_all3)
,label.scale=FALSE
,details = F
)

Related

How to create condition for warnings in R

I am trying to create a "for" loop where each of 100 trials has a set of parameters, each randomly chosen from probability distributions. From there, a model will take in these parameters and spit out an output. The input and output will be stored in a matrix, with each row representing a successful run through. Eventually, this matrix will be converted into a dataframe. I am displaying a sample run through for one case of the for loop below:
#matrix M will have 100 rows for each trial, and 4 columns
#columns will be a val, b val, c val and output
M <- matrix(0, nrow=100, ncol=4)
for (i in 1:100){
#random values for a,b,c for 1st trial
a =runif(1)
b=runif(1)
c=runif (1)
v <- c(a,b,c)
#some model
output[i]=v[1]*v[2]/v[3]
M[i,4]=output[i]
#don't know how to populate first 3 columns with all diff values of a,b,c
}
I know this code will not work, but that's my first question. How do I get the a,b, and c values to regenerate from trial to trial so I can have new outputs for each trial. From there, I am pretty sure I know how to store them in the matrix.
My last question is about warning messages. If I have a warning message because my output did not generate for some trial (no problems with this one, but if I had to divide by 0 or something)... how could I just tell the program to skip that trial and keep going until we get to 100 working trials?
Please comment if I should edit or clarify something above. Thanks in advance.
To answer your first question, you can first generate parameter vectors and then apply your function to each parameter set.
ntrials <- 100
M <- matrix(0, nrow=ntrials, ncol=4)
## Generate parameter vectors
M[,1] <- runif(ntrials)
M[,2] <- runif(ntrials)
M[,3] <- runif(ntrials)
## Example model function
run_mod <- function(a, b, c) {
return(a+b+c)
}
## Create output
M[, 4] <- run_mod(a = M[, 1], b = M[, 2], c = M[, 3])
To address your second question, you could use a while statement to continue generating parameter sets and trying to obtain valid model results until you have enough valid results. Your model function will need a way to handle errors or warnings that could occur, such as tryCatch().
## Example model function with error handling
run_mod <- function(a, b, c) {
tryCatch(
a+b+c,
error = function(e) print("Error"),
warning = function(w) print("Warning")
)
return(a+b+c)
}
i <- 0
while(i < ntrials) {
## Generate a single set of parameters
a <- runif(1)
b <- runif(1)
c <- runif(1)
## Example error
if(floor(100*a) %% 2 == 0) {
a <- "Bad parameter"
}
## Try running your model
output <- run_mod(a,b,c)
## If successful, save output and move on to the next set
if(!is.character(output)) {
M[i, 1] <- a
M[i, 2] <- b
M[i, 3] <- c
M[i, 4] <- output
i <- i + 1
}
}

R function writing - getting error: NaNs producedError in tsort[U + 1]only 0's may be mixed with negative subscripts

I am creating an R function that calculates a bootstrapped bias corrected and accelerated interval, (not using any pre-installed packages) My code seems to be working but am struggling actually writing the code for the lower and upper limits of the interval. Any suggestions would be helpful.
BCa <- function(stat,X,k,level=0.95,...){
if(!is.numeric(k)||k<=0){
stop("The number of bootstrap resamples 'k' must be a numeric value greater than 0")
}
t.star <- stat(X,...)
t.k <- rep(NA,k)
for(i in 1:k){
Xi <- sample(X,replace=TRUE)
t.k[i] <- stat(Xi,...)
}
z0 <- qnorm(mean(t.k<t.star))
n <- length(X)
t.minus.j <- rep(NA,n)
for(j in 1:n){
Xj <- X[-j]
t.minus.j[j]<- stat(Xj,...)
}
t.bar.minus <- mean(t.minus.j)
t.diff <- t.bar.minus - t.minus.j
a <- ((sum(t.diff^3))/(6*(t.diff^2)^3/2))
alpha <- 1-level
tsort <- sort(t.k, decreasing = FALSE)
L <- pnorm(z0 + ((z0+qnorm(alpha/2))/((1-a)*z0+qnorm(alpha/2))))
U <- qnorm(z0 + ((z0+qnorm(alpha/2))/((1-a)*z0+qnorm(alpha/2))))
if(!is.integer(L)){
L <- floor(L*(k+1))
}
if(!is.integer(U)){
U <- ceiling(U*(k+1))
}
lower.limit <- tsort[L]
upper.limit <- tsort[U+1]
return(list(t.star=t.star,ci=c(lower.limit,upper.limit)))
}

Question regarding k fold cross validation for KNN using R

I am trying to fit 5 fold cross validation for several values of k. I used the OJ data set in ISLR package.
my code so far as follows,
library(ISLR)
library(class)
ks=c(1:5)
err.rate.test <- numeric(length = 5)
folds <- cut(seq(1,nrow(OJ)),breaks=5,labels=FALSE)
for (j in seq(along = ks)) {
set.seed(123)
cv.knn <- sapply(1:5, FUN = function(i) {
testID <- which(folds == i, arr.ind = TRUE)
test.X <- OJ[testID, 3]
test.Y <- OJ[testID, 1]
train.X <- OJ[-testID, 3]
train.Y <- OJ[-testID, 1]
knn.test <- knn(data.frame(train.X), data.frame(test.X), train.Y, k = ks[j])
cv.test.est <- mean(knn.test != test.Y)
return(cv.test.est)
})
err.rate.test[j] <- mean(cv.knn)
}
err.rate.test
[1] 0.3757009 0.3757009 0.3757009 0.3757009 0.3757009
The code doesn't give any errors. But for some reason , my test error rate for each value of k is same.This seems to be weird for me. So i assume there is something wrong with my code.
Can anyone help me to figure that out ?
Thank you
remove set.seed(123), this causes the repeat error rates.
set.seed is used for reproducibility, ensuring that any random grid searches or parameter estimates remain constant, meaning all of the parameter estimates that go into fitting the knn model will be the same across executions, resulting in the same predictions and therefore the same error rates.

Downsample matrix in R?

My question is about how to improve the performance of function that downsamples from the columns of a matrix without replacement (a.k.a. "rarefication" of a matrix... I know there has been mention of this here, but I could not find a clear answer that a) does what I need; b) does it quickly).
Here is my function:
downsampled <- function(data,samplerate=0.8) {
data.test <- apply(data,2,function(q) {
names(q) <- rownames(data)
samplepool <- character()
for (i in names(q)) {
samplepool <- append(samplepool,rep(i,times=q[i]))
}
sampled <- sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab <- table(sampled)
mat <- match(names(tab),names(q))
toret=numeric(length <- length(q))
names(toret) <- names(q)
toret[mat] <- tab
return(toret)
})
return(data.test)
}
I need to be downsampling matrices with millions of entries. I find this is quite slow (here I'm using a 1000x1000 matrix, which is about 20-100x smaller than my typical data size):
mat <- matrix(sample(0:40,1000*1000,replace=T),ncol=1000,nrow=1000)
colnames(mat) <- paste0("C",1:1000)
rownames(mat) <- paste0("R",1:1000)
system.time(matd <- downsampled(mat,0.8))
## user system elapsed
## 69.322 21.791 92.512
Is there a faster/easier way to perform this operation that I haven't thought of?
I think you can make this dramatically faster. If I understand what you are trying to do correctly, you want to down-sample each cell of the matrix, such that if samplerate = 0.5 and the cell of the matrix is mat[i,j] = 5, then you want to sample up to 5 things where each thing has a 0.5 chance of being sampled.
To speed things up, rather than doing all these operations on columns of the matrix, you can just loop through each cell of the matrix, draw n things from that cell by using runif (e.g., if mat[i,j] = 5, you can generate 5 random numbers between 0 and 1, and then add up the number of values that are < samplerate), and finally add the number of things to a new matrix. I think this effectively achieves the same down-sampling scheme, but much more efficiently (both in terms of running time and lines of code).
# Sample matrix
set.seed(23)
n <- 1000
mat <- matrix(sample(0:10,n*n,replace=T),ncol=n,nrow=n)
colnames(mat) <- paste0("C",1:n)
rownames(mat) <- paste0("R",1:n)
# Old function
downsampled<-function(data,samplerate=0.8) {
data.test<-apply(data,2,function(q){
names(q)<-rownames(data)
samplepool<-character()
for (i in names(q)) {
samplepool=append(samplepool,rep(i,times=q[i]))
}
sampled=sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab=table(sampled)
mat=match(names(tab),names(q))
toret=numeric(length = length(q))
names(toret)<-names(q)
toret[mat]<-tab
return(toret)
})
return(data.test)
}
# New function
downsampled2 <- function(mat, samplerate=0.8) {
new <- matrix(0, nrow(mat), ncol(mat))
colnames(new) <- colnames(mat)
rownames(new) <- rownames(mat)
for (i in 1:nrow(mat)) {
for (j in 1:ncol(mat)) {
new[i,j] <- sum(runif(mat[i,j], 0, 1) < samplerate)
}
}
return(new)
}
# Compare times
system.time(downsampled(mat,0.8))
## user system elapsed
## 26.840 3.249 29.902
system.time(downsampled2(mat,0.8))
## user system elapsed
## 4.704 0.247 4.918
Using an example 1000 X 1000 matrix, the new function I provided runs about 6 times faster.
One source of savings would be to remove the for loop that appends samplepool using rep. Here is a reproducible example:
myRows <- 1:5
names(myRows) <- letters[1:5]
# get the repeated values for sampling
samplepool <- rep(names(myRows), myRows)
Within your function, this would be
samplepool <- rep(names(q), q)

Dependency matrix

I need to build a dependency matrix with all the 91 variables of my data-set.
I tried to use some codes, but I didn't succeed.
Here you are part of the important codes:
p<- length(dati)
chisquare <- matrix(dati, nrow=(p-1), ncol=p)
It should create a squared-matrix with all the variables
system.time({for(i in 1:p){
for(j in 1:p){
a <- dati[, rn[i+1]]
b <- dati[, cn[j]]
chisquare[i, (1:(p-1))] <- chisq.test(dati[,i], dati[, i+1])$statistic
chisquare[i, p] <- chisq.test(dati[,i], dati, i+1])$p.value
}}
})
It should relate the "p" variables to analyze whether they are dependent to each other
Error in `[.data.frame`(dati, , rn[i + 1]) :
not defined columns selected
Moreover: There are 50 and more alerts (use warnings() to read the first 50)
Timing stopped at: 32.23 0.11 32.69
warnings() #let's check
>: In chisq.test(dati[, i], dati[, i + 1]) :
Chi-squared approximation may be incorrect
chisquare #all the cells (unless in the last column which seems to have the p-values) have the same values by row
I also tried another way, which were provided me by someone who knows how to manage R much better than me:
#strange values I have in some columns
sum(dati == 'x')
#replacing "x" by x
x <- dati[dati=='x']
#distribution of answers for each question
answers <- t(sapply(1:ncol(dati), function(i) table(factor(dati[, i], levels = -2:9), useNA = 'always')))
rownames(answers) <- colnames(dati)
answers
#correlation for the pairs
I<- diag(ncol(dati))
#empty diagonal matrix
colnames(I) <- rownames(I) <- colnames(dati)
rn <- rownames(I)
cn <- colnames(I)
#loop
system.time({
for(i in 1:ncol(dati)){
for(j in 1:ncol(spain)){
a <- dati[, rn[i]]
b <- dati[, cn[j]]
r <- chisq.test(a,b)$statistic
r <- chisq.test(a,b)$p.value
I[i, j] <- r
}
}
})
user system elapsed
29.61 0.09 30.70
There are 50 and more alerts (use warnings() to read the first 50)
warnings() #let's check
-> : In chisq.test(a, b) : Chi-squared approximation may be incorrect
diag(I)<- 1
#result
head(I)
The columns stop at the 5th variable, whereas I need to check the dependency between all the variables. Each one.
I don't understand where I'm wrong, but I hope I'm not so far...
I hope to receive a good help, please.
You are apparently trying to compute the p-value of a chi-squared test,
for all pairs of variables in your dataset.
This can be done as follows.
# Sample data
n <- 1000
k <- 10
d <- matrix(sample(LETTERS[1:5], n*k, replace=TRUE), nc=k)
d <- as.data.frame(d)
names(d) <- letters[1:k]
# Compute the p-values
k <- ncol(d)
result <- matrix(1, nr=k, nc=k)
rownames(result) <- colnames(result) <- names(d)
for(i in 1:k) {
for(j in 1:k) {
result[i,j] <- chisq.test( d[,i], d[,j] )$p.value
}
}
In addition, there may be something wrong with your data,
leading to the warnings you get,
but we do not know anything about it.
Your code has too many problems for me to try to enumerate them
(you start to try to create a square matrix with a different number
of rows and columns, and then I am completely lost).

Resources