Dependency matrix - r

I need to build a dependency matrix with all the 91 variables of my data-set.
I tried to use some codes, but I didn't succeed.
Here you are part of the important codes:
p<- length(dati)
chisquare <- matrix(dati, nrow=(p-1), ncol=p)
It should create a squared-matrix with all the variables
system.time({for(i in 1:p){
for(j in 1:p){
a <- dati[, rn[i+1]]
b <- dati[, cn[j]]
chisquare[i, (1:(p-1))] <- chisq.test(dati[,i], dati[, i+1])$statistic
chisquare[i, p] <- chisq.test(dati[,i], dati, i+1])$p.value
}}
})
It should relate the "p" variables to analyze whether they are dependent to each other
Error in `[.data.frame`(dati, , rn[i + 1]) :
not defined columns selected
Moreover: There are 50 and more alerts (use warnings() to read the first 50)
Timing stopped at: 32.23 0.11 32.69
warnings() #let's check
>: In chisq.test(dati[, i], dati[, i + 1]) :
Chi-squared approximation may be incorrect
chisquare #all the cells (unless in the last column which seems to have the p-values) have the same values by row
I also tried another way, which were provided me by someone who knows how to manage R much better than me:
#strange values I have in some columns
sum(dati == 'x')
#replacing "x" by x
x <- dati[dati=='x']
#distribution of answers for each question
answers <- t(sapply(1:ncol(dati), function(i) table(factor(dati[, i], levels = -2:9), useNA = 'always')))
rownames(answers) <- colnames(dati)
answers
#correlation for the pairs
I<- diag(ncol(dati))
#empty diagonal matrix
colnames(I) <- rownames(I) <- colnames(dati)
rn <- rownames(I)
cn <- colnames(I)
#loop
system.time({
for(i in 1:ncol(dati)){
for(j in 1:ncol(spain)){
a <- dati[, rn[i]]
b <- dati[, cn[j]]
r <- chisq.test(a,b)$statistic
r <- chisq.test(a,b)$p.value
I[i, j] <- r
}
}
})
user system elapsed
29.61 0.09 30.70
There are 50 and more alerts (use warnings() to read the first 50)
warnings() #let's check
-> : In chisq.test(a, b) : Chi-squared approximation may be incorrect
diag(I)<- 1
#result
head(I)
The columns stop at the 5th variable, whereas I need to check the dependency between all the variables. Each one.
I don't understand where I'm wrong, but I hope I'm not so far...
I hope to receive a good help, please.

You are apparently trying to compute the p-value of a chi-squared test,
for all pairs of variables in your dataset.
This can be done as follows.
# Sample data
n <- 1000
k <- 10
d <- matrix(sample(LETTERS[1:5], n*k, replace=TRUE), nc=k)
d <- as.data.frame(d)
names(d) <- letters[1:k]
# Compute the p-values
k <- ncol(d)
result <- matrix(1, nr=k, nc=k)
rownames(result) <- colnames(result) <- names(d)
for(i in 1:k) {
for(j in 1:k) {
result[i,j] <- chisq.test( d[,i], d[,j] )$p.value
}
}
In addition, there may be something wrong with your data,
leading to the warnings you get,
but we do not know anything about it.
Your code has too many problems for me to try to enumerate them
(you start to try to create a square matrix with a different number
of rows and columns, and then I am completely lost).

Related

List giving an empty values upto penultimate cell

following is the code i am trying to run.The main objective is to run the model for different K values then after calculate the accuracies in order to choose the best K value.
so i thought of using for loop where every model.Result and the respective accuracy is stored in lists.,then after is sent out with respective k values..
but the thing is for the following code...the list isnt having any values from 1:29 and there is predicted values for 30..
k = 1:30
for(l in k){
pre[[l]] = knn(train_dataset,test_dataset,cl = labels_train, k = l)
}
output :
enter image description here
can someone help me out with this....like why the list is coming like that and what should be done in order to get the correct result..and why so..?
Here is a solution, with the models fit using the code in tacoman's comment.
library(class)
set.seed(1) # Make the results reproducible
knn_list <- lapply(1:30, function(l){
knn(train_dataset, test_dataset, cl = labels_train, k = l)
})
ok <- sapply(knn_list, '==', labels_test)
acc <- colMeans(ok)
which(acc == max(acc))
plot(acc, type = "b")
The for loop in the question can also be run, as long as the results list is created beforehand. The results are identical.
set.seed(1) # Make the results reproducible
k <- 1:30
pre <- vector("list", length = 30)
for(l in k){
pre[[l]] <- knn(train_dataset, test_dataset, cl = labels_train, k = l)
}
identical(pre, knn_list)
#[1] TRUE
Example data
set.seed(2021)
n <- nrow(iris)
i <- sample(n, 0.7*n)
train_dataset <- iris[i, -5]
test_dataset <- iris[-i, -5]
labels_train <- iris[i, 5]
labels_test <- iris[-i, 5]

How to create condition for warnings in R

I am trying to create a "for" loop where each of 100 trials has a set of parameters, each randomly chosen from probability distributions. From there, a model will take in these parameters and spit out an output. The input and output will be stored in a matrix, with each row representing a successful run through. Eventually, this matrix will be converted into a dataframe. I am displaying a sample run through for one case of the for loop below:
#matrix M will have 100 rows for each trial, and 4 columns
#columns will be a val, b val, c val and output
M <- matrix(0, nrow=100, ncol=4)
for (i in 1:100){
#random values for a,b,c for 1st trial
a =runif(1)
b=runif(1)
c=runif (1)
v <- c(a,b,c)
#some model
output[i]=v[1]*v[2]/v[3]
M[i,4]=output[i]
#don't know how to populate first 3 columns with all diff values of a,b,c
}
I know this code will not work, but that's my first question. How do I get the a,b, and c values to regenerate from trial to trial so I can have new outputs for each trial. From there, I am pretty sure I know how to store them in the matrix.
My last question is about warning messages. If I have a warning message because my output did not generate for some trial (no problems with this one, but if I had to divide by 0 or something)... how could I just tell the program to skip that trial and keep going until we get to 100 working trials?
Please comment if I should edit or clarify something above. Thanks in advance.
To answer your first question, you can first generate parameter vectors and then apply your function to each parameter set.
ntrials <- 100
M <- matrix(0, nrow=ntrials, ncol=4)
## Generate parameter vectors
M[,1] <- runif(ntrials)
M[,2] <- runif(ntrials)
M[,3] <- runif(ntrials)
## Example model function
run_mod <- function(a, b, c) {
return(a+b+c)
}
## Create output
M[, 4] <- run_mod(a = M[, 1], b = M[, 2], c = M[, 3])
To address your second question, you could use a while statement to continue generating parameter sets and trying to obtain valid model results until you have enough valid results. Your model function will need a way to handle errors or warnings that could occur, such as tryCatch().
## Example model function with error handling
run_mod <- function(a, b, c) {
tryCatch(
a+b+c,
error = function(e) print("Error"),
warning = function(w) print("Warning")
)
return(a+b+c)
}
i <- 0
while(i < ntrials) {
## Generate a single set of parameters
a <- runif(1)
b <- runif(1)
c <- runif(1)
## Example error
if(floor(100*a) %% 2 == 0) {
a <- "Bad parameter"
}
## Try running your model
output <- run_mod(a,b,c)
## If successful, save output and move on to the next set
if(!is.character(output)) {
M[i, 1] <- a
M[i, 2] <- b
M[i, 3] <- c
M[i, 4] <- output
i <- i + 1
}
}

R: Stuck on a "simple" problem: calculating total sum of squares in a n*m matrix

Given a data matrix with n rows and m columns, I would like to calculate the total sum of squares in R.
For this I've tried a loop that iterates through the rows of each column and saves the results in a vector. These are then added to the "TSS" vector where each value is the SS of one column. The sum of this vector should be the TSS.
set.seed(2020)
m <- matrix(c(sample(1:100, 80)), nrow = 40, ncol = 2)
tss <- c()
for(j in 1:ncol(m)){
tssVec <- c()
for(i in 1:nrow(m)){
b <- sum(((m[i,]) - mean(m[,j]))^2)
tssVec <- c(tssVec, b)
}
tss <- c(tss, sum(tssVec))
}
sum(tss)
The output is equal to 136705.6. This is not feasible at all. As a novice coder, I am unfortunately stuck.
Any help is appreciated!
There are many methods to evaluate the TSS, of course they will give you the same result. I would do something like:
Method 1 that implies the use of ANOVA:
n <- as.data.frame(m)
mylm <- lm(n$V1 ~ n$V2)
SSTotal <-sum(anova(mylm)[,2])
Method 2:
SSTotal <- var( m[,1] ) * (nrow(m)-1)

How to include p-values<0.05 in q-graphs?

I am following up an old question without answer (https://stackoverflow.com/questions/31653029/r-thresholding-networks-with-inputted-p-values-in-q-graph). I'm trying to assess relations between my variables.For this, I've used a correlation network map. Once I did so, I would like to implement a significance threshold component. For instance, I want to only show results with p-values <0.05. Any idea about how could I implement my code?
Data set: https://www.dropbox.com/s/xntc3i4eqmlcnsj/d100_partition_all3.csv?dl=0
My code:
library(qgraph)
cor_d100_partition_all3<-cor(d100_partition_all3)
qgraph(cor_d100_partition_all3, layout="spring",
label.cex=0.9, labels=names(d100_partition_all3),
label.scale=FALSE, details = TRUE)
Output:
Additionally, I have this small piece of code that transform R2 values into p.values:
Code:
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
p.mat <- cor.mtest(d100_partition_all3)
Cheers
There are a few ways to only plot the significant correlations. First, you could pass additional arguments to the qgraph()function. You can look at the documentation for more details. The function call given below should have values that are close to what is needed.
qgraph(cor_d100_partition_all3
, layout="spring"
, label.cex=0.9
, labels=names(d100_partition_all3)
, label.scale=FALSE
, details = TRUE
, minimum='sig' # minimum based on statistical significance
,alpha=0.05 # significance criteria
,bonf=F # should Bonferroni correction be used
,sampleSize=6 # number of observations
)
A second option is to create a modified correlation matrix. When the correlations are not statistically significant based on your cor.mtest() function, the value is set to NA in the modified correlation matrix. This modified matrix is plotted. A main visual difference between the first and second solutions seems to be the relative line weights.
# initializing modified correlation matrix
cor_d100_partition_all3_mod <- cor_d100_partition_all3
# looping through all elements and setting values to NA when p-values is greater than 0.05
for(i in 1:nrow(cor_d100_partition_all3)){
for(j in 1:nrow(cor_d100_partition_all3)){
if(p.mat[i,j] > 0.05){
cor_d100_partition_all3_mod[i,j] <- NA
}
}
}
# plotting result
qgraph(cor_d100_partition_all3_mod
,layout="spring"
,label.cex=0.7
,labels=names(d100_partition_all3)
,label.scale=FALSE
,details = F
)

Is there a consistent method for both cbind(1, <numeric>) and cbind(1, <matrix>)?

I'm working with some code that has crucial similarities to the following trivial script:
scores <- matrix(rnorm(4*20), ncol=4,nrow=20)
result <- matrix(NA, ncol=2, nrow=20)
index <- as.logical(rbinom(20,1,.2))
result[index, 1:3] <- cbind(1, scores[index,3:4])
where index is a logical vector and the sum(index) is typically greater than 1but can occasionally be 1 or 0.
The script fails in the case where sum(index) == 1:
> scores <- matrix(rnorm(4*20), ncol=4,nrow=20)
> result <- matrix(NA, ncol=3, nrow=20)
> index <- c(rep(FALSE, 19),TRUE)
> result[index, 1:3] <- cbind(1, scores[index,3:4])
Error in result[index, 1:3] <- cbind(1, scores[index, 3:4]) :
number of items to replace is not a multiple of replacement length
> cbind(1, scores[index,3:4])
[,1] [,2]
[1,] 1 -0.1780255
[2,] 1 -0.6840048
> #should be:
> c(1, scores[index,3:4])
[1] 1.0000000 -0.1780255 -0.6840048
and where sum(index) ==0:
> scores <- matrix(rnorm(4*20), ncol=4,nrow=20)
> result <- matrix(NA, ncol=3, nrow=20)
> index <- rep(FALSE, 20)
> result[index, 1:3] <- cbind(1, scores[index,3:4])
Warning message:
In cbind(1, scores[index, 3:4]) :
number of rows of result is not a multiple of vector length (arg 1)
> #cbinding to a zero-row matrix returns an error
The obvious solution to this problem is the following:
scores <- matrix(rnorm(4*20), ncol=4,nrow=20)
result <- matrix(NA, ncol=3, nrow=20)
index <- as.logical(rbinom(20,1,.1))
if(sum(index) > 1){
result[index, 1:3] <- cbind(1, scores[index,3:4])
}else{
if(sum(index) ==1){
result[index, 1:3] <- c(1, scores[index,3:4])
}
}
However, I'm interested in advice on how to code to avoid this error without having to write a bunch of if statements. Is there a trick to binding an atomic vector to an nx2 matrix OR 2-length vector (n=1) such that the result is always an nx3 matrix? Extra points if the script can do this without producing an error when n=0.
If it weren't for like an hour of debugging, I wouldn't have identified this issue--it was buried quite a few functions down in a batch processing script. Any general advice on coding in a way of avoiding such 'gotchas'?
Generally adding drop=FALSE to mtx[1,] calls will avoid the difficulties that arise with single row extractions and subsequent operations that assume a matrix structure:
result[index, 1:2] <- cbind(1, scores[1, 3:4, drop=FALSE]) # no error
# Also adding a third column to avoid dimension mismatch
scores <- matrix(rnorm(4*20), ncol=4,nrow=20)
result <- matrix(NA, ncol=3, nrow=20)
index <- as.logical(rbinom(20,1,.2))
result[index, 1:3] <- cbind(1, scores[index,3:4, drop=FALSE])
I haven't quite figured out how you want us to do to avoid errors with assignment of zero row objects to zero row objects. You should instead check for length(index)==0
(The real problem is that you were assigning a three column matrix to a two column target. Oh, I see you tried to fix that, except you were still trying to assign to a third column that was not there in the dimensions.)

Resources