cor function with NA values due to 0 variance - r

Beginner R user here. I am using the cor function to get the Kendal's tau-b rank correlation coefficient between 2 columns of a dataframe. Examples of such columns are as folows:
A B
1 1
1 2
1 3
when I use cor(d,method="kendall")
The result is NA for the correlation between A and B. Shouldnt it be 0? And if not is there a way that I can replace this NA result with 0 using a parameter in the cor function?

Consider what would happen if we slightly perturb the constant column. We get vastly different solutions depending on the particular perturbation used. In fact we can get any correlation we like with different perturbations. As a result it really makes no sense to use any particular value for the correlation and it would be best left as NA.
x <- c(1, 1, 1)
y <- 1:3
cor(x + (1:3) * 1e-10, y, method = "spearman")
## [1] 1
cor(x - (1:3) * 1e-10, y, method = "spearman")
## [1] -1

Related

OLS estimator in R

I'm trying to compute OLS estimators manually in R for given vectors and matrices, but when I get the formula beta=(x'x)^-1(x'y), R tells me that the is a dimension issue, and I can't figure out why.
My code is
nr = 100
nc = 1000
x=matrix(rnorm(nr * nc, mean=1, sd=1), nrow = nr)
epsilon=matrix(rnorm(nr * nc, mean=0, sd=1), nrow = nr)
k=c(1,2,4,8)
eta1=((epsilon^1-mean(epsilon^1))/(mean(epsilon^(1*2))-mean(epsilon^1)^2)^(1/2))
eta2=((epsilon^2-mean(epsilon^2))/(mean(epsilon^(2*2))-mean(epsilon^2)^2)^(1/2))
eta4=((epsilon^4-mean(epsilon^4))/(mean(epsilon^(4*2))-mean(epsilon^4)^2)^(1/2))
eta8=((epsilon^8-mean(epsilon^8))/(mean(epsilon^(8*2))-mean(epsilon^8)^2)^(1/2))
y1=x+eta1
y2=x+eta2
y4=x+eta4
y8=x+eta8
beta1=inv(t(x)*x)*(t(x)*y1)
beta2=inv(t(x)*x)*(t(x)*y2)
beta4=inv(t(x)*x)*(t(x)*y4)
beta8=inv(t(x)*x)*(t(x)*y8)
Also, I feel that there should be a way to loop through the values of k to get this automated, instead of doing each eta by hand. So, a bit of help in this area would also be appreciated.
The ouput I'm looking for is to get a vector of beta for each of the different values of k.
You have got several issues. Firstly, you should have nx1 matrix for y and epsilon, but you have nxm matrix for them instead. Secondly, you should use matrix multiplication which is %*% in R. i.e. t(x)%*%y1. However you use dot product (*) instead.
For the sake of simplicity, lets create a matrix with 5 columns. My approach is creating a dependent variable which is related with x columns (independent variables or feature matrix in machine learning terminology)
nr = 100
nc = 5
x=matrix(rnorm(nr * nc, mean=1, sd=1), nrow = nr)
epsilon=matrix(rnorm(nr, mean=0, sd=1), nrow = nr) # it should be nx1
k=c(1,2,4,8)
eta1=((epsilon^1-mean(epsilon^1))/(mean(epsilon^(1*2))-mean(epsilon^1)^2)^(1/2))
eta2=((epsilon^2-mean(epsilon^2))/(mean(epsilon^(2*2))-mean(epsilon^2)^2)^(1/2))
eta4=((epsilon^4-mean(epsilon^4))/(mean(epsilon^(4*2))-mean(epsilon^4)^2)^(1/2))
eta8=((epsilon^8-mean(epsilon^8))/(mean(epsilon^(8*2))-mean(epsilon^8)^2)^(1/2))
To check the output we should create y values wisely. So, let's define some betas and create y values wrt them. At the end, we can compare the output with the inputs we defined. Note that, you should have 5 betas for 5 columns.
# made up betas
beta1_real <- 1:5
beta2_real <- -4:0
beta4_real <- 7:11
beta8_real <- seq(0.25,1.25,0.25)
To create the y values,
y1= 10 + x %*% matrix(beta1_real) + eta1
y2= 20 + x %*% matrix(beta2_real) + eta2
y4= 30 + x %*% matrix(beta4_real) + eta4
y8= 40 + x %*% matrix(beta8_real) + eta8
Here, I also added a constant term for each y values. To get the constant term at the end, we should add ones at the beginning of our x matrix like,
x <- cbind(matrix(1,nrow = nr),x)
The rest is almost same with yours. Only difference is I used solve instead of inv and also I used the matrix multiplication (%*%)
beta1=solve(t(x)%*%x)%*%(t(x)%*%y1)
beta2=solve(t(x)%*%x)%*%(t(x)%*%y2)
beta4=solve(t(x)%*%x)%*%(t(x)%*%y4)
beta8=solve(t(x)%*%x)%*%(t(x)%*%y8)
If we compare the outputs,
beta1_real was,
# [1] 1 2 3 4 5
and the output of beta1 is,
# [,1]
# [1,] 10.0049631
# [2,] 0.9632124
# [3,] 1.8987402
# [4,] 2.9816673
# [5,] 4.2111817
# [6,] 4.9529084
The results are similar. 10 at the beginning is the constant term I added. The difference stems from the error term applied (etas).

glmnet: at what lambda is each coefficient shrunk to 0?

I am using LASSO (from package glmnet) to select variables. I have fitted a glmnet model and plotted coefficients against lambda's.
library(glmnet)
set.seed(47)
x = matrix(rnorm(100 * 3), 100, 3)
y = rnorm(100)
fit = glmnet(x, y)
plot(fit, xvar = "lambda", label = TRUE)
Now I want to get the order in which coefficients become 0. In other words, at what lambda does each coefficient become 0?
I don't find a function in glmnet to extract such result. How can I get it?
Function glmnetPath in my initial answer is now in an R package called solzy.
## you may need to first install package "remotes" from CRAN
remotes::install_github("ZheyuanLi/solzy")
## Zheyuan Li's R functions on Stack Overflow
library(solzy)
## use function `glmnetPath` for your example
glmnetPath(fit)
#$enter
# i j ord var lambda
#1 3 2 1 V3 0.15604809
#2 2 19 2 V2 0.03209148
#3 1 24 3 V1 0.02015439
#
#$leave
# i j ord var lambda
#1 1 23 1 V1 0.02211941
#2 2 18 2 V2 0.03522036
#3 3 1 3 V3 0.17126258
#
#$ignored
#[1] i var
#<0 rows> (or 0-length row.names)
Interpretation of enter
As lambda decreases, variables (see i for numeric ID and var for variable names) enter the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 3 enters the model at lambda = 0.15604809, the 2nd value in fit$lambda;
variable 2 enters the model at lambda = 0.03209148, the 19th value in fit$lambda;
variable 1 enters the model at lambda = 0.02015439, the 24th value in fit$lambda.
Interpretation of leave
As lambda increases, variables (see i for numeric ID and var for variable names) leave the model in turn (see ord for the order). The corresponding lambda for the event is fit$lambda[j].
variable 1 leaves the model at lambda = 0.02211941, the 23rd value in fit$lambda;
variable 2 leaves the model at lambda = 0.03522036, the 18th value in fit$lambda;
variable 3 leaves the model at lambda = 0.17126258, the 1st value in fit$lambda.
Interpretation of ignored
If not an empty data.frame, it lists variables that never enter the model. That is, they are effectively ignored. (Yes, this can happen!)
Note: fit$lambda is decreasing, so j is in ascending order in enter but in descending order in leave.
To further explain indices i and j, take variable 2 as an example. It leaves the model (i.e., its coefficient becomes 0) at j = 18 and enters the model (i.e., its coefficient becomes non-zero) at j = 19. You can verify this:
fit$beta[2, 1:18]
## all zeros
fit$beta[2, 19:ncol(fit$beta)]
## all non-zeros
See Obtain variable selection order from glmnet for a more complicated example.

correlated binomial distribution in r

i want to generate a second vector in r which is correlated to my first vector.
the first vector is simply created as following:
x <- rbinom(n=10,1,p=0.8)
x
[1] 0 0 1 0 1 1 1 1 0 0
my second vector should be generated with a defined correlation e.g. 0.8.
i know that you can use mvrnorm() for the normal distribution, but i dont know how to do it for the binomial distribution. i tried to find some solution but the suggestions were a bit too complicated for me or i could not apply to my code.
I second the recommendation that you visit Cross Validated. It is not clear how you are planning to use the correlated binomial distribution. Given your stipulation that you are starting with one vector and you want to create a second based on the first, all you need to do is adjust the probabilities of the second vector:
set.seed(42)
x <- rbinom(n=1000, size=1, p=0.8) # Your first vector
y <- rbinom(n=1000, size=1, p=ifelse(x==1, .95, .05))
cor(x, y)
# [1] 0.8505885
y <- rbinom(n=1000, size=1, p=ifelse(x==1, .94, .06))
cor(x, y)
# [1] 0.821918
y <- rbinom(n=1000, size=1, p=ifelse(x==1, .93, .07))
cor(x, y)
# [1] 0.7679597
In generating the second vector, the probability for the second vector is greater than .8 (the probability of a 1) if the value is 1 in the first vector and less than .2 (the probability of a 0) if the value is 0 in the first vector.

McNemar test in R - sparse data

I'm attempting to run a good sized dataset through R, using the McNemar test to determine whether I have a difference in the proportion of objects detected by one method over another on paired samples. I've noticed that the test works fine when I have a 2x2 table of
test1
y n
y 34 2
n 12 16
but if I try and run something more like:
34 0
12 0
it errors telling me that ''x' and 'y' must have the same number of levels (minimum 2)'.
I should clarify, that I've tried converting wide data to a 2x2 matrix using the table function on my wide data set, where rather than appearing as above, it negates the final column, giving me.
test1
y
y 34
n 12
I've also run mcnemar.test using the factor object option, which gives me the same error, so I'm assuming that it does something similar. I'm wondering whether there is either a way to force the table function to generate the 2nd column despite their being no observations which would fall under either of those categories, or whether there would be a way to make the test overlook this missing data?
Perhaps there's a better way to do this, but you can force R to construct a sparse contingency table by ensuring that the tabulated factors have the same levels attribute and that there are exactly 2 distinct levels specified.
# Example data
x1 <- c(rep("y", 34), rep("n", 12))
x2 <- rep("n", 46)
# Set levels explicitly
x1 <- factor(x1, levels = c("y", "n"))
x2 <- factor(x2, levels = c("y", "n"))
table(x1, x2)
# x2
# x1 y n
# y 0 34
# n 0 12
mcnemar.test(table(x1, x2))
#
# McNemar's Chi-squared test with continuity correction
#
# data: table(x1, x2)
# McNemar's chi-squared = 32.0294, df = 1, p-value = 1.519e-08

R package caret confusionMatrix with missing categories

I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so:
table(predicted,actual)
However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not include the missing outcome and looks like this:
A B C D
A n1 n2 n2 n4
B n5 n6 n7 n8
D n9 n10 n11 n12
# Note how there is no corresponding row for `C`.
The confusionMatrix function can't handle the missing outcome and gives the error:
Error in !all.equal(nrow(data), ncol(data)) : invalid argument type
Is there a way I can use the table function differently to get the missing rows with zeros or use the confusionMatrix function differently so it will view missing outcomes as zero?
As a note: Since I am randomly selecting my data to test with, there are times that a category is also not represented in the actual result as opposed to just the predicted. I don't believe this will change the solution.
You can use union to ensure similar levels:
library(caret)
# Sample Data
predicted <- c(1,2,1,2,1,2,1,2,3,4,3,4,6,5) # Levels 1,2,3,4,5,6
reference <- c(1,2,1,2,1,2,1,2,1,2,1,3,3,4) # Levels 1,2,3,4
u <- union(predicted, reference)
t <- table(factor(predicted, u), factor(reference, u))
confusionMatrix(t)
First note that confusionMatrix can be called as confusionMatrix(predicted, actual) in addition to being called with table objects. However, the function throws an error if predicted and actual (both regarded as factors) do not have the same number of levels.
This (and the fact that the caret package spit an error on me because they don't get the dependencies right in the first place) is why I'd suggest to create your own function:
# Create a confusion matrix from the given outcomes, whose rows correspond
# to the actual and the columns to the predicated classes.
createConfusionMatrix <- function(act, pred) {
# You've mentioned that neither actual nor predicted may give a complete
# picture of the available classes, hence:
numClasses <- max(act, pred)
# Sort predicted and actual as it simplifies what's next. You can make this
# faster by storing `order(act)` in a temporary variable.
pred <- pred[order(act)]
act <- act[order(act)]
sapply(split(pred, act), tabulate, nbins=numClasses)
}
# Generate random data since you've not provided an actual example.
actual <- sample(1:4, 1000, replace=TRUE)
predicted <- sample(c(1L,2L,4L), 1000, replace=TRUE)
print( createConfusionMatrix(actual, predicted) )
which will give you:
1 2 3 4
[1,] 85 87 90 77
[2,] 78 78 79 95
[3,] 0 0 0 0
[4,] 89 77 82 83
I had the same problem and here is my solution:
tab <- table(my_prediction, my_real_label)
if(nrow(tab)!=ncol(tab)){
missings <- setdiff(colnames(tab),rownames(tab))
missing_mat <- mat.or.vec(nr = length(missings), nc = ncol(tab))
tab <- as.table(rbind(as.matrix(tab), missing_mat))
rownames(tab) <- colnames(tab)
}
my_conf <- confusionMatrix(tab)
Cheers
Cankut

Resources