Confusion about calculating sample correlation in r - r

I have been tasked with manually calculating the sample correlation between two datasets (D$Nload and D$Pload), and then compare the result with R's in built cor() function.
I calculate the sample correlation with
cov(D$Nload,D$Pload, use="complete.obs")/(sd(D$Nload)*sd(D$Pload, na.rm=TRUE))
Which gives me the result 0.5693599
Then I try using R's cov() function
cor(D[, c("Nload","Pload")], use="pairwise.complete.obs")
which gives me the result:
Nload Pload
Nload 1.0000000 0.6244952
Pload 0.6244952 1.0000000
Which is a different result. Can anyone see where I've gone wrong?

This happens because when you call sd() on a single vector, it cannot check if the data is pairwise complete. Example:
x <- rnorm(100)
y <- rexp(100)
y[1] <- NA
df <- data.frame(x = x, y = y)
So here we have
df[seq(2), ]
x y
1 1.0879645 NA
2 -0.3919369 0.2191193
We see that while the second row is pairwise complete (all columns used for your computation are not NA), the first row is not. However, if you calculate sd() on just a single column, it doesn't have any information about the pairs. So in your case, sd(df$x) will use all the available data, although it should avoid the first row.
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09301583
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766
But if you remove the first row from your computation, the result is equal
df <- df[complete.cases(df), ]
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09313766
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

Correlation matrix giving the wrong value in R

I wanted to get the correlation matrix for an actual value and a predicted value, but the results after the first column are wrong.
There are many missing values in my dataframe.
This is the code:
# Create different tables for overall (actual) and potential (predicted) values for every year
actual <- table %>%
select(starts_with('overall_'))
predicted <- table %>%
select(starts_with('potential_'))
# Create a matrix for r of every year
cormat <- round( cor(
x = actual, y = predicted,
use = "complete.obs",
method = "pearson"),
3)
cormat
However when I calculate the correlation manually the resut is different.
round(
cor(table$overall_15,
table$potential_15,
use = "complete.obs"), 3)
# Result: 0.804
Anybody knows why?
Thank you,

Call sapply on vector of densities?

I'm generating a set of densities using functions like dweibull and dunif. I'd like to be able to extract their moments using sapply. When I call the mean function on individual distributions, I get the mean value as expected. But when I do this with sapply, instead it seems to call mean() on every individual value in the density (i.e. for every value of x I supplied to dweibull or dunif. Is there a way to get the correct behavior here? Thank you!
weib <- dweibull(seq(from=0, to=25, length=100), shape=1, scale=5)
unif <- dunif(seq(from=1, to=10, length=100), min=1, max=10)
mean(weib) #Works!
dists <- c(weib, unif)
means <- sapply(dists, mean) #Returns a very long list of values, not the mean of weib and unif
You should store the data in a list. c(weib, unif) creates a single combined vector and using sapply(dists, mean) returns the mean of single number i.e the number itself.
dists <- list(weib, unif)
means <- sapply(dists, mean)
You can also use dataframe
dists <- data.frame(weib, unif)
sapply(dists, mean)
weib unif
0.04034828 0.11111111
lapply(dists,mean)
$weib
[1] 0.04034828
$unif
[1] 0.1111111
apply(dists, 2,mean)
weib unif
0.04034828 0.11111111
We may use map
dists <- list(weib, unif)
library(purrr)
means <- map_dbl(dists, mean)

Using Rollapply to return both the Coefficient and RSquare

I have a dataset that looks something like this:
data.table(x=c(11:30),y=rnorm(20))
I would like to calculate the rolling regression coefficient and rsquared over the last 10 items:
dtset[,coefficient:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(coef(reg)[1])
},align="right",fill=NA)]
dtset[,rsquare:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(1 - sum((subdtset$y - reg$fitted.values)^2) / sum((subdtset$y - mean(subdtset$y, na.rm=TRUE))^2))
},align="right",fill=NA)]
The code above accomplishes this, but my dataset has millions of rows and I have multiple columns where I want to make these calculations so it is taking a very long time. I am hoping there is a way to speed things up:
Is there a better way to capture the last 10 items in rollapply rather than passing the row numbers as the variable a and then doing subdtset <- dtset[a]? I tried using .SD and .SDcols but was unable to get that to work. I can only figure out how to get rollapply to accept one column or vector as the input, not two columns/vectors.
Is there a way to return 2 values from one rollapply statement? I think I could get significant time savings if I only had to do the regression once, and then from that take the coefficient and calculate RSquare. It's pretty inefficient to do the same calculations twice.
Thanks for the help!
Use by.column = FALSE to pass both columns to the function. In the function calculate the slope and r squared directly to avoid the overhead of lm.fit. Note that rollapply can return a vector and that rollapplyr with an r on the end is right aligned. This also works if dtset consists of a single x column followed by multiple y columns as in the example below with the builtin anscombe data frame.
library(data.table)
library(zoo)
stats <- function(X, x = X[, 1], y = X[, -1]) {
c(slope = cov(x, y) / var(x), rsq = cor(x, y)^2)
}
rollapplyr(dtset, 10, stats, by.column = FALSE, fill = NA)
a <- anscombe[c("x3", "y1", "y2", "y3")]
rollapplyr(a, 3, stats, by.column = FALSE, fill = NA)
Check
We check the formulas using the built-in BOD data frame.
fm <- lm(demand ~ Time, BOD)
c(coef(fm)[[2]], summary(fm)$r.squared)
## [1] 1.7214286 0.6449202
stats(BOD)
## slope rsq
## 1.7214286 0.6449202

Remove NAs when using mapply for ttest in R

I would like to do a column-wise ttest between two dataframes in R. That is, ttest(df1$col1,df2$col1) ,ttest(df1$col2,df2$col2) and so on....The best option here is to use mapply or Map function. Something like:
mapply(t.test,tnav_DJF_histo.csv[,-1],tnav_DJF.csv[,-1])
works perfectly but if one of your df columns has NAs, it fails with this error:
Error in t.test.default(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
not enough 'y' observations
Question: how can I use na.rm to get the job done? For example, if a column in tnav_DJF.csv[,-1] has Nas but no NAs in tnav_DJF_histo.csv[,-1], how can I tell mapply to to ignore or skip the analyses for these columns?
Many thanks.
aez.
You can do this with mapply and an anonymous function as follows:
Example data:
df1 <- data.frame(a=runif(20), b=runif(20), c=rep(NA,20))
df2 <- data.frame(a=runif(20), b=runif(20), c=c(NA,1:18,NA))
#notice df1's third column is just NAs
Solution:
Use mapply with an anonymous function as follows:
#anonumous function testing for NAs
mapply(function(x, y) {
if(all(is.na(x)) || all(is.na(y))) NULL else t.test(x, y, na.action=na.omit)
}, df1, df2)
Output:
$a
Welch Two Sample t-test
data: x and y
t = 1.4757, df = 37.337, p-value = 0.1484
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0543192 0.3458648
sample estimates:
mean of x mean of y
0.5217619 0.3759890
$b
Welch Two Sample t-test
data: x and y
t = 1.1689, df = 37.7, p-value = 0.2498
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0815067 0.3041051
sample estimates:
mean of x mean of y
0.5846343 0.4733351
$c
NULL
P.S. There is no na.rm argument in the t.test function to use. There is only a na.action argument but even if you set that to na.omit (which I have) you will still get an error if all the column elements are NA.
P.S.2 If some of the elements of either x or y are NA then the t.test function will run properly by omitting those elements. If you want to ignore calculating the t.test if any of the columns contains even a single NA, then you need to change the all in the above function to any.
Can you do something like
t.test2 <- function(col1, col2){
df <- complete.cases(cbind(col1, col2))
if(nrow(df) < 3){return(NA)}
t.test(df[, 1], df[, 2], na.rm = TRUE)
}
mapply(t.test2, csv1[, -1], csv2[, -2])

Resources