Correlation matrix giving the wrong value in R - r

I wanted to get the correlation matrix for an actual value and a predicted value, but the results after the first column are wrong.
There are many missing values in my dataframe.
This is the code:
# Create different tables for overall (actual) and potential (predicted) values for every year
actual <- table %>%
select(starts_with('overall_'))
predicted <- table %>%
select(starts_with('potential_'))
# Create a matrix for r of every year
cormat <- round( cor(
x = actual, y = predicted,
use = "complete.obs",
method = "pearson"),
3)
cormat
However when I calculate the correlation manually the resut is different.
round(
cor(table$overall_15,
table$potential_15,
use = "complete.obs"), 3)
# Result: 0.804
Anybody knows why?
Thank you,

Related

How to move several columns to front of dataframe in R?

I am writing a simulation where I am trying several multiple testing methods. In my simulation, I want to vary the percentage of true null hypotheses, and move the true null hypotheses to the beginning of my dataframe. This is proving to be a little tricky, when the number of null hypotheses are being varied.
I have looked into moving them by index, but this doesn't work in all cases. (especially h0 = 0)
It looks like relocate() might do what I want, but can I use this with several columns, and using only column indexes?
I am just including the "inner loop" in my simulation, where the error occurs. First you can see the levels I want to vary h0 at.
iter <- 100 #number of iterations for 1 datapoint
rho_vec <- c(0, 0.20, 0.40, 0.60, 0.80) # correlation value
h0_vec <- c(0, 0.20, 0.40, 0.60, 0.80) # list of percentage of true h0
for(j in 1 : iter){
mu11 <- c(rep(0, (h0*50)), rep(1.5, (1-h0)*50)) #vector giving the means of the variables. true nulls have 0 mean, false nulls have 1.5 in mean. (12 false h0)
Sigma11 <- diag(k) + rho - diag(k)*rho #Making simple correlation matrix for dependent variables
corrdata1 <- mvrnorm(n, mu = mu11, Sigma = Sigma11)
# now we simulate the unncorrelated data with (1-h0)*50) non-true null hypothesis. n and k are the same.
mu12 <- c(rep(0, h0*50), rep(1.5, (1-h0)*50))
SigmaId <- diag(k) #making correlation matrix (id matrix) for independent data.
indepdata1 <- mvrnorm(n, mu = mu12, Sigma = SigmaId)
#we define the total data matrix for both of the cases
data1 <- cbind(corrdata1,indepdata1) #a 100 x 1000 matrix with 1000 observations of 100 variables
#reorder columns so the false nulls are the last columns.
#data1 <- data1[, c( 0:(h0*50), 51:(50+(h0*50)), (51-((1-h0)*50)):50, (101-(50*(1-h0))):100)] #can check this by calling colMeans(data1). I tried this version first.
data1 %>% relocate(c(0:(h0*50), 51:(50 + (h0*50))) %>% head()) # this is the relocate() approach.
}
This produces an error in relocate():
"Error in UseMethod("relocate") :
no applicable method for 'relocate' applied to an object of class "c('matrix', 'double', 'numeric')"
Does anyone have any ideas on how to to this? Advice is greatly appreciated!
The error message tells you that relocate() is applied to a matrix object, which it cannot do. Indeed, relocate() must be applied to a dataframe, so you should use as.data.frame() or as_tibble() beforehand, as mentioned in the comments.
Finally, you should reassign the result after using the function, otherwise it won't have any effect:
data1 <- data1 %>% as_tibble() %>% relocate(c(0:(h0*50), 51:(50 + (h0*50))) %>% head())

Using boot::boot() function with grouped variables in R

This is a question both about using the boot() function with grouped variables, but also about passing multiple columns of data into boot. Almost all examples of the boot() function seem to pass a single column of data to calculate a simple bootstrap of the mean.
My specific analysis is trying to use the stats::weighted.mean(x,w) function which takes a vector 'x' of values to calculate the mean and a second vector 'w' for weights. The main point is that I need two inputs into this function - and I'm hoping the solution will generalize to any function that takes multiple arguments.
I'm also looking for a solution to use this weighted.means function in a dplyr style workflow with group_by() variables. If the answer is that "it can't be done with dplyr", that's fine, I'm just trying to figure it out.
Below I simulate a dataset with three groups (A,B,C) that each have different ranges of counts. I also attempt to come up with a function "my.function" that will be used to bootstrap the weighted average. Here might be my first mistake: is this how I would set up a function to pass in the 'count' and 'weight' columns of data into each bootstrapped sample? Is there some other way to index the data?
Inside the summarise() call, I reference the original data with "." - Possibly another mistake?
The end result shows that I was able to achieve appropriately grouped calculations using mean() and weighted.mean(), but the calls for confidence intervals using boot() have instead calculated the 95% confidence interval around the global mean of the dataset.
Suggestions on what I'm doing wrong? Why is the boot() function referencing the entire dataset and not the grouped subsets?
library(tidyverse)
library(boot)
set.seed(20)
sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
weights = sample(10,150, replace = TRUE))
##Define function to bootstrap
##I'm using stats::weighted.mean() which needs to take in two arguments
##############
my.function = function(data,index){
d = data[index,] #create bootstrap sample of all columns of original data?
return(weighted.mean(d$counts, d$weights)) #calculate weighted mean using 'counts' and 'weights' columns
}
##############
## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits
## I pass data to boot using "." thinking that this would only pass each grouped subset of data
##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset.
sample.data %>%
group_by(letter) %>%
summarise(avg = mean(counts),
wtd.avg = weighted.mean(counts, weights),
CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])
And below I've calculated a rough estimate of 95% confidence intervals around the global mean to show that this is what was going on with boot() in my summarise() call above
#Here is a rough 95% confidence interval estimate as +/- 1.96* Standard Error
mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))
The following base R solution solves the problem of bootstrapping by groups. Note that boot::boot is only called once.
library(boot)
sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
})
do.call(rbind, y)
# wtd.avg CI.LL CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872
Final clean-up:
rm(sp)
A dplyr solution could be the following. It also calls map_dfr from package purrr.
library(boot)
library(dplyr)
sample.data %>%
group_split(letter) %>%
purrr::map_dfr(
function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
}
)
# wtd.avg CI.LL CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

simulation of normal distribution data contaiminated with outliers

I need to simulate 1000 sets of normal distribution(each 60 subgroups, n=5) by using r programming. Each set of normal distribution is contaiminated with 4 outliers(more than 1.5 IQR). can anyone help?
Thanks in advance
A very simple approach to create a data.frame with a few outliers :
# Create a vector with normally distributed values and a few outliers
# N - Number of random values
# n.out - number of outliers
my.rnorm <- function(N, num.out, mean=0, sd=1){
x <- rnorm(N, mean = mean, sd = sd)
ind <- sample(1:N, num.out, replace=FALSE )
x[ind] <- (abs(x[ind]) + 3*sd) * sign(x[ind])
x
}
N=60
num.out = 4
df <- data.frame( col1 = my.rnorm(N, num.out),
col2 = my.rnorm(N, num.out),
col3 = my.rnorm(N, num.out),
col4 = my.rnorm(N, num.out),
col5 = my.rnorm(N, num.out))
Please note that I used mean=0 and sd=1 as values mean=1, sd=0 that you provided in the comments do not make much sense.
The above approach does not guarantee that there will be exactly 4 outliers. There will be at least 4, but in some rare cases there could be more as rnorm() function does not guarantee that it never produces outliers.
Another note is that data.frames might not be the best objects to store numeric values. If all your 1000 data.frames are numeric, it is better to store them in matrices.
Depending on the final goal and the type of the object you store your data in (list, data.frame or matrix) there are faster ways to create 1000 objects filled with random values.

Confusion about calculating sample correlation in r

I have been tasked with manually calculating the sample correlation between two datasets (D$Nload and D$Pload), and then compare the result with R's in built cor() function.
I calculate the sample correlation with
cov(D$Nload,D$Pload, use="complete.obs")/(sd(D$Nload)*sd(D$Pload, na.rm=TRUE))
Which gives me the result 0.5693599
Then I try using R's cov() function
cor(D[, c("Nload","Pload")], use="pairwise.complete.obs")
which gives me the result:
Nload Pload
Nload 1.0000000 0.6244952
Pload 0.6244952 1.0000000
Which is a different result. Can anyone see where I've gone wrong?
This happens because when you call sd() on a single vector, it cannot check if the data is pairwise complete. Example:
x <- rnorm(100)
y <- rexp(100)
y[1] <- NA
df <- data.frame(x = x, y = y)
So here we have
df[seq(2), ]
x y
1 1.0879645 NA
2 -0.3919369 0.2191193
We see that while the second row is pairwise complete (all columns used for your computation are not NA), the first row is not. However, if you calculate sd() on just a single column, it doesn't have any information about the pairs. So in your case, sd(df$x) will use all the available data, although it should avoid the first row.
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09301583
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766
But if you remove the first row from your computation, the result is equal
df <- df[complete.cases(df), ]
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09313766
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766

Resources