R calculate the correlation coefficient - r

I have a data frame with 3 variables "age", "confidence" and countryname". I want to campare the correlation between age and confidence in different countries. So I write the following commands to calcuate the correlation coefficient.
correlate <- evs%>%group_by(countryname) %>% summarise(c=cor(age,confidence))
But i found that there are a lot missing value in the output "c". i'm wondering is that mean there are little correlation between IV and DV for this countries, or is there something wrong with my commands?

An NA in the correlation matrix means that you have NA values (i.e. missing values) in your observations. The default behaviour of cor is to return a correlation of NA "whenever one of its contributing observations is NA" (from the manual).
That means that a single NA in the date will give a correlation NA even when you only have one NA among a thousand useful data sets.
What you can do from here:
You should investigate these NAs, count it and determine if your data set contains enough usable data. Find out which variables are affected by NAs and to what extent.
Add the argument use when calling cor. This way you specify how the algorithm shall handle missing values. Check out the manual (with ?cor) to find out what options you have. In your case I would just use use="complete.obs". With only 2 variables, most (but not all) options will yield the same result.
Some more explanation:
age <- 18:35
confidence <- (age - 17) / 10 + rnorm(length(age))
cor(age, confidence)
#> [1] 0.3589942
Above is the correlation with all the data. Now lets set a few NAs and try again:
confidence[c(1, 6, 11, 16)] <- NA
cor(age, confidence) # use argument will implicitely be "everything".
#> [1] NA
This gives NA because some confidence values are NA.
The next statement still gives a result:
cor(age, confidence, use="complete.obs")
#> [1] 0.3130549
Created on 2021-10-16 by the reprex package (v2.0.1)

I know two ways of calculation in R;
via built-in cor() function,
manual calculation with code
Calculation with the built-in cor() function:
# importing df:
state_crime <- read.csv("~/Documents/R/state_crime.csv")
# checking colnames:
colnames(state_crime)
[1] "state" "year" "population"
[4] "murder_rate"
# correlation coefficient between population and murder rate:
cor(state_crime$population, state_crime$murder_rate,
method = "pearson")
[1] -0.0322388
Manual calculation with code:
# creating columns for "deviation from the mean" for both variables:
state_crime <- state_crime %>%
mutate(dev_mean_murderrate =
(state_crime$murder_rate - mean(murder_rate))) %>%
mutate(dev_mean_population =
(state_crime$population - mean(population))) %>%
data.frame()
# implementing the formula: r=∑(x−mx)(y−my)∑(x−mx)2∑(y−my)2
sum(state_crime$dev_mean_population * state_crime$dev_mean_murderrate) /
sqrt(sum((state_crime$murder_rate - mean(state_crime$murder_rate))**2) *
sum((state_crime$population - mean(state_crime$population))**2)
)
[1] -0.0322388

Related

Removing outliers can not runt cor.test()

I am extracting outliers from a single column of a dataset. Then I am attempting to run cor.test() on that column plus another column. I am getting error: Error in cor.test.default(dep_delay_noout, distance) : 'x' and 'y' must have the same length I assume this is because removing the outliers from one column caused it to be a different length vector than the other column, but am not sure what to do about it. I have tried mutating the dataset by adding a new column that lacked outliers, but unfortunately ran into the same problem. Does anybody know what to do? Below is my code.
dep_delay<-flights$dep_delay
dep_delay_upper<-quantile(dep_delay,0.997,na.rm=TRUE)
dep_delay_lower<-quantile(dep_delay,0.003,na.rm=TRUE)
dep_delay_out<-which(dep_delay>dep_delay_upper|dep_delay<dep_delay_lower)
dep_delay_noout<-dep_delay[-dep_delay_out]
distance<-flights$distance
cor.test(dep_delay_noout,distance)
You were almost there. In cor.test you also want to subset distance. Additionally, for the preprocessing you could use a quantile vector of length 2 and mapply to do the comparison in one step―just to write it more concise, actually your code is fine.
data('flights', package='nycflights13')
nna <- !is.na(flights$dep_delay)
(q <- quantile(flights$dep_delay[nna], c(0.003, 0.997)))
# 0.3% 99.7%
# -14 270
nout <- rowSums(mapply(\(f, q) f(flights$dep_delay[nna], q), c(`>`, `<`), q)) == 2
with(flights, cor.test(dep_delay[nout], distance[nout]))
# Pearson's product-moment correlation
#
# data: dep_delay[no_out] and distance[no_out]
# t = -12.409, df = 326171, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.02515247 -0.01829207
# sample estimates:
# cor
# -0.02172252

Handling skip in rpart and random forest

I have a dataset containing 10 categorical variables. Each of these has missing values coded as (-9, -6, -3, -2, -1). I want to create 1 column that takes the mean of these 10 variables excluding the negative values. I can collapse the negative values into NA and then median impute them but I need to retain -6 since -6 implies that the person skipped the question because it does not apply to them. For instance, parental relationship quality does not apply to single parents. I ultimately want to use this variable as a predictor in my random forest model so I am not sure how to handle -6 in this case. One way that I could think of is to impute each of the 10 variables as follows (Let's say that the 10 variables are a1 to a10):
missing_categs <- c(-9, -3, -2, -1)
df[df$a1%in%missing_categs,]$a1 <- assign median value of a1
After the above step, I calculate the average of a1 to a10. The ones that yield "-6" are the ones that pertain to single parents (which means it does not apply to them). then, I convert -6 to NA. So, now I have average values and one NA. Can rpart and random forest models handle NA? Other better alternative solutions are most welcome. Thanks in advance!
Can rpart and random forest models handle NA?
I do not know what you mean with handle. If you mean that you can use NA in the predictors than the answer is yes for rpart
> library(rpart)
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> rpart(df, na.action=na.pass)
n= 3
node), split, n, deviance, yval
* denotes terminal node
but no for randomForest
> library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> randomForest(df, na.action=na.pass)
Error in randomForest.default(df, na.action = na.pass) :
NA not permitted in predictors
If you mean handle that they are able to deal with them in some manner, for example by giving them a function, than the answer is yes for both.
rpart and randomForest have the parameter na.action which you can use. See here for rpart and here for randomForest.
The default for rpart na.action is na.rpart which deletes "all observations for which y is missing" and "those in which one or more predictors are missing" are kept.
The default for randomForest na.action is na.fail which returns the given data structure unaltered if no NA's are found, and if at least one NA is found it "signals an error".

different output for PR AUC for different R packages

I find different numeric values for the computation of the Area Under the Precision Recall Curve (PRAUC) with the dataset I am working on when computed via 2 different R packages: yardstick and caret.
I am afraid I was not able to reproduce this mismatch with synthetic data, but only with my dataset (this is strange as well)
In order to make this reproducible, I am sharing the prediction output of my model, you can download it here https://drive.google.com/open?id=1LuCcEw-RNRcdz6cg0X5bIEblatxH4Rdz (don't worry, it's a small csv).
The csv contains a dataframe with 4 columns:
yes probability estimate of being in class yes
no = 1 - yes
obs actual class label
pred predicted class label (with .5 threshold)
here follows the code to produce the 2 values of PRAUC
require(data.table)
require(yardstick)
require(caret)
pr <- fread('pred_sample.csv')
# transform to factors
# put the positive class in the first level
pr[, obs := factor(obs, levels = c('yes', 'no'))]
pr[, pred := factor(pred, levels = c('yes', 'no'))] # this is actually not needed
# compute yardstick PRAUC
pr_auc(pr, obs, yes) # 0.315
# compute caret PRAUC
prSummary(pr, lev = c('yes', 'no')) # 0.2373
I could understand a little difference, due to the approximation when computing the area (interpolating the curve), but this seems way too high.
I even tried a third package, PRROC, and the result is still different, namely around .26.

Clustering leads to very concentrated clusters

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources