Removing outliers can not runt cor.test() - r

I am extracting outliers from a single column of a dataset. Then I am attempting to run cor.test() on that column plus another column. I am getting error: Error in cor.test.default(dep_delay_noout, distance) : 'x' and 'y' must have the same length I assume this is because removing the outliers from one column caused it to be a different length vector than the other column, but am not sure what to do about it. I have tried mutating the dataset by adding a new column that lacked outliers, but unfortunately ran into the same problem. Does anybody know what to do? Below is my code.
dep_delay<-flights$dep_delay
dep_delay_upper<-quantile(dep_delay,0.997,na.rm=TRUE)
dep_delay_lower<-quantile(dep_delay,0.003,na.rm=TRUE)
dep_delay_out<-which(dep_delay>dep_delay_upper|dep_delay<dep_delay_lower)
dep_delay_noout<-dep_delay[-dep_delay_out]
distance<-flights$distance
cor.test(dep_delay_noout,distance)

You were almost there. In cor.test you also want to subset distance. Additionally, for the preprocessing you could use a quantile vector of length 2 and mapply to do the comparison in one step―just to write it more concise, actually your code is fine.
data('flights', package='nycflights13')
nna <- !is.na(flights$dep_delay)
(q <- quantile(flights$dep_delay[nna], c(0.003, 0.997)))
# 0.3% 99.7%
# -14 270
nout <- rowSums(mapply(\(f, q) f(flights$dep_delay[nna], q), c(`>`, `<`), q)) == 2
with(flights, cor.test(dep_delay[nout], distance[nout]))
# Pearson's product-moment correlation
#
# data: dep_delay[no_out] and distance[no_out]
# t = -12.409, df = 326171, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.02515247 -0.01829207
# sample estimates:
# cor
# -0.02172252

Related

R calculate the correlation coefficient

I have a data frame with 3 variables "age", "confidence" and countryname". I want to campare the correlation between age and confidence in different countries. So I write the following commands to calcuate the correlation coefficient.
correlate <- evs%>%group_by(countryname) %>% summarise(c=cor(age,confidence))
But i found that there are a lot missing value in the output "c". i'm wondering is that mean there are little correlation between IV and DV for this countries, or is there something wrong with my commands?
An NA in the correlation matrix means that you have NA values (i.e. missing values) in your observations. The default behaviour of cor is to return a correlation of NA "whenever one of its contributing observations is NA" (from the manual).
That means that a single NA in the date will give a correlation NA even when you only have one NA among a thousand useful data sets.
What you can do from here:
You should investigate these NAs, count it and determine if your data set contains enough usable data. Find out which variables are affected by NAs and to what extent.
Add the argument use when calling cor. This way you specify how the algorithm shall handle missing values. Check out the manual (with ?cor) to find out what options you have. In your case I would just use use="complete.obs". With only 2 variables, most (but not all) options will yield the same result.
Some more explanation:
age <- 18:35
confidence <- (age - 17) / 10 + rnorm(length(age))
cor(age, confidence)
#> [1] 0.3589942
Above is the correlation with all the data. Now lets set a few NAs and try again:
confidence[c(1, 6, 11, 16)] <- NA
cor(age, confidence) # use argument will implicitely be "everything".
#> [1] NA
This gives NA because some confidence values are NA.
The next statement still gives a result:
cor(age, confidence, use="complete.obs")
#> [1] 0.3130549
Created on 2021-10-16 by the reprex package (v2.0.1)
I know two ways of calculation in R;
via built-in cor() function,
manual calculation with code
Calculation with the built-in cor() function:
# importing df:
state_crime <- read.csv("~/Documents/R/state_crime.csv")
# checking colnames:
colnames(state_crime)
[1] "state" "year" "population"
[4] "murder_rate"
# correlation coefficient between population and murder rate:
cor(state_crime$population, state_crime$murder_rate,
method = "pearson")
[1] -0.0322388
Manual calculation with code:
# creating columns for "deviation from the mean" for both variables:
state_crime <- state_crime %>%
mutate(dev_mean_murderrate =
(state_crime$murder_rate - mean(murder_rate))) %>%
mutate(dev_mean_population =
(state_crime$population - mean(population))) %>%
data.frame()
# implementing the formula: r=∑(x−mx)(y−my)∑(x−mx)2∑(y−my)2
sum(state_crime$dev_mean_population * state_crime$dev_mean_murderrate) /
sqrt(sum((state_crime$murder_rate - mean(state_crime$murder_rate))**2) *
sum((state_crime$population - mean(state_crime$population))**2)
)
[1] -0.0322388

Pearson coefficient per rows on large matrices

I'm currently working with a large matrix (4 cols and around 8000 rows).
I want to perform a correlation analysis using Pearson's correlation coefficient between the different rows composing this matrix.
I would like to proceed the following way:
Find Pearson's correlation coefficient between row 1 and row 2. Then between rows 1 and 3... and so on with the rest of the rows.
Then find Pearson's correlation coefficient between row 2 and row 3. Then between rows 2 and 4... and so on with the rest of the rows. Note I won't find the coefficient with row 1 again...
For those coefficients being higher or lower than 0.7 or -0.7 respectively, I would like to list on a separate file the row names corresponding to those coefficients, plus the coefficient. E.g.:
row 230 - row 5812 - 0.76
I wrote the following code for this aim. Unfortunately, it takes a too long running time (I estimated almost a week :( ).
for (i in 1:7999) {
print("Analyzing row:")
print(i)
for (j in (i+1):8000) {
value<- cor(alpha1k[i,],alpha1k[j,],use = "everything",method = "pearson")
if(value>0.7 | value<(-0.7)){
aristi <- c(row.names(alpha1k)[i],row.names(alpha1k)[j],value)
arist1p<-rbind(arist1p,aristi)
}
}
Then my question is if there's any way I could do this faster. I read about making these calculations in parallel but I have no clue on how to make this work. I hope I made myself clear enough, thank you on advance!
As Roland pointed out, you can use the matrix version of cor to simplify your task. Just transpose your matrix to get a "row" comparison.
mydf <- data.frame(a = c(1,2,3,1,2,3,1,2,3,4), b = rep(5,2,10), c = c(1:10))
cor_mat <- cor(t(mydf)) # correlation of your transposed matrix
idx <- which((abs(cor_mat) > 0.7), arr.ind = T) # get relevant indexes in a matrix form
cbind(idx, cor_mat[idx]) # combine coordinates and the correlation
Note that parameters use = everything and method = "pearson" are used by default for correlation. There is no need to specify them.

Grabbing certain results out of multiple t.test outputs to create a table

I have run 48 t-tests (coded by hand instead of writing a loop) and would like to splice out certain results of those t.tests to create a table of the things I'm most interested in.
Specifically, I would like to keep only the p-value, confidence interval, and the mean of x and mean of y for each of these 48 tests and then build a table of the results.
Is there an elegant, quick way to do this beyond the top answer detailed here , wherein I would go in for all 48 tests and grab all three desired outputs with something along the lines of ttest$p.value? Perhaps a loop?
Below is a sample of the coded input for one t-test, followed by the output delivered by R.
# t.test comparing means of Change_Unemp for 2005 government employment (ix)
lowgov6 <- met_res[met_res$Gov_Emp_2005 <= 93310, "Change_Unemp"]
highgov6 <- met_res[met_res$Gov_Emp_2005 > 93310, "Change_Unemp"]
t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
Welch Two Sample t-test
data: lowgov6 and highgov6
t = 1.5896, df = 78.978, p-value = 0.1159
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1813909 1.6198399
sample estimates:
mean of x mean of y
4.761224 4.042000
Save all of your t-tests into a list:
tests <- list()
tests[[1]] <- t.test(lowgov6,highgov6,pool.sd=FALSE,na.rm=TRUE)
# repeat for all tests
# there are probably faster ways than doing all of that by hand
# extract your values using `sapply`
sapply(tests, function(x) {
c(x$estimate[1],
x$estimate[2],
ci.lower = x$conf.int[1],
ci.upper = x$conf.int[2],
p.value = x$p.value)
})
The output is something like the following:
[,1] [,2]
mean of x 0.12095949 0.03029474
mean of y -0.05337072 0.07226999
ci.lower -0.11448679 -0.31771191
ci.upper 0.46314721 0.23376141
p.value 0.23534905 0.76434012
But will have 48 columns. You can t() the result if you'd like it transposed.

Splitting a large vector into intervals in R [duplicate]

This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 9 years ago.
I'm not too good with R. I ran this loop and I have this huge resulting vector of 11,303,044 rows. I have another vector resulting from another loop with dimensions 1681 rows.
I wish to run a chisq.test to compare their distributions. but since they are of different length, it's not working.
I tried taking 1681-sized samples from the 11,303,044-sized vector to match the size length of the 2nd vector but I get different chisq.test results every time I run it.
I'm thinking splitting the 2 vectors into equal number of intervals.
Let's say
vector1:
temp.mat<-matrix((rnorm(11303044))^2, ncol=1)
head(temp.mat)
dim(temp.mat)
vector2:
temp.mat<-matrix((rnorm(1681))^2, ncol=1)
head(temp.mat)
dim(temp.mat)
How do I split them in equal intervals to result in same lengths vectors?
mat1<-matrix((rnorm(1130300))^2, ncol=1) # only one-tenth the size of your vector
smat=sample(mat1, 100000) #and take only one-tenth of that
mat2<-matrix((rnorm(1681))^2, ncol=1)
qqplot(smat,mat2) #and repeat the sampling a few times
What you see seems interesting from a statistical point of view. At the higher levels of "departure from the mean" the large sample is always departing from a "good fit" not surprisingly because it has a higher number of really extreme values.
chisq.test is Pearson's chi-square test. It is designed for discrete data, and with two input vectors, it will coerce the inputs you pass in to factors, and it tests for independence, not equality in distribution. This means, for example, that the order of the data will make a difference.
> set.seed(123)
> x<-sample(5,10,T)
> y<-sample(5,10,T)
> chisq.test(x,y)
Pearson's Chi-squared test
data: x and y
X-squared = 18.3333, df = 16, p-value = 0.3047
Warning message:
In chisq.test(x, y) : Chi-squared approximation may be incorrect
> chisq.test(x,y[10:1])
Pearson's Chi-squared test
data: x and y[10:1]
X-squared = 16.5278, df = 16, p-value = 0.4168
Warning message:
In chisq.test(x, y[10:1]) : Chi-squared approximation may be incorrect
So I don't think that chisq.test is what you want, because it does not compare distributions. Maybe try something like ks.test, which will work with different length vectors and continuous data.
> set.seed(123)
> x<-rnorm(2000)^2
> y<-rnorm(100000)^2
> ks.test(x,y)
Two-sample Kolmogorov-Smirnov test
data: x and y
D = 0.0139, p-value = 0.8425
alternative hypothesis: two-sided
> ks.test(sqrt(x),y)
Two-sample Kolmogorov-Smirnov test
data: sqrt(x) and y
D = 0.1847, p-value < 2.2e-16
alternative hypothesis: two-sided

What is the ggplot2/plyr way to calculate statistical tests between two subgroups?

I am a rather novice user of R and have come to appreciate the elegance of ggplot2 and plyr. Right now, I am trying to analyze a large dataset that I can not share here, but I have reconstructed my problem with the diamonds dataset (shortened for convenience).
Without further ado:
diam <- diamonds[diamonds$cut=="Fair"|diamonds$cut=="Ideal",]
boxplots <- ggplot(diam, aes(x=cut, price)) + geom_boxplot(aes(fill=cut)) + facet_wrap(~ color)
print(boxplots)
What the plot produces is a set of boxplots, comparing the price of the two cuts "Fair" and "Ideal".
I would now very much like to proceed by statistically comparing the two cuts for each color subgroup (D,E,F,..,J) using either t.test or wilcox.test.
How would I implement this in an way that is as elegant as the ggplot2-syntax? I assume I would use ddply from the plyr-package, but I couldn't figure out how to feed two subgroups into a function that calculates the appropriate statistics..
I think you're looking for:
library(plyr)
ddply(diam,"color",
function(x) {
w <- wilcox.test(price~cut,data=x)
with(w,data.frame(statistic,p.value))
})
(Substituting t.test for wilcox.test seems to work fine too.)
results:
color statistic p.value
1 D 339753.5 4.232833e-24
2 E 591104.5 6.789386e-19
3 F 731767.5 2.955504e-11
4 G 950008.0 1.176953e-12
5 H 611157.5 2.055857e-17
6 I 213019.0 3.299365e-04
7 J 56870.0 2.364026e-01
ddply returns a data frame as output and, assuming that I am reading your question properly, that isn't what you are looking for. I believe you would like to conduct a series of t-tests using a series of subsets of data so the only real task is compiling a list of those subsets. Once you have them you can use a function like lapply() to run a t-test for each subset in your list. I am sure this isn't the most elegant solution, but one approach would be to create a list of unique pairs of your colors using a function like this:
get.pairs <- function(v){
l <- length(v)
n <- sum(1:l-1)
a <- vector("list",n)
j = 1
k = 2
for(i in 1:n){
a[[i]] <- c(v[j],v[k])
if(k < l){
k <- k + 1
} else {
j = j + 1
k = j + 1
}
}
return(a)
}
Now you can use that function to get your list of unique pairs of colors:
> (color.pairs <- get.pairs(levels(diam$color))))
[[1]]
[1] "D" "E"
[[2]]
[1] "D" "F"
...
[[21]]
[1] "I" "J"
Now you can use each of these lists to run a t.test (or whatever you would like) on your subset of your data frame, like so:
> t.test(price~cut,data=diam[diam$color %in% color.pairs[[1]],])
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
Now use lapply() to run your test for each subset in your list of color pairs:
> lapply(color.pairs,function(x) t.test(price~cut,data=diam[diam$color %in% x,]))
[[1]]
Welch Two Sample t-test
data: price by cut
t = 8.1594, df = 427.272, p-value = 3.801e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1008.014 1647.768
sample estimates:
mean in group Fair mean in group Ideal
3938.711 2610.820
...
[[21]]
Welch Two Sample t-test
data: price by cut
t = 0.8813, df = 375.996, p-value = 0.3787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-260.0170 682.3882
sample estimates:
mean in group Fair mean in group Ideal
4802.912 4591.726

Resources