correlation failure - Pearson - r

I want to write to datafile information about correlation as follows:
*korelacja=cor(p2,d2,method="pearson",use = "complete.obs")
korelacja2=cor(p2,d2,method="kendall",use = "complete.obs")
korelacja3=cor(p2,d2,method="spearman",use = "complete.obs")
dane=paste(korelacja,korelacja2,korelacja3,sep=';')
write(dane,file=nazwa,append=TRUE)*
Results are strange for me - Pearson correlation is very high (always equal one), but Kendall and Spearman is very low. I create scatterplots and I don't see linear correlation.

It's not hard to replicate this pattern if you have some large outliers in your data that dominate the Pearson correlation but are relatively insignificant in the non-parametric (Kendall/Spearman) approaches. For example, here's a concocted data set with nothing going on except for one large outlier:
> set.seed(1001)
> x <- c(runif(1000),1e5)
> y <- c(runif(1000),1e5)
> cor(x,y,method="pearson")
[1] 1
> cor(x,y,method="kendall")
[1] -0.02216583
> cor(x,y,method="spearman")
[1] -0.03335352
This is consistent with your description so far, although you ought in this case to be able to see the outliers in your scatterplots ...

Related

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Add p-values in corrplot matrix

I calculated the Spearman correlation between two matrices and I'm plotting the r values using corrplot. How can I plot only the significant correlations (so only those correlations having p value lower than 0.00 and delete those having higher p value, even if are strong correlations - high value of r). I generated the correlation matrix using corr.test in psych package, so I already have the p values in cor.matrix$p
This is the code I'm using:
library(corrplot)
library(psych)
corr.test(mydata_t1, mydata_t2, method="spearman")
M <- corrplot(cor.matrix$r, method="square",type="lower",col=col1(100),is.corr=T,mar=c(1,1,1,1),tl.cex=0.5)
How can I modify it to plot only significant corelations?
Take a look at the examples of corrplot. do ?corrplot. It has options for doing what you want.
You can plot the p-values on the graph itself, which I think is better than putting stars,
as people not familiar with that terminology have one more thing to look up.
to put p-values on graph do this corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "p-value") where cor.matrix is object holding the result of cor.test.
The insig option can put:
p-values (as shown above)
blank out insignificant correlations with corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "blank")`
Cross out (put a X on) insignificant correlations) with option corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "pch") (DEFAULT)
Do nothing with to the plot, with corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "n")
If you do want stars, p-value on the correlation matrix plot - take a look at this thread Correlation Corrplot Configuration
Though I have to say I really like #sven hohenstein's elegant subset solution.
Create a copy of cor.mat and replace the corresponding correlation coefficients with zero:
cor.matrix2 <- cor.matrix
# find cells with p-values > 0.05 and replace corresponding
# correlations coefficients with zero
cor.matrix2$r[cor.matrix2$p > 0.05] <- 0
# use this matrix for corrplot
M <- corrplot(cor.matrix2$r, method="square",type="lower",col=col1(100),
is.corr=T,mar=c(1,1,1,1),tl.cex=0.5)
The replaced values will appear as a white cell.
What you are asking is similar to what subset does:
Return subsets of vectors, matrices or data frames which meet
conditions.
So you can do:
cor.matrix <- subset(cor.matrix, p<0.00)
P <- corrplot(cor.matrix$r, method="square",type="lower",col=col1(100),is.corr=T,mar=c(1,1,1,1),tl.cex=0.5)

Splitting a large vector into intervals in R [duplicate]

This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 9 years ago.
I'm not too good with R. I ran this loop and I have this huge resulting vector of 11,303,044 rows. I have another vector resulting from another loop with dimensions 1681 rows.
I wish to run a chisq.test to compare their distributions. but since they are of different length, it's not working.
I tried taking 1681-sized samples from the 11,303,044-sized vector to match the size length of the 2nd vector but I get different chisq.test results every time I run it.
I'm thinking splitting the 2 vectors into equal number of intervals.
Let's say
vector1:
temp.mat<-matrix((rnorm(11303044))^2, ncol=1)
head(temp.mat)
dim(temp.mat)
vector2:
temp.mat<-matrix((rnorm(1681))^2, ncol=1)
head(temp.mat)
dim(temp.mat)
How do I split them in equal intervals to result in same lengths vectors?
mat1<-matrix((rnorm(1130300))^2, ncol=1) # only one-tenth the size of your vector
smat=sample(mat1, 100000) #and take only one-tenth of that
mat2<-matrix((rnorm(1681))^2, ncol=1)
qqplot(smat,mat2) #and repeat the sampling a few times
What you see seems interesting from a statistical point of view. At the higher levels of "departure from the mean" the large sample is always departing from a "good fit" not surprisingly because it has a higher number of really extreme values.
chisq.test is Pearson's chi-square test. It is designed for discrete data, and with two input vectors, it will coerce the inputs you pass in to factors, and it tests for independence, not equality in distribution. This means, for example, that the order of the data will make a difference.
> set.seed(123)
> x<-sample(5,10,T)
> y<-sample(5,10,T)
> chisq.test(x,y)
Pearson's Chi-squared test
data: x and y
X-squared = 18.3333, df = 16, p-value = 0.3047
Warning message:
In chisq.test(x, y) : Chi-squared approximation may be incorrect
> chisq.test(x,y[10:1])
Pearson's Chi-squared test
data: x and y[10:1]
X-squared = 16.5278, df = 16, p-value = 0.4168
Warning message:
In chisq.test(x, y[10:1]) : Chi-squared approximation may be incorrect
So I don't think that chisq.test is what you want, because it does not compare distributions. Maybe try something like ks.test, which will work with different length vectors and continuous data.
> set.seed(123)
> x<-rnorm(2000)^2
> y<-rnorm(100000)^2
> ks.test(x,y)
Two-sample Kolmogorov-Smirnov test
data: x and y
D = 0.0139, p-value = 0.8425
alternative hypothesis: two-sided
> ks.test(sqrt(x),y)
Two-sample Kolmogorov-Smirnov test
data: sqrt(x) and y
D = 0.1847, p-value < 2.2e-16
alternative hypothesis: two-sided

Confidence interval for Weibull distribution

I have wind data that I'm using to perform extreme value analysis (calculate return levels). I'm using R with packages 'evd', 'extRemes' and 'ismev'.
I'm fitting GEV, Gumbel and Weibull distributions, in order to estimate the return levels (RL) for some period T.
For the GEV and Gumbel cases, I can get RL's and Confidence Intervals using the extRemes::return.level() function.
Some code:
require(ismev)
require(MASS)
data(wind)
x = wind[, 2]
rperiod = 10
fit <- fitdistr(x, 'weibull')
s <- fit$estimate['shape']
b <- fit$estimate['scale']
rlevel <- qweibull(1 - 1/rperiod, shape = s, scale = b)
## CI around rlevel
## ci.rlevel = ??
But for the Weibull case, I need some help to generate the CI's.
I suspect the excruciatingly correct answer will be that the joint confidence region is an ellipse or some bent-sausage shape but you can extract variance estimates for the parameters from the fit object with the vcov function and then build standard errors for which +/- 1.96 SE's should be informative:
> sqrt(vcov(fit)["shape", "shape"])
[1] 0.691422
> sqrt(vcov(fit)["scale", "scale"])
[1] 1.371256
> s +c(-1,1)*sqrt(vcov(fit)["shape", "shape"])
[1] 6.162104 7.544948
> b +c(-1,1)*sqrt(vcov(fit)["scale", "scale"])
[1] 54.46597 57.20848
The usual way to calculate a CI for a single parameter is to assume Normal distribution and use theta+/- 1.96*SE(theta). In this case, you have two parameters so doing that with both of them would give you a "box", the 2D analog of an interval. The truly correct answer would be something more complex in the 'scale'-by-'shape' parameter space and might be most easily achieved with simulation methods, unless you have a better grasp of theory than I have.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources