Positive Spearman's r value with negative abline - r

I'm looking over some data I collected this week-end and one of the Spearman correlation test gave me a positive value, but when I add an abline on the plot, it is descending. I'm curious how this can be possible.
Here's are the data for Study:
Year Matches Total
1 1958 2 7
2 1959 2 14
3 1960 5 9
4 1961 2 20
5 1962 4 27
6 1963 5 20
7 1964 5 25
8 1965 5 20
9 1966 3 18
10 1967 5 28
11 1968 6 26
12 1969 4 24
13 1970 6 22
14 1971 7 32
And here's the procedure I used:
Results<- (Study$Matches/Study$Total)*100
Year<-Study$Year
plot(Year, Results, main = "MAIN")
fit1 <- lm (Results ~ Year, data = Study)
abline(fit1, lty = "dashed")
cor.test(Year, Results, method = "s")

The fit produces a linear model. Pearson is the linear correlation and it is negative. Spearman is non-linear and based on the ranking.
> cor.test(Year,Results,method="spearman")
Spearman's rank correlation rho
data: Year and Results
S = 438.9647, p-value = 0.9048
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.03524238
Warning message:
In cor.test.default(Year, Results, method = "spearman") :
Cannot compute exact p-value with ties
> cor(Year,Results,method="spearman")
[1] 0.03524238
> cor(Year,Results,method="pearson")
[1] -0.17501
It is correctly plotting the negative linear correlation. It just happens that Spearman is positive. Unintuitive things can happen when correlations are low. The high p-value is a clue too.

Related

Using a loop to run Linear Models over multiple populations exporting AIC values and coeficients

I have data that includes multiple populations (which each contain multiple individuals per populations). I am trying to evaluate resource selection through using linear models to evaluate use. I want to do this at the population level, and therefore am hoping to use a loop to loop through running 6 models on each population, and then provide AIC tables for each population as well as the coefficients for the best fitting models.
Here is a sample of my data:
Population WLH_ID Used Var1 Var2 Var3 Var4 Var5
Tweed 1 1 15 2 10 21 22.1
Tweed 2 1 7 3 9 20 20
Lake 3 1 11 2 7 19 20
Lake 4 1 13 2 8 21 20
Hwy 5 1 14 1 6 12 23
Hwy 6 1 10 2 7 17 20
Jasper 7 1 12 4 7 19 22
Tweed 1 0 15 2 10 21 22.1
Tweed 2 0 7 3 9 20 20
Lake 3 0 11 2 7 19 20
Lake 4 0 11 2 8 21 20
Hwy 5 0 12 1 5 23 23
Hwy 6 0 14 7 7 17 20
Jasper 7 0 17 2 4 19 21.5
So far I have tried the following
Model1 <- as.formula(Used ~ var1+var2+var3+var4+var5+(1|WLH_ID))
Model2 <- as.formula(Used ~ var1+var2+var3+var4+(1|WLH_ID))
Model3 <- as.formula(Used ~ var1+var2+var3+(1|WLH_ID))
Model4 <- as.formula(Used ~ var1+var2+(1|WLH_ID))
Model5 <- as.formula(Used ~ var1+(1|WLH_ID))
Model6<- as.formula(Used~1)
### It will use the Model1 formula entered above, so make sure you have run that
SM.split <- split(mydata,mydata$Population) #Split the data into a list with one entry for each population
for (i in 1:length (SM.split)){
poprun<-SM.split[[i]]
Cand.models[[1]]<-glmer(Model1,family = binomial,data=poprun)
Cand.models[[2]]<-glmer(Model2,family = binomial,data=poprun)
Cand.models[[3]]<-glmer(Model3,family = binomial,data=poprun)
Cand.models[[4]]<-glmer(Model4,family = binomial,data=poprun)
Cand.models[[5]]<-glmer(Model5,family = binomial,data=poprun)
Cand.models[[6]]<-glmer(Model6,family = binomial,data=poprun)
Modnames[[i]]<-paste("mod",1:length(Cand.models),sep=" ")
AICTable[[i]]<-aictab(cand.set = Cand.models, modnames = Modnames, sort = TRUE)
}
I receive the following error:
Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev, :
pwrssUpdate did not converge in (maxit) iterations
Alternatively I am really trying to split the data and have R loop through each population separately, and then tell me for that population what is the model with the lowest AIC and what are the coefficients for that model. In reality I have 30+ populations and am comparing 6 models so I am hoping to not have to write out each one.
NOTE: I do not have a good way to make up data to attach, and cannot attach my own.

Linear regression using time lagged predictors (independent variables) for forecasting purpose

I'm working on forecasting the Monthly Average Precipitation of a geographical region in India (Assam and Meghalaya subdivision). For this purpose, I'm using the Monthly Average Air Temperature data and Monthly Averaged Relative Humidity data (which I extracted and averaged it spatially from the netCDF4 file for this geographical region present on the NOAA website) as the independent variables(predictors).
For the forecasting purpose, I want to model a linear regression with Precipitation as the dependent variable and "Air Temperature" and "Relative Humidity" data as the independent variables such that they're having a time-lagged effect in the regression.
The Linear regression equation should look like:
Please follow this link for the equation
Here, "Y" is Precipitation, "X" is Air Temperature and "Z" is Relative Humidity.
The sample "Training data" is as follows:
ID Time Precipitation Air_Temperature Relative_Humidity
1 1 1948-01-01 105 20.31194 81.64137
2 2 1948-02-01 397 21.21052 80.20120
3 3 1948-03-01 594 22.14363 81.94274
4 4 1948-04-01 2653 20.79417 78.89908
5 5 1948-05-01 7058 20.43589 82.99959
6 6 1948-06-01 5328 18.10059 77.91983
7 7 1948-07-01 4882 16.63936 76.25758
8 8 1948-08-01 3979 16.56065 76.89210
9 9 1948-09-01 2625 16.95542 76.80116
10 10 1948-10-01 2578 17.13323 75.62411
And a segment of "Test data" is as follows:
ID Time Precipitation Air_Temperature Relative_Humidity
1 663 2003-03-01 862 21.27210 79.77419
2 664 2003-04-01 1812 20.44042 79.42500
3 665 2003-05-01 1941 19.24267 79.57057
4 666 2003-06-01 4981 18.53784 80.67292
5 667 2003-07-01 4263 17.21581 79.97178
6 668 2003-08-01 2436 16.88686 81.37097
7 669 2003-09-01 2322 16.23134 77.63333
8 670 2003-10-01 2220 17.40589 81.14516
9 671 2003-11-01 131 19.01159 79.15000
10 672 2003-12-01 241 20.86234 79.05847
Any help would be highly appreciated. Thanks!
Reacting to your clarification in the comments, here is one of many ways to produce lagged variables, using the lag function within dplyr (I am also adding a new row here for later forecasting):
df %>%
add_row(ID = 11, Time = "1948-11-01") %>%
mutate(Air_Temperature_lagged = dplyr::lag(Air_Temperature, 1),
Relative_Humidity_lagged = dplyr::lag(Relative_Humidity, 1)) -> df.withlags
You can then fit a straightforward linear regression using lm, with Precipitation as your dependent variable and the lagged versions of the two other variables as the predictor:
precip.model <- lm(data = df.withlags, Precipitation ~ Air_Temperature_lagged + Relative_Humidity_lagged)
You could then apply your coefficients to your most recent values in Air_Temperature and Relative_Humidity to forecast the precipitation for November of 1948 using the predict function.
predict(precip.model, newdata = df.withlags)
1 2 3 4 5 6 7 8 9 10 11
NA 2929.566 3512.551 3236.421 3778.742 2586.012 3473.482 3615.884 3426.378 3534.965 3893.255
The model's prediction is 3893.255.
Note that this model will only allow you to forecast one time period into the future, since you don't have more information in your predictors.

Coefplot for a chi square distribution

I was told to do a coefplot in R to visualise my data better.
Therefore i first did a chi square test. and after i put my data into a table it looked like this:
1 2 3 5 6
5_min_blank 11 21 18 19 8
Boldstyle 6 7 14 10 2
Boldstyle_pause 9 22 19 8 0
Breaststroke 7 16 10 5 4
Breaststroke_pause 9 13 10 8 3
Diving 14 20 10 10 4
1-6 are categories and "bold style" etc. are different sounds.
i than did a test:
fit.swim<-chisq.test(X2,simulate.p.value = TRUE, B = 10000)
and got this result:
Pearson's Chi-squared test with simulated p-value (based on 10000 replicates)
data: X2
X-squared = 87.794, df = NA, p-value = 0.09479
Now i would like to do a coefplot with my data but i only get this error:
coefplot(fit.swim)
Error: $ operator is invalid for atomic vectors
Any ideas how to draw a nice plot?
Thank you very much for the help!
All the best
Marie
I think that the reason you are getting that error is because coefplot requires a fitted model as input in the form of an lm, glm or rxLinMod obj.
In your case you have carried out a goodness of fit test that essentially compares the observed sample distribution with the expected probability distribution. There isn't a fitted model to plot the coefficients from.

Linear Regression - Append Predicted Values to Same dataset

I recently started with R programming. This is my dataset
WeekOfYear Production
1 202612
2 245633
3 299653
4 252612
5 299633
6 288993
7 254653
8 288612
9 277733
10 245633
I need to predict “Production” values for the remaining Weeks of the Year
relation<-lm(Production~WeekOfYear,dataset)
predict(relation,data.frame(WeekOfYear=c(11)))
How to append predicted values for week nos 11 to 52 (end of year) to the same dataset like below
WeekOfYear Production
1 202612
2 245633
3 299653
4 252612
5 299633
6 288993
7 254653
8 288612
9 277733
10 245633
11 predicted value
12 predicted value
so on
-OR-
WeekOfYear Production Regression
1 202612 fitted value
2 245633 fitted value
3 299653 fitted value
4 252612 fitted value
5 299633 fitted value
6 288993 fitted value
7 254653 fitted value
8 288612 fitted value
9 277733 fitted value
10 245633 fitted value
11 predicted value
12 predicted value
13 predicted value
14 predicted value
.
.
52 predicted value
You could do it like this:
relation <- lm(Production ~ WeekOfYear, dat)
WeekOfYear <- 1:52
predict(relation, data.frame(WeekOfYear))
dat2 <- data.frame(WeekOfYear, regression = predict(relation, data.frame(WeekOfYear)))
merge(dat, dat2, by = 'WeekOfYear', all.y = TRUE)
The result:
WeekOfYear Production regression
1 1 202612 250517.6
2 2 245633 253864.1
3 3 299653 257210.5
4 4 252612 260557.0
5 5 299633 263903.5
6 6 288993 267249.9
7 7 254653 270596.4
8 8 288612 273942.9
9 9 277733 277289.3
10 10 245633 280635.8
11 11 NA 283982.3
12 12 NA 287328.7
----
51 51 NA 417840.9
52 52 NA 421187.4
To append your values you can use the following
test_data <- data.frame(WeekOfYear=11:52, Production = rep(0, 52-11+1))
test_data$Production <- predict(relation,test_data)
df = rbind(df, test_data)
where I have defined with df your data Frame
df = data.frame(WeekOfYear =
c(1,2,3,4,5,6,7,8,9,10),
Production = c(202612,245633,299653,252612,299633,288993,254653,288612, 277733,245633))
this will give you this behaviour (plot put together very quickly)
I am not sure anyway that your data follow a linear behaviour but you may know your data better...

covariance structure for multilevel modelling

I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

Resources