Spearman's rho between ordered factors in R - r

I have two ordered factors and simply want to find Spearman's rho between them.
However:
> cor(dat$UEMS.2,dat$SCIM23_SubScore1.2,use="pairwise.complete.obs",method="spearman")
Error in cor(dat$UEMS.2, dat$SCIM23_SubScore1.2, use = "pairwise.complete.obs", :
'x' must be numeric
And just as a sanity check:
> class(dat$UEMS.2)
[1] "ordered" "factor"
> class(dat$SCIM23_SubScore1.2)
[1] "ordered" "factor"
How do I find spearman's rho for ordered factors using R?
I did find the following:
Calculate correlation - cor() - for only a subset of columns
Which raises the same issue: R's cor() function only accepts numerical data. This doesn't seem right to me, because spearman's rho should be able to handle ordinal variables. Ordered factors are ordinal variables.
Thanx in advance.

You can use the pspearman package to handle ordinal variables:
a <- factor(c(1, 2, 3, 4, 4, 4, 3, 4, 2, 2, 1), ordered=TRUE)
b <- factor(c(1, 4, 2, 2, 4, 1, 1, 4, 4, 3, 3), ordered=TRUE)
library(pspearman)
spearman.test(a, b)
# Rsquare F df1 df2 pvalue n
# 0.001015235 0.009146396 1.000000000 9.000000000 0.925904654 11.000000000

Related

variogram function in R returns one single observation

I'm trying to construct a variogram cloud in R using the variogram function from the gstat package. I'm not sure if there's something about the topic that I've misunderstood, but surely I should get more than one observation, right? Here's my code:
data = data.frame(matrix(c(2, 4, 8, 7, 6, 4, 7, 9, 4, 4, -1.01, .05, .47, 1.36, 1.18), nrow=5, ncol=3))
data = rename(data, X=X1, Y=X2, Z=X3)
coordinates(data) = c("X","Y")
var.cld = variogram(Z ~ 1,data=data, cloud = TRUE)
And here's the output:
> var.cld
dist gamma dir.hor dir.ver id left right
1 1 0.0162 0 0 var1 5 4
I found the problem! Apparently the default value of the cutoff argument was too low for my specific set of data. Specifying a higher value resulted in additional observations.

Which packages in R allow for lagged variables for time series analysis?

I would like to include multiple lags of an exogenous variable in a regression. Let's say that I have the following data:
X = c(1, 4, 8, 9, 3, 5...)
X2 = c(4, 6, 7, 9, 7, 8...)
I want to use lags of X2 to predict X. Does anyone know why package allows for me to do this? I have tried using dynlm and lag() from stats.
Thanks
library(zoo)
set.seed(1111)
x <- as.zoo(rnorm(10, 0, 0.02))
y <- lag(x, 2, na.pad = TRUE)
cbind(x, y)
This performs an ordinary linear regression of X on the first 2 lags of X2 with an intercept (fit2), on the first lag with an intercept (fit1) and just on an intercept (fit0). Note that in R one normally uses negative numbers to lag so for convenience we defined a Lag function which uses positive numbers to indicate lags. lag.zoo allows vector lags so Lag(z2, 1:2) has two columns, one column for each of the two lags.
library(dyn)
X = c(1, 4, 8, 9, 3, 5)
X2 = c(4, 6, 7, 9, 7, 8)
z <- zoo(X)
z2 <- zoo(X2)
Lag <- function(x, k = 1) lag(x, k = -k)
fit2 <- dyn$lm(z ~ Lag(z2, 1:2))
fit1 <- dyn$lm(z ~ Lag(z2))
fit0 <- dyn$lm(z ~ 1)
For example, here is fit2.
> fit2
Call:
lm(formula = dyn(z ~ Lag(z2, 1:2)))
Coefficients:
(Intercept) Lag(z2, 1:2)1 Lag(z2, 1:2)2
19.3333 -1.4242 -0.4242
Here is a comparison of the three fits showing that the one and two lag fits are not significantly better than just using the intercept; however, there is a quite drop in residual sum of squares by adding the first lag to the intercept only model so you might want to ignore the statistical significance and use the first lag anyways.
> anova(fit0, fit1, fit12)
Analysis of Variance Table
Model 1: z ~ 1
Model 2: z ~ Lag(z2)
Model 3: z ~ Lag(z2, 1:2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3 22.7500
2 2 8.4211 1 14.3289 2.1891 0.3784
3 1 6.5455 1 1.8756 0.2865 0.6871
It would also be possible to use ts class in place of the zoo class; however, lag.ts does not support vector lags so with ts each term would have to be written out separately. Lag is from above.
tt <- ts(X)
tt2 <- ts(X2)
fits12_ts <- dyn$lm(tt ~ Lag(tt2) + Lag(tt2, 2))
No external R library is required, I would say
X2 = c(4, 6, 7, 9, 7, 8)
lag = 2
lagged_data <- function(x) c(tail(X2, -x), rep(NA, x))
lagged_data(lag)
# [1] 7 9 7 8 NA NA

R's equivalence of numpy.linalg.lstsq

I have multiple linear regressions of the form vc = x1 * va + x2 * vb.
(Now, a too minimal example follows - it has the same values, which leads to warnings in R. Below a second data set illustrating my issue)
In Python, I programmed
#!/usr/bin/env python3
import numpy as np
va = np.array([1, 2, 3, 4, 5])
vb = np.array([1, 2, 3, 4, 5])
vc = np.array([1, 2, 3, 4, 5])
A = np.vstack([va, vb]).T
print(A)
result = np.linalg.lstsq(A, vc)
print(result)
Output:
(array([ 0.5, 0.5]), array([], dtype=float64), 1, array([ 1.04880885e+01, 3.14018492e-16]))
I thought, following code would be identical:
#!/usr/bin/Rscript
va <- c(1, 2, 3, 4, 5)
vb <- c(1, 2, 3, 4, 5)
vc <- c(1, 2, 3, 4, 5)
reg <- lm(vc ~ va + vb)
reg
summary(reg)
However, I get following output (excerpt):
Coefficients:
A1 A2
1 NA
esidual standard error: 7.022e-16 on 4 degrees of freedom
In summary.lm(reg) : essentially perfect fit: summary may be unreliable
Even if I adjust the numbers somehow, R still keeps complaining.
I assume, I am doing something basic wrong, but I can't figure out. I also tried to construct a matrix A (containg vb and vc as colums) and then use reg <- lm(vc ~ 0 + A). There, I get 3 degrees of freedom, but with the same Coefficients.
2nd data set
va = np.array([1, 2, 3, 4, 5])
vb = np.array([2, 2, 2, 2, 2])
vc = np.array([3.1, 3.2, 3.3, 3.4, 3.5])
va <- c(1, 2, 3, 4, 5)
vb <- c(2, 2, 2, 2, 2)
vc <- c(3.1, 3.2, 3.3, 3.4, 3.5)
If I add 0 + (which results in lm(vc ~ 0 + va + vb)), I geed 3 degrees of freedom and the same result. Looks good.
The 0 + removes the "implied intercept term" (whatever this this). Source
The problem is that you have a singular fit, and multiple combinations of coefficients will represent it equally well. IMHO, both numpy and R should really throw an error in this case by default. You can get R to give you an error by adding singular.ok = FALSE to your arguments. Additionally, altough your intercept in this case is zero, your regression equation indicates that you're not looking to fit one. To fit a linear model without an intercept in R, use a formula in the form:
lm(vc ~ va + vb - 1)
So, to (properly) return an error in this singular fit, you would call:
reg <- lm(vc ~ va + vb - 1, singular.ok = FALSE)

Inverted ED95 and ED5 in drc package

My aim is to determine the distance between site of injection of a treatment and target of this treatment with a 0.95 probability of success.
The outcome variable was a binary variable (Success:1/failure:0)
I used Dixon up and down methodology with six distances tested : 0, 2, 4, 6, 8 and 10 mm.
Here are my data :
column 1 : distances used
column 2 : number of success
column 3 : total number of patients
data <- data.frame(1:6,1:6,1:6)
data[,1] <- c(0, 2, 4, 6, 8, 10)
data[,2] <- c(2, 12, 3, 2, 1, 0)
data[,3] <- c(2, 12, 15, 8, 4, 1)
names(data) <- c("Distance", "Success", "Total")
I built a model with DRC package 2.3-96 and R ver 3.1.2 on Windows Vista Os :
library(drc)
model <- drm(Success/Total~Distance, weights=Total,
data=data, fct=LL.2(), type="binomial")
summary(model)
plot(model, bp=.5, legend=FALSE
, xlab=paste("Distance"), ylab="Probability of success", lwd=2,
cex=1.2, cex.axis=1.2, cex.lab=1.2, log = "")
All seems to be Ok
but when it come to estimating ED 95 (Effective dose 95 : distance required to have 0.95 probability of success), i think that this ED95 was inverted with ED5 (Effective dose 5 : distance required to have 0.05 probability of success) :
ED(model, 95, interval="delta")
ED(model, 5, interval="delta")
ED95 : 8.0780 SE: 2.0723 CI 95 % (4.0165 ; 12.139)
ED5 : 1.58440 SE: 0.46413 CI 95 % (0.67472 ; 2.4941)
ED values in drc package are by default calculated relative to the the control level. In our case, we are looking for ED values calculated relative to the upper limit.
So we must change the reference value from "control" (default) to "upper" :
ED(model, 95, interval="delta", reference = "upper")
Many thanks to Christian Ritz

How to grab coefficients with R when estimating a Zero Inflation Model

Probably pretty easy, but I want to know, how to grab coefficients when using the zeroinfl command?
treatment <- factor(rep(c(1, 2), c(43, 41)),
levels = c(1, 2),labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),
levels = c(1, 2, 3),labels = c("none", "some", "marked"))
numberofdrugs <- rpois(84, 2)
healthvalue <- rpois(84,0.5)
y <- data.frame(healthvalue,numberofdrugs, treatment, improved)
require(pscl)
ZIP<-zeroinfl(healthvalue~numberofdrugs+treatment+improved, y)
summary(ZIP)
I usually use ZIP$coef[1] to grab a coefficient, but unfortunately here you grab a whole bunch. So how can I grab one single coeficients from a ZIP model?
Use the coef extraction function to list all coefficients in one long vector, and then you can use single index notation to select them:
coef(ZIP)[1]
count_(Intercept)
0.1128742
Alternatively, you need to select which model you want to get the coefficients from first:
ZIP$coef$count[1]
(Intercept)
0.1128742
ZIP$coef[[1]][1]
(Intercept)
0.1128742
If you wanted to get fancy you could split the coefficients into a list:
clist <- function(m) {
cc <- coef(m)
ptype <- gsub("_.+$","",names(cc))
ss <- split(cc,ptype)
lapply(ss, function(x) names(x) <- gsub("^.*_","",names(x)))
}
> clist(ZIP)
$count
(Intercept) numberofdrugs treatmenttreated improvedsome
-1.16112045 0.16126724 -0.07200549 -0.34807344
improvedmarked
0.23593220
$zero
(Intercept) numberofdrugs treatmenttreated improvedsome
7.509235 -14.449669 -58.644743 -8.060501
improvedmarked
58.034805
c1 <- clist(ZIP)
c1$count["numberofdrugs"]

Resources