Inverted ED95 and ED5 in drc package - r

My aim is to determine the distance between site of injection of a treatment and target of this treatment with a 0.95 probability of success.
The outcome variable was a binary variable (Success:1/failure:0)
I used Dixon up and down methodology with six distances tested : 0, 2, 4, 6, 8 and 10 mm.
Here are my data :
column 1 : distances used
column 2 : number of success
column 3 : total number of patients
data <- data.frame(1:6,1:6,1:6)
data[,1] <- c(0, 2, 4, 6, 8, 10)
data[,2] <- c(2, 12, 3, 2, 1, 0)
data[,3] <- c(2, 12, 15, 8, 4, 1)
names(data) <- c("Distance", "Success", "Total")
I built a model with DRC package 2.3-96 and R ver 3.1.2 on Windows Vista Os :
library(drc)
model <- drm(Success/Total~Distance, weights=Total,
data=data, fct=LL.2(), type="binomial")
summary(model)
plot(model, bp=.5, legend=FALSE
, xlab=paste("Distance"), ylab="Probability of success", lwd=2,
cex=1.2, cex.axis=1.2, cex.lab=1.2, log = "")
All seems to be Ok
but when it come to estimating ED 95 (Effective dose 95 : distance required to have 0.95 probability of success), i think that this ED95 was inverted with ED5 (Effective dose 5 : distance required to have 0.05 probability of success) :
ED(model, 95, interval="delta")
ED(model, 5, interval="delta")
ED95 : 8.0780 SE: 2.0723 CI 95 % (4.0165 ; 12.139)
ED5 : 1.58440 SE: 0.46413 CI 95 % (0.67472 ; 2.4941)

ED values in drc package are by default calculated relative to the the control level. In our case, we are looking for ED values calculated relative to the upper limit.
So we must change the reference value from "control" (default) to "upper" :
ED(model, 95, interval="delta", reference = "upper")
Many thanks to Christian Ritz

Related

variogram function in R returns one single observation

I'm trying to construct a variogram cloud in R using the variogram function from the gstat package. I'm not sure if there's something about the topic that I've misunderstood, but surely I should get more than one observation, right? Here's my code:
data = data.frame(matrix(c(2, 4, 8, 7, 6, 4, 7, 9, 4, 4, -1.01, .05, .47, 1.36, 1.18), nrow=5, ncol=3))
data = rename(data, X=X1, Y=X2, Z=X3)
coordinates(data) = c("X","Y")
var.cld = variogram(Z ~ 1,data=data, cloud = TRUE)
And here's the output:
> var.cld
dist gamma dir.hor dir.ver id left right
1 1 0.0162 0 0 var1 5 4
I found the problem! Apparently the default value of the cutoff argument was too low for my specific set of data. Specifying a higher value resulted in additional observations.

How to find difference in mean using MCMCregress command?

I am trying to figure out how to find the difference in means for two categorical variables using MCMCregress and to plot the densities.
My code is
library(MCMCpack)
data("crabs")
out <- MCMCregress(sex~sp , data = data, family=binomial)
summary(out)
I keep getting the error message-
Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, : NA/NaN/Inf in 'y'
What should i do to fix this?
I noticed that the sex variable is a factor. You can simple convert the factor to numeric and your code will work. Here is the code,
library(MCMCpack)
data("crabs")
out <- MCMCregress(as.numeric(sex)~sp , data = crabs, family=binomial)
summary(out)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 1.5002783 0.05052 0.0005052 0.0005052
spO -0.0003147 0.07202 0.0007202 0.0007202
sigma2 0.2551607 0.02597 0.0002597 0.0002637
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 1.4016 1.46639 1.5005847 1.53420 1.5996
spO -0.1433 -0.04842 -0.0009755 0.04696 0.1420
sigma2 0.2091 0.23688 0.2534471 0.27180 0.3105

Stratified cluster sampling estimates from survey package

I want to estimate means and totals from a stratified sampling design in which single stage cluster sampling was used in each stratum. I believe I have the design properly specified using the svydesign() function of the survey package. But I'm not sure how to correctly specify the stratum weights.
Example code is shown below. I provide unadjusted stratum weights using the weights= argument. I expected that the estimate and the SE from svytotal() would be equal to the sum of the stratum weights (70, in the example) times the estimate and SE from svymean(). Instead the estimates differ by a factor of 530 (which is the sum of the stratum weights over all of the elements in the counts data) and the SEs differ by a factor of 898 (???). My questions are (1) how can I provide my 3 stratum weights to svydesign() in a way that it understands, and (2) why aren't the estimates and SEs from svytotal() and svymean() differing by the same factor?
library(survey)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = 1:21
)
# stratum weights
weights <- data.frame(
Stratum=c("A", "B", "C"),
W=c(10, 20, 40)
)
# combine counts and weights
both <- merge(counts, weights)
# estimate mean and total count
D <- svydesign(id=~Cluster, strata=~Stratum, weights=~W, data=both)
a <- svymean(~Count, D)
b <- svytotal(~Count, D)
sum(weights$W) # 70
sum(both$W) # 530
coef(b)/coef(a) # 530
SE(b)/SE(a) # 898.4308
First update
I'm adding a diagram to help explain my design. The entire population is a lake with known area (70 ha in this example). The strata have known areas, too (10, 20, and 40 ha). The number of clusters allocated to each stratum was not proportional. Also, the clusters are tiny relative to the number that could possibly be sampled, so the finite population correction is FPC = 1.
I want to calculate an overall mean and SE on a per unit area basis and a total that is equal to 70 times this mean and SE.
Second update
I wrote the code to do the calculations from scratch. I get a total estimate of 920 with se 61.6.
library(survey)
library(tidyverse)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = c(5:1, 6:21)
)
# stratum weights
areas <- data.frame(
Stratum=c("A", "B", "C"),
A_h=c(10, 20, 40)
)
# calculate cluster means
step1 <- counts %>%
group_by(Stratum, Cluster) %>%
summarise(P_hi = sum(Count), m_hi=n())
step2 <- step1 %>%
group_by(Stratum) %>%
summarise(
ybar_h = sum(P_hi) / sum(m_hi),
n_h = n(),
sh.numerator = sum((P_hi - ybar_h*m_hi)^2),
mbar_h = mean(m_hi)
) %>%
mutate(
S_ybar_h = 1 / mbar_h * sqrt( sh.numerator / (n_h * (n_h-1)) )
)
# now expand up to strata
step3 <- step2 %>%
left_join(areas) %>%
mutate(
W_h = A_h / sum(A_h)
) %>%
summarise(
A = sum(A_h),
ybar_strat = sum(W_h * ybar_h),
S_ybar_strat = sum(W_h * S_ybar_h / sqrt(n_h))
) %>%
mutate(
tot = A * ybar_strat,
S_tot = A * S_ybar_strat
)
step2
step3
This gives the following output:
> step2
# A tibble: 3 x 6
Stratum ybar_h n_h sh.numerator mbar_h S_ybar_h
<fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 3.0 2 18.0 2.500000 1.200000
2 B 9.5 3 112.5 2.666667 1.623798
3 C 17.5 3 94.5 2.666667 1.488235
> step3
# A tibble: 1 x 5
A ybar_strat S_ybar_strat tot S_tot
<dbl> <dbl> <dbl> <dbl> <dbl>
1 70 13.14286 0.8800657 920 61.6046
(Revised answer to revised question)
In this case svytotal isn't what you want -- it's for the actual population total of the elements being sampled, and so doesn't make sense when the population is thought of as infinitely bigger than the sample. The whole survey package is really designed for discrete, finite populations, but we can work around it.
I think you want to get a mean for each stratum and then multiply it by the stratum weights. To do that,
D <- svydesign(id=~Cluster, strata=~Stratum, data=both)
means<- svyby(~Count, ~Stratum, svymean, design=D)
svycontrast(means, quote(10*A+20*B+40*C))
You'll get a warning
Warning message:
In vcov.svyby(stat) : Only diagonal elements of vcov() available
That's because svyby doesn't return covariances between the stratum means. It's harmless, because the strata really are independent samples (that's what stratification means) so the covariances are zero.
svytotal is doing what I think it should do here: weights are based on sampling probability, so they are only defined for sampling units. The svydesign call applied those weights to the clusters and (because cluster sampling) to the elements, giving the 530-fold higher total. You need to supply either observation weights or enough information for svydesign to calculate them itself. If this is cluster sampling with no subsampling, you can divide the stratum weight over the clusters to get the cluster weight and the divide this over elements within a cluster to get the observation weight. Or, if the stratum weight is the number of clusters in the population, you can use the fpc argument to svydesign
The fact that the SE doesn't scale the same way as the point estimate is because the population size is unknown and has to be estimated. The mean is the estimated total divided by the estimated population size, and the SE estimate takes account of the variance of the denominator and its covariance with the numerator.

Caret using C5.0 method, how to plot the final tree

I am using the train package method=C5.0 and would like to see the finalModel plotted as a tree.
The resulting tree has been defined :
The final values used for the model were trials = 15, model = tree and winnow = FALSE.
When I tried to plot the tree using plot or the rattle's fancyRplotModel, i get the errors below:
Using plot:
plot(diabetes.c50$finalModel,trials=15)
Error in plot(diabetes.c50$finalModel, trials = 15) :
object 'diabetes.c50' not found
Using rattle:
fancyRpartPlot(diabetes.C50$finalModel,trials=15)
Error in if (model$method == "class") { : argument is of length zero
The finalModel has been defined:
> diabetes.C50$finalModel
Call:
C5.0.default(x = structure(c(6, 8, 0, 8, 4, 10, 10, 1, 5, 7, 1, 1, 3, 8, 7, 9, 11, 10, 7, 1, 13, 5, 5, 3, 6, 4, 11, 9, 4, 3, 9, 7, 0,
"outcome", seed = 2187L), .Names = c("subset", "bands", "winnow", "noGlobalPruning", "CF", "minCases", "fuzzyThreshold",
"sample", "earlyStopping", "label", "seed")), verbose = FALSE)
Classification Tree
Number of samples: 538
Number of predictors: 8
Number of boosting iterations: 15
Average tree size: 12.9
Non-standard options: attempt to group attributes
The data structure representing a C5.0 tree is different to that representing an rpart tree. Rattle's fancyRpartPlot() assumes an rpart tree hence you get an error (recent versions of rattle check for the model class and explain this error rather than failing with the above indecipherable message).
You first error though looks like a typo and the error message is self explanatory. You meant diabetes.C50$finalModel rather than diabetes.c50$finalModel (capital C50 rather than lower c50).

R's equivalence of numpy.linalg.lstsq

I have multiple linear regressions of the form vc = x1 * va + x2 * vb.
(Now, a too minimal example follows - it has the same values, which leads to warnings in R. Below a second data set illustrating my issue)
In Python, I programmed
#!/usr/bin/env python3
import numpy as np
va = np.array([1, 2, 3, 4, 5])
vb = np.array([1, 2, 3, 4, 5])
vc = np.array([1, 2, 3, 4, 5])
A = np.vstack([va, vb]).T
print(A)
result = np.linalg.lstsq(A, vc)
print(result)
Output:
(array([ 0.5, 0.5]), array([], dtype=float64), 1, array([ 1.04880885e+01, 3.14018492e-16]))
I thought, following code would be identical:
#!/usr/bin/Rscript
va <- c(1, 2, 3, 4, 5)
vb <- c(1, 2, 3, 4, 5)
vc <- c(1, 2, 3, 4, 5)
reg <- lm(vc ~ va + vb)
reg
summary(reg)
However, I get following output (excerpt):
Coefficients:
A1 A2
1 NA
esidual standard error: 7.022e-16 on 4 degrees of freedom
In summary.lm(reg) : essentially perfect fit: summary may be unreliable
Even if I adjust the numbers somehow, R still keeps complaining.
I assume, I am doing something basic wrong, but I can't figure out. I also tried to construct a matrix A (containg vb and vc as colums) and then use reg <- lm(vc ~ 0 + A). There, I get 3 degrees of freedom, but with the same Coefficients.
2nd data set
va = np.array([1, 2, 3, 4, 5])
vb = np.array([2, 2, 2, 2, 2])
vc = np.array([3.1, 3.2, 3.3, 3.4, 3.5])
va <- c(1, 2, 3, 4, 5)
vb <- c(2, 2, 2, 2, 2)
vc <- c(3.1, 3.2, 3.3, 3.4, 3.5)
If I add 0 + (which results in lm(vc ~ 0 + va + vb)), I geed 3 degrees of freedom and the same result. Looks good.
The 0 + removes the "implied intercept term" (whatever this this). Source
The problem is that you have a singular fit, and multiple combinations of coefficients will represent it equally well. IMHO, both numpy and R should really throw an error in this case by default. You can get R to give you an error by adding singular.ok = FALSE to your arguments. Additionally, altough your intercept in this case is zero, your regression equation indicates that you're not looking to fit one. To fit a linear model without an intercept in R, use a formula in the form:
lm(vc ~ va + vb - 1)
So, to (properly) return an error in this singular fit, you would call:
reg <- lm(vc ~ va + vb - 1, singular.ok = FALSE)

Resources