I was told to do a coefplot in R to visualise my data better.
Therefore i first did a chi square test. and after i put my data into a table it looked like this:
1 2 3 5 6
5_min_blank 11 21 18 19 8
Boldstyle 6 7 14 10 2
Boldstyle_pause 9 22 19 8 0
Breaststroke 7 16 10 5 4
Breaststroke_pause 9 13 10 8 3
Diving 14 20 10 10 4
1-6 are categories and "bold style" etc. are different sounds.
i than did a test:
fit.swim<-chisq.test(X2,simulate.p.value = TRUE, B = 10000)
and got this result:
Pearson's Chi-squared test with simulated p-value (based on 10000 replicates)
data: X2
X-squared = 87.794, df = NA, p-value = 0.09479
Now i would like to do a coefplot with my data but i only get this error:
coefplot(fit.swim)
Error: $ operator is invalid for atomic vectors
Any ideas how to draw a nice plot?
Thank you very much for the help!
All the best
Marie
I think that the reason you are getting that error is because coefplot requires a fitted model as input in the form of an lm, glm or rxLinMod obj.
In your case you have carried out a goodness of fit test that essentially compares the observed sample distribution with the expected probability distribution. There isn't a fitted model to plot the coefficients from.
Related
I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004
I have this table abd I want to analyse it statistically.
table(sci$category, sci$true_group)
mono sim_rus_nen suc_balanced suc_nen_rus suc_rus_nen
generalization 9 3 9 4 3
description 35 16 15 13 17
scheme 2 1 1 1 2
syncretism 5 3 7 16 2
tautology 2 2 2 3 3
substitution 1 0 0 0 0
indefinite 7 5 5 6 9
no_answer 30 17 18 13 19
So I decided to apply Fisher's exact test. But I have this error (although it's OK with chiq.square)
fisher.test(table(sci$category, sci$true_group))
Error in fisher.test(my_tab) : Bug in fexact3, it[i=6]=0: negative
key -1099365618 (kyy=91)
How can I fix this?
For larger contingency table/ counts, it gets resource intensive, to count all the worst cases to arrive at the p-value (seems to be that error).
So its convenient to simulate the p-values for tables larger than (2x2):
df <- table(sci$category, sci$true_group)
fisher.test(df, simulate.p.value = TRUE, B = 1e6)
Fisher's Exact Test for Count Data with simulated p-value (based on 1e+06 replicates)
data: df
p-value = 0.1054
alternative hypothesis: two.sided
PS: Choosing between Fishers exact test vs Chisq-test is a whole another discussion. I would refer you to this cross validated post for clarity: Alternatives to chisq-test
I want to use BoxCoxTrans function in R to resolve problem of skewness.
But, I have a problem that couldn't get result as data frame. This is my R code.
df<-read.csv("dataSetNA1.csv",header=TRUE)
dd1<-apply(df[2:61],2,BoxCoxTrans) #Except independent variable that located first column, All variables are numeric variable.
dd1
$LT1Y_MXOD_AMT
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 19594 0 1600000
Lambda could not be estimated; no transformation is applied
$MOBL_PRIN
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 100000 191229 320000 1100000
Lambda could not be estimated; no transformation is applied
str(dd1)
I don't know how to get result as data frame.
If I use as.data.frame function, this error message is posted.
dd2<-as.data.frame(dd1)
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
클래스 ""BoxCoxTrans""를 data.frame으로 강제형변환 할 수 없습니다
please help me.
Here is one way to accomplish what you are after (I assume you are transforming the features):
library(caret)
data(cars)
#create a list with the BoxCox objects
g <- apply(cars, 2, BoxCoxTrans)
#use map2 from purr to apply the models to new data
z <- purrr::map2(g, cars, function(x, y) predict(x, y))
#here the transformation is performed on the same data on
#which I estimated the BoxCox lambda for
B_trans = as.data.frame(do.call(cbind, z)) #to convert to data frame
head(data.frame(B_trans, cars), 20)
#outpout
speed dist speed.1 dist.1
1 4 0.8284271 4 2
2 4 4.3245553 4 10
3 7 2.0000000 7 4
4 7 7.3808315 7 22
5 8 6.0000000 8 16
6 9 4.3245553 9 10
7 10 6.4852814 10 18
8 10 8.1980390 10 26
9 10 9.6619038 10 34
10 11 6.2462113 11 17
11 11 8.5830052 11 28
12 12 5.4833148 12 14
13 12 6.9442719 12 20
14 12 7.7979590 12 24
15 12 8.5830052 12 28
16 13 8.1980390 13 26
17 13 9.6619038 13 34
18 13 9.6619038 13 34
19 13 11.5646600 13 46
20 14 8.1980390 14 26
First two columns are transformed data and 2nd two are original data.
Another way is to incorporate the transformation of features during the training:
train(....preProcess = "BoxCox"...)
more on the matter: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/train
In order to perform a Box Cox transformation your data has to be positive. Hence, the values should be greater than 0.
The reason for this is, that the logarithm of 0 is -Inf.
If your data contains values of 0 you can just add 1 to each observation. This won't change your distribution/skewness.
A BoxCox transformation is a transformation on your response variable. You could use the Boxcox function of the MASS package to find out what transformation is needed. Boxcox returns a lambda value. U should raise your response, say y, to the power lambda and this results in a new response variable, y*.
Then just replace the y-column in your old data frame by y*.
Note that if the resulting lambda is 0, you should apply a logarithmic transformation ln(y).
I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))
Long story short, I want to manually write the code for the Kolmogorov-Smirnov one-sample statistic instead of using ks.test() in R. From what I understand, the K-S test can be broken down into a ratio between a numerator and a denominator. I am interested in writing out the numerator, and from what I understand it is the maximal absolute difference between a sample of observations and the theoretical assumption. Let's use the below case as an example:
Data Expected
1 0.01052632 0.008864266
2 0.02105263 0.010969529
13 0.05263158 0.018282548
20 0.06315789 0.031689751
22 0.09473684 0.046315789
24 0.26315789 0.210526316
26 0.27368421 0.220387812
27 0.29473684 0.236232687
28 0.30526316 0.252520776
3 0.42105263 0.365650970
4 0.42105263 0.372299169
5 0.45263158 0.398781163
6 0.49473684 0.452853186
7 0.50526316 0.460277008
8 0.73684211 0.656842105
9 0.74736842 0.665484765
10 0.75789474 0.691523546
11 0.77894737 0.718005540
12 0.80000000 0.735955679
14 0.84210526 0.791135734
15 0.86315789 0.809972299
16 0.88421053 0.838559557
17 0.89473684 0.857950139
18 0.96842105 0.958337950
19 0.97894737 0.968642659
21 0.97894737 0.979058172
23 0.98947368 0.989473684
25 1.00000000 1.000000000
Here, I want to obtain the maximal absolute difference (Data - Expected).
Anyone have an idea? I can rephrase this question, if necessary. Thanks!
I was looking for an answer something along the lines of this code:
> A <- with(df, max(abs(Data-Expected)))
,where df is the data frame.
Here, I obtain the differences between each Data and Expected, convert the values into absolute values, and from the vector of absolute differences select the maximum value. Thus, the answer is:
> A
0.082