Add.times in R relsurv - r

I am new to coding and to R. Currently working with package relsurv. For this I would like to calculate the relative survival at certain timepoints.
I am using the following to assess RS at five years:
rcurve2 <- rs.surv(Surv(time_days17/365.241,event_17)~1+
ratetable(age = age_diagnosis*365.241, sex = sex,
year = year_diagnosis_days), data = survdata, ratetable = swepop,
method="ederer1",conf.int=0.95,type="kaplan-meier",
add.times = 5*365.241)
summary(rcurve2)
However, I get the same result in my summary output regardless of what number I put after add.times ie for all event/cenasoring points (see below)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0.205 177 1 0.9944 0.00562 0.9834 1.005
0.627 176 1 0.9888 0.00792 0.9734 1.004
0.742 175 1 0.9831 0.00968 0.9644 1.002
0.827 174 1 0.9775 0.01114 0.9559 1.000
0.849 173 1 0.9718 0.01242 0.9478 0.996
0.947 172 1 0.9662 0.01356 0.9400 0.993
...cont.
I am clearly not getting it right! Would be grateful for your help!

A very good question!
When adding "imaginary" times using add.times, they are automatically censored, and wont show up with the summary() function. To see your added times set censored = TRUE:
summary(rcurve2, censored = TRUE)
You should now find your added time in the list that follows.
EXAMPLE
Using built in data with the relsurv package
data(slopop)
data(rdata)
#note the last argument add.times=1000
rcurve2 <- rs.surv(Surv(time,cens)~sex+ratetable(age=age*365.241, sex=sex,
year=year), ratetable=slopop, data=rdata, add.times = 1000)
When using summary(rcurve2) the time 1000 wont show up:
>summary(rcurve2)
[...]
973 200 1 0.792 0.03081 0.734 0.855
994 199 1 0.790 0.03103 0.732 0.854
1002 198 1 0.783 0.03183 0.723 0.848
[...]
BUT using summary(rcurve2, censored=TRUE) it will!
>summary(rcurve2, censored=TRUE)
[...]
973 200 1 0.792 0.03081 0.734 0.855
994 199 1 0.790 0.03103 0.732 0.854
1000 198 0 0.791 0.03106 0.732 0.854
1002 198 1 0.783 0.03183 0.723 0.848
[...]

Related

Emmeans is reporting different estimates and CIs for marginal means if printed as data.frame

After fitting a LMM I am using the emmeans() function to extract the estimated marginal means, SE and Confidence Intervals. However, depending if I directly extract the means, or save the as a data frame the estimates, their SE and their Confidence Intervals differ. Any insight would be appreciated.
Example (was not able to use dput and provide raw data due to character limit):
> summary(model)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: asin(sqrt(r_index)) ~ year + prov_season + factor_month + group + prov_season * year * group + (1 | individual)
Extraction of emmeans directly:
mm <- emmeans(model, pairwise ~ prov_season*year | group, at = list(year = c(1:8))) # extract estimates, sems and and CIs
> print(mm$emmeans)
group = naive:
prov_season year emmean SE df lower.CL upper.CL
in 1 0.0112 0.1587 309 -0.3011 0.324
off 1 0.0872 0.1768 378 -0.2604 0.435
in 2 0.0229 0.1437 253 -0.2600 0.306
off 2 0.1186 0.1577 313 -0.1916 0.429
in 3 0.0345 0.1305 203 -0.2228 0.292
off 3 0.1500 0.1405 247 -0.1268 0.427
in 4 0.0461 0.1199 162 -0.1906 0.283
off 4 0.1814 0.1261 189 -0.0674 0.430
in 5 0.0577 0.1125 136 -0.1647 0.280
off 5 0.2128 0.1155 148 -0.0155 0.441
in 6 0.0693 0.1090 125 -0.1465 0.285
off 6 0.2442 0.1098 128 0.0268 0.462
in 7 0.0810 0.1099 129 -0.1364 0.298
off 7 0.2756 0.1098 129 0.0584 0.493
in 8 0.0926 0.1149 149 -0.1345 0.320
off 8 0.3070 0.1154 151 0.0790 0.535
group = provisioned:
prov_season year emmean SE df lower.CL upper.CL
in 1 0.4076 0.0924 314 0.2258 0.589
off 1 0.2519 0.1043 413 0.0469 0.457
in 2 0.4422 0.0907 307 0.2638 0.621
off 2 0.2528 0.1000 381 0.0561 0.449
in 3 0.4768 0.0899 305 0.2999 0.654
off 3 0.2538 0.0970 355 0.0630 0.444
in 4 0.5114 0.0902 308 0.3339 0.689
off 4 0.2547 0.0952 337 0.0674 0.442
in 5 0.5461 0.0915 315 0.3659 0.726
off 5 0.2557 0.0949 329 0.0690 0.442
in 6 0.5807 0.0938 325 0.3961 0.765
off 6 0.2566 0.0959 331 0.0680 0.445
in 7 0.6153 0.0970 339 0.4245 0.806
off 7 0.2576 0.0983 342 0.0643 0.451
in 8 0.6499 0.1010 355 0.4512 0.849
off 8 0.2585 0.1019 361 0.0581 0.459
Results are averaged over the levels of: factor_month
Degrees-of-freedom method: kenward-roger
Results are given on the asin(sqrt(mu)) (not the response) scale.
Confidence level used: 0.95
Extraction of emmeans as.data.frame():
> as.data.frame(mm)
group prov_season year contrast emmean SE df lower.CL upper.CL
naive in 1 . 0.011232 0.15872 309 -0.5897 0.61217
naive off 1 . 0.087219 0.17677 378 -0.5806 0.75500
naive in 2 . 0.022854 0.14365 253 -0.5225 0.56821
naive off 2 . 0.118613 0.15767 313 -0.4783 0.71550
naive in 3 . 0.034476 0.13049 203 -0.4628 0.53172
naive off 3 . 0.150007 0.14053 247 -0.3837 0.68374
naive in 4 . 0.046098 0.11986 162 -0.4128 0.50498
naive off 4 . 0.181401 0.12615 189 -0.2999 0.66275
naive in 5 . 0.057720 0.11249 136 -0.3749 0.49036
naive off 5 . 0.212795 0.11555 148 -0.2306 0.65616
naive in 6 . 0.069342 0.10904 125 -0.3511 0.48977
naive off 6 . 0.244189 0.10984 128 -0.1790 0.66738
naive in 7 . 0.080964 0.10988 129 -0.3423 0.50419
naive off 7 . 0.275583 0.10979 129 -0.1473 0.69850
naive in 8 . 0.092586 0.11491 149 -0.3483 0.53345
naive off 8 . 0.306977 0.11541 151 -0.1356 0.74957
provisioned in 1 . 0.407628 0.09240 314 0.0578 0.75742
provisioned off 1 . 0.251854 0.10425 413 -0.1416 0.64535
provisioned in 2 . 0.442235 0.09067 307 0.0989 0.78555
provisioned off 2 . 0.252805 0.10002 381 -0.1250 0.63063
provisioned in 3 . 0.476842 0.08994 305 0.1363 0.81742
provisioned off 3 . 0.253756 0.09698 355 -0.1128 0.62035
provisioned in 4 . 0.511450 0.09023 308 0.1698 0.85310
provisioned off 4 . 0.254708 0.09524 337 -0.1055 0.61493
provisioned in 5 . 0.546057 0.09154 315 0.1995 0.89257
provisioned off 5 . 0.255659 0.09488 329 -0.1033 0.61460
provisioned in 6 . 0.580665 0.09382 325 0.2257 0.93566
provisioned off 6 . 0.256610 0.09590 331 -0.1062 0.61941
provisioned in 7 . 0.615272 0.09700 339 0.2484 0.98214
provisioned off 7 . 0.257561 0.09827 342 -0.1141 0.62920
provisioned in 8 . 0.649879 0.10101 355 0.2681 1.03170
provisioned off 8 . 0.258513 0.10190 361 -0.1266 0.64363
as.data.frame is making a Bonferroni correction to the confidence intervals by default based on both the contrasts and the means in the table.
You can use change this behaviour using eg adjust="none" as an argument.
There is more detail on this behaviour (with respect to p-value adjustment) in answers to this question.
Why is converting emmeans contrasts to a data.frame not reporting correct p-values?
It can be difficult to predict what adjustment to p-values and confidence intervals emmeans makes and its not always obvious from the documentation so its usually better to control it explicitly.
By the way, even though you couldn't paste your full data it would have been easy enough to make a small reproducible dataset to demonstrate your problem. emmeans with any linear model will behave in the same way.

How can I extract specific data points from a wide-formatted text file in R?

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

How do I include p-value and R-square for the estimates in semPaths?

I am using semPaths (semPlot package) to draw my structural equation models. After some trial and error, I have a pretty good script to show what I want. Except, I haven’t been able to figure out how to include the p-value/significance levels of the estimates/regression coefficients in the figure.
Can/how can I include significance levels either as e.g. p-value in the edge labels below the estimate or as a broken line for insignificance or …?
I am also interested in including the R-square, but not as critically as the significance level.
This is the script I am using so far:
semPaths(fitmod.bac.class2,
what = "std",
whatLabels = "std",
style="ram",
edge.label.cex = 1.3,
layout = 'tree',
intercepts=FALSE,
residuals=FALSE,
nodeLabels = c("Negati-\nvicutes","cand_class\n_MB_A2_108", "CO2", "Bacilli","Ignavi-\nbacteria","C/N", "pH","Water\ncontent"),
sizeMan=7 )
Example of one of the SemPath outputs
In this example the following are not significant:
Ignavibacteria -> First_C_CO2_ugC_gC_day, p = 0.096
pH -> Ignavibacteria, p = 0.151
cand_class_MB_A2_108 <-> Bacilli correlation, p = 0.054
I am a R-user and not really a coder, so I might just be missing a crucial point in the arguments.
I am testing a lot of different models at the moment, and would really like not to have to draw them all up by hand.
update:
Using semPlotModel: Am I right in understanding that semPlotModel doesn’t include the significance levels from the sem function (see my script and output below)? I am specifically looking to include the P(>|z|) for regressions and covariance.
Is it just me that is missing that, or is it not included? If it is not included, my solution is simply just to custom the edge labels.
{model.NA.UP.bac.class2 <- '
#LATANT VARIABLES
#REGRESSIONS
#soil organic carbon quality
c_Negativicutes ~ CN
#microorganisms
First_C_CO2_ugC_gC_day ~ c_Bacilli
First_C_CO2_ugC_gC_day ~ c_Ignavibacteria
First_C_CO2_ugC_gC_day ~ c_cand_class_MB_A2_108
First_C_CO2_ugC_gC_day ~ c_Negativicutes
#pH
c_Bacilli ~pH
c_Ignavibacteria ~pH
c_cand_class_MB_A2_108~pH
c_Negativicutes ~pH
#COVARIANCE
initial_water ~~ CN
c_cand_class_MB_A2_108 ~~ c_Bacilli
'
fitmod.bac.class2 <- sem(model.NA.UP.bac.class2, data=datapNA.UP.log, missing="ml", meanstructure=TRUE, fixed.x=FALSE, std.lv=FALSE, std.ov=FALSE)
summary(fitmod.bac.class2, standardized=TRUE, fit.measures=TRUE, rsq=TRUE)
out <- capture.output(summary(fitmod.bac.class2, standardized=TRUE, fit.measures=TRUE, rsq=TRUE))
}
Output:
lavaan 0.6-5 ended normally after 188 iterations
Estimator ML
Optimization method NLMINB
Number of free parameters 28
Number of observations 30
Number of missing patterns 1
Model Test User Model:
Test statistic 17.816
Degrees of freedom 16
P-value (Chi-square) 0.335
Model Test Baseline Model:
Test statistic 101.570
Degrees of freedom 28
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.975
Tucker-Lewis Index (TLI) 0.957
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) 472.465
Loglikelihood unrestricted model (H1) 481.373
Akaike (AIC) -888.930
Bayesian (BIC) -849.697
Sample-size adjusted Bayesian (BIC) -936.875
Root Mean Square Error of Approximation:
RMSEA 0.062
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.185
P-value RMSEA <= 0.05 0.414
Standardized Root Mean Square Residual:
SRMR 0.107
Parameter Estimates:
Information Observed
Observed information based on Hessian
Standard errors Standard
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
c_Negativicutes ~
CN 0.419 0.143 2.939 0.003 0.419 0.416
c_cand_class_MB_A2_108 ~
CN -0.433 0.160 -2.707 0.007 -0.433 -0.394
First_C_CO2_ugC_gC_day ~
c_Bacilli 0.525 0.128 4.092 0.000 0.525 0.496
c_Ignavibacter 0.207 0.124 1.667 0.096 0.207 0.195
c_c__MB_A2_108 0.310 0.125 2.475 0.013 0.310 0.301
c_Negativicuts 0.304 0.137 2.220 0.026 0.304 0.271
c_Bacilli ~
pH 0.624 0.135 4.604 0.000 0.624 0.643
c_Ignavibacteria ~
pH 0.245 0.171 1.436 0.151 0.245 0.254
c_cand_class_MB_A2_108 ~
pH 0.393 0.151 2.597 0.009 0.393 0.394
c_Negativicutes ~
pH 0.435 0.129 3.361 0.001 0.435 0.476
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
CN ~~
initial_water 0.001 0.000 2.679 0.007 0.001 0.561
.c_cand_class_MB_A2_108 ~~
.c_Bacilli -0.000 0.000 -1.923 0.054 -0.000 -0.388
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.c_Negativicuts 0.145 0.198 0.734 0.463 0.145 3.826
.c_c__MB_A2_108 1.038 0.226 4.594 0.000 1.038 25.076
.Frs_C_CO2_C_C_ -0.346 0.233 -1.485 0.137 -0.346 -8.115
.c_Bacilli 0.376 0.135 2.778 0.005 0.376 9.340
.c_Ignavibacter 0.754 0.170 4.424 0.000 0.754 18.796
CN 0.998 0.007 145.158 0.000 0.998 26.502
pH 0.998 0.008 131.642 0.000 0.998 24.034
initial_water 0.998 0.008 125.994 0.000 0.998 23.003
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.c_Negativicuts 0.001 0.000 3.873 0.000 0.001 0.600
.c_c__MB_A2_108 0.001 0.000 3.833 0.000 0.001 0.689
.Frs_C_CO2_C_C_ 0.001 0.000 3.873 0.000 0.001 0.408
.c_Bacilli 0.001 0.000 3.873 0.000 0.001 0.586
.c_Ignavibacter 0.002 0.000 3.873 0.000 0.002 0.936
CN 0.001 0.000 3.873 0.000 0.001 1.000
initial_water 0.002 0.000 3.873 0.000 0.002 1.000
pH 0.002 0.000 3.873 0.000 0.002 1.000
R-Square:
Estimate
c_Negativicuts 0.400
c_c__MB_A2_108 0.311
Frs_C_CO2_C_C_ 0.592
c_Bacilli 0.414
c_Ignavibacter 0.064
Warning message:
In lav_model_hessian(lavmodel = lavmodel, lavsamplestats = lavsamplestats, :
lavaan WARNING: Hessian is not fully symmetric. Max diff = 5.15131396241486e-05
This example is taken from ?semPaths since we don't have your object.
library('semPlot')
modFile <- tempfile(fileext = '.OUT')
download.file('http://sachaepskamp.com/files/mi1.OUT', modFile)
Use semPlotModel to get the object without plotting. There you can inspect what is to be plotted. I just dug around without reading the docs until I found what it seems to be using.
After you run semPlotModel, the object has an element x#Pars which contains the edges, nodes, and the std which is being used for the edge labels in your case. semPaths also has an argument that allows you to make custom edge labels, so you can take the data you need from x#Pars and add your p-values:
x <- semPlotModel(modFile)
x#Pars
# label lhs edge rhs est std group fixed par
# 1 lambda[11]^{(y)} perfIQ -> pc 1.000 0.6219648 Group 1 TRUE 0
# 2 lambda[21]^{(y)} perfIQ -> pa 0.923 0.5664888 Group 1 FALSE 1
# 3 lambda[31]^{(y)} perfIQ -> oa 1.098 0.6550159 Group 1 FALSE 2
# 4 lambda[41]^{(y)} perfIQ -> ma 0.784 0.4609990 Group 1 FALSE 3
# 5 theta[11]^{(epsilon)} pc <-> pc 5.088 0.6131598 Group 1 FALSE 5
# 10 theta[22]^{(epsilon)} pa <-> pa 5.787 0.6790905 Group 1 FALSE 6
# 15 theta[33]^{(epsilon)} oa <-> oa 5.150 0.5709541 Group 1 FALSE 7
# 20 theta[44]^{(epsilon)} ma <-> ma 7.311 0.7874800 Group 1 FALSE 8
# 21 psi[11] perfIQ <-> perfIQ 3.210 1.0000000 Group 1 FALSE 4
# 22 tau[1]^{(y)} int pc 10.500 NA Group 1 FALSE 9
# 23 tau[2]^{(y)} int pa 10.374 NA Group 1 FALSE 10
# 24 tau[3]^{(y)} int oa 10.663 NA Group 1 FALSE 11
# 25 tau[4]^{(y)} int ma 10.371 NA Group 1 FALSE 12
# 11 lambda[11]^{(y)} perfIQ -> pc 1.000 0.6515609 Group 2 TRUE 0
# 27 lambda[21]^{(y)} perfIQ -> pa 0.923 0.5876948 Group 2 FALSE 1
# 31 lambda[31]^{(y)} perfIQ -> oa 1.098 0.6981974 Group 2 FALSE 2
# 41 lambda[41]^{(y)} perfIQ -> ma 0.784 0.4621919 Group 2 FALSE 3
# 51 theta[11]^{(epsilon)} pc <-> pc 5.006 0.5754684 Group 2 FALSE 14
# 101 theta[22]^{(epsilon)} pa <-> pa 5.963 0.6546148 Group 2 FALSE 15
# 151 theta[33]^{(epsilon)} oa <-> oa 4.681 0.5125204 Group 2 FALSE 16
# 201 theta[44]^{(epsilon)} ma <-> ma 8.356 0.7863786 Group 2 FALSE 17
# 211 psi[11] perfIQ <-> perfIQ 3.693 1.0000000 Group 2 FALSE 13
# 221 tau[1]^{(y)} int pc 10.500 NA Group 2 FALSE 9
# 231 tau[2]^{(y)} int pa 10.374 NA Group 2 FALSE 10
# 241 tau[3]^{(y)} int oa 10.663 NA Group 2 FALSE 11
# 251 tau[4]^{(y)} int ma 10.371 NA Group 2 FALSE 12
# 26 alpha[1] int perfIQ -2.469 NA Group 2 FALSE 18
As you can see there are more edge labels than ones that are plotted, and I have no idea how it chooses which to use, so I am just taking the first four from each group (since there are four edges shown and the stds match those. Maybe there is an option to plot all of them or select which ones you need--I haven't read the docs.
## take first four stds from each group, generate some p-values
l <- sapply(split(x#Pars$std, x#Pars$group), function(x) head(x, 4))
set.seed(1)
l <- sprintf('%.3f, p=%s', l, format.pval(runif(length(l)), digits = 2))
l
# [1] "0.622, p=0.27" "0.566, p=0.37" "0.655, p=0.57" "0.461, p=0.91" "0.652, p=0.20" "0.588, p=0.90" "0.698, p=0.94" "0.462, p=0.66"
Then you can plot the object with your new labels, edgeLabels = l
layout(1:2)
semPaths(
x,
edgeLabels = l,
ask = FALSE, title = FALSE,
what = 'std',
whatLabels = 'std',
style = 'ram',
edge.label.cex = 1.3,
layout = 'tree',
intercepts = FALSE,
residuals = FALSE,
sizeMan = 7
)
With the help from #rawr, I have worked it out. If anybody else needs to include estimates and p-value from Lavaan in their semPaths, here is how it can be done.
#extracting the parameters from the sem model and selecting the interactions relevant for the semPaths (here, I need 12 estimates and p-values)
table2<-parameterEstimates(fitmod.bac.class2,standardized=TRUE) %>% head(12)
#turning the chosen parameters into text
b<-gettextf('%.3f \n p=%.3f', table2$std.all, digits=table2$pvalue)
I can honestly say that I do not understand how the last bit of script works. This is copied from rawr's answer before a lot of trial and error until it worked. There might (quite possibly) be a nicer way to write it, but it works :)
#putting that list into edgeLabels in sempaths
semPaths(fitmod.bac.class2,
what = "std",
edgeLabels = b,
style="ram",
edge.label.cex = 1,
layout = 'tree',
intercepts=FALSE,
residuals=FALSE,
nodeLabels = c("Negati-\nvicutes","cand_class\n_MB_A2_108", "CO2", "Bacilli","Ignavi-\nbacteria","C/N", "pH","Water\ncontent"),
sizeMan=7
)
Just a small, but relevant detail for an improvement for the above answer.
The above code requires an inspection of the parameter table to count how many lines to maintain to specify as in %>%head(4).
We can exclude from the extracted parameter table those lines which lhs and rhs are not equal.
#extracting the parameters from the sem model and selecting the interactions relevant for the semPaths
table2<-parameterEstimates(fitmod.bac.class2,standardized=TRUE)%>%as.dataframe()
table2<-table2[!table2$lhs==table2$rhs,]
If the formula comprised also extra lines as those with ':=' those also will comprise the parameter table, and should be removed.
The remaining keeps the same...
#turning the chosen parameters into text
b<-gettextf('%.3f \n p=%.3f', table2$std.all, digits=table2$pvalue)
#putting that list into edgeLabels in sempaths
semPaths(fitmod.bac.class2,
what = "std",
edgeLabels = b,
style="ram",
edge.label.cex = 1,
layout = 'tree',
intercepts=FALSE,
residuals=FALSE,
nodeLabels = c("Negati-\nvicutes","cand_class\n_MB_A2_108", "CO2", "Bacilli","Ignavi-\nbacteria","C/N", "pH","Water\ncontent"),
sizeMan=7
)

How to calculate the Bonferroni Lower and Upper limits in R?

With the following data, I am trying to calculate the Chi Square and Bonferroni lower and upper Confidence intervals. The column "Data_No" identifies the dataset (as calculations needs to be done separately for each dataset).
Data_No Area Observed
1 3353 31
1 2297 2
1 1590 15
1 1087 16
1 817 2
1 847 10
1 1014 28
1 872 29
1 1026 29
1 1215 21
2 3353 31
2 2297 2
2 1590 15
3 1087 16
3 817 2
The code I used is
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),] data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/2*count(Data_No),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')
And the error message I get is:
Error in UseMethod("summarise_") : no applicable method for
'summarise_' applied to an object of class "c('integer', 'numeric')"
The column that I am calling "Alpha" is the alpha value of 0.05/2k, where K is the number of categories - in my example, I have 10 categories ("Data_No" column) for the first dataset, so "Alpha" needs to be 0.05/20 = 0.0025, and it's corresponding Z value is 2.807. The second dataset has 3 categories (so 0.05/6) and the third has 2 categories (0.05/4) in my example table (Data_No" column). Using the values from the newly calculated "Alpha" column, I then need to calculate the ZValue column (Zvalue = qnorm(Alpha,lower.tail=FALSE)) which I then use to calculate the lower and upper confidence intervals.
From the above data, here are the results that I should get, but note that I have had to manually calculate Alpha column and Zvalue, rather than insert those calculations within the R code:
Data_No Area Observed RelativeArea Alpha Z value lower upper
1 3353 31 0.237 0.003 2.807 0.092 0.247
1 2297 2 0.163 0.003 2.807 -0.011 0.033
1 1590 15 0.113 0.003 2.807 0.025 0.139
1 1087 16 0.077 0.003 2.807 0.029 0.146
1 817 2 0.058 0.003 2.807 -0.011 0.033
1 847 10 0.060 0.003 2.807 0.007 0.102
1 1014 28 0.072 0.003 2.807 0.078 0.228
1 872 29 0.062 0.003 2.807 0.083 0.234
1 1026 29 0.073 0.003 2.807 0.083 0.234
1 1215 21 0.086 0.003 2.807 0.049 0.181
2 3353 31 0.463 0.008 2.394 0.481 0.811
2 2297 2 0.317 0.008 2.394 -0.027 0.111
2 1590 15 0.220 0.008 2.394 0.152 0.473
3 1087 16 0.571 0.013 2.241 0.723 1.055
3 817 2 0.429 0.013 2.241 -0.055 0.277
Please note that I only included some of the columns generated from the code.
# You need to check the closing bracket for lower c.f. sqrt value. Following code should work.
data <- read.csv("test.csv")
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
#Answer to follow-up question.
#Sample Data
Data_No Area Observed
1 3353 31
1 2297 2
2 1590 15
2 1087 16
#Code to run
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data)
}
write.csv(result_data,file='final_result.csv')
#Issue in calculating Alpha. I have updated the code.
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
#Creating the NO_OF_CATEGORIES column based on your question.
total_data$NO_OF_CATEGORIES <- 0
total_data[which(total_data$Data_No==1),]$NO_OF_CATEGORIES <- 10
total_data[which(total_data$Data_No==2),]$NO_OF_CATEGORIES <- 3
total_data[which(total_data$Data_No==3),]$NO_OF_CATEGORIES <- 2
#Actual code
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/(2*(unique(data$NO_OF_CATEGORIES))),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')

R lavaan sem categorical variable no standard error

I want to compute a structural equation model with the sem() function in R with the package lavaan.
There are two categorial variables, one latent exogenous and one latent endogenous, I want to include in the final version of the model.
When I include one of the categorial variables in the model, however, R produces the following warning:
1: In estimateVCOV(lavaanModel, samplestats = lavaanSampleStats,
options = lavaanOptions, : lavaan WARNING: could not compute
standard errors!
2: In computeTestStatistic(lavaanModel, partable = lavaanParTable, :
lavaan WARNING: could not compute scaled test statistic
Code used:
model1 <- '
Wertschaetzung_Essen =~ abwechslungsreiche_M + schnell_zubereitbar + koche_sehr_gerne + koche_sehr_haeufig
Fleischverzicht =~ Ern_Index1
Fleischverzicht ~ Wertschaetzung_Essen
'
fit_model1 <- sem(model1, data=survey2_subset, ordered = c("Ern_Index1"))
Note: This is only a small version of the final model and in which I only introduce one categorial variable. The warning, however, is the same for more complex versions of the model.
Output
str(survey2_subset):
'data.frame': 3676 obs. of 116 variables:
$ abwechslungsreiche_M : num 4 2 3 4 3 3 4 3 3 3 ...
$ schnell_zubereitbar : num 0 3 2 0 0 1 3 2 1 1 ...
$ koche_sehr_gerne : num 1 3 3 1 3 1 4 4 4 3 ...
$ koche_sehr_haeufig : num 2 2 3 NA 3 2 2 4 3 3 ...
$ Ern_Index1 : num 1 1 1 1 0 0 1 0 1 0 ...
summary(fit_model1, fit.measures = TRUE, standardized=TRUE)
lavaan (0.5-15) converged normally after 31 iterations
Used Total
Number of observations 3469 3676
Estimator DWLS Robust
Minimum Function Test Statistic 13.716 NA
Degrees of freedom 4 4
P-value (Chi-square) 0.008 NA
Scaling correction factor NA
Shift parameter
for simple second-order correction (Mplus variant)
Model test baseline model:
Minimum Function Test Statistic 2176.159 1582.139
Degrees of freedom 10 10
P-value 0.000 0.000
User model versus baseline model:
Comparative Fit Index (CFI) 0.996 NA
Tucker-Lewis Index (TLI) 0.989 NA
Root Mean Square Error of Approximation:
RMSEA 0.026 NA
90 Percent Confidence Interval 0.012 0.042 NA NA
P-value RMSEA <= 0.05 0.994 NA
Parameter estimates:
Information Expected
Standard Errors Robust.sem
Estimate Std.err Z-value P(>|z|) Std.lv Std.all
Latent variables:
Wertschaetzung_Essen =~
abwchslngsr_M 1.000 0.363 0.436
schnll_zbrtbr 1.179 0.428 0.438
koche_shr_grn 2.549 0.925 0.846
koche_shr_hfg 2.530 0.918 0.775
Fleischverzicht =~
Ern_Index1 1.000 0.249 0.249
Regressions:
Fleischverzicht ~
Wrtschtzng_Es 0.302 0.440 0.440
Intercepts:
abwchslngsr_M 3.133 3.133 3.760
schnll_zbrtbr 1.701 1.701 1.741
koche_shr_grn 2.978 2.978 2.725
koche_shr_hfg 2.543 2.543 2.148
Wrtschtzng_Es 0.000 0.000 0.000
Fleischvrzcht 0.000 0.000 0.000
Thresholds:
Ern_Index1|t1 0.197 0.197 0.197
Variances:
abwchslngsr_M 0.562 0.562 0.810
schnll_zbrtbr 0.771 0.771 0.808
koche_shr_grn 0.339 0.339 0.284
koche_shr_hfg 0.559 0.559 0.399
Ern_Index1 0.938 0.938 0.938
Wrtschtzng_Es 0.132 1.000 1.000
Fleischvrzcht 0.050 0.806 0.806
Is the model not identified? There should be enough degrees of freedom and the loadings of the first manifest items are set to one.
How can I resolve this issue?
My first thought was:
You can´t have missing values in the dataframe, because with categorial variables WLSMV is used and FIML (missing="ML") is only usable with ML estimates. Perhaps that´s a problem.
Also: Does lavaan automatically fix the residual-variance of "Fleischverzicht" to 0 (or some other value)? A single-item latent variable would not be identified without that, I think.

Resources