Survival analysis: aft model, simexaft package in R - r

We are trying to reproduce the results of an accelarated failure time (aft) model in R, which has been coded in SAS.
The data set we use is here
There you can find the SAS code as well.
formula <- survreg(Surv(Duration, Censor) ~ Acq_Expense + Acq_Expense_SQ + Ret_Expense + Ret_Expense_SQ + Crossbuy + Frequency + Frequency_SQ + Industry + Revenue + Employees, dist='weibull', data = daten [daten$Acquisition==1, ])
out1 <- survreg(formula = formula, data = daten [daten$Acquisition==1, ], dist = "weibull")
summary(out1)
ind <- c("Duration", "Censor")
err.mat <- ???
out2 <- simexaft(formula = formula, data = daten [daten$Acquisition==1, ], SIMEXvariable = ind, repeated = FALSE, err.mat = err.mat, dist = "weibull")
summary(out2)
Our question is how to define the err.mat term?
err.mat specifies the variables with measurement errors. Since our data set is right censored I thought the variables with measurement error are probably Duration and/or Censor. But it is not as simple as that, err.mat must be a square symmetric numeric matrix.

If you read the Journal of Statistical Software,January 2012, Volume 46, article describing the simexaft package, it becomes clear that in the situation without repeated measurements to estimate the measurement errors from data, that you must supply these estimates yourself from domain knowledge. See the example in pages 6-8. Also see the cited "Statistics in Medicine" article available at Dr Yi's website. The measurement errors are the first two predictor variables, systolic blood pressure (SBP) and serum cholesterol(CHOL) in that example. If you are using the text from which you are extracting that data, then you will need to read the chapter text (which does not appear to be available at that website) to determine what assumptions they make about the measurement errors.

Related

Mastter thesis help. Autocorrelation/lagrange test for master thesis in political science

I'm currently trying to detect how many lags I should include in my linear regression analysis in R.
The study is about whether the presence of commercial military actors (CMA) correlates/causes more military- and or civil deaths. My supervisor is very keen on me using lagrange multiplier test to test for how many lags I need. However, he is not a R user and can't help me implement. He also want me to include panel corrected standard errors (PCSE) proposed by Katz and Bailey.
Short variable description
DV = log_military_cas; it is a log transformation of yearly military deaths on country basis
IV = CMA; dummy coded variable suggesting either CMA presence in country and year combination (1) og no presence (0)
lag-variable = lag_md; log_md lagged one year.
DATA = lagr
This is what my supervisor sent me:
Testing for serial correlation. This is what I wrote down in my notes as a grad student:
Using the Lagrange Multiplier test first recommended by Engle (1984)(but also used by Beck and Katz (1996)) this is done in two steps: 1) estimate the model and save the residuals and 2)regress these residuals on the first lag of those and the independent variable. If the lag of the residual is statistically significant in the last regression, more lags of the dependent variable are needed.
<-- So just do this but with a model without any lags of dependent variable. If you find serial correlation, include a lag of DV and test again.
Question is twofold 1) what I'm I doing wrong the attached code, and 2) Should the baseline reg include pcse?
# no lag
lagtest_0a <- lm(log_military_cas ~ CMA + as.factor(country) + as.factor(year), data = lagr)
# save risiduals
lagr$Risid_0 <- resid(lagtest_0)
lagtest_0b <- lm(log_military_cas ~ CMA + Risid_0 + as.factor(country) + as.factor(year), data = lagr)
summary(lagtest_0b)
# Risid_0 is significant, so I need at least one lag
# lag 1
lagtest_1a <- lm(log_military_cas ~ CMA + lag_md + as.factor(country) + as.factor(year), data = lagr)
# save new risiduals
lagr$Risid1 <- resid(lagtest_1a)
# here the follwoing errorcode arrives:
Error in `$<-.data.frame`(`*tmp*`, Risid1, value = c(`2` = 1.84005148256506, :
replacement has 2855 rows, data has 2856
# Then I'm thinking, maybe I shouldnt store Risid_0 in the lagr dataframe. So I try without that just storing it for itself.
# save new risiduals in new way
Risid1 <- resid(lagtest_1a)
# rerun model
lagtest1 <- lm(log_military_cas ~ CMA + Rs_lagtest_md1 + as.factor(country) + as.factor(year), data = lagr)
# Then, the following errorcode arrives:
Error in model.frame.default(formula = log_military_cas ~ CMA + Rs_lagtest_md1 + :
variable lengths differ (found for 'Rs_lagtest_md1')
it seems like the problem is, that when I include lag_md (which has NA's on first year, since its lagged) the lenght of the variables are not the same, however as far as I know, the default system in R omits NA's. I even tried to specify this with na.action = na.omit, but the same error arrives.
Hope anyone can help me

How to combine all datasets into a data frame after multiple imputation (mice)

I read this article (https://journal.r-project.org/archive/2021/RJ-2021-073/RJ-2021-073.pdf) about multiple imputation and propensity score matching - here is the code from this article:
# code from "MatchThem:: Matching and Weighting after Multiple Imputation", Pishgar et al, The R Journal Vol. XX/YY, AAAA 20ZZ:
library(MatchThem)
data('osteoarthritis')
summary(osteoarthritis)
library(mice)
imputed.datasets <- mice(osteoarthritis, m = 5)
matched.datasets <- matchthem(OSP ~ AGE + SEX + BMI + RAC + SMK,
datasets = imputed.datasets,
approach = 'within',
method = 'nearest',
caliper = 0.05,
ratio = 2)
weighted.datasets <- weightthem(OSP ~ AGE + SEX + BMI + RAC + SMK,
datasets = imputed.datasets,
approach = 'across',
method = 'ps',
estimand = 'ATM')
library(cobalt)
bal.tab(matched.datasets, stats = c('m', 'ks'),
imp.fun = 'max')
bal.tab(weighted.datasets, stats = c('m', 'ks'),
imp.fun = 'max')
library(survey)
matched.models <- with(matched.datasets,
svyglm(KOA ~ OSP, family = quasibinomial()),
cluster = TRUE)
weighted.models <- with(weighted.datasets,
svyglm(KOA ~ OSP, family = quasibinomial()))
matched.results <- pool(matched.models)
summary(matched.results, conf.int = TRUE)
As far as I understand the author first uses multiple imputation with mice (m = 5) and continues with the matching procedure with MatchThem - in the end MatchThem gives back a "mimids-object" called "matched.datasets" which contains the 5 different dataset of multiple imputation.
There is the "complete" function which can extract one of the datasets, f.e.
newdataset <- complete(matched.datasets, 2) # extracts the second dataset.
So newdataset is a data frame without NAs (because imputed) and can be used for any further tests.
Now, I would like to use a dataset as a dataframe (like after using complete), but this dataset should be some kind of a "mean" of all datasets - because how could I decide, which of the 5 datasets I use for my further analyses? Is there a way of doing something like this:
meanofdatasets <- complete(matched.datasets, meanofall5datasets) # extracts a dataset which contains something like the mean values of all datasets
In my data, for which I want to use this method, I would like to use an imputed and matched dataset of my original about 500 rows to do further tests, f.e. cox regression, kaplan meier plots or competing risk analyses as well as simple descriptive statistics with plots about the matched population. But on which of the 5 datasets do I have to append my tests? For those tests I need a real data frame, don't I?
Thank you for any help!
here is some valuable source (from the creator of the mice package : Stef Vanbuuren) to learn why you should NOT average the multiples dataset, but POOL the estimates of each imputed dataset for instance doing your cox regression (see section 5.1 workflow).
Quick steps for Cox regression:
you can easily do the imputation + multiple imputation with matchthem() which will give you a mimids class object.
Then do your cox regression through with() function on your mimids object.
Finally pool your estimates through pool(), which will give you a mimira object.
Eventually mimira object is easily managed with gtsummary package (tbl_regression) which give you a fine and readily publishable table.

R and multiple time series and Error in model.frame.default: variable lengths differ

I am new to R and I am using it to analyse time series data (I am also new to this).
I have quarterly data for 15 years and I am interested in exploring the interplay between drinking and smoking rates in young people - treating smoking as the outcome variable. I was advised to use the gls command in the nlme package as this would allow me to include AR and MA terms. I know I could use more complex approaches like ARIMAX but as a first step, I would like to use simpler models.
After loading the data, specify the time series
data.ts = ts(data=data$smoke, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
data.ts.dec = decompose(data.ts)
After decomposing the data and some tests (KPSS and ADF test), it is clear that the data are not stationary so I differenced the data:
diff_dv<-diff(data$smoke, difference=1)
plot.ts(diff_dv, main="differenced")
data.diff.ts = ts(diff_dv, frequency=4, start=c(hse[1, "Year"], hse[1, "Quarter"]))
The ACF and PACF plots suggest AR(2) should also be included so I set up the model as:
mod.gls = gls(diff_dv ~ drink+time , data = data,
correlation=corARMA(p=2), method="ML")
However, when I run this command I get the following:
"Error in model.frame.default: variable lengths differ".
I understand from previous posts that this is due to the differencing and the fact that the diff_dv is now shorter. I have attempted fixing this by modifying the code but neither approach works:
mod.gls = gls(diff_dv ~ drink+time , data = data[1:(length(data)-1), ],
correlation=corARMA(p=2), method="ML")
mod.gls = gls(I(c(diff(smoke), NA)) ~ drink+time+as.factor(quarterly) , data = data,
correlation=corARMA(p=2), method="ML")
Can anyone help with this? Is there a workaround which would allow me to run the -gls- command or is there an alternative approach which would be equivalent to the -gls- command?
As a side question, is it OK to include time as I do - a variable with values 1 to 60? A similar question is for the quarters which I included as dummies to adjust for possible seasonality - is this OK?
Your help is greatly appreciated!
Specify na.action = na.omit or na.action = na.exclude to omit the rows with NA's. Here is an example using the built-in Ovary data set. See ?na.fail for info on the differences between these two.
Ovary2 <- transform(Ovary, dfoll = c(NA, diff(follicles)))
gls(dfoll ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary2,
correlation = corAR1(form = ~ 1 | Mare), na.action = na.exclude)

Discrepancy emmeans in R (using ezAnova) vs estimated marginal means in SPSS

So this is a bit of a hail mary, but I'm hoping someone here has encountered this before. I recently switched from SPSS to R, and I'm now trying to do a mixed-model ANOVA. Since I'm not confident in my R skills yet, I use the exact same dataset in SPSS to compare my results.
I have a dataset with
dv = RT
within = Session (2 levels), Cue (3 levels), Flanker (2 levels)
between = Group(3 levels).
no covariates.
unequal number of participants per group level (25,25,23)
In R I'm using the ezAnova package to do the mixed-model anova:
results <- ezANOVA(
data = ant_rt_correct
, wid = subject
, dv = rt
, between = group
, within = .(session, cue, flanker)
, detailed = T
, type = 3
, return_aov = T
)
In SPSS I use the following GLM:
GLM rt.1.center.congruent rt.1.center.incongruent rt.1.no.congruent rt.1.no.incongruent
rt.1.spatial.congruent rt.1.spatial.incongruent rt.2.center.congruent rt.2.center.incongruent
rt.2.no.congruent rt.2.no.incongruent rt.2.spatial.congruent rt.2.spatial.incongruent BY group
/WSFACTOR=session 2 Polynomial cue 3 Polynomial flanker 2 Polynomial
/METHOD=SSTYPE(3)
/EMMEANS=TABLES(group*session*cue*flanker)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/WSDESIGN=session cue flanker session*cue session*flanker cue*flanker session*cue*flanker
/DESIGN=group.
The results of which line up great, ie:
R: Session F(1,70) = 46.123 p = .000
SPSS: Session F(1,70) = 46.123 p = .000
I also ask for the means per cell using:
descMeans <- ezStats(
data = ant_rt_correct
, wid = subject
, dv = rt
, between = group
, within = .(session, cue, flanker) #,cue,flanker)
, within_full = .(location,direction)
, type = 3
)
Which again line up perfectly with the descriptives from SPSS, e.g. for the cell:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 484.22
SPSS: M = 484.22
However, when I try to get to the estimated marginal means, using the emmeans package:
eMeans <- emmeans(results$aov, ~ group | session | cue | flanker)
I run into descrepancies as compared to the Estimated Marginal Means table from the SPSS GLM output (for the same interactions), eg:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 522.5643
SPSS: M = 484.22
It's been my understanding that the estimated marginal means should be the same as the descriptive means in this case, as I have not included any covariates. Am I mistaken in this? And if so, how come the two give different results?
Since the group sizes are unbalanced, I also redid the analyses above after making the groups of equal size. In that case the emmeans became:
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M =521.2954
SPSS: M = 482.426
So even with equal group sizes in both conditions, I end up with quite different means. Keep in mind that the rest of the statistics and the descriptive means áre equal between SPSS and R. What am I missing... ?
Thanks!
EDIT:
The plot thickens.. If I perform the ANOVA using the AFEX package:
results <- aov_ez(
"subject"
,"rt"
,ant_rt_correct
,between=c("group")
,within=c("session", "cue", "flanker")
)
)
and then take the emmeans again:
eMeans <- emmeans(results, ~ group | session | cue | flanker)
I suddenly get values much closer to that of SPSS (and the descriptive means)
Group(1) - Session(1) - Cue(center) - Flanker(1)
R: M = 484.08
SPSS: M = 484.22
So perhaps ezANOVA is doing something fishy somewhere?
I suggest you try this:
library(lme4) ### I'm guessing you need to install this package first
mod <- lmer(rt ~ session + cue + flanker + (1|group),
data = ant_rt_correct)
library(emmeans)
emm <- emmeans(mod, ~ session * cue * flanker)
pairs(emm, by = c("cue", "flanker") # simple comparisons for session
pairs(emm, by = c("session", "flanker") # simple comparisons for cue
pairs(emm, by = c("session", "cue") # simple comparisons for flanker
This fits a mixed model with random intercepts for each group. It uses REML estimation, which is likely to be what SPSS uses.
In contrast, ezANOVA fits a fixed-effects model (no within factor at all), and aov_ez uses the aov function which produces an analysis that ignores the inter-block effects. Those make a difference especially with unbalanced data.
An alternative is to use afex::mixed, which in fact uses lme4::lmer to fit the model.

How to create Naive Bayes in R for numerical and categorical variables

I am trying to implement a Naive Bayes model in R based on known information:
Age group, e.g. "18-24" and "25-34", etc.
Gender, "male" and "female"
Region, "London" and "Wales", etc.
Income, "£10,000 - £15,000", etc.
Job, "Full Time" and "Part Time", etc.
I am experiencing errors when implementing. My code is as per below:
library(readxl)
iphone <- read_excel("~/Documents/iPhone_1k.xlsx")
View(iphone)
summary(iphone)
iphone
library(caTools)
library(e1071)
set.seed(101)
sample = sample.split(iphone$Gender, SplitRatio = .7)
train = subset(iphone, sample == TRUE)
test = subset(iphone, sample == FALSE)
nB_model <- naiveBayes(Gender ~ Region + Retailer, data = train)
pred <- predict(nB_model, test, type="raw")
In the above scenario, I have an excel file called iPhone_1k (1,000 entries relating to people who have visited a website to buy an iPhone). Each row is a person visiting the website and the above demographics are known.
I have been trying to make the model work and have resorted to following the below link that uses only two variables (I would like to use a minimum of 4 but introduce more, if possible):
https://rpubs.com/dvorakt/144238
I want to be able to use these demographics to predict which retailer they will go to (also known for each instance in the iPhone_1k file). There are only 3 options. Can you please advise how to complete this?
P.S. Below is a screenshot of a simplified version of the data I have used to keep it simple in R. Once I get some code to work, I'll expand the number of variables and entries.
You are setting the problem incorrectly. It should be:
naiveBayes(Retailer ~ Gender + Region + AgeGroup, data = train)
or in short
naiveBayes(Retailer ~ ., data = train)
Also you might need to convert the columns into factors if they are characters. You can do it for all columns, right after reading from excel, by
iphone[] <- lapply(iphone, factor)
Note that if you add numeric variables in the future, you should not apply this step on them.

Resources