I have longitudinal panel data of 1000 individuals measured at two time points. Using the MICE package I have imputed values for those variables with missing data. The imputation itself works fine, generating the required 17 imputed data frames. One of the imputed variables is fitness. I would like to create a new variable of fitness scaled, scale(fitness). My understanding is that I should impute first, and then create the new variable with the imputed data. How do I access each of the 17 imputed datasets and generate a scaled fitness variable in each?
My original data frame looks like (some variables missing):
id age school sex andersen ldl_c_trad pre_post
<dbl> <dbl> <fct> <fct> <int> <dbl> <fct>
1 2 10.7 1 1 951 2.31 1
2 2 11.3 1 1 877 2.20 2
3 3 11.3 1 1 736 2.88 1
4 3 11.9 1 1 668 3.36 2
5 4 10.1 1 0 872 3.31 1
6 4 10.7 1 0 905 2.95 2
7 5 10.5 1 1 925 2.02 1
8 5 11.0 1 1 860 1.92 2
9 8 10.7 1 1 767 3.41 1
10 8 11.2 1 1 709 3.32 2
My imputation code is:
imputed <- mice(imp_vars, method = meth, predictorMatrix = predM, m = 17)
imp_vars are the variables selected for imputation.
I have pre-specified both the method and predictor matrix.
Also, my assumption is that the scaling should be performed separately for each time point, as fitness is likely to have improved over time. Is it possible to perform the scaling filtered by pre_post for each imputed dataset?
Many thanks.
To access each of the imputations where x is a value from 1-17
data <- complete(imputed, x)
or if you want access to the fitness variable
complete(imputed, x)$fitness
If you want to filter observations according to a value of another variable in the dataframe, you could use
data[which(data$pre_post==1), "fitness"]
This should return the fitness observations for when pre_post==1, from there it is simply a matter of scaling these observations for each level of pre_post, assigning them to another variable fitness_scaled and then repeating for each imputation 1-17.
Related
I am using flexsurvreg() to fit and extrapolate parametric models on survival data. I use the treatment group as a covariate to make a proportional hazard model. I need a variance covariance matrix separately for the two treatment groups but I am unable to find out how I separate the groups after fitting the parametric model.
weib <- flexsurvreg(Surv(os_mnts, censoring) ~ treat, data = date_ex, dist = "weibull")
An example of the data is below. I do have treat == control as well even though it does not show here.
#sx_date last_fup_date censoring sex treat os_mnts
# <date> <date> <dbl> <dbl> <chr> <dbl>
# 1 2010-06-03 2013-08-10 0 1 treatment 38.2
# 2 2013-06-10 2014-09-09 1 1 treatment 15.0
# 3 2014-11-05 2015-07-03 0 0 treatment 7.89
# 4 2011-03-07 2014-08-10 1 1 treatment 41.1
# 5 2010-03-06 2013-12-11 0 1 treatment 45.2
# 6 2011-09-08 2015-01-01 0 1 treatment 39.8
# 7 2008-10-09 2016-06-02 1 0 treatment 91.8
# 8 2010-02-11 2015-01-02 1 1 treatment 58.7
# 9 2009-08-06 2014-07-06 0 1 treatment 59.0
#10 2011-07-03 2016-04-03 0 0 treatment 57.0
When I call vcov(weib) to get the variance covariance matrix, I get the following.
# shape scale treattreatment
#shape 0.0218074155 -0.004631324 -0.0001595603
#scale -0.0046313242 0.007912648 -0.0068951896
#treattreatment -0.0001595603 -0.006895190 0.0138593195
However, I need two variance covariance matrices (1 for each treatment group) with shape and scale only.
I have tried searching for way to separate the matrix itself and to subset the weib object. However I cannot find how to do either of these things. Does anyone know how I can get separate matrices out of this?
I would like to assign a number to the days (events) were val>0,36 . I want to identify with an unique number any event that reach that rule (val> 0.36) preferably using Tidyverse.
library(lubridate)
#create a df
date =as_date(ymd("2020-11-01"):ymd("2020-11-25"))
val = rnorm (25)
data = tibble(date, val)
Could anyone help me?
Thanks
There are several ways to choose unique numbers to identify the events for which the threshold value is exceeded. Also, tidyverse might not be necessary - here is a Base R solution. It takes the index of the value in the val vector that exceeds the threshold as the unique ID. All other values that are below the threshold of 0.36 are coded as 0.
# (a) Initializing the 0s that later identify the non-events
data$flag <- rep(0, nrow(data))
# (b) identifying all values which exceed the threshold with the index
data$flag[which(val > 0.36)] <- which(val > 0.36)
Output (on my machine, randomly)
> data
# A tibble: 25 x 3
date val flag
<date> <dbl> <dbl>
1 2020-11-01 0.0231 0
2 2020-11-02 -0.413 0
3 2020-11-03 0.240 0
4 2020-11-04 -0.465 0
5 2020-11-05 -0.929 0
6 2020-11-06 -0.409 0
7 2020-11-07 0.598 7
8 2020-11-08 0.970 8
9 2020-11-09 1.25 9
10 2020-11-10 0.244 0
# ... with 15 more rows
I have produced a logistic regression model in R using the logistf function from the logistf package due to quasi-complete separation. I get the error message:
Error in solve.default(object$var[2:(object$df + 1), 2:(object$df + 1)]) :
system is computationally singular: reciprocal condition number = 3.39158e-17
The data is structured as shown below, though a lot of the data has been cut here. Numbers represent levels (i.e 1 = very low, 5 = very high) not count data. Variables OrdA to OrdH are ordered factors. The variable Binary is a factor.
OrdA OrdB OrdC OrdE OrdF OrdG OrdH Binary
1 3 4 1 1 2 1 1
2 3 4 5 1 3 1 1
1 3 2 5 2 4 1 0
1 1 1 1 3 1 2 0
3 2 2 2 1 1 1 0
I have read here that this can be caused by multicollinearity, but have tested this and it is not the problem.
VIFModel <- lm(Binary ~ OrdA + OrdB + OrdC + OrdD + OrdE +
OrdF + OrdG + OrdH, data = VIFdata)
vif(VIFModel)
GVIF Df GVIF^(1/(2*Df))
OrdA 6.09 3 1.35
OrdB 3.50 2 1.37
OrdC 7.09 3 1.38
OrdD 6.07 2 1.57
OrdE 5.48 4 1.23
OrdF 3.05 2 1.32
OrdG 5.41 4 1.23
OrdH 3.03 2 1.31
The post also indicates that the problem can be caused by having "more variables than observations." However, I have 8 independent variables and 82 observations.
For context each independent variable is ordinal with 5 levels, and the binary dependent variable has 30% of the observations with "successes." I'm not sure if this could be associated with the issue. How do I fix this issue?
X <- model.matrix(Binary ~ OrdA+OrdB+OrdC+OrdD+OrdE+OrdF+OrdG+OrdH,
Data3, family = "binomial"); dim(X); Matrix::rankMatrix(X)
[1] 82 24
[1] 23
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.820766e-14
Short answer: your ordinal input variables are transformed to 24 predictor variables (number of columns of the model matrix), but the rank of your model matrix is only 23, so you do indeed have multicollinearity in your predictor variables. I don't know what vif is doing ...
You can use svd(X) to help figure out which components are collinear ...
I have a longitudinal data frame with multiple rows per id.
> data("dietox")
> head(dietox, 5)
Pig Evit Cu Litter Start Weight Feed Time
1 4601 Evit000 Cu000 1 26.5 26.50000 NA 1
2 4601 Evit000 Cu000 1 26.5 27.59999 5.200005 2
3 4601 Evit000 Cu000 1 26.5 36.50000 17.600000 3
4 4601 Evit000 Cu000 1 26.5 40.29999 28.500000 4
5 4601 Evit000 Cu000 1 26.5 49.09998 45.200001 5
I am trying to fit a GEE model to predict Weight for each row of the data frame.
library(gee)
library(dplyr)
> model1 <- gee(Weight ~ Start + Feed, id=Pig, data=dietox, corstr="exchangeable")
> model1
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Exchangeable
Call:
gee(formula = Weight ~ Start + Feed, id = Pig, data = dietox,
corstr = "exchangeable")
Number of observations : 789
Maximum cluster size : 11
Coefficients:
(Intercept) Start Feed
5.1539561 0.9384232 0.4294209
I now want to be able to add a new column to the data frame- prediction, which contains the predicted weight value for each row of data. The idea is that I will then be able to compare the original Weight variable with the prediction variable at different points in the Time variable.
When I try do do this using mutate and predict functions, I get an error saying that the number of observations used in the model fit (789) is different from the number of observations in the original data frame (861).
> new_df <- dietox %>%
+ mutate(prediction = predict(model1))
Error: Column `prediction` must be length 861 (the number of rows) or one, not 789
My questions are:
1. How do I extract the data frame for the 789 observations that
were used in the model fit?
2. Why is the number of observations
used in the model fit different to the total number of observations
in the original data frame?
The 789 observations used in model fitting were the ones which were without NA. You had 72 observations as NA in Feed column
sum(is.na(dietox$Feed))
#[1] 72
and 789 + 72 gives you complete 861 observations. To get all the predicted values you could do
dietox$Prediction <- NA
dietox$Prediction[!is.na(dietox$Feed)] <- predict(model1)
head(dietox)
# Weight Feed Time Pig Evit Cu Litter Prediction
#1 26.50000 NA 1 4601 1 1 1 NA
#2 27.59999 5.200005 2 4601 1 1 1 31.43603
#3 36.50000 17.600000 3 4601 1 1 1 36.76708
#4 40.29999 28.500000 4 4601 1 1 1 41.45324
#5 49.09998 45.200001 5 4601 1 1 1 48.63296
#6 55.39999 56.900002 6 4601 1 1 1 53.66306
Also the values which were used in the model are present in model1$y.
I would like to conduct a linear regression that will have three steps: 1) Running the regression on all data points 2) Taking out the 10 outiers as found by using the absolute distanse value of rstandard 3) Running the regression again on the new data frame.
I know how to do it manually but these is very awkwarding. Is there a way to do it automatically? Can it be done for taking out columns as well?
Here is my toy data frame and code (I'll take out 2 top outliers):
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 16
608 0 1 5
123 1 17 7
321 1 8 7
226 0 2 7
556 0 20 3
334 1 6 3
225 0 1 1
999 0 3 11
987 0 30 1 ",header = TRUE)
model<- lm(target~ birds+ wolfs,data=df)
rstandard <- abs(rstandard(model))
df<-cbind(df,rstandard)
g<-subset(df,rstandard > sort(unique(rstandard),decreasing=T)[3])
g
userid target birds wolfs rstandard
4 543 1 2 3 1.189858
13 334 1 6 3 1.122579
modelNew<- lm(target~ birds+ wolfs,data=df[-c(4,13),])
I don't see how you could do this without estimating two models, the first to identify the most influential cases and the second on the data without those cases. You could simplify your code and avoid cluttering the workspace, however, by doing it all in one shot, with the subsetting process embedded in the call to estimate the "final" model. Here's code that does this for the example you gave:
model <- lm(target ~ birds + wolfs,
data = df[-(as.numeric(names(sort(abs(rstandard(lm(target ~ birds + wolfs, data=df))), decreasing=TRUE)))[1:2]),])
Here, the initial model, evaluation of influence, and ensuing subsetting of the data are all built into the code that comes after the first data =.
Also, note that the resulting model will differ from the one your code produced. That's because your g did not correctly identify the two most influential cases, as you can see if you just eyeball the results of abs(rstandard(lm(target ~ birds + wolfs, data=df))). I think it has to do with your use of unique(), which seems unnecessary, but I'm not sure.