De-identifying survival or flexsurvreg objects in R - r

Please consider the following:
I need to provide some R code syntax to analyse data with the flexsurv package. I am not allowed to receive/analyse directly or on-site. I am however allowed to receive the analysis results.
Problem
When we run the flexsurvreg() function on some data (here ovarian from the flexsurv package), the created object (here fitw) contains enough information to "re-create" or "back-engineer" the actual data. But then I would technically have access to the data I am not allowed to have.
# Load package
library("flexsurv")
#> Loading required package: survival
# Run flexsurvreg with data = ovarian
fitw <- flexsurvreg(formula = Surv(futime, fustat) ~ factor(rx) + age,
data = ovarian, dist="weibull")
# Look at first observation in ovarian
ovarian[1, ]
#> futime fustat age resid.ds rx ecog.ps
#> 1 59 1 72.3315 2 1 1
# With the following from the survival object, the data could be re-created
fitw$data$Y[1, ]
#> time status start stop time1 time2
#> 59 1 0 59 59 Inf
fitw$data$m[1, ]
#> Surv(futime, fustat) factor(rx) age (weights)
#> 1 59 1 72.3315 1
Potential solution
We could write the code so that it also sets all those data that might be used for this back-engineering to NA as follows:
# Setting all survival object observation to NA
fitw$data$Y <- NA
fitw$data$m <- NA
fitw$data$mml$scale <- NA
fitw$data$mml$rate <- NA
fitw$data$mml$mu <- NA
Created on 2021-08-27 by the reprex package (v2.0.0)
Question
If I proceed as the above and set all these parameters to NA, could I then receive the fitw object (e.g. as an .RDS file) without ever being able to "back-engineer" the original data? Or is there any other way to share fitw without the attached data?
Thanks!

Setting, e.g. fitw$data <- NULL will remove all the individual-level data from the fitted model object. Some of the output functions may not work with objects stripped of data however. In the current development version on github, printing the model object should work. Also summary and predict methods should work, as long as covariate values are supplied in newdata - omitting them won't work, since the default is to take the covariate values from the observed data.

Related

Trying to create a data frame for mlogit and keep running into this error Error in names(data)[ix] : invalid subscript type 'language'

I am trying to use this data set https://data.cityofnewyork.us/Transportation/Citywide-Mobility-Survey-Person-Survey-2019/6bqn-qdwq to create an mnl model but every time I try to change my original data frame like this
nydata_df = dfidx(nydata, shape="wide",choice="work_mode",varying = sort)
I get this error here.
Error in names(data)[ix] : invalid subscript type 'language'
I'm unclear about what is causing this error I think it is something wrong with dplyr but I am not sure.
According to this vignette from the mlogit package, the varying argument should be used to specify which variables should be "lengthened" when converting a dataframe from wide to long using dfidx. Are you actively trying to lengthen your dataframe (like in the style of dplyr::pivot_longer())?
If you aren't, I don't believe that you need the varying argument (see ?stats::reshape for more info on varying). If you want to use the varying argument, you should specify specific variables rather than only "sort" (example1, example2). Additionally, when I run your models, I don't get a NaN for McFadden's R2, p-value, or chi-square test. Are your packages fully updated?
library(dfidx)
library(mlogit)
library(performance) # to extract McFadden's R2 easily
packageVersion("dfidx")
#> [1] '0.0.5'
packageVersion("mlogit")
#> [1] '1.1.1'
packageVersion("dplyr")
#> [1] '1.0.10'
# currently running RStudio Version 2022.7.2.576
nydata <- read.csv(url("https://data.cityofnewyork.us/api/views/6bqn-qdwq/rows.csv?accessType=DOWNLOAD"))
nydata_df <- dfidx(data = nydata,
shape = "wide",
choice = "work_mode")
m <- mlogit(work_mode ~ 1, nydata_df)
#summary(m)
r2_mcfadden(m)
#> McFadden's R2
#> 1.110223e-16
m3 <- mlogit(work_mode ~ 1 | harassment_mode + age, nydata_df)
#summary(m3)
r2_mcfadden(m3)
#> McFadden's R2
#> 0.03410362

In data(aml) : data set ‘aml’ not found

library(readr)
aml <- read.csv("~/Documents/MH4315/aml.dat", sep="")
View(aml)
data(aml)
aml.km<-survfit(Surv(time,status)~x, data = aml)
plot(aml.km, main="Estimated survival function of the two
groups", lty=c(1,2) )
When i run my code, it pops up an error that in data(aml): data set 'aml' is not found.
I am not sure what is wrong with my code. And so, is the code for my aml.km correct?
I'm not sure what you expected from the data(aml) call. The data function is used to load data objects from installed packages. There is an aml dataset in both the survivla package and the boot package and I think they are identical, at least they have the same number of columns and rows as well as a similar description in their associated help pages. Since at the moment I do have the survival package loaded when I do this:
data(aml)
Nothing happens at the console but in this case no-news-is-good-news. The dataset is in the workspace.
You on the other hand have read in an aml data object from your local disk so we cannot really be sure what it has in it. When I run you survfit call I get no error:
aml.km<-survfit(Surv(time,status)~x, data = aml)
> aml.km
Call: survfit(formula = Surv(time, status) ~ x, data = aml)
n events median 0.95LCL 0.95UCL
x=Maintained 11 7 31 18 NA
x=Nonmaintained 12 11 23 8 NA
#----------------------
png();plot(aml.km); dev.off()
#RStudioGD

Handling alternative-specific NA values in mlogit

It is common in mode choice models to have variables that vary with alternatives ("generic variables") but that are undefined for certain modes. For example, transit fare is present for bus and light rail, but undefined for automobiles and biking. Note that the fare is not zero.
I'm trying to make this work with the mlogit package for R. In this MWE I've asserted that price is undefined for fishing from the beach. This results in a singularity error.
library(mlogit)
#> Warning: package 'mlogit' was built under R version 3.5.2
#> Loading required package: Formula
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#> Loading required package: lmtest
data("Fishing", package = "mlogit")
Fishing$price.beach <- NA
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
head(Fish)
#> mode income alt price catch chid
#> 1.beach FALSE 7083.332 beach NA 0.0678 1
#> 1.boat FALSE 7083.332 boat 157.930 0.2601 1
#> 1.charter TRUE 7083.332 charter 182.930 0.5391 1
#> 1.pier FALSE 7083.332 pier 157.930 0.0503 1
#> 2.beach FALSE 1250.000 beach NA 0.1049 2
#> 2.boat FALSE 1250.000 boat 10.534 0.1574 2
mlogit(mode ~ catch + price | income, data = Fish, na.action = na.omit)
#> Error in solve.default(H, g[!fixed]): system is computationally singular: reciprocal condition number = 3.92205e-24
Created on 2019-07-08 by the reprex package (v0.2.1)
This happens when price is moved to the alternative-specific variable position as well. I think the issue may lie in the na.action function argument, but I can't find any documentation on this argument beyond the basic documentation tag:
na.action: a function which indicates what should happen when the data contains NAs
There appear to be no examples showing how this term is used differently and what the results are. There's a related unanswered question here.
There appears to be a few things going on.
I am not quite sure how na.action = na.omit works under the hood, but it sounds to me like it will drop the entire row. I always find it better to do this explicitly.
When you drop the entire row, you will have choice occasions where no choice was made. This is not going to work. Remember, we are working with logit type probabilities. Furthermore, if no choice is made, no information is gained, so we need to drop these choice observations entirely. Doing these two steps in combination, I am able to run the model you propose.
Here is a commented working example:
library(mlogit)
# Read in the data
data("Fishing", package = "mlogit")
# Set price for the beach option to NA
Fishing$price.beach <- NA
# Scale income
Fishing$income <- Fishing$income / 10000
# Turn into 'mlogit' data
fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
# Explicitly drop the alts with NA in price
fish <- fish[fish$alt != "beach", ]
# Dropping all NA also means that we now have choice occasions where no choice
# was made and we need to get rid of these as well
fish$choice_made <- rep(colSums(matrix(fish$mode, nrow = 3)), each = 3)
fish <- fish[fish$choice_made == 1, ]
fish <- mlogit.data(fish, shape = "long", alt.var = "alt", choice = "mode")
# Run an MNL model
mnl <- mlogit(mode ~ catch + price | income, data = fish)
summary(mnl)
In general, when working with these models, I find it very useful to always make all data transformations before running a model rather than rely on functions such as na.action.

Can we make prediction with nlxb from nlmrt package?

I'm asking this question because I couldn't figure it out why nlxb fitting function does not work with the predict() function.
I have been looking around to solve this but so far no luck:(
I use dplyr to group data and use do to fit each group using nlxb from nlmrt package.
Here is my attempt
set.seed(12345)
set =rep(rep(c("1","2","3","4"),each=21),times=1)
time=rep(c(10,seq(100,900,100),seq(1000,10000,1000),20000),times=1)
value <- replicate(1,c(replicate(4,sort(10^runif(21,-6,-3),decreasing=FALSE))))
data_rep <- data.frame(time, value,set)
> head(data_rep)
# time value set
#1 10 1.007882e-06 1
#2 100 1.269423e-06 1
#3 200 2.864973e-06 1
#4 300 3.155843e-06 1
#5 400 3.442633e-06 1
#6 500 9.446831e-06 1
* * * *
library(dplyr)
library(nlmrt)
d_step <- 1
f <- 1e9
d <- 32
formula = value~Ps*(1-exp(-2*f*time*exp(-d)))*1/(sqrt(2*pi*sigma))*exp(-(d-d_ave)^2/(2*sigma))*d_step
dffit = data_rep %>% group_by(set) %>%
do(fit = nlxb(formula ,
data = .,
start=c(d_ave=44,sigma=12,Ps=0.5),
control=nls.lm.control(maxiter = 100),
trace=TRUE))
--------------------------------------------------------
There are two points I would like to get finally,
1)First, how to get fitting coefficients of each group in continuation to dffitpipeline.
2) Doing prediction of based on new x values.
for instance range <- data.frame(x=seq(1e-5,20000,length.out=10000))
predict(fit,data.frame(x=range)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "nlmrt"
Since nlxb is working smoothly compared to nls r-minpack-lmnls-lm-failed-with-good-results I would prefer solutions with nlxb. But if you have a better solution please let us know.
There are no coef or predict methods for "nlmrt" class objects but the nlmrt package does provide wrapnls which will run nlmrt and then nls so that an "nls" object results and then that object can be used with all the "nls" class methods.
Also note that nls.lm.control is from the nlsLM package and should not be used here -- use list instead.

r rms error using validate

I'm building an Linear model using OLS in the r package with:
model<-ols(nallSmells ~ rcs(size, 5) + rcs(minor,5)+rcs(change_churn,3)
+rcs(review_rate,0), data=quality,x=T, y=T)
When I want to validate my model using:
validate(model,B=100)
I get the following error:
Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 1164 missing values deleted
But if I decrease B, e.g., B=10, I works. Why I can't iterate more. Also I notice that the seed has an effect when I use this method.
Can someone give me some advice?
UPDATE:
I'm using rcs(review_rate,0) because I want to assign the 0 number of knots to this predictor, according to my DOF budget. I noticed that the problem is with thte data in review_rate. Even if I ommit the parameter in rcs() and just put the name of the predictor, I get errors. This is the frequency of the data in review_rate: count(quality$review_rate)
x freq
1 0.8571429 1
2 0.9483871 1
3 0.9789474 1
4 0.9887640 1
5 0.9940476 1
6 1.0000000 1159 I wonder if there is a relationship with the values of this vector? Because when I built the OLS model, I get the following warning:
Warning message:
In rcspline.eval(x, nk = nknots, inclx = TRUE, pc = pc, fractied = fractied) :
5 knots requested with 6 unique values of x. knots set to 4 interior values.
The values in the other predictors are real positives, but if ommit review_rate predictor I don't get any warning or error.
Thanks for your support.
I add the link for a sample of 100 of my data for replication
https://www.dropbox.com/s/oks2ztcse3l8567/examplestackoverflow.csv?dl=0
X represent the depedent variable and Y4 the predictor that is giving me problems.
require (rms)
Data <- read.csv ("examplestackoverflow.csv")
testmodel<-ols(X~ rcs(Y1)+rcs(Y2)+rcs(Y3),rcs(Y4),data=Data,x=T,y=T)
validate(testmodel,B=1000)
Kind regards,

Resources