plm - industry + year fixed effects with firm-year data - r
My question is, how do I include industry and year fixed effects in plm, when I have multiple firms in same industry in same year? Repex of my data looks like this:
Year Industry CompanyID CEOID CEO.background MBA.CEO CEO.Tenure Female.CEO CEO.age Capex Log.TA Leverage
2005 6 1075 10739 0 0 6.92 0 55 0.08623238 9.199961396 0.330732917
2006 6 1075 10739 0 0 7.92 0 56 0.097455145 9.334559982 0.26575725
2007 6 1075 10739 0 0 8.92 0 56 0.113033772 9.346263914 0.285439531
2008 6 1075 10739 0 0 9.92 0 57 0.108640177 9.327564318 0.322985772
2009 6 1075 5835 0 0 0.67 0 54 0.08526524 9.360491034 0.333880116
2010 6 1075 5835 0 0 1.67 0 55 0.081452292 9.376545673 0.32197511
2005 6 1743 8379 0 0 17.43 0 65 0.236487293 6.693007633 0.021915227
2006 6 1743 26012 0 1 0.91 0 59 0.319264835 6.820455133 0.023157959
2007 6 1743 26012 0 1 1.91 0 58 0.207384938 6.844512984 0.020087012
2008 6 1743 26012 0 1 2.92 0 59 0.130632264 6.890964093 0.017103795
2009 6 1743 26012 0 1 3.92 0 60 0.112029325 6.879662342 0.017283796
2010 6 1743 30801 0 0 1 1 47 0.02804693 6.767971236 0.044755539
2005 7 1004 9249 0 0 9.65 0 53 0.076370794 6.596094672 0.31534354
2006 7 1004 9249 0 0 10.65 0 54 0.114891589 6.886346743 0.327808308
2007 7 1004 9249 0 0 11.65 0 55 0.097727719 6.973199328 0.307086799
2008 7 1004 9249 0 0 12.65 0 56 0.112119583 7.216716829 0.389800369
2009 7 1004 9249 0 0 13.65 0 57 0.086281135 7.228033526 0.331455792
2010 7 1004 9249 0 0 14.65 0 58 0.298922358 7.313914813 0.291147083
CEO.background, MBA.CEO, and Female.CEO are time-invariant dummies for each CEO and industry time-invariant dummy for firm, while rest are time varying firm/CEO attributes.
I would like to run the following fixed effects for industry/year regression code:
plm(Capex ~ CEO.background + MBA.CEO + CEO.Tenure + Female.CEO + CEO.age + Log.TA + Leverage, data=repexcapex, index = (c("Industry", "Year")), model = "within", effect = "twoways")
However, if I have multiple companies in same industry like above data (company ID 1075/1743 both in industry 6), the code gives an error about duplicates.
Error in pdim.default(index[[1]], index[[2]]) :
duplicate couples (id-time)
In addition: Warning messages:
1: In pdata.frame(data, index) :
duplicate couples (id-time) in resulting pdata.frame
[...]
If I kill the first 5 rows and run it with just 1 firm per industry, the code works.
How should I formulate my regression to be able to include both industry and year fixed effects? Is running the code with industry dummies like below equivalent to industry fixed effects:
plm(Capex ~ CEO.background + MBA.CEO + CEO.Tenure + Female.CEO + CEO.age + Log.TA + Leverage + factor(Industries), data=repexcapex, index = (c("Year")), model = "within", effect = "individual")
this is the formatted data:
repexcapex <- read.table(text="
Year,Industry,CompanyID,CEOID,CEO.background,MBA.CEO,CEO.Tenure,Female.CEO,CEO.age,Capex,Log.TA,Leverage
2005,6,1075,10739,0,0,6.92,0,55,0.08623238,9.199961396,0.330732917
2006,6,1075,10739,0,0,7.92,0,56,0.097455145,9.334559982,0.26575725
2007,6,1075,10739,0,0,8.92,0,56,0.113033772,9.346263914,0.285439531
2008,6,1075,10739,0,0,9.92,0,57,0.108640177,9.327564318,0.322985772
2009,6,1075,5835,0,0,0.67,0,54,0.08526524,9.360491034,0.333880116
2010,6,1075,5835,0,0,1.67,0,55,0.081452292,9.376545673,0.32197511
2005,6,1743,8379,0,0,17.43,0,65,0.236487293,6.693007633,0.021915227
2006,6,1743,26012,0,1,0.91,0,59,0.319264835,6.820455133,0.023157959
2007,6,1743,26012,0,1,1.91,0,58,0.207384938,6.844512984,0.020087012
2008,6,1743,26012,0,1,2.92,0,59,0.130632264,6.890964093,0.017103795
2009,6,1743,26012,0,1,3.92,0,60,0.112029325,6.879662342,0.017283796
2010,6,1743,30801,0,0,1,1,47,0.02804693,6.767971236,0.044755539
2005,7,1004,9249,0,0,9.65,0,53,0.076370794,6.596094672,0.31534354
2006,7,1004,9249,0,0,10.65,0,54,0.114891589,6.886346743,0.327808308
2007,7,1004,9249,0,0,11.65,0,55,0.097727719,6.973199328,0.307086799
2008,7,1004,9249,0,0,12.65,0,56,0.112119583,7.216716829,0.389800369
2009,7,1004,9249,0,0,13.65,0,57,0.086281135,7.228033526,0.331455792
2010,7,1004,9249,0,0,14.65,0,58,0.298922358,7.313914813,0.291147083",
sep=",",header=TRUE)
As your dependent variable Capex seems to be a company-specific measure, likely the unit of observation (= what plm calls the individual dimension) is company (variable CompanyID) which is to be specified in the index argument.
Thus, a basic 2-way model can be estimated by:
plm(Capex ~ CEO.background + MBA.CEO + CEO.Tenure + Female.CEO + CEO.age + Log.TA + Leverage, data=repexcapex, index = (c("CompanyID", "Year")), model = "within", effect = "twoways")
To add industry fixed effects, include +factor(Industry) in the formula. Likely, this variable will drop out of the estimation as it is correlated with the other fixed effects (it is for the small sample data you provided).
Related
How to store individual prediction models when looping over a C5.0 decision tree with cross validation in R?
I am new to R and am using a for loop in order to implement 5 fold cross validation with a C5.0 decision tree for an assignment. My dataset looks as follows: head(data_known) order_item_id order_date item_id item_size brand_id item_price user_id 1 1 2012-09 1507 UNSIZED 102 24.9 4694 2 2 2012-11 1745 10 64 75.0 6097 3 3 2013-01 2588 XXL 42 79.9 7223 4 4 2012-08 164 40 47 79.9 4124 5 5 2012-09 1640 L 97 69.9 881 6 6 2013-03 2378 38 72 129.9 1576 user_title user_dob user_state user_reg_date 1 Mrs 1964-11 Rhineland-Palatinate 2011-02 2 Mrs 1973-08 Brandenburg 2011-05 3 Mrs 1949-08 Saarland 2013-01 4 Mrs 1960-12 Thuringia 2012-08 5 Mrs 1971-06 Baden-Wuerttemberg 2012-01 6 Mrs 1965-10 North Rhine-Westphalia 2011-02 delivery_time_days user_title_NA item_size_NA user_dob_NA target 1 2 0 0 0 Return 2 4 0 0 0 No Return 3 2 0 0 0 Return 4 5 0 0 0 Return 5 3 0 0 0 Return 6 11 0 0 0 Return Now, my loop is: explanatory_variables.dt<-names(data_known)[-16] form.dt<-as.formula(paste("target ~", paste(explanatory_variables.dt, collapse = "+"))) folds.dt<-split(data_known,cut(sample(1:nrow(data_known)),5)) errs.c50.dt<-rep(NA,length(folds.dt)) for (i in 1:length(folds.dt)) { test.dt<-ldply(folds.dt[i],data.frame) train.dt<-ldply(folds.dt[-i],data.frame) tmp.model.dt<-C5.0(form.dt,train.dt) tmp.predict.dt<-predict(tmp.model.dt, newdata=test.dt) conf.mat.dt<-table(test.dt$target,tmp.predict.dt) errs.c50.dt[i]<-1-sum(diag(conf.mat.dt))/sum(conf.mat.dt) } print(sprintf("average error using k-fold cross validation and C5.0 decision tree algorithm: %.3f percent", 100*mean(errs.c50.dt))) How do I access/safe the whole tree model in the loop in order to predict the outcome of the target variable in another dataset where its true realizations are still unknown? Or do I have to base the predictions on tmp.model.dt alone when using cross validation? Thank you in advance for your help. Best, Nico
Here is a simple reproducible answer that expands upon Roman's comment. list_models <- list() for (i in 1:2){ tmp_data <- mtcars[,c(1, i+1)] list_models[[i]] <- lm(mpg ~ ., data = tmp_data) } head(predict(list_models[[1]], newdata = mtcars)) head(predict(list_models[[2]], newdata = mtcars)) I am using lm here, but this will work just as well with C5.0 as the predict function will work on either model object.
Function throws error while using lapply, works fine otherwise
I have a list containing data tables. A sample list can be created using following code. mydata=read.table(textConnection(" MSA_id code variable Caucasian African.American Asian Hispanic Other 412 111011 1 64 2 0 0 0 412 111011 2 464 17 4 11 0 412 111021 1 2006 43 32 22 61 412 111021 2 559 18 6 10 0 412 111031 1 56 1 0 0 0 412 111031 2 1 0 0 0 0"),header=TRUE) setDT(mydata) z = split(mydata,mydata$code) > z[1:2] $`111011` MSA_id code variable Caucasian African.American Asian Hispanic Other 1: 412 111011 1 64 2 0 0 0 2: 412 111011 2 464 17 4 11 0 $`111021` MSA_id code variable Caucasian African.American Asian Hispanic Other 1: 412 111021 1 2006 43 32 22 61 2: 412 111021 2 559 18 6 10 0 I want to reformat elements of this list (data.tables) based on their values. From my code, the elements of reformatted list should like this: First Element: [,1] [,2] [1,] 64 2 [2,] 464 32 Second Element Caucasian African.American Asian Hispanic 1: 2006 43 32 22 2: 559 18 6 10 Algorithm for this is: Remove first 3 columns and the last column. If minimum value of Caucasian is 0, or sum of minimum values of rest 3 (that is:African.American,Asian,Hispanic) categories is 0, then set the element as NA. Else if minimum of African.American is 0 or sum of minimum values of Asian and Hispanic is 0, then sum up African.American, Asian, and Hispanic as single category. Else if minimum value of Asian is 0 or minimum value of Hispanic is 0, sum up Asian and Hispanic as single category. Else keep the format as it is. I created a function to do it. When I use this function on one element at a time, it works fine, but when I use lapply, it breaks. formatTable <- function(z){ a = z[[1]] b = a[,list(Caucasian,African.American,Asian,Hispanic),] # Deleting columns 1,2,3 and 8 if ( min(b$Caucasian) == 0) { formatTable=NA } else if ( (min(b$African.American) + min(b$Asian) + min(b$Hispanic)) == 0) { formatTable=NA } else if ( (min(b$African.American) == 0) | (min(b$Asian) + min(b$Hispanic)==0)) { formatTable = cbind(b$Caucasian, b$African.American+b$Asian+b$Hispanic) } else if ( min(b$Asian)==0 | min(b$Hispanic)==0) { formatTable = cbind(b$Caucasian, b$African.American, b$Asian+b$Hispanic) } else formatTable = b } Using this function, t1=formatTable(z[1]) and t2=formatTable(z[2]) gives correct result, however if I use tbls = lapply(z[1:2],formatTable) it says Error in FUN(X[[1L]], ...) : object 'Caucasian' not found. Please help on why lapply throws this error.
Product between two data.frames columns
I have two data.frames: The first one is the coefficients of my regressions for each day: > parametrosBase beta0 beta1 beta4 2015-12-15 0.1622824 -0.012956819 -0.04637442 2015-12-16 0.1641884 -0.007914548 -0.06170213 2015-12-17 0.1623660 -0.005618474 -0.05914809 2015-12-18 0.1643263 0.005380472 -0.08533237 2015-12-21 0.1667710 0.003824588 -0.09040071 The second one is: the independent (x) variables: > head(ir_dfSTORED) ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x 1 2015-12-15 21 1 0.5642792 0.2859359 0 0 0 2 2015-12-15 42 1 0.3606713 0.2831963 0 0 0 3 2015-12-15 63 1 0.2550200 0.2334554 0 0 0 4 2015-12-15 84 1 0.1943071 0.1883048 0 0 0 5 2015-12-15 105 1 0.1561231 0.1544524 0 0 0 6 2015-12-15 126 1 0.1302597 0.1297947 0 0 0 > tail(ir_dfSTORED) ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x 835 2015-12-21 2415 1 0.006799321 0.006799321 0 0 0 836 2015-12-21 2436 1 0.006740707 0.006740707 0 0 0 837 2015-12-21 2457 1 0.006683094 0.006683094 0 0 0 838 2015-12-21 2478 1 0.006626457 0.006626457 0 0 0 839 2015-12-21 2499 1 0.006570773 0.006570773 0 0 0 840 2015-12-21 2520 1 0.006516016 0.006516016 0 0 0 What i want is to multiply the beta0 column of "parametrosBase" by h0x column of "ir_dfSTORED" and store the result in the beta0_h0x column. And the same for the others: beta1 and beta4 The problem im facing is with the dates in "ind" column. This multiplication has to respect the dates. So, once i change the day in "ir_dfSTORED" i have to change to the same day in "parametrosBase". For example: The first rowof "parametrosBase" df is 2015-12-15 0.1622824 -0.012956819 -0.04637442 is fixed for the 2015-12-15 day. And then i do the product. Once i enter on the 2015-12-16 day i will have to consider the second row of "parametrosBase" df. How can i do this? Thanks a lot. :)
Maybe you should merge the two datasets first: parametrosBase$ind <- rownames(parametrosBase) df <- merge(ir_dfSTORED,parametrosBase) df <- within(df,{ beta0_h0x <- beta0*h0x beta1_h0x <- beta1*h0x beta4_h0x <- beta4*h0x }) Since I don't know the structure of the data, you may have to convert the dates from rownames to a date format in order for the merge to work. Using ind as the name of the date in parametrosBase is key to making merge work, otherwise you'll have to specify the variables to merge by.
Format of time series data with regression variables
I am new to R and am attempting an analysis in R with the tslm() function. Sample data in csv format: UnitSales GDP GDPPerCap CPI PropInvIndex DispIncTopDecile TransCommSecDecile CivilVehOwn Urban AutoFin 2000 1 1198243.4 949 81.62 4984 10643 618 1609 36.22 0 2001 2 1324337.8 1042 81.38 6344 14219 782 1802 37.66 0 2002 3 1453827.558 1135 80.8 7790.9223 18995.9 991.2 2053.17 39.08978381 0 2003 4 1640958.735 1274 81.83 10153.8009 21837.3 1106 2382.93 40.53022975 0 2004 5 1931644.33 1409 85.02 13158.2516 25377.2 1274.2 2693.71 41.76000862 0 2005 6 2256902.591 1731 86.56 15909.2471 28773.1 1590.3 3160 42.98999663 0 2006 7 2712950.885 2069 87.83 19422.9174 31967.3 1801 3697.3531 44.34301016 0 2007 9 3494055.942 2651 92.02 25288.8373 36784.5 2467.7 4358.355 45.8892446 0.1 2008 11 4521827.271 3414 97.45 31203.1942 43613.8 2632.9 5099.6094 46.98950317 0.12 2009 13 4990233.519 3749 96.76 36241.808 46826.1 3181.9 6280.6086 48.34170101 0.14 2010 15 5930502.27 4433 100 48259.403 51431.6 3630.6 7801.8259 49.94966105 0.16 2011 18 7321891.955 5447 105.45 61796.8858 58841.9 3963 9356.3163 51.27027127 0.18 2012 21 8229490.03 6093 108.22 71803.7869 63824.2 4304.1 10933.0912 52.57008656 0.22 I load the data and then attempt to run: testc <- tslm(UnitSales~GPD+trend, data=lm0015c) A simple attempt to model UnitSales from the GDP variable plus the trend. I get the following error: Error in tslm(UnitSales ~ GPD + trend, data = lm0015c) : Not time series data How do I designate the data as a time series?
You can create a time series formatted version of your data directly with the ts() function. yourGreatData <- ts(d[,2:length(d)], start = 2000, end = 2012)
I discovered as easier solution to this problem is simply by coercing your dataframe as a time series: testc <- tslm(UnitSales~GPD+trend, data=as.ts(lm0015c))
mistake in multivePenal but not in frailtyPenal
The libraries used are: library(survival) library(splines) library(boot) library(frailtypack) and the function used is in the library frailty pack. In my data I have two recurrent events(delta.stable and delta.unstable) and one terminal event (delta.censor). There are some time-varying explanatory variables, like unemployment rate(u.rate) (is quarterly) that's why my dataset has been splitted by quarters. Here there is a link to the subsample used in the code just below, just in case it may be helpful to see the mistake. https://www.dropbox.com/s/spfywobydr94bml/cr_05_males_services.rda The problem is that it takes a lot of time running until the warning message appear. Main variables of the Survival function are: I have two recurrent events: delta.unstable (unst.): takes value one when the individual find an unstable job. delta.stable (stable): takes value one when the individual find a stable job. And one terminal event delta.censor (d.censor): takes value one when the individual has death, retired or emigrated. row id contadorbis unst. stable d.censor .t0 .t 1 78 1 0 1 0 0 88 2 101 2 0 1 0 0 46 3 155 3 0 1 0 0 27 4 170 4 0 0 0 0 61 5 170 4 1 0 0 61 86 6 213 5 0 0 0 0 92 7 213 5 0 0 0 92 182 8 213 5 0 0 0 182 273 9 213 5 0 0 0 273 365 10 213 5 1 0 0 365 394 11 334 6 0 1 0 0 6 12 334 7 1 0 0 0 38 13 369 8 0 0 0 0 27 14 369 8 0 0 0 27 119 15 369 8 0 0 0 119 209 16 369 8 0 0 0 209 300 17 369 8 0 0 0 300 392 When I apply multivePenal I obtain the following message: Error en aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length Además: Mensajes de aviso perdidos In Surv(.t0, .t, delta.stable) : Stop time must be > start time, NA created #### multivePenal function fit.joint.05_malesP<multivePenal(Surv(.t0,.t,delta.stable)~cluster(contadorbis)+terminal(as.factor(delta.censor))+event2(delta.unstable),formula.terminalEvent=~1, formula2=~as.factor(h.skill),data=cr_05_males_serv,Frailty=TRUE,recurrentAG=TRUE,cross.validation=F,n.knots=c(7,7,7), kappa=c(1,1,1), maxit=1000, hazard="Splines") I have checked if Surv(.t0,.t,delta.stable) contains NA, and there are no NA's. In addition, when I apply for the same data the function frailtyPenal for both possible combinations, the function run well and I get results. I take one week looking at this and I do not find the key. I would appreciate some of light to this problem. #delta unstable+death enter code here fit.joint.05_males<-frailtyPenal(Surv(.t0,.t,delta.unstable)~cluster(id)+u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(non.manual)+as.factor(municipio)+as.factor(spanish.speakers)+ as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+ as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+ as.factor(responsabilities)+ terminal(delta.censor),formula.terminalEvent=~u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+ as.factor(responsabilities),data=cr_05_males_services,n.knots=12,kappa1=1000,kappa2=1000,maxit=1000, Frailty=TRUE,joint=TRUE, recurrentAG=TRUE) ###Be patient. The program is computing ... ###The program took 2259.42 seconds #delta stable+death fit.joint.05_males<frailtyPenal(Surv(.t0,.t,delta.stable)~cluster(id)+u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(non.manual)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+as.factor(responsabilities)+terminal(delta.censor),formula.terminalEvent=~u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+as.factor(responsabilities),data=cr_05_males_services,n.knots=12,kappa1=1000,kappa2=1000,maxit=1000, Frailty=TRUE,joint=TRUE, recurrentAG=TRUE) ###The program took 3167.15 seconds
Because you neither provide information about the packages used, nor the data necessary to run multivepenal or frailtyPenal, I can only help you with the Surv part (because I happened to have that package loaded). The Surv warning message you provided (In Surv(.t0, .t, delta.stable) : Stop time must be > start time, NA created) suggests that something is strange with your variables .t0 (the time argument in Surv, refered to as 'start time' in the warning), and/or .t (time2 argument, 'Stop time' in the warning). I check this possibility with a simple example # read the data you feed `Surv` with df <- read.table(text = "row id contadorbis unst. stable d.censor .t0 .t 1 78 1 0 1 0 0 88 2 101 2 0 1 0 0 46 3 155 3 0 1 0 0 27 4 170 4 0 0 0 0 61 5 170 4 1 0 0 61 86 6 213 5 0 0 0 0 92 7 213 5 0 0 0 92 182 8 213 5 0 0 0 182 273 9 213 5 0 0 0 273 365 10 213 5 1 0 0 365 394 11 334 6 0 1 0 0 6 12 334 7 1 0 0 0 38 13 369 8 0 0 0 0 27 14 369 8 0 0 0 27 119 15 369 8 0 0 0 119 209 16 369 8 0 0 0 209 300 17 369 8 0 0 0 300 392", header = TRUE) # create survival object mysurv <- with(df, Surv(time = .t0, time2 = .t, event = stable)) mysurv # create a new data set where one .t for some reason is less than .to # on row five .t0 is 61, so I set .t to 60 df2 <- df df2$.t[df2$.t == 86] <- 60 # create survival object using new data which contains at least one Stop time that is less than Start time mysurv2 <- with(df2, Surv(time = .t0, time2 = .t, event = stable)) # Warning message: # In Surv(time = .t0, time2 = .t, event = stable) : # Stop time must be > start time, NA created # i.e. the same warning message as you got # check the survival object mysurv2 # as you can see, the fifth interval contains NA # I would recommend you check .t0 and .t in your data set carefully # one way to examine rows where Stop time (.t) is less than start time (.t0) is: df2[which(df2$.t0 > df2$.t), ] I am not familiar with multivepenal but it seems that it does not accept a survival object which contains intervals with NA, whereas might frailtyPenal might do so.
The authors of the package have told me that the function is not finished yet, so perhaps that is the reason that it is not working well.
I encountered the same error and arrived at this solution. frailtyPenal() will not accept data.frames of different length. The data.frame used in Surv and data.frame named in data= in frailtyPenal must be the same length. I used a Cox regression to identify the incomplete cases, reset the survival object to exclude the missing cases and, finally, run frailtyPenal: library(survival) library(frailtypack) data(readmission) #Reproduce the error #change the first start time to NA readmission[1,3] <- NA #create a survival object with one missing time surv.obj1 <- with(readmission, Surv(t.start, t.stop, event)) #observe the error frailtyPenal(surv.obj1 ~ cluster(id) + dukes, data=readmission, cross.validation=FALSE, n.knots=10, kappa=1, hazard="Splines") #repair by resetting the surv object to omit the missing value(s) #identify NAs using a Cox model cox.na <- coxph(surv.obj1 ~ dukes, data = readmission) #remove the NA cases from the original set to create complete cases readmission2 <- readmission[-cox.na$na.action,] #reset the survival object using the complete cases surv.obj2 <- with(readmission2, Surv(t.start, t.stop, event)) #run frailtyPenal using the complete cases dataset and the complete cases Surv object frailtyPenal(surv.obj2 ~ cluster(id) + dukes, data = readmission2, cross.validation = FALSE, n.knots = 10, kappa = 1, hazard = "Splines")