Calculated values on imputed data - r

I'd like to do something like the following: (myData is a data table)
#create some data
myData = data.table(invisible.covariate=rnorm(50),
visible.covariate=rnorm(50),
category=factor(sample(1:3,50, replace=TRUE)),
treatment=sample(0:1,50, replace=TRUE))
myData[,outcome:=invisible.covariate+visible.covariate+treatment*as.integer(category)]
myData[,invisible.covariate:=NULL]
#process it
myData[treatment == 0,untreated.outcome:=outcome]
myData[treatment == 1,treated.outcome:=outcome]
myPredictors = matrix(0,ncol(myData),ncol(myData))
myPredictors[5,] = c(1,1,0,0,0,0)
myPredictors[6,] = c(1,1,0,0,0,0)
myImp = mice(myData,predictorMatrix=myPredictors)
fit1 = with(myImp, lm(treated.outcome ~ category)) #this works fine
for_each_imputed_dataset(myImp, #THIS IS NOT A REAL FUNCTION but I hope you get the idea
function(imputed_data_table) {
imputed_data_table[,treatment.effect:=treated.outcome-untreated.outcome]
})
fit2 = with(myImp, lm(treatment.effect ~ category))
#I want fit2 to be an object similar to fit1
...
I would like to add a calculated value to each imputed data set, then do statistics using that calculated value. Obviously the structure above is probably not how you'd do it. I'd be happy with any solution, whether it involves preparing the data table somehow before the mice, a step before the "fit =" as sketched above, or some complex function inside the "with" call.

The complete() function will generate the "complete" imputed data set for each of the requested iterations. But note that mice expects to work with data.frames, so it returns data.frames and not data.tables. (Of course you can convert if you like). But here is one way to fit all those models
imp = mice(myData,predictorMatrix=predictors)
fits<-lapply(seq.int(imp$m), function(i) {
lm(I(treated.outcome-untreated.outcome)~category, complete(imp, i))
})
fits
The results will be in a list and you can extract particular lm objects via fits[[1]], fits[[2]], etc

Related

Joining two time-series in same function

I'm trying to join two function results in one and trying to obtain results in one column.
My code
myfun <-function(x){
fit <-Arima(x, order = c(1,1,1), seasonal = list(order = c(0,1,0), period = 52),include.mean=TRUE,
include.constant = FALSE, method = 'CSS')
fit_a <- forecast(fit$fitted)
fit_a <- data.frame(fit_a$fitted)
colnames(fit_a)[1] <- "load"
fit_a$load <- as.data.frame(fit_a$load)
fit_b <- data.frame(forecast(fit,h=400))
fit_b <- data.frame(fit_b$Point.Forecast)
colnames(fit_b)[1] <- "load"
fit_b$load <- as.data.frame(fit_b$load)
return(rbind(fit_a,fit_b))
}
I'm getting values individually like return(fit_a) and return(fit_b) but while doing rbind() I can't because of individual time-series data.
Tried : c(fit_a,fit_b) showing two different ts( which confirms we are having output and just failing over rbind()).
Can someone help me, how to extract both fitted and forecasted values in same function.
Thanks in advance!
One way to return more than one object is to create a list of the objects, then return that. In your case you could use this at the end of your function:
fit <- list(fit_a, fit_b)
return(fit)
Then you can access the elements using fit[[1]] or fit[[2]].
You also have the option of naming the elements so you can access them using the $, like so:
fit <- list(fit_a = fit_a, fit_b = fit_b)
return(fit)
Then you can use fit$fit_a and fit$fit_b

Include an object within a function only if it exists

I have a loop that needs to be executed; within which are 6 models. The objects that those models are stored in then need to get passed into a function that executes an AIC analysis. However, sometimes one of the models does not work, which then breaks the code for the AIC function because it does not recognize whatever model that failed because it was not stored as an object.
So, I need a way to pull those models that worked into the AIC function.
Here is an example, but keep in mind it is important that this can all be executed within a loop. Here are three hypothetical models:
hn.1 <- ds(data)
hn.1.obs <- ds(data,formula = ~OBSCODE)
hn.1.obs.mas <- ds(dataformula = ~OBSCODE+MAS)
And this would be my AIC function that compares the models:
summarize_ds_models(hn.1, hn.1.obs, hn.1.obs.mas)
But I get an error if say, the hn.1.obs.mas model failed.
I tried to use "get" and "ls" and I successfully pull the models that exist when I call:
get(ls(pattern='hn.15*'))
But that just returns a character vector, so that when I call:
summarize_ds_models(get(ls(pattern='hn.15*')))
it only conducts the AIC analysis on the first model in the above character vector.
Am I on the right track or is there a better way to do this?
UPDATE with a reproducible example.
Here is a simplified version of my problem:
create and fill two data frames that will be put into a list:
data.frame <- data.frame(x = integer(4),
y = integer(4),
z = integer(4),
i = integer(4))
data.frame$x <- c(1,2,3,4)
data.frame$y <- c(1,4,9,16)
data.frame$z <- c(1,3,8,10)
data.frame$i <- c(1,5,10,15)
data.frame.2 <- data.frame[1:4,1:3]
my.list <- list(data.frame,data.frame.2)
create df to fill with best models from AIC analyses
bestmodels <- data.frame(modelname = character(2))
Here is the function that will run the loop:
myfun <- function(list) {
for (i in 1:length(my.list)){
mod.1 = lm(y ~ x, data = my.list[[i]])
mod.2 = lm(y ~ x + z, data = my.list[[i]])
mod.3 = lm(y ~ i, data = my.list[[i]])
bestmodels[i,1] <- rownames(AIC(mod.1,mod.2,mod.3))[1]#bestmodel is 1st row
}
print(bestmodels)
}
However, on the second iteration of the loop, the AIC function will fail because mod.3 will fail. So, is there a generic way to make it so the AIC function will only execute for those models that worked? The outcome I would want here would be:
> bestmodels
modelname
1 mod.1
2 mod.1
since mod.1 would be chosen for both AIC analyses.
Gregor's comment:
Use a list instead of individual named objects. Then do.call(summarize_ds_models, my_list_of_models). If it isn't done already, you can Filter the list first to make sure only working models are in the list.
solved my problem. Thanks

Pass df column names to nested equation in Graph Printing Function

I need some clarification on the primary post on Passing a data.frame column name to a function
I need to create a function that will take a testSet, trainSet, and colName(aka predictor) as inputs to a function that prints a plot of the dataset with a GAM model trend line.
The issue I run into is:
plot.model = function(predictor, train, test) {
mod = gam(Response ~ s(train[[predictor]], spar = 1), data = train)
...
}
#Function Call
plot.model("Predictor1", 1.0, crime.train, crime.test)
I can't simply pass the predictor as a string into the gam function, but I also can't use a string to index the data frame values as shown in the link above. Somehow, I need to pass the colName key to the game function. This issue occurs in other similar scenarios regarding plotting.
plot <- ggplot(data = test, mapping = aes(x=predictor, y=ViolentCrimesPerPop))
Again, I can't pass a string value for the column name and I can't pass the column values either.
Does anyone have a generic solution for these situations. I apologize if the answer is buried in the above link, but it's not clear to me if it is.
Note: A working gam function call looks like this:
mod = gam(Response ~ s(Predictor1, spar = 1.0), data = train)
Where the train set is a data frame with column names "Response" & "Predictor".
Use aes_string instead of aes when you pass a column name as string.
plot <- ggplot(data = test, mapping = aes_string(x=predictor, y=ViolentCrimesPerPop))
For gam function:: Example which is copied from gam function's documentation. I have used vector, scalar is even easier. Its just using paste with a collapse parameter.
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
# String manipulate for formula
formula <- as.formula(paste("y~s(", paste(colnames(dat)[2:5], collapse = ")+s("), ")", sep =""))
b <- gam(formula, data=dat)
is same as
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

How to use one variable in regression with many independent variables in lm()

I need to reproduce this code using all of these variables.
composite <- read.csv("file.csv", header = T, stringsAsFactors = FALSE)
composite <- subset(composite, select = -Date)
model1 <- lm(indepvariable ~., data = composite, na.action = na.exclude)
composite is a data frame with 82 variables.
UPDATE:
What I have done is found a way to create an object that contains only the significantly correlated variables, to narrow the number of independent variables down.
I have a variable now: sigvars, which is the names of an object that sorted a correlation matrix and picked out only the variables with correlation coefficients >0.5 and <-0.5. Here is the code:
sortedcor <- sort(cor(composite)[,1])
regvar = NULL
k = 1
for(i in 1:length(sortedcor)){
if(sortedcor[i] > .5 | sortedcor[i] < -.5){
regvar[k] = i
k = k+1
}
}
regvar
sigvars <- names(sortedcor[regvar])
However, it is not working in my lm() function:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
Error: Error in model.frame.default(formula = data.matrix(composite[1]) ~ sigvars, : variable lengths differ (found for 'sigvars')
Think about what sigvars is for a minute...?
After sigvars <- names(sortedcor[regvar]), sigvars is a character vector of column names. Say your data have 100 rows and 5 variables come out as significant using the method you've chosen (which doesn't sound overly defensible to be). The model formula you are using will result in composite[, 1] being a vector of length 100 (100 rows) and sigvars being a character vector of length 5.
Assuming you have the variables you want to include in the model, then you could do:
form <- reformulate(sigvars, response = names(composite)[1])
model1 <- lm(form, data = composite)
or
model1 <- lm(composite[,1] ~ ., data = composite[, sigvars])
In the latter case, do yourself a favour and write the name of the dependent variable into the formula instead of composite[,1].
Also, you don't seem to have appreciated the difference between [i] and [i,j] for data frames, hence you are doing data.matrix(composite[1]) which is taking the first component of composite, leaving it as a data frame, then converting that to a matrix via the data.matrix() function. All you really need is just the name of the dependent variable on the LHS of the formula.
The error is here:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
The sigvars is names(data). The equation is usually of the form lm(var1 ~ var2+var3+var4), you however have it as lm(var1 ~ var2 var3 var4).
Hopefully that helps.

apply series of commands to split data frame

I'm having some difficulties figuring out how to approach this problem. I have a data frame that I am splitting into distinct sites (link5). Once split I basically want to run a linear regression model on the subsets. Here is the code I'm working with, but it's definitely not correct. Also, It would be great if I could output the model results to a new data frame such that each site would have one row with the model parameter estimates - that is just a wish and not a necessity right now. Thank you for any help!
les_events <- split(les, les$link5)
result <- lapply(les_events) {
lm1 <-lm(cpe~K,data=les_events)
coef <- coef(lm1)
q.hat <- -coef(lm1)[2]
les_events$N0.hat <- coef(lm1[1]/q.hat)
}
You have a number of issues.
You haven't passed a function (the FUN argument) to lapply
Your closure ( The bit inside {} is almost, but not quite the body you want for your function)
something like th following will return the coefficients from your models
result <- lapply(les_events, function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
This will return a list of data.frames containing columns for each coefficient.
lapply(les_events, lm, formula = 'cpe~K')
will return a list of linear model objects, which may be more useful.
For a more general split / apply / combine approaches use plyr or data.table
data.table
library(data.table)
DT <- data.table(les)
result <- les[, {lm1 <- lm(cpe ~ K, data = .SD)
as.list(lm1)}, by = link5]
plyr
library(plyr)
result <- ddply(les, .(link5), function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
# or to return a list of linear model objects
dlply(les, link5, function(DD){ lm(cpe ~K, data =DD)})

Resources