For documentation purposes, my department requires I have copies of the raw output from models, not just the an organized table of the parts I want to keep and publish.
For example, here is a function I wrote to run a CoxPH model and organize the output into a table. I have several iterations of this function for various outcomes, I use lapply() to call them all. Here is a simplified version of the code:
myfun <- function(x){
model <- summary(coxph(Surv(FAIL, DEAD)~ x + covariate1 + covariate2, data=df))
# just pull out the parts I need
results <- data.frame(cbind(model$conf.int[,c("exp(coef)",
"lower .95", "upper .95")],
coef(model)[,"Pr(>|z|)"]))
names(results) <- c("RR", "LL", "UL", "pval")
# format the the results for the table
results$answer <- paste0(format(round(results[,"RR"],2),2), " (",
format(round(results[,"LL"],2),2), ", ",
format(round(results[,"UL"],2),2), ")")
# drop all the results for covariates
results <-results[substr(row.names(results),1,nchar(expo))==expo,
c("answer","pval")]
return(results)
The final "results" is just a nice clean data frame of the main independent variable, RR (95% CI) and a p-value, everything runs fine.
When I call the function:
outcomes <- c("DEAD","HEARTDISEASE","LUNGCANCER")
final.results <- lapply(outcomes, function(x) myfun(x))
In this case I get a nice list of three data frames with my organized results. I can rbind() each element of the list and export it to excel. -- in truth, there are a lot more outcomes, and age-only model and a fully adjusted model, and several main effect variables, but it all basically works like the code above --
Now, I need to print in the console the original summary() of the Cox model, and still keep my final list of organized output. I have done this by adding the following to the above code:
myfun <- function(x){
# all that code listed above #
# create a list to hold both raw output and my cleaned results
final <- list()
final$model <- model
final$results <- results
return(final)
Then after the function call, I do some separate coding to separate out the summary() portion for each iterative run, print them, then do the same for the final organized results.
Ideally I would just like each iterative run of the function to only return the nice clean table, but print the entire summary(model) to the console. How can I make that happen in the function that will still work with my lapply() call?
Is there a simpler method than just creating a giant list of everything and dealing with it after all the models are run?
Thank you.
Related
Let's say I have the following loop:
test <- mtcars
target_var <- c("mpg", "wt")
group_var <- c("gear", "carb")
library(tidyverse)
for (i in target_var) {
for (j in group_var) {
print(prop.table(table(test[, i], test[, j]), 2))
}
}
When I run it, I see four prop.tables reported, each with a heading of [1] through [[4]]. I can visually match up which prop.table goes with which variables by looking at the code and seeing the order of variables assigned to target_var and group_var.
But let's say I'm looping through a dozen variables or more. Obviously, it's a problem to match up one particular prop.table (listed as, say, "[[14]]") with the actual variables in that table.
Is there a way to print the command R used to generate each particular table?
I am not looking for a progress bar but, instead, something that would list the code above each table printed in the console. For example, the following code would be printed right above the appropriate prop.table:
prop.table(table(test$mpg, test$gear), 2)
# actual prop.table result from the first time through the loop here
prop.table(table(test$mpg, test$carb), 2)
# actual prop.table result from the second time through the loop here
This would help me when doing exploratory data analysis.
If I understand well you want the results of your function to be printed "as you go" (i.e. at each iteration of the loop, not at the end)?
You can use cat instead of print (be sure to print a new line \n after you call cat to avoid a messy console):
for(i in 1:3) {
results <- mean(rnorm(10))
cat(results)
cat("\n")
}
I want to make a linear regression where my dependant variable is data$fs_deviation_score while independent vairables are multiple columns of my data frame (column 660 to 675).
with this function it works but i am not able to save the output (Coefficients : Estimate , Std.Error, p value..)
Reg<-lapply( data[660:675], function(x) summary(lm(data$fs_deviation_score ~ x)))
I search a function to save the output (Coefficients :Estimate, Std.Error, p value..)
Thanks
If you have that Reg object produced by:
Reg<-lapply( data[660:675], function(x) summary(lm(data$fs_deviation_score ~ x)))
Then all you need to do to extract the coefficients matrix from each summary object is this:
CoefMats <- lapply( Reg, coef)
The summary-objects are actually named lists (as are lm and glm objects). ?summary.lm will bring up the specific help page and the Value subsection will give you all the names. The See Also links should also be reviewed.
Had you want the r.squared values you could have use this:
Rsqds <- lapply(Reg, "[[", "r.squared")
Example:
m <- lm (as.integer (Species), iris);
save (m, "iris_lm.RData");
Now in a new session.
load ("iris_lm.RData");
summary (m);
names (m);
You can use the save function to save the entire object and read and use it later. Note that this is a general way to store any R object.
To be able to want to store the components of the components of the summary of the model m. You can do as follows for example to store the residuals.
write.csv (m$residuals, "residuals.csv");
To access each components of the summary, execute names(m) so see what are the components do help (summary.lm)
If you want to save the output of the regression model in Tibble format, you can use the broom package. This works for other model objects too.
Also, gtsummary package helps to get output in presentation-ready table format.
For examples; see this
I am writing a script that ultimately returns a data frame. My question is around if there are any good practices on how to use a unit test package to make sure that the data frame that is returned is correct. (I'm a beginning R programmer, plus new to the concept of unit testing)
My script effectively looks like the following:
# initialize data frame
df.out <- data.frame(...)
# function set
function1 <- function(x) {...}
function2 <- function(x) {...}
# do something to this data frame
df.out$new.column <- function1(df.out)
# do something else
df.out$other.new.column <- function2(df.out)
# etc ....
... and I ultimately end up with a data frame with many new columns. However, what is the best approach to test that the data frame that is produced is what is anticipated, using unit tests?
So far I have created unit tests that check the results of each function, but I want to make sure that running all of these together produces what is intended. I've looked at Hadley Wickham's page on testing but can't see anything obvious regarding what to do when returning data frames.
My thoughts to date are:
Create an expected data frame by hand
Check that the output equals this data frame, using expect_that or similar
Any thoughts / pointers on where to look for guidance? My Google-fu has let me down considerably on this one to date.
Your intuition seems correct. Construct a data.frame manually based on the expected output of the function and then compare that against the function's output.
# manually created data
dat <- iris[1:5, c("Species", "Sepal.Length")]
# function
myfun <- function(row, col, data) {
data[row, col]
}
# result of applying function
outdat <- myfun(1:5, c("Species", "Sepal.Length"), iris)
# two versions of the same test
expect_true(identical(dat, outdat))
expect_identical(dat, outdat)
If your data.frame may not be identical, you could also run tests in parts of the data.frame, including:
dim(outdat), to check if the size is correct
attributes(outdat) or attributes of columns
sapply(outdat, class), to check variable classes
summary statistics for variables, if applicable
and so forth
If you would like to test this at runtime, you should check out the excellent ensurer package, see here. At the bottom of the page you can see how to construct a template that you can test your dataframe against, you can make it as detailed and specific as you like.
I'm just using something like this
d1 <- iris
d2 <- iris
expect_that(d1, equals(d2)) # passes
d3 <- iris
d3[141,3] <- 5
expect_that(d1, equals(d3)) # fails
Function lm(...) returns an object of class 'lm'. How do I create an array of such objects? I want to do the following:
my_lm_array <- rep(as.lm(NULL), 20)
#### next, populate this array by running lm() repeatedly:
for(i in 1:20) {
my_lm_array[i] <- lm(my_data$results ~ my_data[i,])
}
Obviously the line "my_lm <- rep(as.lm(NULL), 20)" does not work. I'm trying to create an array of objects of type 'lm'. How do I do that?
Not sure it will answer your question, but if what you want to do is run a series of lm from a variable against different columns of a data frame, you can do something like this :
data <- data.frame(result=rnorm(10), v1=rnorm(10), v2=rnorm(10))
my_lms <- lapply(data[,c("v1","v2")], function(v) {
lm(data$result ~ v)
})
Then, my_lms would be a list of elements of class lm.
Well, you can create an array of empty/meaningless lm objects as follows:
z <- NA
class(z) <- "lm"
lm_array <- replicate(20,z,simplify=FALSE)
but that's probably not the best way to solve the problem. You could just create an empty list of the appropriate length (vector("list",20)) and fill in the elements as you go along: R is weakly enough typed that it won't mind you replacing NULL values with lm objects. More idiomatically, though, you can run lapply on your list of predictor names:
my_data <- data.frame(result=rnorm(10), v1=rnorm(10), v2=rnorm(10))
prednames <- setdiff(names(my_data),"result") ## extract predictor names
lapply(prednames,
function(n) lm(reformulate(n,response="result"),
data=my_data))
Or, if you don't feel like creating an anonymous function, you can first generate a list of formulae (using lapply) and then run lm on them:
formList <- lapply(prednames,reformulate,response="result") ## create formulae
lapply(formList,lm,data=my_data) ## run lm() on each formula in turn
will create the same list of lm objects as the first strategy above.
In general it is good practice to avoid using syntax such as my_data$result inside modeling formulae; instead, try to set things up so that all the variables in the model are drawn from inside the data object. That way methods like predict and update are more likely to work correctly ...
For each of 100 data sets, I am using lm() to generate 7 different equations and would like to extract and compare the p-values and adjusted R-squared values.
Kindly assume that lm() is in fact the best regression technique possible for this scenario.
In searching the web I've found a number of useful examples for how to create a function that will extract this information and write it elsewhere, however, my code uses paste() to label each of the functions by the data source, and I can't figure out how to include these unique pasted names in the function I create.
Here's a mini-example:
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
for (i in testrun)
{
assign(paste(i,"test",sep=""),lm(temp$LogPre~temp$labels))
assign(paste(i,"test2",sep=""),lm(temp$LogPre~temp$labels2))
}
I would then like to extract the coefficients of each equation
But the following doesn't work:
summary(paste(i,"test",sep="")$coefficients)
and neither does this:
coef(summary(paste(i,"test",sep="")))
Both generating the error :$ operator is invalid for atomic vectors
EVEN THOUGH
summary(XXtest)$coefficients
and
coef(summary(XXtest))
work just fine.
How can I use paste() within summary() to allow me to do this for AAtest, AAtest2, ABtest, ABtest2, etc.
Thanks!
Hard to tell exactly what your purpose is, but some kind of apply loop may do what you want in a simpler way. Perhaps something like this?
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
names(testrun) <- testrun
out <- lapply(testrun, function(i) {
list(test1=lm(temp$LogPre~temp$labels),
test2=lm(temp$LogPre~temp$labels2))
})
Then to get all the p-values for the slopes you could do:
> sapply(out, function(i) sapply(i, function(x) coef(summary(x))[2,4]))
XX
test1 0.02392516
test2 0.02389790
Just using paste results in a character string, not the object with that name. You need to tell R to get the object with that name by using get.
summary(get(paste(i,"test",sep="")))$coefficients