I understand that in the following
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(colnames(s)[c(21,259,330,380)], collapse='+'))))
I am missing x
but i really don't understand how and where to insert it to be correct.
Thank you for any help.
Making this an answer instead of a comment due to amount of text.
If I understand you correctly, you're trying to iterate over a list of variables, which you want to add (each in turn) to a set of independent variables in a survival model. The issue in the code you gave is that you don't give x a place. There are several approaches to do so.
The first one is very similar to what you're doing, and creates the formulas. I demonstrate this using the 'cancer' dataset:
library(survival)
data(cancer)
myvars <- c("meal.cal","wt.loss")
a1 <- sapply(myvars,function(x){
as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
}
)
#then we can fit our models
lapply(a1,function(x){coxph(formula=x,data=cancer)})
In my opinion, this is a bit convoluted and can be done in one step:
models <- lapply(myvars, function(x){
form <- as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
fit <- coxph(formula=form, data=cancer)
return(fit)
})
Using the code you started with, we can simply add 'x' to the vector of dependent variables. However, this is not very readable code and I'm always a bit nervous about feeding column indices to models. You might be safer using variable names instead.
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(c(x,colnames(s)[c(21,259,330,380)]), collapse='+'))))
Related
I'm trying to extract OLS coefficient from a set of model that run under lapply:
The problem that not all submodel from my list have all levels and most of the time I end up with coefficients "out of bound". For example in the code below "No Reward" option is not present for Reward2 and result0$coef[3,1] will be out of bound as there is no third estimate reported.
Question: is it possible to force "lm" to report all the coefficients specified in the model even if there is no estimate available?
I would like to apologize to the community for not presenting a reproducible code on my earlier attempt. Since my last attempt I solved the problem by checking within the function for presence of a particular estimate, but the question still remains and here is the code:
RewardList<-c("Reward1","Reward2")
set.seed(1234)
GL<-rnorm(10)
RewardName<-rep(c("Reward1","Reward2"),each=5)
NoReward<-c(0,1,0,1,0, 0,0,0,0,0)
Under1<- c(1,0,0,0,1, 0,1,0,1,0)
Above1<- c(0,0,1,0,0, 1,0,1,0,1)
tinput<-as.data.frame(RewardName)
tinput<-cbind(tinput,GL,NoReward,Under1,Above1)
regMF <- lapply(seq_along(RewardList),
function (n) {
tinput <- tinput[tinput$RewardName==RewardList[n],]
result0 <- summary(lm(GL~NoReward+Under1+Above1-1,tinput))
result1 <- result0$coef[1,1] #no rebate
result2 <- result0$coef[2,1]
result3 <- result0$coef[3,1]
return(list(result1,result2,result3))})
To make your function more robust, you could use tryCatch or use dplyr::failwith. For instance, replacing
result0 <- summary(lm(GL~`No Reward`+`Under 1%`-1,tinput))
with
require(dplyr)
result0 <- failwith(NULL, summary(lm(GL~`No Reward`+`Under 1%`-1,tinput)))
may work, though it is difficult to tell without a reproducable example. On how to produce a reproducable example, please have a look here.
Although not exactly your question, I would like to point out a potentially easier way to arrange your data using broom, which avoids the list of lists-output structure of your approach. With broom and dplyr, you can collect the output of your models in a dataframe for easier access. For instance, have a look at the output of
library(dplyr)
library(broom)
mtcars %>% group_by(gear) %>% do(data.frame(tidy(lm(mpg~cyl, data=.), conf.int=T)))
Here, you can wrap the lm function around failwith as well.
As I said, use lmList:
library(nlme)
fits <- lmList(GL~NoReward+Under1+Above1-1 | RewardName, data = tinput)
coef(fits)
# NoReward Under1 Above1
#Reward1 -1.034134 -0.3889705 1.0844412
#Reward2 NA -0.5695960 -0.3102046
I am writing a script that ultimately returns a data frame. My question is around if there are any good practices on how to use a unit test package to make sure that the data frame that is returned is correct. (I'm a beginning R programmer, plus new to the concept of unit testing)
My script effectively looks like the following:
# initialize data frame
df.out <- data.frame(...)
# function set
function1 <- function(x) {...}
function2 <- function(x) {...}
# do something to this data frame
df.out$new.column <- function1(df.out)
# do something else
df.out$other.new.column <- function2(df.out)
# etc ....
... and I ultimately end up with a data frame with many new columns. However, what is the best approach to test that the data frame that is produced is what is anticipated, using unit tests?
So far I have created unit tests that check the results of each function, but I want to make sure that running all of these together produces what is intended. I've looked at Hadley Wickham's page on testing but can't see anything obvious regarding what to do when returning data frames.
My thoughts to date are:
Create an expected data frame by hand
Check that the output equals this data frame, using expect_that or similar
Any thoughts / pointers on where to look for guidance? My Google-fu has let me down considerably on this one to date.
Your intuition seems correct. Construct a data.frame manually based on the expected output of the function and then compare that against the function's output.
# manually created data
dat <- iris[1:5, c("Species", "Sepal.Length")]
# function
myfun <- function(row, col, data) {
data[row, col]
}
# result of applying function
outdat <- myfun(1:5, c("Species", "Sepal.Length"), iris)
# two versions of the same test
expect_true(identical(dat, outdat))
expect_identical(dat, outdat)
If your data.frame may not be identical, you could also run tests in parts of the data.frame, including:
dim(outdat), to check if the size is correct
attributes(outdat) or attributes of columns
sapply(outdat, class), to check variable classes
summary statistics for variables, if applicable
and so forth
If you would like to test this at runtime, you should check out the excellent ensurer package, see here. At the bottom of the page you can see how to construct a template that you can test your dataframe against, you can make it as detailed and specific as you like.
I'm just using something like this
d1 <- iris
d2 <- iris
expect_that(d1, equals(d2)) # passes
d3 <- iris
d3[141,3] <- 5
expect_that(d1, equals(d3)) # fails
I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.
I'm using the library poLCA. To use the main command of the library one has to create a formula as follows:
f <- cbind(V1,V2,V3)~1
After this a command is invoked:
poLCA(f,data0,...)
V1, V2, V3 are the names of variables in the dataset data0. I'm running a simulation and I need to change the formula several times. Sometimes it has 3 variables, sometimes 4, sometimes more.
If I try something like:
f <- cbind(get(names(data0)[1]),get(names(data0)[2]),get(names(data0)[3]))~1
it works fine. But then I have to know in advance how many variables I will use. I would like to define an arbitrary vector
vars0 <- c(1,5,17,21)
and then create the formula as follows
f<- cbind(get(names(data0)[var0]))
Unfortunaly I get an error. I suspect the answer may involve some form of apply but I still don't understand very well how this functions work. Thanks in advance for any help.
Using data from the examples in ?poLCA this (possibly hackish) idiom seems to work:
library(poLCA)
vec <- c(1,3,4)
M4 <- poLCA(do.call(cbind,values[,vec])~1,values,nclass = 1)
Edit
As Hadley points out in the comments, we're making this a bit more complicated than we need. In this case values is a data frame, not a matrix, so this:
M1 <- poLCA(values[,c(1,2,4)]~1,values,nclass = 1)
generates an error, but this:
M1 <- poLCA(as.matrix(values[,c(1,2,4)])~1,values,nclass = 1)
works fine. So you can just subset the columns as long as you wrap it in as.matrix.
#DWin mentioned building the formula with paste and as.formula. I thought I'd show you what that would look like using the election dataset.
library("poLCA")
data(election)
vec <- c(1,3,4)
f <- as.formula(paste("cbind(",paste(names(election)[vec],collapse=","),")~1",sep=""))
For each of 100 data sets, I am using lm() to generate 7 different equations and would like to extract and compare the p-values and adjusted R-squared values.
Kindly assume that lm() is in fact the best regression technique possible for this scenario.
In searching the web I've found a number of useful examples for how to create a function that will extract this information and write it elsewhere, however, my code uses paste() to label each of the functions by the data source, and I can't figure out how to include these unique pasted names in the function I create.
Here's a mini-example:
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
for (i in testrun)
{
assign(paste(i,"test",sep=""),lm(temp$LogPre~temp$labels))
assign(paste(i,"test2",sep=""),lm(temp$LogPre~temp$labels2))
}
I would then like to extract the coefficients of each equation
But the following doesn't work:
summary(paste(i,"test",sep="")$coefficients)
and neither does this:
coef(summary(paste(i,"test",sep="")))
Both generating the error :$ operator is invalid for atomic vectors
EVEN THOUGH
summary(XXtest)$coefficients
and
coef(summary(XXtest))
work just fine.
How can I use paste() within summary() to allow me to do this for AAtest, AAtest2, ABtest, ABtest2, etc.
Thanks!
Hard to tell exactly what your purpose is, but some kind of apply loop may do what you want in a simpler way. Perhaps something like this?
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
names(testrun) <- testrun
out <- lapply(testrun, function(i) {
list(test1=lm(temp$LogPre~temp$labels),
test2=lm(temp$LogPre~temp$labels2))
})
Then to get all the p-values for the slopes you could do:
> sapply(out, function(i) sapply(i, function(x) coef(summary(x))[2,4]))
XX
test1 0.02392516
test2 0.02389790
Just using paste results in a character string, not the object with that name. You need to tell R to get the object with that name by using get.
summary(get(paste(i,"test",sep="")))$coefficients