my first R question that has not been discussed in any forum yet, apparently...Consider my fake dataset:
A<-matrix(c(1,2,3,4,5,2,3,4,5,6,3,4,5,6,7),5,3)
a<-c(2,4,6,8,9)
I want to regress each column of A on a and perform systemfit to test some restrictions, e.g.:
system.1<-list(A[,1]~a,A[,2]~a,A[,3]~a)
systemfit(system.1)
Now my problem is that my "real" matrix A has hundreds of columns. I'm struggling to create a list that systemfit accepts. I've come up with the following, not workable code:
varlist=NULL
for (i in 1:3){varlist[i] <- paste("A[,",i,"] ~ a",sep="")}
models <- lapply(varlist, function(x){
systemfit(substitute(j, list(j = as.name(x))))
})
If you hit
substitute(j, list(j = as.name(varlist)))
you can see that the solution
`A[,1] ~ a`
contains `` signs which seem to be causing the trouble for systemfit, since it is not accepted as a formula. Hence the problem seems to be the columnwise looping, but I dont see any alternative for the dataset at hand...Any ideas?
Any help would be highly appreciated!
Thanks!
The idiomatic way to do this is to create a list of formulas which reference columns in a data frame, then pass the list and the data frame to systemfit(...).
df <- data.frame(a,A) # data frame with columns a, X1, X2, X3, ...
forms <- lapply(paste0(colnames(df)[-1],"~a"),as.formula)
library(systemfit)
systemfit(forms,data=df)
# systemfit results
# method: OLS
#
# Coefficients:
# eq1_(Intercept) eq1_a eq2_(Intercept) eq2_a eq3_(Intercept) eq3_a
# -0.182927 0.548780 0.817073 0.548780 1.817073 0.548780
Related
I understand that in the following
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(colnames(s)[c(21,259,330,380)], collapse='+'))))
I am missing x
but i really don't understand how and where to insert it to be correct.
Thank you for any help.
Making this an answer instead of a comment due to amount of text.
If I understand you correctly, you're trying to iterate over a list of variables, which you want to add (each in turn) to a set of independent variables in a survival model. The issue in the code you gave is that you don't give x a place. There are several approaches to do so.
The first one is very similar to what you're doing, and creates the formulas. I demonstrate this using the 'cancer' dataset:
library(survival)
data(cancer)
myvars <- c("meal.cal","wt.loss")
a1 <- sapply(myvars,function(x){
as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
}
)
#then we can fit our models
lapply(a1,function(x){coxph(formula=x,data=cancer)})
In my opinion, this is a bit convoluted and can be done in one step:
models <- lapply(myvars, function(x){
form <- as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
fit <- coxph(formula=form, data=cancer)
return(fit)
})
Using the code you started with, we can simply add 'x' to the vector of dependent variables. However, this is not very readable code and I'm always a bit nervous about feeding column indices to models. You might be safer using variable names instead.
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(c(x,colnames(s)[c(21,259,330,380)]), collapse='+'))))
I'm a new user trying to figure out lapply.
I have two data sets with the same 30 variables in each, and I'm trying to run t-tests to compare the variables in each sample. My ideal outcome would be a table that lists each variable along with the t stat and the p-value for the difference in that variable between the two datasets.
I tried to design a function to do the t test, so that I could then use lapply. Here's my code with a reproducible example.
height<-c(2,3,4,2,3,4,5,6)
weight<-c(3,4,5,7,8,9,5,6)
location<-c(0,1,1,1,0,0,0,1)
data_test<-cbind(height,weight,location)
data_north<-subset(data_test,location==0)
data_south<-subset(data_test,location==1)
variables<-colnames(data_test)
compare_t_tests<-function(x){
model<-t.test(data_south[[x]], data_north[[x]], na.rm=TRUE)
return(summary(model[["t"]]), summary(model[["p-value"]]))
}
compare_t_tests(height)
which gets the error:
Error in data_south[[x]] : attempt to select more than one element
My plan was to use the function in lapply like this, once I figure it out.
lapply(variables, compare_t_tests)
I'd really appreciate any advice. It seems to me like I might not even be looking at this right, so redirection would also be welcome!
You're very close. There are just a few tweaks:
Data:
height <- c(2,3,4,2,3,4,5,6)
weight <- c(3,4,5,7,8,9,5,6)
location <- c(0,1,1,1,0,0,0,1)
Use data.frame instead of cbind to get a data frame with real names ...
data_test <- data.frame(height,weight,location)
data_north <- subset(data_test,location==0)
data_south <- subset(data_test,location==1)
Don't include location in the set of variables ...
variables <- colnames(data_test)[1:2] ## skip location
Use the model, not the summary; return a vector
compare_t_tests<-function(x){
model <- t.test(data_south[[x]], data_north[[x]], na.rm=TRUE)
unlist(model[c("statistic","p.value")])
}
Compare with the variable in quotation marks, not as a raw symbol:
compare_t_tests("height")
## statistic.t p.value
## 0.2335497 0.8236578
Using sapply will automatically collapse the results into a table:
sapply(variables,compare_t_tests)
## height weight
## statistic.t 0.2335497 -0.4931970
## p.value 0.8236578 0.6462352
You can transpose this (t()) if you prefer ...
I'm working with a set of results of INLA package in R. These results are stored in objects with meaningful names so I can have, for instance, model_a, model_b... in current environment. For each of these models I'd like to do several processing tasks including extracting of the data to separate data frame, which can then be used to merge to spatial data to create map, etc.
Turning to simpler, reproducible example let's assume two results
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
model_a <- lm(weight ~ group)
model_b <- lm(weight ~ group - 1)
I can handle the steps for an individual model, for instance:
model_a_sum <- data.frame(var = character(1), model_a_value = numeric(1))
model_a_sum$var <- "Intercept"
model_a_sum$model_a_value <- model_a$coefficients[1]
png("model_a_plot.png")
plot(model_a, las = 1)
dev.off()
Now, I'd like to reuse this code for each of the models, essentially constructing correct names depending on the model I'm using. I'm more Stata than R person and inside Stata that would be a trivial task to use the stub of a name (model_a, or even a only..) and construct foreach loop that would implement all the steps, adapting names for each of the models.
In R, for loops have been bashed all over the internet so I presume I shouldn't attempt to venture into the territory of:
models <- c("model_a", "model_b", "model_c")
for (model in models) {
...
}
What would be the better solution for such scenario?
Update 1: Since comments suggested that for might indeed be an option I'm trying to put all the tasks into a loop. So far I manged to name the data frame correctly using assign and get correct data plotted under correct name using get:
models <- c("model_a", "model_b")
for (i in 1:length(models)) {
# create df
name.df <- paste0(models[i], "_sum")
assign(name.df, data.frame(var = character(1), value = numeric(1)))
# replace variables of df with results from the model
# plot and save
name.plot <- paste0(models[i], "_plot.png")
png(name.plot)
plot(get(models[i]), which = 1, las = 1)
dev.off()
}
Is this reasonable approach? Any better solutions?
One thing I cannot solve is having the second variable of the df named according to the model (ie. model_a_value instead of current value. Any ideas how to solve that?
Some general tips/advice:
As mentioned in comments, don't believe much of the negativity about for loops in R. The issue is not that they are bad, but more that they are correlated with some bad code patterns that are inefficient.
More important is to use the right data organization. Don't keep the models each in a separate object!. Put them in a list:
l <- vector("list",3)
l[[1]] <- lm(...)
l[[2]] <- lm(...)
l[[3]] <- lm(...)
Then name the list:
names(l) <- paste0("model_",letters[1:3])
Now you can loop over the list without resorting to awkward and unnecessary tools like assign and get, and more importantly when you're ready to step up from for loops to tools like lapply you're all good to go.
I would use similar strategies for your data frames as well.
See #joran answer, this one is to show use of assign and get but should be avoided when possible.
I would go this way for the for loop:
for (model in models) {
m <- get(model) # to get the real model object
# create the model_?_sum dataframe
assign(paste0(model,"_sum"), data.frame(var = "Intercept", value = m$coefficients[1]))
assign(paste0(model,"_sum"), setNames( get(paste0(model,"_sum")), c("var",paste0(model,"_value"))) ) # per comment to rename the value column thanks to #Franck in chat for the guidance
# paste0 to create the text
png(paste0(model,"_plot.png"))
plot(m, las = 1) # use the m object to graph
dev.off()
}
which give the two images and this:
> model_a_sum
var value
(Intercept) Integer 5.032
> model_b_sum
var value
groupCtl Integer 5.032
>
I'm unsure of why you wish this dataframe, but I hope this give clues on how to makes variables names and how to access them.
I am writing a script that ultimately returns a data frame. My question is around if there are any good practices on how to use a unit test package to make sure that the data frame that is returned is correct. (I'm a beginning R programmer, plus new to the concept of unit testing)
My script effectively looks like the following:
# initialize data frame
df.out <- data.frame(...)
# function set
function1 <- function(x) {...}
function2 <- function(x) {...}
# do something to this data frame
df.out$new.column <- function1(df.out)
# do something else
df.out$other.new.column <- function2(df.out)
# etc ....
... and I ultimately end up with a data frame with many new columns. However, what is the best approach to test that the data frame that is produced is what is anticipated, using unit tests?
So far I have created unit tests that check the results of each function, but I want to make sure that running all of these together produces what is intended. I've looked at Hadley Wickham's page on testing but can't see anything obvious regarding what to do when returning data frames.
My thoughts to date are:
Create an expected data frame by hand
Check that the output equals this data frame, using expect_that or similar
Any thoughts / pointers on where to look for guidance? My Google-fu has let me down considerably on this one to date.
Your intuition seems correct. Construct a data.frame manually based on the expected output of the function and then compare that against the function's output.
# manually created data
dat <- iris[1:5, c("Species", "Sepal.Length")]
# function
myfun <- function(row, col, data) {
data[row, col]
}
# result of applying function
outdat <- myfun(1:5, c("Species", "Sepal.Length"), iris)
# two versions of the same test
expect_true(identical(dat, outdat))
expect_identical(dat, outdat)
If your data.frame may not be identical, you could also run tests in parts of the data.frame, including:
dim(outdat), to check if the size is correct
attributes(outdat) or attributes of columns
sapply(outdat, class), to check variable classes
summary statistics for variables, if applicable
and so forth
If you would like to test this at runtime, you should check out the excellent ensurer package, see here. At the bottom of the page you can see how to construct a template that you can test your dataframe against, you can make it as detailed and specific as you like.
I'm just using something like this
d1 <- iris
d2 <- iris
expect_that(d1, equals(d2)) # passes
d3 <- iris
d3[141,3] <- 5
expect_that(d1, equals(d3)) # fails
I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.