Making defining objects easier - r

so, I'm a noobie in R and want to make my experience with it as straightforward as possible. I work with multi-response datasets (like 50+ responses) and would like to avoid manually typing in x1 = dataset$x1 / x2 = dataset$x2 / ect....
Is there a script to make every column header an object?
Cheers!

There are two common approaches (these have also been mentioned in the comments):
You could attach() the dataset, and detach() when done.
You could also use with().
Suppose you have a data.frame named dataset, and in it are $x1 and $x2.
An example using attach() would be:
attach(dataset)
newvar <- x1 + x2
newvar2 <- x1 - x2
detach(dataset)
And an example using with():
with(dataset, {
newvar <- x1 + x2
newvar2 <- x1 - x2
})
I hope I answered your question, if not, feel free to rephrase / edit.
For further examples, take a look at the example in ?attach(), and the boxplot example in ?with().

Here is a reproducible suggestion using only base R functions:
# mtcars is dummy dataset to work with
list_objects = as.list(mtcars) # make a list with all your columns
# note that you can do lapply(list_object, function) at this stage...
#but if you really want your objects to be in your global environment here is the trick :
list2env(list_objects, globalenv()) # extract the objects of the previous list in your environment

Related

Missing function(x) in defined function

I understand that in the following
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(colnames(s)[c(21,259,330,380)], collapse='+'))))
I am missing x
but i really don't understand how and where to insert it to be correct.
Thank you for any help.
Making this an answer instead of a comment due to amount of text.
If I understand you correctly, you're trying to iterate over a list of variables, which you want to add (each in turn) to a set of independent variables in a survival model. The issue in the code you gave is that you don't give x a place. There are several approaches to do so.
The first one is very similar to what you're doing, and creates the formulas. I demonstrate this using the 'cancer' dataset:
library(survival)
data(cancer)
myvars <- c("meal.cal","wt.loss")
a1 <- sapply(myvars,function(x){
as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
}
)
#then we can fit our models
lapply(a1,function(x){coxph(formula=x,data=cancer)})
In my opinion, this is a bit convoluted and can be done in one step:
models <- lapply(myvars, function(x){
form <- as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
fit <- coxph(formula=form, data=cancer)
return(fit)
})
Using the code you started with, we can simply add 'x' to the vector of dependent variables. However, this is not very readable code and I'm always a bit nervous about feeding column indices to models. You might be safer using variable names instead.
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(c(x,colnames(s)[c(21,259,330,380)]), collapse='+'))))

How to write a testthat unit test for a function that returns a data frame

I am writing a script that ultimately returns a data frame. My question is around if there are any good practices on how to use a unit test package to make sure that the data frame that is returned is correct. (I'm a beginning R programmer, plus new to the concept of unit testing)
My script effectively looks like the following:
# initialize data frame
df.out <- data.frame(...)
# function set
function1 <- function(x) {...}
function2 <- function(x) {...}
# do something to this data frame
df.out$new.column <- function1(df.out)
# do something else
df.out$other.new.column <- function2(df.out)
# etc ....
... and I ultimately end up with a data frame with many new columns. However, what is the best approach to test that the data frame that is produced is what is anticipated, using unit tests?
So far I have created unit tests that check the results of each function, but I want to make sure that running all of these together produces what is intended. I've looked at Hadley Wickham's page on testing but can't see anything obvious regarding what to do when returning data frames.
My thoughts to date are:
Create an expected data frame by hand
Check that the output equals this data frame, using expect_that or similar
Any thoughts / pointers on where to look for guidance? My Google-fu has let me down considerably on this one to date.
Your intuition seems correct. Construct a data.frame manually based on the expected output of the function and then compare that against the function's output.
# manually created data
dat <- iris[1:5, c("Species", "Sepal.Length")]
# function
myfun <- function(row, col, data) {
data[row, col]
}
# result of applying function
outdat <- myfun(1:5, c("Species", "Sepal.Length"), iris)
# two versions of the same test
expect_true(identical(dat, outdat))
expect_identical(dat, outdat)
If your data.frame may not be identical, you could also run tests in parts of the data.frame, including:
dim(outdat), to check if the size is correct
attributes(outdat) or attributes of columns
sapply(outdat, class), to check variable classes
summary statistics for variables, if applicable
and so forth
If you would like to test this at runtime, you should check out the excellent ensurer package, see here. At the bottom of the page you can see how to construct a template that you can test your dataframe against, you can make it as detailed and specific as you like.
I'm just using something like this
d1 <- iris
d2 <- iris
expect_that(d1, equals(d2)) # passes
d3 <- iris
d3[141,3] <- 5
expect_that(d1, equals(d3)) # fails

R: Columnwise loop over string variables

my first R question that has not been discussed in any forum yet, apparently...Consider my fake dataset:
A<-matrix(c(1,2,3,4,5,2,3,4,5,6,3,4,5,6,7),5,3)
a<-c(2,4,6,8,9)
I want to regress each column of A on a and perform systemfit to test some restrictions, e.g.:
system.1<-list(A[,1]~a,A[,2]~a,A[,3]~a)
systemfit(system.1)
Now my problem is that my "real" matrix A has hundreds of columns. I'm struggling to create a list that systemfit accepts. I've come up with the following, not workable code:
varlist=NULL
for (i in 1:3){varlist[i] <- paste("A[,",i,"] ~ a",sep="")}
models <- lapply(varlist, function(x){
systemfit(substitute(j, list(j = as.name(x))))
})
If you hit
substitute(j, list(j = as.name(varlist)))
you can see that the solution
`A[,1] ~ a`
contains `` signs which seem to be causing the trouble for systemfit, since it is not accepted as a formula. Hence the problem seems to be the columnwise looping, but I dont see any alternative for the dataset at hand...Any ideas?
Any help would be highly appreciated!
Thanks!
The idiomatic way to do this is to create a list of formulas which reference columns in a data frame, then pass the list and the data frame to systemfit(...).
df <- data.frame(a,A) # data frame with columns a, X1, X2, X3, ...
forms <- lapply(paste0(colnames(df)[-1],"~a"),as.formula)
library(systemfit)
systemfit(forms,data=df)
# systemfit results
# method: OLS
#
# Coefficients:
# eq1_(Intercept) eq1_a eq2_(Intercept) eq2_a eq3_(Intercept) eq3_a
# -0.182927 0.548780 0.817073 0.548780 1.817073 0.548780

using lm(my_formula) inside [.data.table's j

I have gotten in the habit of accessing data.table columns in j even when I do not need to:
require(data.table)
set.seed(1); n = 10
DT <- data.table(x=rnorm(n),y=rnorm(n))
frm <- formula(x~y)
DT[,lm(x~y)] # 1 works
DT[,lm(frm)] # 2 fails
lm(frm,data=DT) # 3 what I'll do instead
I expected # 2 to work, since lm should search for variables in DT and then in the global environment... Is there an elegant way to get something like # 2 to work?
In this case, I'm using lm, which takes a "data" argument, so # 3 works just fine.
EDIT. Note that this works:
x1 <- DT$x
y1 <- DT$y
frm1 <- formula(x1~y1)
lm(frm1)
and this, too:
rm(x1,y1)
bah <- function(){
x1 <- DT$x
y1 <- DT$y
frm1 <- formula(x1~y1)
lm(frm1)
}
bah()
EDIT2. However, this fails, illustrating #eddi's answer
frm1 <- formula(x1~y1)
bah1 <- function(){
x1 <- DT$x
y1 <- DT$y
lm(frm1)
}
bah1()
The way lm works it looks for the variables used in the environment of the formula supplied. Since you create your formula in the global environment, it's not going to look in the j-expression environment, so the only way to make the exact expression lm(frm) work would be to add the appropriate variables to the correct environment:
DT[, {assign('x', x, environment(frm));
assign('y', y, environment(frm));
lm(frm)}]
Now obviously this is not a very good solution, and both Arun's and Josh's suggestions are much better and I'm just putting it here for the understanding of the problem at hand.
edit Another (possibly more perverted, and quite fragile) way would be to change the environment of the formula at hand (I do it permanently here, but you could revert it back, or copy it and then do it):
DT[, {setattr(frm, '.Environment', get('SDenv', parent.frame(2))); lm(frm)}]
Btw a funny thing is happening here - whenever you use get in j-expression, all of the variables get constructed (so don't use it if you can avoid it), and this is why I don't need to also use x and y in some way for data.table to know that those variables are needed.

Creating formulas in R involving an arbitrary number of variables

I'm using the library poLCA. To use the main command of the library one has to create a formula as follows:
f <- cbind(V1,V2,V3)~1
After this a command is invoked:
poLCA(f,data0,...)
V1, V2, V3 are the names of variables in the dataset data0. I'm running a simulation and I need to change the formula several times. Sometimes it has 3 variables, sometimes 4, sometimes more.
If I try something like:
f <- cbind(get(names(data0)[1]),get(names(data0)[2]),get(names(data0)[3]))~1
it works fine. But then I have to know in advance how many variables I will use. I would like to define an arbitrary vector
vars0 <- c(1,5,17,21)
and then create the formula as follows
f<- cbind(get(names(data0)[var0]))
Unfortunaly I get an error. I suspect the answer may involve some form of apply but I still don't understand very well how this functions work. Thanks in advance for any help.
Using data from the examples in ?poLCA this (possibly hackish) idiom seems to work:
library(poLCA)
vec <- c(1,3,4)
M4 <- poLCA(do.call(cbind,values[,vec])~1,values,nclass = 1)
Edit
As Hadley points out in the comments, we're making this a bit more complicated than we need. In this case values is a data frame, not a matrix, so this:
M1 <- poLCA(values[,c(1,2,4)]~1,values,nclass = 1)
generates an error, but this:
M1 <- poLCA(as.matrix(values[,c(1,2,4)])~1,values,nclass = 1)
works fine. So you can just subset the columns as long as you wrap it in as.matrix.
#DWin mentioned building the formula with paste and as.formula. I thought I'd show you what that would look like using the election dataset.
library("poLCA")
data(election)
vec <- c(1,3,4)
f <- as.formula(paste("cbind(",paste(names(election)[vec],collapse=","),")~1",sep=""))

Resources