I am writing a function to create some predicted variables within an existing data set that I am using to run some ML models. My function looks like this:
doall <- function(x1, x2){
J48 <- J48(ML, data=df1)
#summary(J48)
X1 <- predict(J48, df1, type="class")
X2 <- predict(J48, df2, type="class")
#return(X1)
}
doall(df1$DT_predict, df2$DT_predict1)
J48 is a decision tree model (via RWeka). The code works (doall(df1$DT_predict1, df2$DT_predict1)) properly, I believe, because when I include the return function, it returns the values of X1. However, the predicted variables are not getting generated/stored in the data frames (df1 and df2). Ideally, I would like to have the dataframe names within the function, but that's the next step.
Can someone show how can I store the variables X1 and X2 within dataframes df1 and df2 respectively.
Ideally your question would have a bit more information about what your data frames look like, what X1 and X2 look like, and where your data frames are stored. For my answer I am assuming your data frames are stored in the global environment, and you want to modify them through a function.
This question has to do with scoping. For an in-depth description of scoping check out this article http://adv-r.had.co.nz/Functions.html#lexical-scoping
First, by assigning your variables within a function you are assigning them in a local environment. This means that the variables you are assigning do not carry over into the global environment (what you see when you type ls().
I believe you either want change a 'global variable' from within a function. This is done by the
<<-
command
for instance
a <- 2
print(a)
returns 2
change_a<-function(x){
x<-x*4
}
change_a(a)
print(a)
still returns 2
while
change_a<-function(x){
x<<-x*4
}
change_a(a)
print(a)
would return 8
I think you want to use the <<- operator instead of <- to accomplish what you want.
On a related note, it is not generally considered to be best practices to assign and change global variables from within a function.
Related
I have defined two global variables as lists, let's say:
importance.5maturity <- list()
importance.10maturity <- list()
I also have a function which runs randomForest and I want to add the Importance of this function in the list, in each loop (I am using a rolling window). I believe this can be done using list.append().
The input of this function has a variable named maturity. I want to have an if statement, in a way that if the list name in global variable has the same number as in maturity, the function stores the Importance information in that particular list. For example,
b <- randomForest(y ~., data= d.na, mtry=5, ntree=1000, importance=TRUE)
if(maturity==5){
importance.5maturity <- list.append(Importance(b))
But I don't know how to match the maturity and the number (5 and 10) in the list name so the function would choose the correct list to store the information in automatically.
I also don't want to use local variables, which the function would return, since I am returning another data frame from it.
I have created a function, which computes the statistics on various patients data, and as well as outputting plots, it generates data frames containing summary statistics for each patient.
If i copy and run the function within R, the outputs are available to me. However, I am now calling the function from a separate R script, and the data frames are no longer available.
Is there any way to correct this?
For example,
test=function(a){
A=a
B=2*a
C=3*a
D=4*a
DF=data.frame(A,B,C,D)
}
a=c(1,2,3,4)
test(a)
This does not return DF, yet if I were to type:
a=c(1,2,3,4)
A=a
B=2*a
C=3*a
D=4*a
DF=data.frame(A,B,C,D)
Then clearly DF is returned. Is there a simple way to fix this so that DF becomes available from the test function?
Try:
test=function(a){
A=a
B=2*a
C=3*a
D=4*a
DF=data.frame(A,B,C,D)
}
a=c(1,2,3,4)
df<-test(a)
print(df)
By assigning the function's returned value to a new variable it is now accessible in the global space.
If you want to assign an object from within a function to the global environment for easy retrieval then your operators are "<<-" or "->>" for more info see:
?assignOps() i.e.
test <- function(a)
A=a
B=2*a
C=3*a
D=4*a
DF <<- data.frame(A,B,C,D)
}
# trial your dummy data
a=c(1,2,3,4)
test(a)
DF
Hey presto ... it works! Writing return(DF) within the function will not deliver your data frame to the global environment.
I am writing a script that ultimately returns a data frame. My question is around if there are any good practices on how to use a unit test package to make sure that the data frame that is returned is correct. (I'm a beginning R programmer, plus new to the concept of unit testing)
My script effectively looks like the following:
# initialize data frame
df.out <- data.frame(...)
# function set
function1 <- function(x) {...}
function2 <- function(x) {...}
# do something to this data frame
df.out$new.column <- function1(df.out)
# do something else
df.out$other.new.column <- function2(df.out)
# etc ....
... and I ultimately end up with a data frame with many new columns. However, what is the best approach to test that the data frame that is produced is what is anticipated, using unit tests?
So far I have created unit tests that check the results of each function, but I want to make sure that running all of these together produces what is intended. I've looked at Hadley Wickham's page on testing but can't see anything obvious regarding what to do when returning data frames.
My thoughts to date are:
Create an expected data frame by hand
Check that the output equals this data frame, using expect_that or similar
Any thoughts / pointers on where to look for guidance? My Google-fu has let me down considerably on this one to date.
Your intuition seems correct. Construct a data.frame manually based on the expected output of the function and then compare that against the function's output.
# manually created data
dat <- iris[1:5, c("Species", "Sepal.Length")]
# function
myfun <- function(row, col, data) {
data[row, col]
}
# result of applying function
outdat <- myfun(1:5, c("Species", "Sepal.Length"), iris)
# two versions of the same test
expect_true(identical(dat, outdat))
expect_identical(dat, outdat)
If your data.frame may not be identical, you could also run tests in parts of the data.frame, including:
dim(outdat), to check if the size is correct
attributes(outdat) or attributes of columns
sapply(outdat, class), to check variable classes
summary statistics for variables, if applicable
and so forth
If you would like to test this at runtime, you should check out the excellent ensurer package, see here. At the bottom of the page you can see how to construct a template that you can test your dataframe against, you can make it as detailed and specific as you like.
I'm just using something like this
d1 <- iris
d2 <- iris
expect_that(d1, equals(d2)) # passes
d3 <- iris
d3[141,3] <- 5
expect_that(d1, equals(d3)) # fails
Function lm(...) returns an object of class 'lm'. How do I create an array of such objects? I want to do the following:
my_lm_array <- rep(as.lm(NULL), 20)
#### next, populate this array by running lm() repeatedly:
for(i in 1:20) {
my_lm_array[i] <- lm(my_data$results ~ my_data[i,])
}
Obviously the line "my_lm <- rep(as.lm(NULL), 20)" does not work. I'm trying to create an array of objects of type 'lm'. How do I do that?
Not sure it will answer your question, but if what you want to do is run a series of lm from a variable against different columns of a data frame, you can do something like this :
data <- data.frame(result=rnorm(10), v1=rnorm(10), v2=rnorm(10))
my_lms <- lapply(data[,c("v1","v2")], function(v) {
lm(data$result ~ v)
})
Then, my_lms would be a list of elements of class lm.
Well, you can create an array of empty/meaningless lm objects as follows:
z <- NA
class(z) <- "lm"
lm_array <- replicate(20,z,simplify=FALSE)
but that's probably not the best way to solve the problem. You could just create an empty list of the appropriate length (vector("list",20)) and fill in the elements as you go along: R is weakly enough typed that it won't mind you replacing NULL values with lm objects. More idiomatically, though, you can run lapply on your list of predictor names:
my_data <- data.frame(result=rnorm(10), v1=rnorm(10), v2=rnorm(10))
prednames <- setdiff(names(my_data),"result") ## extract predictor names
lapply(prednames,
function(n) lm(reformulate(n,response="result"),
data=my_data))
Or, if you don't feel like creating an anonymous function, you can first generate a list of formulae (using lapply) and then run lm on them:
formList <- lapply(prednames,reformulate,response="result") ## create formulae
lapply(formList,lm,data=my_data) ## run lm() on each formula in turn
will create the same list of lm objects as the first strategy above.
In general it is good practice to avoid using syntax such as my_data$result inside modeling formulae; instead, try to set things up so that all the variables in the model are drawn from inside the data object. That way methods like predict and update are more likely to work correctly ...
I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.