I am trying to write a function that will add a new column to a data frame, when I call it, without doing any explicit assignment.
i.e I just want to call the function with arguments and have it modify the data frame:
input_data:
x y
1 2
2 6
column_creator<-function(data,column_name,...){
data$column_name <- newdata ...}
column_creator(input_data,new_col,...)
x y new_col
1 2 5
2 6 9
As opposed to:
input_data$new_col <- column_creator(input_data,new_col,...)
However doing assignment inside the function is not modifying the global variable.
I am working around this by having the function return a statement of assignment (temp in the function below), however is there another way to do this?
Here is my function for reference, it should create a column of 1s inbetween the supplied start and end date with the name dummy_name.
dummy_creator<-function(data,date,dummy_name,start,end){
temp<-paste(data,"['",dummy_name,"'] <- ifelse(",data,"['",date,"'] > as.Date (","'" , start,"'" , ", format= '%Y-%m-%d') & ",data,"['",date,"'] < as.Date(", "'", end,"'" ,",format='%Y-%m-%d') ,1,0)",sep="")
print(temp)
return()
}
Thanks
I also tried:
dummy_creator<-function(data,date,dummy_name,start,end){
data[dummy_name] <<- ifelse(data[,date] > as.Date (start, format= "%Y-%m-%d") & data[,date] < as.Date(end,format="%Y-%m-%d") ,1,0)
}
But that attempt gave me error object of type closure is not subsettable.
It’s generally a bad idea to modify global data or data passed into a function: R objects are immutable, and using tricks to modify them inside a function breaks the user’s expectations and makes it harder to reason about the program’s state.
It is good form to return the modified object instead:
input_data = column_creator(input_data, new_col, …)
That said, you have a few options. Generally, R has several mechanisms to allow modifiable objects. I recommend you look into R6 classes for this.
You could also use non-standard evaluation to capture the passed object and modify it at the caller’s site. However, this is rarely advisable. I’m posting an example of this here because the mechanism is interesting and worth knowing, but I’ll reiterate that you shouldn’t use it here.
function (df, new_col, new_data) {
# Get the unevaluated expression representing the data frame
df_obj = substitute(df)
new_col = substitute(new_col)
# Ensure that the input is valid
stopifnot(is.name(df_obj))
stopifnot(is.name(new_col))
stopifnot(is.data.frame(df))
# Add new column to data frame
df[[deparse(new_col)]] = new_data
# Assign back to object in caller scope
assign(deparse(df_obj), df, parent.frame())
invisible(df)
}
test = data.frame(A = 1 : 5, B = 1 : 5)
column_creator(test, C, 6 : 10)
test
# A B C
# 1 1 1 6
# 2 2 2 7
# 3 3 3 8
# 4 4 4 9
# 5 5 5 10
Related
I am trying to write a function that will add a new column to a data frame, when I call it, without doing any explicit assignment.
i.e I just want to call the function with arguments and have it modify the data frame:
input_data:
x y
1 2
2 6
column_creator<-function(data,column_name,...){
data$column_name <- newdata ...}
column_creator(input_data,new_col,...)
x y new_col
1 2 5
2 6 9
As opposed to:
input_data$new_col <- column_creator(input_data,new_col,...)
However doing assignment inside the function is not modifying the global variable.
I am working around this by having the function return a statement of assignment (temp in the function below), however is there another way to do this?
Here is my function for reference, it should create a column of 1s inbetween the supplied start and end date with the name dummy_name.
dummy_creator<-function(data,date,dummy_name,start,end){
temp<-paste(data,"['",dummy_name,"'] <- ifelse(",data,"['",date,"'] > as.Date (","'" , start,"'" , ", format= '%Y-%m-%d') & ",data,"['",date,"'] < as.Date(", "'", end,"'" ,",format='%Y-%m-%d') ,1,0)",sep="")
print(temp)
return()
}
Thanks
I also tried:
dummy_creator<-function(data,date,dummy_name,start,end){
data[dummy_name] <<- ifelse(data[,date] > as.Date (start, format= "%Y-%m-%d") & data[,date] < as.Date(end,format="%Y-%m-%d") ,1,0)
}
But that attempt gave me error object of type closure is not subsettable.
It’s generally a bad idea to modify global data or data passed into a function: R objects are immutable, and using tricks to modify them inside a function breaks the user’s expectations and makes it harder to reason about the program’s state.
It is good form to return the modified object instead:
input_data = column_creator(input_data, new_col, …)
That said, you have a few options. Generally, R has several mechanisms to allow modifiable objects. I recommend you look into R6 classes for this.
You could also use non-standard evaluation to capture the passed object and modify it at the caller’s site. However, this is rarely advisable. I’m posting an example of this here because the mechanism is interesting and worth knowing, but I’ll reiterate that you shouldn’t use it here.
function (df, new_col, new_data) {
# Get the unevaluated expression representing the data frame
df_obj = substitute(df)
new_col = substitute(new_col)
# Ensure that the input is valid
stopifnot(is.name(df_obj))
stopifnot(is.name(new_col))
stopifnot(is.data.frame(df))
# Add new column to data frame
df[[deparse(new_col)]] = new_data
# Assign back to object in caller scope
assign(deparse(df_obj), df, parent.frame())
invisible(df)
}
test = data.frame(A = 1 : 5, B = 1 : 5)
column_creator(test, C, 6 : 10)
test
# A B C
# 1 1 1 6
# 2 2 2 7
# 3 3 3 8
# 4 4 4 9
# 5 5 5 10
I learnt that in R you can pass a variable number of parameters to your function with ...
I'm now trying to create a function and loop through ..., for example to create a dataframe.
create_df <- function(...) {
for(i in ...){
... <- data.frame(...=c(1,2,3,4),column2=c(1,2,3,4))
}
}
create_df(hello,world)
I would like to create two dataframes with this code one named hello and the other world. One of the columns should also be named hello and world respectively. Thanks
This is my error:
Error in create_df(hello, there) : '...' used in an incorrect context
It's generally not a good idea for function to inject variables into their calling environment. It's better to return values from functions and keep related values in a list. In this case you could do this instead
create_df <- function(...) {
names <- sapply(match.call(expand.dots = FALSE)$..., deparse)
Map(function(x) {
setNames(data.frame(a1=c(1,2,3,4),a2=c(1,2,3,4)), c(x, "column2"))
}, names)
}
create_df(hello,world)
# $hello
# hello column2
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# $world
# world column2
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
This returns a named list which is much easier to work with in R. We use match.call to turn the ... names into strings and then use those strings with functions that expect them like setNames() to change data.frame column names. Map is also a great helper for generating lists. It's often easier to use map style functions rather than bothering with explicit for loops.
I'd like to perform different aggregations in a loop to be applied to different row subsets of my data, but it seems tricky to achieve (if possible at all):
t <- data.frame(agg=c(list("field1"=field1, "field2"=field2), ...),
fun=c(mean, ...))
f <- function(x) {
for (i in 1:nrow(t) {
y <- aggregate(x, by=t$agg[i], FUN=t$fun[i])
# do something with y
}
}
One problem is that the field list agg triggers an error when trying to build the data frame ("object 'field1' not found"), and the other problem is that R does not like to assign a function value to fun ("cannot coerce class ""function"" to a data.frame").
Appendix:
A concrete example for my data (just to match the definitions above) could be:
> d <- data.frame(field1=round(rnorm(5, 10, 1)),field2=letters[round(rnorm(5, 10, 1))], field3=1:5)
> d
field1 field2 field3
1 11 j 1
2 11 i 2
3 10 j 3
4 12 i 4
5 11 j 5
> with(d, aggregate(d$field3,by=list(field1, field2),FUN=mean))
Group.1 Group.2 x
1 11 i 2
2 12 i 4
3 10 j 3
4 11 j 3
Playing tricks with the variable names in the data frame, I still get this:
> with(d,t <- data.frame(agg=c(list("field1"=field1, "field2"=field2)),fun=c(mean)))
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.frame
The problems were several, mostly caused by R making exceptions to general processing:
First a vector cannot be nested, but only lists can. Still all the elements are required to have the same type.
Second, data.frame does some magic treatment when constructing the variables (causing the inability to assign closures), so it cannot be used.
Finally I had to refer to variables to aggregate by name
So the definition looks like this (where , ... means "add more similar items"):
t <- list(agg=list(c("field1", "field2"), ...),
fun=list(mean, ...))
f <- function(x) {
for (i in 1:length(t$agg)) {
agg <- t$agg[[i]]
aggList <- lapply(agg, FUN=function(e) x[[e]])
names(aggList) <- agg
y <- aggregate(x, by=aggList, FUN=t$fun[[i]])
# do something with y
}
}
Note: In the actual solution I added another list holding the names of the columns to select for the aggregated data frame to avoid warnings about mean returning NA.
I have searched quite a bit and not found a question that addresses this issue--but if this has been answered, forgive me, I am still quite green when it comes to coding in general. I have a data frame with a large number of variables that I would like to combine & create new variables from based on names I've put in a 2nd data frame in a loop. The data frame formulas should create & call columns from the main data frame data
USDb = c(1,2,3)
USDc = c(4,5,6)
EURb = c(7,8,9)
EURc = c(10,11,12)
data = data.frame(USDb, USDc, EURb, EURc)
Now I'd like to create a new column data$USDa as defined by
data$USDa = data$USDb - data$USDc
and so on for EUR and other variables. This is easy enough to do manually, but I'd like to create a loop that pulls the names from formulas, something like this:
a = c("USDa", "EURa")
b = c("USDb", "EURb")
c = c("USDc", "EURc")
formulas = data.frame(a,b,c)
for (i in 1:length(formulas[,a])){
data$formulas[i,a] = data$formulas[i,b] - data$formulas[i,c]
}
Obviously data$formulas[i,a] this returns NULL, so I tried data$paste0(formulas[i,a]) and that returns Error: attempt to apply non-function
How can I get these strings to be recognized as variables in this way? Thanks.
There are simpler ways to do this, but I'll stick to most of your code as a means of explanation. Your code should work so long as you edit your for loop to the following:
for (i in 1:length(formulas[,"a"])){
data[formulas[i,"a"]] = data[formulas[i,"b"]] - data[formulas[i,"c"]]
}
formulas[,a] won't work because you have a variable defined as a already that is not appropriate inside an index. Use formulas[, "a"] instead if you want all rows from column "a" in data.frame formulas.
data$formulas is literally searching for the column called "formulas" in the data.frame data. Instead you want to write data[formulas](of course, knowing that you need to index formulas in order to make it a proper string)
logic : iterate through each of the formulae, using a apply which is a for loop internally, and do calculation based on the formula
x = apply(formulas, 1, function(x) data[[x[3]]] - data[[x[2]]])
colnames(x) = formulas$a
x
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3
cbind(data, x)
# USDb USDc EURb EURc USDa EURa
#1 1 4 7 10 3 3
#2 2 5 8 11 3 3
#3 3 6 9 12 3 3
Another option is split with sapply
sapply(setNames(split.default(as.matrix(formulas[-1]),
row(formulas[-1])), formulas$a), function(x) Reduce(`-`, data[rev(x)]))
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3
I have a general problem in understanding how to create a user defined function that can accept variables as arguments that can be manipulated inside the defined function. I want to create a function in which I can pass variables as arguments to internal functions for manipulation. It appears that many of the functions I want to use require the c() operator which requires quotes around the arguments.
So my function has to be able to pass the name of a variable from a dataframe into the quotes for c() and other functions requiring quote strings. I read through many post on paste0, paste and cat(x), but I cannot figure out how to solve my problem completely.
Here is a simple dataset and shortened code to help structure the problem. Here I just want to be able to provide a dataframe, and three variables. The function should provide the mean of the variable in the y position for each combo of the x and z variable. The resultant aggregate table should have the names of the variables provided as arguments to XTABAR as column headers.
n=50
DataTest = data.frame( xcol=sample(1:3, n, replace=TRUE), ycol = rnorm(n, 5, 2), Catg=letters[1:5])
XTABAR<- function(DS,xcat,yvar,group){
library(plyr)
#library(ggplot2)
#library(dplyr)
#library(scales)
localenv<-environment()
gg<-data.frame(DS,x=DS[,xcat],y=DS[,yvar],z=DS[,group] )
cnames<-colnames(gg)
ag.gg<-aggregate(gg$y, by=list(gg$x,gg$z),FUN=mean)
colnames(ag.gg)<-c(cat('"',cnames[1],'"'),cat('"',cnames[2],'"'),cat('"',cnames[3],'"'))
return(ag.gg)
}
XTABAR(DataTest,"xcol","ycol","Catg")
This code is as close as I can get to solving the simple problem. I don't know how to remove the quotes from the column names nor how to get rid of the NA's.
Thank you for any help on the logic and or code.
Try the following. I was not too clear about the desire to quote the names but we put stars around them in the code below. If that is not needed then remove the setNames statement.
XTABAR <- function(DS, xcat, yvar, group) {
ag <- aggregate(DS[yvar], DS[c(xcat, group)], mean)
setNames(ag, paste0("*", names(ag), "*"))
}
Test it:
XTABAR(DataTest, "xcol", "ycol", "Catg")
giving:
*xcol* *Catg* *ycol*
1 1 a 5.700938
2 2 a 5.292628
3 3 a 5.204395
4 1 b 4.054289
5 2 b 5.119659
6 3 b 4.050799
7 1 c 2.937309
8 2 c 5.696256
9 3 c 6.773029
10 1 d 5.323572
11 2 d 3.430644
12 3 d 4.892041
13 1 e 4.024070
14 3 e 5.038122
I make heavy use of eval(parse(text=)) for this purpose. It evaluates a character string as though it is a command. For example:
> x <- "5 + 5"
> eval(parse(text=x))
[1] 10
Using your example, this should work if you input your parameters as character strings:
XTABAR<- function(DS,xcat,yvar,group){
library(plyr)
#library(ggplot2)
#library(dplyr)
#library(scales)
var1 <- eval(parse(text=paste(DS, "$", xcat, sep="")))
var2 <- eval(parse(text=paste(DS, "$", yvar, sep="")))
var3 <- eval(parse(text=paste(DS, "$", group, sep="")))
localenv<-environment()
gg<-data.frame(x=var1, y=var2, z=var3)
cnames<-colnames(gg)
ag.gg<-aggregate(gg$y, by=list(gg$x,gg$z),FUN=mean)
colnames(ag.gg)<-c(cat('"',cnames[1],'"'),cat('"',cnames[2],'"'),cat('"',cnames[3],'"'))
return(ag.gg)
}
I'm going to go ahead and anticipate a criticism of my answer.
> require(fortunes)
Loading required package: fortunes
> fortune(106)
If the answer is parse() you should usually rethink
the question.
-- Thomas Lumley
R-help (February 2005)
Mr. Lumley is probably correct in this case. There are probably simpler solutions, but this should at least get you going.
To set the column names, use colnames(ag.gg) <- c(xcat, yvar, group).