Can I aggregate with parameters taken from data frame? - r

I'd like to perform different aggregations in a loop to be applied to different row subsets of my data, but it seems tricky to achieve (if possible at all):
t <- data.frame(agg=c(list("field1"=field1, "field2"=field2), ...),
fun=c(mean, ...))
f <- function(x) {
for (i in 1:nrow(t) {
y <- aggregate(x, by=t$agg[i], FUN=t$fun[i])
# do something with y
}
}
One problem is that the field list agg triggers an error when trying to build the data frame ("object 'field1' not found"), and the other problem is that R does not like to assign a function value to fun ("cannot coerce class ""function"" to a data.frame").
Appendix:
A concrete example for my data (just to match the definitions above) could be:
> d <- data.frame(field1=round(rnorm(5, 10, 1)),field2=letters[round(rnorm(5, 10, 1))], field3=1:5)
> d
field1 field2 field3
1 11 j 1
2 11 i 2
3 10 j 3
4 12 i 4
5 11 j 5
> with(d, aggregate(d$field3,by=list(field1, field2),FUN=mean))
Group.1 Group.2 x
1 11 i 2
2 12 i 4
3 10 j 3
4 11 j 3
Playing tricks with the variable names in the data frame, I still get this:
> with(d,t <- data.frame(agg=c(list("field1"=field1, "field2"=field2)),fun=c(mean)))
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.frame

The problems were several, mostly caused by R making exceptions to general processing:
First a vector cannot be nested, but only lists can. Still all the elements are required to have the same type.
Second, data.frame does some magic treatment when constructing the variables (causing the inability to assign closures), so it cannot be used.
Finally I had to refer to variables to aggregate by name
So the definition looks like this (where , ... means "add more similar items"):
t <- list(agg=list(c("field1", "field2"), ...),
fun=list(mean, ...))
f <- function(x) {
for (i in 1:length(t$agg)) {
agg <- t$agg[[i]]
aggList <- lapply(agg, FUN=function(e) x[[e]])
names(aggList) <- agg
y <- aggregate(x, by=aggList, FUN=t$fun[[i]])
# do something with y
}
}
Note: In the actual solution I added another list holding the names of the columns to select for the aggregated data frame to avoid warnings about mean returning NA.

Related

R: Make function change dataset [duplicate]

I am trying to write a function that will add a new column to a data frame, when I call it, without doing any explicit assignment.
i.e I just want to call the function with arguments and have it modify the data frame:
input_data:
x y
1 2
2 6
column_creator<-function(data,column_name,...){
data$column_name <- newdata ...}
column_creator(input_data,new_col,...)
x y new_col
1 2 5
2 6 9
As opposed to:
input_data$new_col <- column_creator(input_data,new_col,...)
However doing assignment inside the function is not modifying the global variable.
I am working around this by having the function return a statement of assignment (temp in the function below), however is there another way to do this?
Here is my function for reference, it should create a column of 1s inbetween the supplied start and end date with the name dummy_name.
dummy_creator<-function(data,date,dummy_name,start,end){
temp<-paste(data,"['",dummy_name,"'] <- ifelse(",data,"['",date,"'] > as.Date (","'" , start,"'" , ", format= '%Y-%m-%d') & ",data,"['",date,"'] < as.Date(", "'", end,"'" ,",format='%Y-%m-%d') ,1,0)",sep="")
print(temp)
return()
}
Thanks
I also tried:
dummy_creator<-function(data,date,dummy_name,start,end){
data[dummy_name] <<- ifelse(data[,date] > as.Date (start, format= "%Y-%m-%d") & data[,date] < as.Date(end,format="%Y-%m-%d") ,1,0)
}
But that attempt gave me error object of type closure is not subsettable.
It’s generally a bad idea to modify global data or data passed into a function: R objects are immutable, and using tricks to modify them inside a function breaks the user’s expectations and makes it harder to reason about the program’s state.
It is good form to return the modified object instead:
input_data = column_creator(input_data, new_col, …)
That said, you have a few options. Generally, R has several mechanisms to allow modifiable objects. I recommend you look into R6 classes for this.
You could also use non-standard evaluation to capture the passed object and modify it at the caller’s site. However, this is rarely advisable. I’m posting an example of this here because the mechanism is interesting and worth knowing, but I’ll reiterate that you shouldn’t use it here.
function (df, new_col, new_data) {
# Get the unevaluated expression representing the data frame
df_obj = substitute(df)
new_col = substitute(new_col)
# Ensure that the input is valid
stopifnot(is.name(df_obj))
stopifnot(is.name(new_col))
stopifnot(is.data.frame(df))
# Add new column to data frame
df[[deparse(new_col)]] = new_data
# Assign back to object in caller scope
assign(deparse(df_obj), df, parent.frame())
invisible(df)
}
test = data.frame(A = 1 : 5, B = 1 : 5)
column_creator(test, C, 6 : 10)
test
# A B C
# 1 1 1 6
# 2 2 2 7
# 3 3 3 8
# 4 4 4 9
# 5 5 5 10

Assigning value to an R object without using its name with get()

I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.
This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.
Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}

Modify global data from within a function in R

I am trying to write a function that will add a new column to a data frame, when I call it, without doing any explicit assignment.
i.e I just want to call the function with arguments and have it modify the data frame:
input_data:
x y
1 2
2 6
column_creator<-function(data,column_name,...){
data$column_name <- newdata ...}
column_creator(input_data,new_col,...)
x y new_col
1 2 5
2 6 9
As opposed to:
input_data$new_col <- column_creator(input_data,new_col,...)
However doing assignment inside the function is not modifying the global variable.
I am working around this by having the function return a statement of assignment (temp in the function below), however is there another way to do this?
Here is my function for reference, it should create a column of 1s inbetween the supplied start and end date with the name dummy_name.
dummy_creator<-function(data,date,dummy_name,start,end){
temp<-paste(data,"['",dummy_name,"'] <- ifelse(",data,"['",date,"'] > as.Date (","'" , start,"'" , ", format= '%Y-%m-%d') & ",data,"['",date,"'] < as.Date(", "'", end,"'" ,",format='%Y-%m-%d') ,1,0)",sep="")
print(temp)
return()
}
Thanks
I also tried:
dummy_creator<-function(data,date,dummy_name,start,end){
data[dummy_name] <<- ifelse(data[,date] > as.Date (start, format= "%Y-%m-%d") & data[,date] < as.Date(end,format="%Y-%m-%d") ,1,0)
}
But that attempt gave me error object of type closure is not subsettable.
It’s generally a bad idea to modify global data or data passed into a function: R objects are immutable, and using tricks to modify them inside a function breaks the user’s expectations and makes it harder to reason about the program’s state.
It is good form to return the modified object instead:
input_data = column_creator(input_data, new_col, …)
That said, you have a few options. Generally, R has several mechanisms to allow modifiable objects. I recommend you look into R6 classes for this.
You could also use non-standard evaluation to capture the passed object and modify it at the caller’s site. However, this is rarely advisable. I’m posting an example of this here because the mechanism is interesting and worth knowing, but I’ll reiterate that you shouldn’t use it here.
function (df, new_col, new_data) {
# Get the unevaluated expression representing the data frame
df_obj = substitute(df)
new_col = substitute(new_col)
# Ensure that the input is valid
stopifnot(is.name(df_obj))
stopifnot(is.name(new_col))
stopifnot(is.data.frame(df))
# Add new column to data frame
df[[deparse(new_col)]] = new_data
# Assign back to object in caller scope
assign(deparse(df_obj), df, parent.frame())
invisible(df)
}
test = data.frame(A = 1 : 5, B = 1 : 5)
column_creator(test, C, 6 : 10)
test
# A B C
# 1 1 1 6
# 2 2 2 7
# 3 3 3 8
# 4 4 4 9
# 5 5 5 10

Don't understand how apply gets its parameters in r

I am struggling to make my apply() work: I have two dataframes:
from <- c(1,2,3)
to <- c(2,3,4)
df1 <- data.frame(from, to)
long <-c(9,9.2,9.4,9.6)
lat <- c(45,45.2,45.4,45.6)
id <- c(1,2,3,4)
df2 <- data.frame(long, lat, id)
Now I want something like this:
myFunction <- function(arg){
>>> How do I access arg$from and arg$to? <<<<
}
apply(df1,1,myFunction)
In myFunction I need to make some calculations and return a value for each from-to pair. I don't understand how to access parts of the arg, since arg[0] gives me numeric(0) and arg$from just crashes.
The problem is that apply(...) requires a matrix or array as the first argument. If you pass a dataframe, it will coerce that to a matrix. Matrices are 1 indexed, so the upper left element is [1,1], not [0,0]. Also, matrix columns cannot be referenced using the $ notation.
So,
f <- function(x) {
from <- x[1]
to <- x[2]
# do stuff with from and to...
}
apply(df,1,f)
would work.
One other thing to watch out for is that if your dataframe has (other) columns that have character strings, the conversion will make everything character (including the numbers!). This is because, by definition, all elements of a matrix must have the same data type. Your example does not have that problem, though.
Try mapply(). It's a multivariate version of sapply(). For example:
> myFunction <- function(arg1, arg2){
+ return(sum(arg1, arg2))
+ }
>
> mapply(myFunction, df1$from, df1$to)
[1] 3 5 7
You can also use it to make a new variable in your data frame.
> df1$newvar <- mapply(myFunction, df1$from, df1$to)
> df1
from to newvar
1 1 2 3
2 2 3 5
3 3 4 7

Retain Vector Names as Dataframe Column Names

In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1

Resources