Use sqldf inside a function with generic column references - r

I'm trying to use the sqldf package inside a user-defined function in r with generic column names. I can only get it to work if the variable names match placeholder variable names (x and y) within the function. However, I want it to work regardless of the variable name fed into the function. Here is the example I've been playing with:
Here is the form that works:
df<-data.frame(X=as.factor(c("a","a","a","b","b","b","c","c","c")), Y=c(2.5,3,4,4,5.3,6,6.555,7,8))
df
Bar_Prep1<-function(data,x,y){
library(sqldf)
require(sqldf)
dataframe<-sqldf("select a.[x] Grp, AVG(a.[y]) Mean, stdev(a.[y]) SD, Max(a.[y]) Max
from data a
group by a.[x]")
dataframe$RD<-round(dataframe$Mean,digits=0)
return(dataframe)
}
test<-Bar_Prep1(df,df$X,df$Y)
test
Which returns the following df:
Grp Mean SD Max RD
1 a 3.166667 0.7637626 4 3
2 b 5.100000 1.0148892 6 5
3 c 7.185000 0.7400507 8 7
BUT, I want to be able to use the function on various column names, so I tried this:
df1<-data.frame(a=as.factor(c("a","a","a","b","b","b","c","c","c")), b=c(2.5,3,4,4,5.3,6,6.555,7,8))
df1
test1<-Bar_Prep1(df1,df1$a,df1$b)
test1
Returns the following errors: "Error: no such column: a.x"
"object 'test1' not found
So the question is, how do I need to modify my function code to accept variable names other than "x" and "y"?

Pass the names rather than the columns. Change the sqldf call to fn$sqldf which will enable string interpolation using $. Then in the select statement use $x and $y.
library(sqldf)
Bar_Prep1 <- function(data, x, y) {
dataframe <- fn$sqldf("select
a.[$x] Grp,
AVG(a.[$y]) Mean,
stdev(a.[$y]) SD,
Max(a.[$y]) Max
from data a
group by a.[$x]")
dataframe$RD <- round(dataframe$Mean, digits = 0)
return(dataframe)
}
Bar_Prep1(df, "X", "Y")
## Grp Mean SD Max RD
## 1 a 3.166667 0.7637626 4 3
## 2 b 5.100000 1.0148892 6 5
## 3 c 7.185000 0.7400507 8 7
Note that it would be possible to absorb the rounding into the SQL statement:
Bar_Prep1 <- function(data, x, y) {
fn$sqldf("with tmp as (select
a.[$x] Grp,
AVG(a.[$y]) Mean,
stdev(a.[$y]) SD,
Max(a.[$y]) Max
from data a
group by a.[$x])
select *, round(Mean) RD from tmp")
}

Related

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!
You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))

Various results with distinct() in a custom function

I want to create a function in R that will create a numerical column based on a character/categorical column. In order to do this I need to get the distinct values in the categorical column. I can do this outside a function well, but would like to make a reusable function to do it. The issue I've run into is that the same distinct() formula that works outside the function doesn't behave the same way within the formula. I've created a demo below:
# test of call to db to numericize
DF <- data.frame("a" = c("a","b","c","a","b","c"),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)
catnum <- function(db, inputcolname) {
x <- distinct(db,inputcolname);
print(x);
return(x);
}
y <- distinct(DF,a)
y
catnum(DF,'a')
While y gives the correct distinct one column answer (one column with (a,b,c) in it), x within the function is the entire dataframe. I have tried with and without the ' ', as in catnum(DF,a) but the results are the same.
Could someone tell me what is happening or suggest some code that would work?
One solution is to use distinct_ function inside function. The distinct expect column name and it doesn't work with column names in a variable.
For example distinct(DF, "a") will not work. The actual syntax is: distinct(DF, a). Notice the missing quotes. When distinct is called from function then column name was provided as variable name (i.e inputcolname) which was evaluated. Hence unexpected result. But distinct_ works on variable name for columns.
library(dplyr)
catnum <- function(db, inputcolname) {
x <- distinct_(db,inputcolname);
#print(x);
return(x);
}
#With modified function results were as expected.
catnum(DF,'a')
# a
# 1 a
# 2 b
# 3 c
Not sure what you are trying to do and where distinct function is coming from. Are you looking for this?
catnum<-function(DF,var){
length(unique(DF[[var]]))
}
catnum(DF,'a')
You're inputs are not the same, and so you get different results. If you give distinct the same arguments you give catnum, you will get the same result:
isTRUE(all.equal(distinct(DF, a),
catnum(DF, "a")))
## [1] FALSE
isTRUE(all.equal(distinct(DF, "a"),
catnum(DF, "a")))
##[1] TRUE
Unfortunately, this does not work:
catnum(DF, a)
## a b c
## 1 a 0.1 a
## 2 b 1.1 b
## 3 c 2.1 c
## 4 a 3.1 d
## 5 b 4.1 e
## 6 c 5.1 f
The reason, as explained in
vignette("programming")
is that you must jump through several annoying hoops if you want to write functions that use functions from dplyr. The solution (as you will learn in the vignette) is as follows:
catnum <- function(db, inputcolname) {
inputcolname <- enquo(inputcolname)
distinct(db, !!inputcolname)
}
catnum(DF, a)
## a
## 1 a
## 2 b
## 3 c
Or you could conclude that this is all too confusing and do something like
catnum <- function(db, inputcolname) {
unique(db[, inputcolname, drop = FALSE])
}
catnum(DF, "a")
## a
## 1 a
## 2 b
## 3 c
instead.

Transform elements in column with the column name as argument

I'm trying to modify the data in a data set based on a vector of columns to change. That way I could factorize the treatment based on a config file which would have the list of columns to change as a variable.
Ideally, I'd like to be able to use ddply like that :
column <- "var2"
df <- ddply(df, .(), transform, column = func(column))
The output would be the same dataframe but in the column "B", each letter would have an "A" added behind it
Which would change each element of the column var2 by the element through func (func here is used to trim a chr in a particular way). I've tried several solutions, like :
df[do.call(func, df[,column]), ]
which doesn't accept the df[,column] as argument (not a list), or
param = c("var1", "var2")
for(p in param){
df <- df[func(df[,p]),]
}
which destroys the other data, or
df[, column] <- lapply(df[, column], func)
Which doesn't work because it takes the whole column as argument instead of changing each element 1 by 1. I'm kinda out of ideas on how to make this treatment more automatic.
Example :
df <- data.frame(A=1:10, B=letters[2:11])
colname <- "B"
addA <- function(text) { paste0(text, "A") }
And I would like to do something like this :
df <- ddply(df, .(), transform, colname = addA(colname))
Though if the solution does not use ddply, it's not an issue, it's just what I'm the most used to
You could use mutate_at from package dplyr for this.
library(dplyr)
mutate_at(df, colname, addA)
A B
1 1 bA
2 2 cA
3 3 dA
4 4 eA
5 5 fA
6 6 gA
7 7 hA
8 8 iA
9 9 jA
10 10 kA

Can I aggregate with parameters taken from data frame?

I'd like to perform different aggregations in a loop to be applied to different row subsets of my data, but it seems tricky to achieve (if possible at all):
t <- data.frame(agg=c(list("field1"=field1, "field2"=field2), ...),
fun=c(mean, ...))
f <- function(x) {
for (i in 1:nrow(t) {
y <- aggregate(x, by=t$agg[i], FUN=t$fun[i])
# do something with y
}
}
One problem is that the field list agg triggers an error when trying to build the data frame ("object 'field1' not found"), and the other problem is that R does not like to assign a function value to fun ("cannot coerce class ""function"" to a data.frame").
Appendix:
A concrete example for my data (just to match the definitions above) could be:
> d <- data.frame(field1=round(rnorm(5, 10, 1)),field2=letters[round(rnorm(5, 10, 1))], field3=1:5)
> d
field1 field2 field3
1 11 j 1
2 11 i 2
3 10 j 3
4 12 i 4
5 11 j 5
> with(d, aggregate(d$field3,by=list(field1, field2),FUN=mean))
Group.1 Group.2 x
1 11 i 2
2 12 i 4
3 10 j 3
4 11 j 3
Playing tricks with the variable names in the data frame, I still get this:
> with(d,t <- data.frame(agg=c(list("field1"=field1, "field2"=field2)),fun=c(mean)))
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.frame
The problems were several, mostly caused by R making exceptions to general processing:
First a vector cannot be nested, but only lists can. Still all the elements are required to have the same type.
Second, data.frame does some magic treatment when constructing the variables (causing the inability to assign closures), so it cannot be used.
Finally I had to refer to variables to aggregate by name
So the definition looks like this (where , ... means "add more similar items"):
t <- list(agg=list(c("field1", "field2"), ...),
fun=list(mean, ...))
f <- function(x) {
for (i in 1:length(t$agg)) {
agg <- t$agg[[i]]
aggList <- lapply(agg, FUN=function(e) x[[e]])
names(aggList) <- agg
y <- aggregate(x, by=aggList, FUN=t$fun[[i]])
# do something with y
}
}
Note: In the actual solution I added another list holding the names of the columns to select for the aggregated data frame to avoid warnings about mean returning NA.

Assigning value to an R object without using its name with get()

I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.
This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.
Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}

Resources