I am writing a R function using dplyr 0.7.2 syntax to pass input and output data frame names and a column name to sort on. The following is the code I have.
#test data frame creation
lb<- data.frame(study = replicate(25,"ABC"),
subjid = c("x1","x2","x3","x4","x5"),
visit = c("SCREENING","VISIT1","VISIT2","VISIT3","EOT"),
visitn = c(-1,1,2,3,4),
param = c("ALB","AST","HGB","HCT","LDL"),
aval = replicate(5, sample(c(20:100), 1, rep = TRUE)))
#sort function- user to provide input/output df names and column name to sort on
sortdf <- function(ind,outd,col){
col <- enquo(col)
outd <- ind %>% arrange(!!col)
outd <<- outd # return dataframe to workspace
}
sortdf(lb,lb_sort, visitn)
the above code works but the output df name is not getting resolved to lb_sort. output df is named as the name of the associated parameter (outd). Need some help!
Thanks,
Prasanna
You do not need to make use of the << in this context. In effect, your function is a wrapper for arrange:
my_sort <- function(df, col) {
col <- enquo(col)
df %>%
arrange(!!col)
}
my_sort(df = lb, col = visitn)
Then you could create your objects as usual:
my_sort(df = lb, col = visitn) -> sorted_stuff
Edit
As per request, forcing creation of names object in parent environment.
my_sort <- function(df, col, some_name) {
col <- enquo(col)
df %>%
arrange(!!col) -> dta_a
# Gather env. inf
e <- environment() # current environment
p <- parent.env(e)
# Create object in parent env.
assign(x = some_name,
value = dta_a,
envir = p)
# If desired return another object
# return(some_other_data)
}
my_sort(df = lb, col = visitn, some_name ="created_data")
Explanation
e/p objects are used to gather information about functions current and parent environment
assign uses string and creates names object in function's parent environment. Global environment, if called as provided in the example.
Remarks
This is odd behaviour, when called as shown:
>> ls()
[1] "lb" "my_sort"
>> my_sort(df = lb, col = visitn, some_name ="created_data")
>> ls()
[1] "created_data" "lb" "my_sort"
The function leaves "created_data" object in global environment. This is inconsistent with expected behaviour where the user would usually create objects:
my_sort(df = lb, col = visitn) -> created_data
and I wouldn't encourage using it. If the actual problem is concerned with returning multiple objects a potentially better approach may involve packing all the results into a list and returning one list:
list(result_1 = mtcars,
result_2 = airquality)
Related
I am writing a package where a large number of different methods will take the same pattern of arguments and build a dataframe. I am trying to make a helper function that would call substitute on various passed parameters to identify columns in a dataframe, and I can't figure out how to get it to work two levels up. Here is a small example (the real one would have Yobs, B, Z, siteID all as different variables to be fetched from the passed data):
worker <- function( YY,
data = NULL,
env = NULL ) {
vv = substitute(YY, env=env)
res <- tibble( V = eval(vv, data) )
return( res )
}
compare_methods <- function(Yobs, data = NULL ) {
env = rlang::env()
dat = worker( Yobs, data=data, env = env )
return( dat )
}
dat = tibble( kitty = 1:10, pig = LETTERS[1:10], orc = 1:10 * 10 )
compare_methods( kitty, data=dat )
This is so I can not have all the variable names quoted in the function call, aka tidyverse methods. I think I am fundamentally not understanding some of the magic with tidyverse's passing variable names not as strings, however, and perhaps I should be using an entire different toolset here?
I need to calculate aggregate using a native R function IQR.
df1 <- SparkR::createDataFrame(iris)
df2 <- SparkR::agg(SparkR::groupBy(df1, "Species"),
IQR_Sepal_Length=IQR(df1$Sepal_Length, na.rm = TRUE)
)
returns
Error in as.numeric(x): cannot coerce type 'S4' to vector of type 'double'
How can I do it?
This is exactly what gapply, dapply, gapplyCollect is created for! Essentially you can use a user defined function in Spark, which will not run as optimally as native Spark functions, but at least you will get what you want.
I would suggest you to start using gapplyCollect initially, then move on to gapply.
df1 <- SparkR::createDataFrame(iris)
# gapplyCollect does not require you to specify output schema
# but, it will collect all the distributed workload back to driver node
# hence, it is not efficient if you expect huge sized output
df2 <- SparkR::gapplyCollect(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
}
)
# This is how you do the same thing using gapply - specify schema
df3 <- SparkR::gapply(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
},
schema = "Species STRING, IQR_Sepal_Length DOUBLE"
)
SparkR::head(df3)
I have the following situation: I have different dataframes, I would like to be able, for each dataframe, to create 2 dataframes according to the value of one of the columns (log2FoldChange>1 and logFoldChange<-1).
For this I use the following code:
DJ29_T0_Overexpr = DJ29_T0[which(DJ29_T0$log2FoldChange > 1),]
DJ29_T0_Underexpr = DJ29_T0[which(DJ21_T0$log2FoldChange < -1),]
DJ229_T0 being one of my dataframe.
First problem: the sign for the dataframe where log2FoldChange < -1 is not taken into account.
But the main problem is at the time of making the function, I wrote the following:
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
assign(paste(nm,"_Overexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) > 1),])
assign(paste(nm,"_Underexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) < -1),])
}
Which I then ran with :
spliteOverUnder(DJ29_T0)
No error message, but my objects are not exported in my global environment. I tried with return(paste(nm,"_Overexpr", sep="") but it only returns the object name but not the associated dataframe.
Using paste() forces the use of assign(), so I can't do :
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
paste(nm,"_Overexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > 1),]
paste(nm,"_Underexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) < -1),]
}
spliteOverUnder(DJ24_T0)
I encounter the following error:
Error in paste(nm, "_Overexpr", sep = "") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > :
could not find function "paste<-"
If you've encountered this difficulty before, I'd appreciate a little help.
And if you knew, once the function works, how to use a For loop going through a list containing all my dataframes to apply this function to each of them, I'm also a taker.
Thanks
When assigning, use the pos argument to hoist the new objects out of the function.
function(){
assign(x = ..., value = ...,
pos = 1 ## see below
)
}
... where 0 = the function's local environment, 1 = the environment next up (in which the function is defined) etc.
edit
A general function to create the split dataframes in your global environment follows. However, you might rather want to save the new dataframes (from within the function) or just forward them to downstream functions than cram your workspace with intermediary objects.
splitOverUnder <- function(the_name_of_the_frame){
df <- get(the_name_of_the_frame)
df$cat <- cut(df$log2FoldChange,
breaks = c(-Inf, -1, 1, Inf),
labels = c('underexpr', 'normal', 'overexpr')
)
split_data <- split(df, df$cat)
sapply(c('underexpr', 'overexpr'),
function(n){
new_df_name <- paste(the_name_of_the_frame, n, sep = '_')
assign(x = new_df_name,
value = split_data$n,
envir = .GlobalEnv
)
}
)
}
## say, df1 and df2 are your initial dataframes to split:
sapply(c('df1', 'df2'), function(n) splitOverUnder(n))
This question already has answers here:
R function with no return value
(5 answers)
Closed 1 year ago.
I have this function. I'm trying to a tibble with tweets within an object.
tuits <- function(x, y) {
x <- search_tweets(y, n=5000, include_rts = FALSE, lang = "es",
since = since, until = until) %>%
filter(screen_name != y)
}
x is the name of the object and y is the query I want to search in Twitter.
The problem is that the objects are not accessible outside the function.
tuits(Juan, "JuanPerez")
tuits(Pedro, "PedroJimenez")
For instance if I want to execute Juan, R returns this: Error: object 'Juan' not found.
What can I do, because I need those objects accessible outside the function because I want to save all of them in a XLS.
Thanks
Update: I now should have solved your problem using the ensym() and assign functions.
Essentially we want to access the global variable with the name passed to the function. In order to do this we capture its name not its contents with ensym and then we assign it using the assign function which we tell that the object we are looking for is in the global environment and has the name that we stored with ensym.
Here is a brief explanation showing how it works.
library(rlang)
f <- function(x) {
x <- ensym(x)
assign(as_string(x), 2, envir = globalenv())
}
john <- 1
f(john)
print(john)
#> [1] 2
Created on 2021-04-05 by the reprex package (v2.0.0)
For your function we would want to take this approach:
library(rlang)
tuits <- function(x, y) {
# Get the name of the variable we want to store
x <- ensym(x)
tmp <- search_tweets(y, n=5000, include_rts = FALSE, lang = "es",
since = since, until = until) %>%
filter(screen_name != y)
# Assign the value to the variable in the global environment
assign(as_string(x), tmp, envir = globalenv())
}
tuits(Juan, "JuanPerez")
# to test
print(Juan)
Old Answer (improved on in the above section)
I believe the issue here is an issue of understanding scope or environments. If an object is modifed or set within the environment used by a function or the sdcope of the function then it can only be accessed in that form within the function.
Usually the scope of a function contains the variables that are assigned inside the function statement.
Usually the way to solve this would be to return the object using return(x) and setting the function call to the object.
tuits <- function(x, y) {
x <- search_tweets(y, n=5000, include_rts = FALSE, lang = "es",
since = since, until = until) %>%
filter(screen_name != y)
return(x)
}
Juan <- tuits(Juan, "JuanPerez")
You could modify the object x using a superassignment (<<-) operation however this is usually not best practise. I will provide this solution for completeness sake.
Superassignment modifies the variable in the global scope. This however will assign the value to x not the object.
tuits <- function(x, y) {
x <<- search_tweets(y, n=5000, include_rts = FALSE, lang = "es",
since = since, until = until) %>%
filter(screen_name != y)
}
tuits(Juan, "JuanPerez")
How can you access the name of a list element within a function if you pass not the whole list but only the list element (dataframe)?
I have a named list of dataframes, e.g.
files <- list(BRX = -0.72, BRY = -0.72, BRZ = -0.156, BTX = -0.002, BTY = -0.002,
BTZ = -0.0034)
Later in the code, I will use a single list element as input for a plot function. This plot function shall also print the list element's name. How can I access it?
I have the following solution - it works but is a bit cumbersome:
map2(files, names(files),
function(file, filename) {
data.table::setattr(file, "filename", filename)
})
Later, I can retrieve the filename as attribute within the plot function by:
plotfunction(list_element, ...) {
...
filename <- attr(input, "filename")
...
+ ggtitle(filename)
...
}
Is there a more elegant alternative solution, either by a different way to access the list element name, or by setting the filename attribute differently?
One straightforward approach might be to pass the data to the plot function plot_fun as a list (using single brackets [), instead of as a list element (using double brackets [[). In this way, the list element's name will directly be available inside the plot function:
## dummy list of datasets
data_ls <- list(`Dataset 1` = data.frame(x = 1:10, y = 1:10), `Dataset 2` = data.frame(x = 1:10, y = 2 * (1:10)))
## dummy plot function
plot_fun <- function(data_el, ...) {
plot(data_el[[1]], ...) +
title(names(data_el))
}
plot_fun(data_ls["Dataset 1"], type = "l")
plot_fun(data_ls[2], type = "l")
Edit: to call plot_fun for each list element in data_ls, we could modify plot_fun to accept a data and name argument, and then call lapply, Map, mapply or purrr's walk2 or map2 (walk2 is preferable, since plot_fun is called for its side-effects).
## modified dummy plot function
plot_fun <- function(data, name, ...) {
plot(data, ...) +
title(name)
}
## using lapply
lapply(seq_along(data_ls), function(i) plot_fun(data_ls[[i]], names(data_ls)[i], type = "l"))
## or with Map
Map(plot_fun, data = data_ls, name = names(data_ls), type = "l")
## or with purrr
purrr::walk2(.x = data_ls, .y = names(data_ls), .f = plot_fun, type = "l")