I have a very big dataset and I analyze it with R.
The problem is that I want to add some columns with different treatments on my dataset AND I need some recursive function which use some global variable. Each function modify some global variable et create some variables. So the duplication of my dataset in memory is a big problem...
I read some documentation: if I didn't misunderstand, neither the use of <<- nor assign() could help me...
What I want:
mydata <- list(read.table(), ...)
myfunction <- function(var1, var2) {
#modification of global mydata
mydata = ...
#definition of another variable with the new mydata
var3 <- ...
#recursive function
mydata = myfunction(var2, var3)
}
Do you have some suggestions for my problem?
Both <<- and assign will work:
myfunction <- function(var1, var2) {
# Modification of global mydata
mydata <<- ...
# Alternatively:
#assign('mydata', ..., globalenv())
# Assign locally as well
mydata <- mydata
# Definition of another variable with the new mydata
var3 <- ...
# Recursive function
mydata = myfunction(var2, var3)
}
That said, it’s almost always a bad idea to want to modify global data from a function, and there’s almost certainly a more elegant solution to this.
Furthermore, note that <<- is actually not the same as assigning to a variable in globalenv(), rather, it assigns to a variable in the parent scope, whatever that may be. For functions defined in the global environment, it’s the global environment. For functions defined elsewhere, it’s not the global environment.
Related
I'm using functions from an external package (that I cannot modify). These functions put a lot of stuff in the global environment, for instance the package does things like
the.data <<- data.frame(A=rnorm(10),B=rnorm(10),C=rnorm(10)) ## A sample dataset
package.plot <- function(){
x.coords <<- the.data$A/the.data$B
y.coords <<- the.data$C
plot(x.coords, y.coords)
}
(obviously hyper-simplified example... here the key is that x.coods and y.coords are rather complex derivations, sufficiently complex that I do not want to recode them but find it advantageous to re-use the existing code)
I want to use these functions in my own scripts, namely make the same graph with ggplot. Of course, a first, obvious solution is
my.better.plot <- function(){
package.plot()
tibble(x.coords,y.coords) %>% ggplot(aes(x=x.coords,y=y.coords))+geom_point() # etc.
}
However, this has two issues:
I end up plotting twice (a minor issue, it is sufficiently fast
to be unnoticeable);
I "pollute" the global environment with
global x.coords and y.coords
Hence, I would like to run package.plot() in a "pseudo-global" environment to avoid ending up with global variables that may be modified in an "uncontrolled" way.
A workaround, of course, is
my.better.plot <- function(){
package.plot()
tibble(x.coords,y.coords) %>% ggplot(aes(x=x.coords,y=y.coords))+geom_point() # etc.
rm(x.coords,envir=.GlobalEnv)
}
However, I'd prefer to do something like
my.better.and.cleaner.plot(){
within.envir(dummy_env,my.better.plot)
}
.. assuming that there is, indeed, a function "within.envir" that allows to run its second argument in a mock global environment.
Is something like this possible at all ? I did read http://adv-r.had.co.nz/Environments.html , but could not find the answer... (not one that I understood, at any rate). Bonus question : if this is possible, how can I extract the return value of ggplot from dummy_env and return it ?
This function avoids the side effects as much as possible:
library(ggplot2)
library(magrittr)
library(tibble)
my.better.plot <- function(){
x.coords <- 1
y.coords <- 1
environment(package.plot) <- environment()
bmp(tempfile())
package.plot()
dev.off()
print(tibble(x.coords,y.coords) %>% ggplot(aes(x=x.coords,y=y.coords))+geom_point()) # etc.
}
my.better.plot()
#creates only the ggplot in the current device
ls(globalenv())
#[1] "my.better.plot" "package.plot" "the.data"
So this is unfortunately very hacky, since the <<- operator will traverse the environment tree upwards if it does not find the variable name (hence why you should basically never use it.
The one workaround is to call the function from another environment that already has the variables in question initialized. Then it will assign it into those variables and not traverse further up into the globalEnv. You need to know the variable names beforehand though.
f <- function(x) a <<- x
f(5)
# a = 5 in GlobalEnv
rm(a)
CapturedCall <- function(fun, CapturedVars,...)
{
stopifnot(is.function(fun))
SandBox <- new.env()
for(varName in CapturedVars) assign(varName, NA,SandBox)
environment(fun) <- SandBox
fun(...)
}
CapturedCall(f,"a",1)
#Nothing in GlobalEnv
I have written a function that should make my script more streamlined, however I am having trouble with assigning variables or calling variables. In this case I want tax_count to be as if Domain_count was written outside the function. I have tried the solution by R: How can a function assign a value to a variable that will persist outside the function? but with no luck.
Here is my function and calling the function:
taxonomic_count <- function(tax_count, tax, col_num, data) {
tax_count <- aggregate(.~tax, data, sum)
tax_count$count <- rowSums(select(tax_count, col_num:7))
tax_count <- tax_count %>%
select(tax, count)
}
taxonomic_count(Domain_count, Domain, 2, Data1)
Help would be awesome!
I have created a function which cleans up my data and plots using ggplot. I want to name the cleaned data and plot with a suffix so that it can be recalled easily.
For example:
data_frame
data_frame_cleaned
data_frame_plot
I haven't managed to find anything that might pull this off.
I read about using deparse(substitute(x)) to turn the variable into a string, so I gave it a shot together with paste().
import a new data frame
my_data <- read.csv("my_data.csv")
analyze_data(my_data)
function with dpylr and ggplot.
Then, I want to store analyse_data and data_plot in the environment, here is what I thought might work, but no...
analyze_data <- function(x){
x_data <- x %>%
filter()%>%
group_by() %>%
summarize() %>%
mutate()
x_plot <- ggplot(x_data)
x_name <- deparse(substitute(x))
assign(paste(x_name,"cleaned",sep="_"),x_data)
assign(paste(x_name,"plot",sep="_"),x_plot)
}
I got warning message instead.
Warning messages:
1: In assign(paste(x_name, "cost_plot", sep = "_"), campg_data) :
only the first element is used as variable name
Using assign to assign variables is not the best idea. You can litter your environment with lots of variables, which can become confusing, and makes it difficult to handle them programmatically. It's better to store your objects in something like a list, which allows you to extract data easily or modify it in sequence using the *apply or map_* functions. That said…
I cannot replicate the warning when I run your function more or less as it is above. Nevertheless, although the function seems to run just fine, it doesn't do what is desired, i.e. no new variables appear in .GlobalEnv. The issue is that you haven't specified the environment in which the variables should be assigned, so they are assigned within the function's own local environment and vanish when the function completes.
You can use pos = 1 to assign your variables within the .GlobalEnv. The following code create variables mtcars_cleaned and mtcars_plot in my .GlobalEnv:
library(dplyr)
analyze_data <- function(x){
x_data <- x %>%
filter(cyl > 4)
x_plot <- ggplot(x_data, aes(mpg, disp)) + geom_point()
x_name <- deparse(substitute(x))
assign(paste(x_name,"cleaned", sep="_"), x_data, pos = 1)
assign(paste(x_name,"plot", sep="_"), x_plot, pos = 1)
}
analyze_data(mtcars)
Here is what I would expect the function to do:
datalist <- c("var1","var2",...)
my.function <- function(datalist){
n <- length(dlist)
varnames <- paste("data", dlist, sep = ".")
for (...) { # for each var in 'varnames'
... # grab each variable from some specific online dataset;
... # do some basic data manipulation for each variable
}
... # return all the results
}
The main difficulty for me is:
(1) how to do the loop so the grabbed data could be properly temporally stored, and
(2) how the multiple variables could be returned, after finishing the loop;
EDIT:
The loop can create variables I want during the loop, say VAR1 and VAR2, which were stored in the 'dlist' argument, but I cannot manipulate VAR1 or VAR2 in the function, dlist[1] or dlist[2] in the function would only give me a string but not the variable itself.
Thanks in advance.
I think I have solved the problem and make the function work as I expected.
As I described in the question, the main problem in fact is how to manipulate the variables while VAR1 and VAR2 themselves are strings in the function.
eval combined with as.name should work:
eval(as.name(dlist[i]))
Here's an example that assigns in two different ways, one which works and one which doesn't:
library(datasets)
dat <- as.data.frame(ChickWeight)
dat$test1 <- with(dat, Time + weight)
with(dat, test2 <- Time + weight)
> colnames(dat)
[1] "weight" "Time" "Chick" "Diet" "test1"
I've grown accustomed to this behavior. Perhaps more surprising is that test2 just disappears (instead of winding up in the base environment, as I'd expect):
> ls(pattern="test")
character(0)
Note that with is a fairly simple^H^H^H^H^H^H short function:
function (data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
First let's replicate with's functionality:
eval( substitute(Time+weight), envir=dat, enclos=parent.frame() )
Now test with a different enclosure:
testEnv <- new.env()
eval( substitute(test3 <- Time+weight), envir=dat, enclos=testEnv )
ls( envir=testEnv )
Which still doesn't assign anywhere. This disproves my hunch that it was related to the enclosing environment being discarded, and rather points to something more fundamental to the ,enclos argument not doing what I think it does.
I'm curious about the mechanics of why this is going on and if there's an alternative which allows assignment.
Change with to within. with is only for making variables available, not changing them.
Edit: To elaborate, I believe that both with and within create a new environment and populate it with the given list-like object (such as a data frame), and then evaluate the given expression within that environhment. The difference is that with returns the result of the expression and discards the environment, while within returns the environment (converted back to whatever class it originally was, e.g. data.frame). Either way, any assignments made within the expression are presumably performed inside the created environment, which is discarded by with. This explains why test2 is nowhere to be found after doing with(dat, test2 <- Time + weight).
Note that since within returns the modified environment instead of editing it in place (i.e. call-by-value semantics), you need to do dat <- within(dat, test2 <- Time + weight).
If you want a function to do assignment to the current environment (or any specified environment), look at assign.
Edit 2: The modern answer is to embrace the tidyverse and use magrittr & dplyr:
library(datasets)
library(dplyr)
library(magrittr)
dat <- as.data.frame(ChickWeight)
dat %<>% mutate(test1 = Time + weight)
The last line is equivalent to
dat <- dat %>% mutate(test1 = Time + weight)
which is in turn equivalent to
dat <- mutate(dat, test1 = Time + weight)
Use whichever of the last 3 lines makes the most sense to you.
Inspired by the fact that the following works from the command line ...
eval(substitute(test <- Time + weight, dat))
... I put together the following, which seems to work.
myWith <- function(DAT, expr) {
X <- call("eval",
call("substitute", substitute(expr), DAT))
eval(X, parent.frame())
}
## Trying it out
dat <- as.data.frame(ChickWeight)
myWith(dat, test <- Time + weight)
head(test)
# [1] 42 53 63 70 84 103
(The complicated aspect of this problem is that we need substitute() to search for symbols in one environment (the current frame) while the "outer" eval() assigns into a different environment (the parent frame).)
I get the sense that this is being made way too complex. Both with and within return values calculated by operations on named columns of dataframes. If you don't assign them to anything, the value will get garbage collected. The usual way to store tehn is assignment to to a named object or possibly a component of an object with the <- operator. within returns the entire dataframe, whereas with returns only the vector that was calculated from whatever operations were performed on the column names. You could, of course, use assign instead of <-, but I think overuse of that function may obfuscate rather than clarify the code. The difference in use is just assignment to an entrire dataframe or just a column:
dat <- within(dat, newcol <- oldcol1*oldcol2)
dat$newcol <- with(dat, oldcol1*oldcol2)