Assignment to a data.frame with `with` - r

Here's an example that assigns in two different ways, one which works and one which doesn't:
library(datasets)
dat <- as.data.frame(ChickWeight)
dat$test1 <- with(dat, Time + weight)
with(dat, test2 <- Time + weight)
> colnames(dat)
[1] "weight" "Time" "Chick" "Diet" "test1"
I've grown accustomed to this behavior. Perhaps more surprising is that test2 just disappears (instead of winding up in the base environment, as I'd expect):
> ls(pattern="test")
character(0)
Note that with is a fairly simple^H^H^H^H^H^H short function:
function (data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
First let's replicate with's functionality:
eval( substitute(Time+weight), envir=dat, enclos=parent.frame() )
Now test with a different enclosure:
testEnv <- new.env()
eval( substitute(test3 <- Time+weight), envir=dat, enclos=testEnv )
ls( envir=testEnv )
Which still doesn't assign anywhere. This disproves my hunch that it was related to the enclosing environment being discarded, and rather points to something more fundamental to the ,enclos argument not doing what I think it does.
I'm curious about the mechanics of why this is going on and if there's an alternative which allows assignment.

Change with to within. with is only for making variables available, not changing them.
Edit: To elaborate, I believe that both with and within create a new environment and populate it with the given list-like object (such as a data frame), and then evaluate the given expression within that environhment. The difference is that with returns the result of the expression and discards the environment, while within returns the environment (converted back to whatever class it originally was, e.g. data.frame). Either way, any assignments made within the expression are presumably performed inside the created environment, which is discarded by with. This explains why test2 is nowhere to be found after doing with(dat, test2 <- Time + weight).
Note that since within returns the modified environment instead of editing it in place (i.e. call-by-value semantics), you need to do dat <- within(dat, test2 <- Time + weight).
If you want a function to do assignment to the current environment (or any specified environment), look at assign.
Edit 2: The modern answer is to embrace the tidyverse and use magrittr & dplyr:
library(datasets)
library(dplyr)
library(magrittr)
dat <- as.data.frame(ChickWeight)
dat %<>% mutate(test1 = Time + weight)
The last line is equivalent to
dat <- dat %>% mutate(test1 = Time + weight)
which is in turn equivalent to
dat <- mutate(dat, test1 = Time + weight)
Use whichever of the last 3 lines makes the most sense to you.

Inspired by the fact that the following works from the command line ...
eval(substitute(test <- Time + weight, dat))
... I put together the following, which seems to work.
myWith <- function(DAT, expr) {
X <- call("eval",
call("substitute", substitute(expr), DAT))
eval(X, parent.frame())
}
## Trying it out
dat <- as.data.frame(ChickWeight)
myWith(dat, test <- Time + weight)
head(test)
# [1] 42 53 63 70 84 103
(The complicated aspect of this problem is that we need substitute() to search for symbols in one environment (the current frame) while the "outer" eval() assigns into a different environment (the parent frame).)

I get the sense that this is being made way too complex. Both with and within return values calculated by operations on named columns of dataframes. If you don't assign them to anything, the value will get garbage collected. The usual way to store tehn is assignment to to a named object or possibly a component of an object with the <- operator. within returns the entire dataframe, whereas with returns only the vector that was calculated from whatever operations were performed on the column names. You could, of course, use assign instead of <-, but I think overuse of that function may obfuscate rather than clarify the code. The difference in use is just assignment to an entrire dataframe or just a column:
dat <- within(dat, newcol <- oldcol1*oldcol2)
dat$newcol <- with(dat, oldcol1*oldcol2)

Related

Can I create value name from function argument, and assign a function output to it?

I have created a function which cleans up my data and plots using ggplot. I want to name the cleaned data and plot with a suffix so that it can be recalled easily.
For example:
data_frame
data_frame_cleaned
data_frame_plot
I haven't managed to find anything that might pull this off.
I read about using deparse(substitute(x)) to turn the variable into a string, so I gave it a shot together with paste().
import a new data frame
my_data <- read.csv("my_data.csv")
analyze_data(my_data)
function with dpylr and ggplot.
Then, I want to store analyse_data and data_plot in the environment, here is what I thought might work, but no...
analyze_data <- function(x){
x_data <- x %>%
filter()%>%
group_by() %>%
summarize() %>%
mutate()
x_plot <- ggplot(x_data)
x_name <- deparse(substitute(x))
assign(paste(x_name,"cleaned",sep="_"),x_data)
assign(paste(x_name,"plot",sep="_"),x_plot)
}
I got warning message instead.
Warning messages:
1: In assign(paste(x_name, "cost_plot", sep = "_"), campg_data) :
only the first element is used as variable name
Using assign to assign variables is not the best idea. You can litter your environment with lots of variables, which can become confusing, and makes it difficult to handle them programmatically. It's better to store your objects in something like a list, which allows you to extract data easily or modify it in sequence using the *apply or map_* functions. That said…
I cannot replicate the warning when I run your function more or less as it is above. Nevertheless, although the function seems to run just fine, it doesn't do what is desired, i.e. no new variables appear in .GlobalEnv. The issue is that you haven't specified the environment in which the variables should be assigned, so they are assigned within the function's own local environment and vanish when the function completes.
You can use pos = 1 to assign your variables within the .GlobalEnv. The following code create variables mtcars_cleaned and mtcars_plot in my .GlobalEnv:
library(dplyr)
analyze_data <- function(x){
x_data <- x %>%
filter(cyl > 4)
x_plot <- ggplot(x_data, aes(mpg, disp)) + geom_point()
x_name <- deparse(substitute(x))
assign(paste(x_name,"cleaned", sep="_"), x_data, pos = 1)
assign(paste(x_name,"plot", sep="_"), x_plot, pos = 1)
}
analyze_data(mtcars)

Check if column exists and if it does check something about it

Hi I want to check if a column in a data.frame exists, and only if it does check another conditions.
I know I can use a nested if statement as I have in the example.
This is normally for checking inputs to functions. This is a working example which gives me the output I want, I just was wondering if there is a smarter way, as this can get messy especially if I am doing it for a number of conditions. My example:
testfun <- function(dat,...){
library(dplyr)
if("Site" %in% colnames(dat)){
#for example check number of sites, this condition could be anything though
if(n_distinct(dat$Site) > 1) stop ("Function must have site specific data")
}
#do stuff
return(1)
}
testdf1 <- data.frame(x = 1:10, y = 1:10)
testdf2 <- data.frame(x = 1:10, y = 1:10,Site = "A")
testdf3 <- data.frame(x = 1:10, y = 1:10,Site = rep(c("A","B"),each = 5))
testfun(testdf1)
testfun(testdf2)
testfun(testdf3)
Edit with a bit more context: For this example the reason for this is that the user may input data that is site specific and therefore doesn't have a Site column (i.e. they have a data.frame with data only at one site so they have never specified the site as a column) or they might be using a data.frame that has had data for a number of sites specified in a column. So if there is no Site column it is safe to assume that data is for one site and the its valid to continue calculations, but if there is a site column I have to check that it only has one distinct value (eg might have been filtered on this column before applying the function of applied through plyr::ddply).
There are a lot of other cases however where I want to check that my input data to a function is of the expected form, and if the input is a data.frame this often means checking for column names and something about that column
You can decide if this is a smarter way or not but one way is by separating the logic using map_if. Here we check the basic condition ("Site" %in% colnames(dat)) in predicate part and based on that we call two functions one for TRUE and other for FALSE. We still check similar conditions but by keeping the functions separate we can keep the code clean and it is easy to understand which part is doing what.
library(dplyr)
library(purrr)
testfun <- function(dat, ...) {
unlist(map_if(list(dat), "Site" %in% colnames(dat), true_fun, .else = false_fun))
}
true_fun <- function(dat) {
if(n_distinct(dat$Site) > 1) stop ("Function must have site specific data")
return(1)
}
false_fun <- function(dat) { return(1) }
testfun(testdf1)
#[1] 1
testfun(testdf2)
#[1] 1
testfun(testdf3)
Error in .f(.x[[i]], ...) : Function must have site specific data

R character variables in function

I’ve worked with SAS and SQL previously I’m trying to get into R via a course. I’ve been set the following task by my tutor:
“Using the Iris dataset, write an R function that takes as its arguments an Iris species and attribute name and returns the minimum and maximum values of the attribute for that species.”
Which sounded straightforward at first, but I’ve come unstuck trying to make the function. The below is as far as I've gotten
#write the function
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table$y)
themax <-max(new_table$y)
return(themin)
return(themax)}
#test the function - Species , Attribute, Data
question_2("setosa",Sepal.Width, iris)
I assumed I needed quotes around the species when running the function, but I get an error that there were "no non-missing arguments to min/max", which I'm guessing means my attempt at making 'new_table' has brought back zero observations.
Can anyone see where I'm going wrong?
edit: thanks all for the swift and insightful responses. i'll take that reading on board. thanks again!
Indeed, your teacher didn't give you the easiest thing to do in R. You were almost right. You can't return twice in a function.
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table[[y]])
themax <-max(new_table[[y]])
return(list(themin, themax))}
question_2("setosa","Sepal.Width", iris)
df$colname cannot be used with a variable to the right of $, because it will search for the column named "colname" ("y" in your case) rather than the character the variable colname (if it even exists) represents.
The syntax df[["colname"]] is useful in this case because it allows for character input (which may also be a variable representing a character). This holds for both object types list and data.frame. In fact, a data.frame can be seen as a list of vectors.
Example
df <- data.frame(col1 = 5:7, col2 = letters[1:3])
a <- "col1"
# $ vs [[
df$col1 # works because "col1" is a column of df
df$a # does not work because "a" is not a column of df
df[["col1"]] # works because "col1" is a column of df
df[[a]] # works because "col1" is a column of df
# dataframes can be seen as list of vectors
ls <- list(col1 = 5:7, col2 = letters[1:3])
ls$col1 # works
ls[[a]] # works
One problem is that Sepal.Width seems to be some object in the workspace. Otherwise R would yell at you Object "Sepal.Width" not found.. Whatever Sepal.Width (the object) is, it is probably not a character string with the value "Sepal.Width". But even if it were, R would not know how to use the $ operator to get that named column from new_table, not without some needlessly advanced programming. #Flo.P's suggestion of using [[ is a good one.
You must pass y as "Sepal.Width".
Another approach: you can take advantage of subset by writing this:
question_2 <- function(x, y, data){
newy <- subset(data, subset=Species==x, select=y)
themin <-min(newy)
themax <-max(newy)
return(c(themin, themax))
}
question_2("setosa","Sepal.Width", iris)

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Why doesn't my function save my assignment?

f1 <- function(x) {
zx1 <- sample(1:nrow(zone4[[x]]), nrow(zone4[[x]]), replace=F)
zone4[[x]]$randnums <- zx1
}
f1(1)
## DOESN'T UPDATE zone4[[1]]
zx2 <- sample(1:nrow(zone4[[1]]), nrow(zone4[[1]]), replace=F)
zone4[[1]]$randnums <- zx2
## DOES UPDATE zone[[1]]
If I make a function f1() like shown above, the object 'zone4[[x]]' is not updated. However, if I run the same command as above but explicitly state 'x', as shown below, then the object 'zone4[[x]]' is updated. Why could this be? I want to know because I want to run iterations of the code. If within the definition of the function f1() above I write "names(zone4[[x]])", then the output I get tells me that the function did what it was supposed to, but when queried again, zone[[x]] appears to be unchanged. Thank you for your help. The idea is to make random numbers for each subset of a data set for a given year and another variable, zone. The data set was originally a single data frame, but I used the split() function to separate the data according to year and zone, of which there are 4. Maybe there is a better way to assign random numbers to specific subsets of data without using the split() function?
R functions don't usually have side effects (ie. changing things in global objects)
This is a good thing (most of the time as we don't want unintended consequences)
The idiomatic approach is to assign the results to a new object (it can be the same name to overwrite the original)
f1 <- function(x) {
zx1 <- sample(1:nrow(zone4[[x]]), nrow(zone4[[x]]), replace=F)
zone4[[x]]$randnums <- zx1
# usually a good idea to return the complete object
# especially when a replacement function (in your case `[[<-`)
# is the last one called
return(zone4)
}
zone4 <- f1(1)
An alternative would be to use data.table
library(data.table)
zone4 <- lapply(zone4, as.data.table)
f1 <- function(x) {
zone4[[x]][,randnums := sample(.N)]
invisible(NULL)
}

Resources