I have a function that roughly follows this structure:
TestFunc <- function(dat, point) {
if (!(point %in% c("NW", "SE", "SW")))
stop("point must be one of 'SW', 'SE', 'NW'")
point <- enquo(point)
return(dat %>% filter(point == !!point))
The issue is that I get the following error when I include the check for values:
Error in (function (x, strict = TRUE) :
the argument has already been evaluated
The error disappears when I remove the check. How can I keep both?
The thing to remember about the quosure framework is that it's a very clever, sophisticated piece of code that undoes another very clever, sophisticated piece of code to get you back where you started from. What you want is achievable in a very simple fashion, without going to NSE and coming back again.
TestFunc <- function(dat, point)
{
if(!(point %in% c("NW", "SE", "SW")))
stop("point must be one of 'SW', 'SE', 'NW'")
dat[dat$point == point, ]
}
(The difference between this and using match.arg, as #Frank suggests in a comment, is that match.arg will use the first value as the default if no input is supplied whereas this will fail.)
If you want to call other dplyr/tidyverse verbs, just do that after filtering the rows.
Because of technical reasons that have to do with how R optimises code, you can only capture arguments that have not been evaluated yet.
So you first have to capture with enquo() and then proceed to check the value. However if you have to mix both quoting and value-based code it often indicates a design problem.
As Hong suggested it seems that in your case you can directly unquote the value without capturing it. Unquoting will ensure the right value is found (since you gave the same name to that variable as the column in your data frame).
Evaluate point so filter can tell the difference between the argument and the data's column
aosmith has a good idea, so I'm putting it in answer form, along with a reproducible example:
f <- function(dat, point) {
if (!(point %in% c("NW", "SE", "SW")))
stop("point must be one of 'SW', 'SE', 'NW'")
filter(dat, point == !!point)
}
tbl <- tibble(point = c('SE', 'NW'))
f(tbl, 'SE')
f(tbl, 'XX')
If you're passing point in as a string, you only need to differentiate the argument point ( = "SE", in this case) and the column dat$point (or tbl$point, depending if you're inside or outside the function). From the dplyr programming doc:
In dplyr (and in tidyeval in general) you use !! to say that you want to unquote an input so that it’s evaluated, not quoted.
You want to evaluate the argument point so that "SE" is passed to filter, that way filter can tell the difference between the column point and the argument SE.
(I also tried using filter(dat, .data$point == point), but the RHS point still refers to the column, not the f argument.)
I hope this example helps with your real code. 🍟
Related
I'm quite new to R and I've been learning with the available resources on the internet.
I came across this issue where I have a vector (a) with vars "1", "2", and "3". I want to use the count function to generate a new df with the categories for each of those variables and its frequencies.
The function I want to use in a loop is this
b <- count(mydata, var1)
However, when I use this loop below;
for (i in (a)) {
'j' <- count(mydata[, i])
print (j)
}
The loop happens but the frequencies which gets saved on j is only of the categorical variable "var 3".
Can someone assist me on this code please?
TIA!
In R there are generally better ways than to use loops to process data. In your particular case, the “straightforward” way fails, because the idea of the “tidyverse” is to have the data in tidy format (I highly recommend you read this article; it’s somewhat long but its explanation is really fundamental for any kind of data processing, even beyond the tidyverse). But (from the perspective of your code) your data is spread across multiple columns (wide format) rather than being in a single column (long form).
The other issue is that count (like many other tidyverse functions) expect an unevaluated column name. It does not accept the column name via a variable. akrun’s answer shows how you can work around this (using tidy evaluation and the bang-bang operator) but that’s a workaround that’s not necessary here.
The usual solution, instead of using a loop, would first require you to bring your data into long form, using pivot_longer.
After that, you can perform a single count on your data:
result <- mydata %>%
pivot_longer(all_of(a), names_to = 'Var', values_to = 'Value') %>%
count(Var, Value)
Some comments regarding your current approach:
Be wary of cryptic variable names: what are i, j and a? Use concise but descriptive variable names. There are some conventions where i and j are used but, if so, they almost exclusively refer to index variables in a loop over vector indices. Using them differently is therefore quite misleading.
There’s generally no need to put parentheses around a variable name in R (except when that name is the sole argument to a function call). That is, instead of for (i in (a)) it’s conventional to write for (i in a).
Don’t put quotes around your variable names! R happens to accept the code 'j' <- … but since quotes normally signify string literals, its use here is incredibly misleading, and additionally doesn’t serve a purpose.
What is the reason behind quoting the name of your function? I have seen this in a couple of packages (for example line number 2 here in the quantmod package). Instead of writing
f <- function(x)
they write
"f" <- function(x)
Another example is from the gratia (line 88) package where functions are back quoted:
`f` <- function(x)
Transferred from Allan Cameron's comment.
It doesn't make any difference for the functions in the link you shared. It appears to be more of a stylistic choice from the developer to allow the top of function declarations to stand out. Sometimes it is necessary to wrap function names in quotes if they contain illegal characters.
The most frequently seen ones in R are the [<- type operators. That is, if you want to define a function that writes to a subset of a custom class so the user can do x[y] <- z then you need to write a function like "[<-.myclass" <- function(y, z) {...}.
Basically in SAS I could just do an if statement without an else. For example:
if species='setosa' then species='regular';
there is no need for else.
How to do it in R? This is my script below which does not work:
attach(iris)
iris2 <- iris
iris2$Species <- ifelse(iris2$Species=='setosa',iris2$Species <- 'regular',iris2$Species <- iris2$Species)
table(iris2$Species)
A couple options. The best is to just do the replacement, this is nice and clean:
iris2$Species[iris2$Species == 'setosa'] <- 'regular'
ifelse returns a vector, so the way to use it in cases like this is to replace the column with a new one created by ifelse. Don't do assignment inside ifelse!
iris2$Species <- ifelse(iris2$Species=='setosa', 'regular', iris2$Species)
But there's rarely need to use ifelse if the else is "stay the same" - the direct replacement of the subset (the first line of code in this answer) is better.
New factor levels
Okay, so the code posted above doesn't actually work - this is because iris$Species is a factor (categorical) variable, and 'regular' isn't one of the categories. The easiest way to deal with this is to coerce the variable to character before editing:
iris2$Species <- as.character(iris2$Species)
iris2$Species[iris2$Species == 'setosa'] <- 'regular'
Other methods work as well, (editing the factor levels directly or re-factoring and specifying new labels), but that's not the focus of your question so I'll consider it out of scope for the answer.
Also, as I said in the comments, don't use attach. If you're not careful with it you can end up with your columns out of sync creating annoying bugs. (In the code you post, you're not using it anyway - the rest runs just as well if you delete the attach line.)
I would recommend looking at the base R documentation for help with this. You can find the documentation of if, else, and ifelse here. For use of if and else, refer to ?Control.
Regular control flow in code is done with the basic if and else statements, as in most languages. ifelse() is used for vectorized operations--it will return the same shape as your vector based on the test. Regular if and else expressions do not necessarily have those properties.
I am trying to make a function in R that calculates the mean of nitrate, sulfate and ID. My original dataframe have 4 columns (date,nitrate, sulfulfate,ID). So I designed the next code
prueba<-read.csv("C:/Users/User/Desktop/coursera/001.csv",header=T)
columnmean<-function(y, removeNA=TRUE){ #y will be a matrix
whichnumeric<-sapply(y, is.numeric)#which columns are numeric
onlynumeric<-y[ , whichnumeric] #selecting just the numeric columns
nc<-ncol(onlynumeric) #lenght of onlynumeric
means<-numeric(nc)#empty vector for the means
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
}
columnmean(prueba)
When I run my data without using the function(), but I use row by row with my data it will give me the mean values. Nevertheless if I try to use the function so it will make all the steps by itself, it wont mark me error but it also won't compute any value, as in my environment the dataframe 'prueba' and the columnmean function
what am I doing wrong?
A reproducible example would be nice (although not absolutely necessary in this case).
You need a final line return(means) at the end of your function. (Some old-school R users maintain that means alone is OK - R automatically returns the value of the last expression evaluated within the function whether return() is specified or not - but I feel that using return() explicitly is better practice.)
colMeans(y[sapply(y, is.numeric)], na.rm=TRUE)
is a slightly more compact way to achieve your goal (although there's nothing wrong with being a little more verbose if it makes your code easier for you to read and understand).
The result of an R function is the value of the last expression. Your last expression is:
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
It may seem strange that the value of that expression is NULL, but that's the way it is with for-loops in R. The means vector does get changed sequentially, which means that BenBolker's advice to use return(.) is correct (as his advice almost always is.) . For-loops in R are a notable exception to the functional programming paradigm. They provide a mechanism for looping (as do the various *apply functions) but the commands inside the loop exert their effects in the calling environment via side effects (unlike the apply functions).
I've been trying to learn more about environments in R. Through reading, it seemed that I should be able to use functions like with() and transform() to modify variables in a data.frame as if I was operating within that object's environment. So, I thought the following might work:
X <- expand.grid(
Cond=c("baseline","perceptual","semantic"),
Age=c("child","adult"),
Gender=c("male","female")
)
Z <- transform(X,
contrasts(Cond) <- cbind(c(1,0,-1)/2, c(1,-2,1))/4,
contrasts(Age) <- cbind(c(-1,1)/2),
contrasts(Gender) <- cbind(c(-1,1)/2)
)
str(Z)
contrasts(Z$Cond)
But it does not. I was hoping someone could explain why. Of course, I understand that contrasts(X$Cond) <- ... would work, but I'm curious about why this does not.
In fact, this does not work either [EDIT: false, this does work. I tried this quickly before posting originally and did something wrong]:
attach(X)
contrasts(Cond) <- cbind(c(1,0,-1)/2, c(1,-2,1))/4
contrasts(Age) <- cbind(c(-1,1)/2)
contrasts(Gender) <- cbind(c(-1,1)/2)
detach(X)
I apologize if this is a "RTFM" sort of thing... it's not that I haven't looked. I just don't understand. Thank you!
[EDIT: Thank you joran---within() instead of with() or transform() does the trick! The following syntax worked.]
Z <- within(X, {
contrasts(Cond) <- ...
contrasts(Age) <- ...
contrasts(Gender) <- ...
}
)
transform is definitely the wrong tool, I think. And you don't want with, you probably want within, in order to return the entire object:
X <- within(X,{contrasts(Cond) <- cbind(c(1,0,-1)/2, c(1,-2,1))/4
contrasts(Age) <- cbind(c(-1,1)/2)
contrasts(Gender) <- cbind(c(-1,1)/2)})
The only tricky part here is to remember the curly braces to enclose multiple lines in a single expression.
Your last example, using attach, works just fine for me.
transform is only set up to evaluate expressions of the form tag = value, and because of the way it evaluates those expressions, it isn't really set up to modify attributes of a column. It is more intended for direct modifications to the columns themselves. (Scaling, taking the log, etc.)
The difference between with and within is nicely summed up by the Value section of ?within:
Value For with, the value of the evaluated expr. For within, the modified object.
So with only returns the result of the expression. within is for modifying an object and returning the whole thing.
While I agree with #Jornan that within is the best strategy here, I will point out it is possible to use transform you just need to do so in a different way
Z <- transform(X,
Cond = `contrasts<-`(Cond, value=cbind(c(1,0,-1)/2, c(1,-2,1))/4),
Age = `contrasts<-`(Age, value=cbind(c(-1,1)/2)),
Gender= `contrasts<-`(Gender, value=cbind(c(-1,1)/2))
)
Here we are explicitly calling the magic function that is used when you run contrasts(a)=b. This actually returns a value that can be used with the a=b format that transform expects. And of course it leaves X unchanged.
The within solution looks much cleaner of course.