Why is one_of() called that? - r

Why is dplyr::one_of() called that? All the other select_helpers names make sense to me, so I'm wondering if there's an aspect of one_of() that I don't understand.
My understanding of one_of() is that it just lets you select variables using a character vector of their names instead of putting their names into the select() call, but then you get all of the variables whose names are in the vector, not just one of them. Is that wrong, and if it's correct, where does the name one_of() come from?

one_of allows for guessing or subset-matching
Let's say I know in general my column names will come from c("mpg","cyl","garbage") but I don't know which columns will be present because of interactivity/reactivity
mtcars %>% select(one_of(c("mpg","cyl","garbage")))
evaluates but provides a message
Warning message:
Unknown variables: `garbage`
In contrast
mtcars %>% select(mpg, cyl, garbage)
does not evaluate and gives the error
Error in overscope_eval_next(overscope, expr) :
object 'garbage' not found

The way I think about it is that select() eventually evaluates to a logical vector. So if you use starts_with it goes through the variables in the dataframe and asks whether the variable name starts with the right set of characters. one_of does the same thing but asks whether the variable name is one of the names listed in the character vector. But as they say, naming things is hard!

The reason for its name seems to be that it allows you to look for, at least, one of the variables that are contained in the vector.
For example:
select(flights, dep, arr_delay, sched_dep_time) won't work because the variable "dep" does not exits. It will produce no result.
select(flights, one_of(c("dep", "arr_delay", "sched_dep_time"))) will work, even due the variable "dep" does not exist. In this case, "arr_delay" and "sched_dep_time" will be shown.
The helper should be read as: at least one_of() the variables will be shown :)

Related

Warning message: In mean.default(., lead_time) : argument is not numeric or logical: returning NA

I run into this problem.
The thing is, the column I am using is actually a numeric column and I get correct results using:
mean(hotel_bookings_city$lead_time).
but when I try this:
hotel_bookings_city %>% mean(lead_time)
I receive the warning that the column 'lead_time' is not numeric/logical even though it is.
What am I not doing right?
As #Ritchi Sacramento stated already, in your second approach, the hole data.frame is passed as first argument to mean because of the use of the dpyr operator %>%. This you can notice also in your warning where the point refers to the actual (eventually by former lines of %>% modified) version of your original data.frame.
You can solve the problem with the code #Ritchi Sacramento suggested already, by (1) defining the column to modify before applying mean or (2) with the function summarise.
Another approach which might be more similar to your original approach is to use the pull-function:
hotel_bookings_city %>%
pull(lead_time) %>%
mean()

What distinguishes dplyr::pull from purrr::pluck and magrittr::extract2?

In the past, when working with a data frame and wanting to get a single column as a vector, I would use magrittr::extract2() like this:
mtcars %>%
mutate(wt_to_hp = wt/hp) %>%
extract2('wt_to_hp')
But I've seen that dplyr::pull() and purrr::pluck() also exists to do much the same job: return a single vector from a data frame, not unlike [[.
Assuming that I'm always loading all 3 libraries for any project I work on, what are the advantages and use cases of each of these 3 functions? Or more specifically, what distinguishes them from each other?
When you "should" use a function is really a matter of personal preference. Which function expresses your intention most clearly. There are differences between them. For example, pluck works better when you want to do multiple extractions. From help file:
accessor(x[[1]])$foo
# is the same as
pluck(x, 1, accessor, "foo")
so while it can be use to just extract a column, it's useful when you have more deeply nested structures or you want to compose with an accessor function.
The pull function is meant to blend in with the result of the dplyr function. It can take the name of a column using any of the ways you can with other functions in the package. For example it will work with !! style expansion where say extract2 will not.
irispull <- function(x) {
iris %>% pull(!!enquo(x))
}
irispull(Sepal.Length)
And extract2 is nothing more than a "more readable" wrapper for the base function [[. In fact it's defined as .Primitive("[[") so it expects column names as character or column indexes and integers.

How to use apply() with my function

bmi<-function(x,y){
(x)/((y/100)^2)
}
bmi(70,177) it can work
but with apply() it does't work
apply(Student,1:2,bmi(Student$weight,Student$height))
Error in match.fun(FUN) :
'bmi(Student$weight, Student$height)' is not a function, character or symbol
It's a bit unclear what the goal is. If it's just to get an answer, then the comments do answer it. If on the other hand, the goal is to understand what you are doing wrong, then read on. I'd say the first error going from left to right is passing the whole dataframe. I would have only passed the 'height' and 'weight' columns.
The next error, again going from left to right, is the use of 1:2 as the second argument to apply. You obviously want to do this "by rows" which mean you should use only 1, i.e. the first dimension of the dataframe.
And the third error is using a function call rather than the function name. Functions with arguments in parentheses don't work when an R function (meaning apply in this case) is expecting a function name or an anonymous function as illustrated in comments.
Fourth error is not assigning the value to a column in your dataframe. So this probably would have succeeded in making the desired extra column via the apply method. But, as noted in comments this is not the most efficient method.:
Student$bmi_val <- apply(Student[ ,c("weight", "height")], bmi)
# didn't want my column name to be the same as the function name
The apply function was actually designed to work with matrices and arrays, so for many purposes it is ill-suited when used with dataframes. In this case where all the arguments to the bmi function are numeric and you can control the order of argument in the first argument to match the x and y positions, it's arguably an acceptable strategy, but not most R-ish method. When working with dates or factor variables, you should definitely avoid apply.

Error while grouping variables usuing describeBy function

I am just starting to learn R.
Used function psych::describeBy in order to group observation in standard dataset airquality.
psych::describeBy(airquality, group = df$month)
However, got the error message:
"df$month : object of type 'closure' is not subsettable"
Still, cannot understand what is wrong.
UPDATE:
Ok, this question was my first shot on Stack Overflow. Not particularly successful, but I was doing my best :). Decided not to delete it, in case it may be useful for somebody who is doing his first steps (and to humble myself too).
What I did not realize, when I was dealing with this problem years ago, is that I need to specify my column for grouping using the name of dataset airquality$Month, rather than df$Month. It was not a spelling issue, but the misunderstanding of syntax basics. I believed that I've already stated that I want to use dataframe named airquality, therefore I can address its columns by name df$Month, meaning that df is a placeholder for airquality. Which is totally wrong, of course. However, my intuition was not 100% wrong, in fact this could have been accomplished by using this syntax:
psych::describeBy(airquality, group = "Month")
You don't need to wrote name of dataframe (because you named it in first argument for function describeBy), just need to specify name of column of interest as second argument (as string, therefore in quotation marks).
Also, for some reason I wrote month, but correct name of column is Month (maybe it was renamed I am not sure).
Hope that my blunder would be a help for somebody else!
Most likely the issue is with how you're loading/using the dataset. Make sure refer to the dataset in a consistent manner
You can try something like
library(psych)
describeBy(airquality, group = airquality$month)

ggplot iterate several columns

lapply(7:12, function(x) ggplot(mydf)+geom_histogram(aes(mydf[,x])))
will give an error Error in [.data.frame(mydf, , x) : undefined columns selected.
I have used several SO questions (e.g. this) as guidance, but can't figure out my error.
The code below works with the mtcars dataset. Just replace mtcars with mydf.
library(ggplot2)
lapply(1:3,function(i) {
ggplot(data.frame(x=mtcars[,i]))+
geom_histogram(aes(x=x))+
ggtitle(names(mtcars)[i])
})
Notice how the reference to i (the column index) was moved from the mapping argument (the call to aes(...)), to the data argument.
Your problem is actually quite subtle. ggplot evaluates the arguments to aes(...) first in the context of your data - e.g. it looks for column names in mydf. If that fails it jumps to the global environment. It does not look in the function's environment. See this post for another example of this behavior and some discussion.
The bottom line is that it is a really bad idea to use external variables in a call to aes(...). However, the data=... argument does not suffer from this. If you must refer to a column number, etc., do it in the call to ggplot(data=...).

Resources