What distinguishes dplyr::pull from purrr::pluck and magrittr::extract2? - r

In the past, when working with a data frame and wanting to get a single column as a vector, I would use magrittr::extract2() like this:
mtcars %>%
mutate(wt_to_hp = wt/hp) %>%
extract2('wt_to_hp')
But I've seen that dplyr::pull() and purrr::pluck() also exists to do much the same job: return a single vector from a data frame, not unlike [[.
Assuming that I'm always loading all 3 libraries for any project I work on, what are the advantages and use cases of each of these 3 functions? Or more specifically, what distinguishes them from each other?

When you "should" use a function is really a matter of personal preference. Which function expresses your intention most clearly. There are differences between them. For example, pluck works better when you want to do multiple extractions. From help file:
accessor(x[[1]])$foo
# is the same as
pluck(x, 1, accessor, "foo")
so while it can be use to just extract a column, it's useful when you have more deeply nested structures or you want to compose with an accessor function.
The pull function is meant to blend in with the result of the dplyr function. It can take the name of a column using any of the ways you can with other functions in the package. For example it will work with !! style expansion where say extract2 will not.
irispull <- function(x) {
iris %>% pull(!!enquo(x))
}
irispull(Sepal.Length)
And extract2 is nothing more than a "more readable" wrapper for the base function [[. In fact it's defined as .Primitive("[[") so it expects column names as character or column indexes and integers.

Related

Using "count" function in a loop in R

I'm quite new to R and I've been learning with the available resources on the internet.
I came across this issue where I have a vector (a) with vars "1", "2", and "3". I want to use the count function to generate a new df with the categories for each of those variables and its frequencies.
The function I want to use in a loop is this
b <- count(mydata, var1)
However, when I use this loop below;
for (i in (a)) {
'j' <- count(mydata[, i])
print (j)
}
The loop happens but the frequencies which gets saved on j is only of the categorical variable "var 3".
Can someone assist me on this code please?
TIA!
In R there are generally better ways than to use loops to process data. In your particular case, the “straightforward” way fails, because the idea of the “tidyverse” is to have the data in tidy format (I highly recommend you read this article; it’s somewhat long but its explanation is really fundamental for any kind of data processing, even beyond the tidyverse). But (from the perspective of your code) your data is spread across multiple columns (wide format) rather than being in a single column (long form).
The other issue is that count (like many other tidyverse functions) expect an unevaluated column name. It does not accept the column name via a variable. akrun’s answer shows how you can work around this (using tidy evaluation and the bang-bang operator) but that’s a workaround that’s not necessary here.
The usual solution, instead of using a loop, would first require you to bring your data into long form, using pivot_longer.
After that, you can perform a single count on your data:
result <- mydata %>%
pivot_longer(all_of(a), names_to = 'Var', values_to = 'Value') %>%
count(Var, Value)
Some comments regarding your current approach:
Be wary of cryptic variable names: what are i, j and a? Use concise but descriptive variable names. There are some conventions where i and j are used but, if so, they almost exclusively refer to index variables in a loop over vector indices. Using them differently is therefore quite misleading.
There’s generally no need to put parentheses around a variable name in R (except when that name is the sole argument to a function call). That is, instead of for (i in (a)) it’s conventional to write for (i in a).
Don’t put quotes around your variable names! R happens to accept the code 'j' <- … but since quotes normally signify string literals, its use here is incredibly misleading, and additionally doesn’t serve a purpose.

R - Access an object own's name in apply functions

This is a problem I often encounters: I try to access an object's own name when using a function from apply family and spend hours figuring out how to do it... For instance (this is not the core of my question), today I was willing to inspect an attached package trying to figure out if it contained some non function objects. After a lot of tries and fails, I finally came up with (for the rrapply package - I know looking at the documentation is also easy but this one illustrates well the problem):
library(rrapply)
eapply(rlang::pkg_env('rrapply'), function(x) {if(!is.function(x)) x}) %>%
`[`(sapply(., function(x) !is.null(x))) %>%
names()
## [1] "renewable_energy_by_country" "pokedex"
I feel that is really too complicated for a simple test !
So my question: is there an easy way to loop through an object in base R (or maybe tidyverse) and return only the names of those elements that correspond to a certain condition ? rrapply seems to be able to achieve that but:
it is fairly complicated
and it seems to work on lists only and to loop through all sub-elements as well which is not desired
Thanks !
Identify the environment of interest, e, and then use eapply with the indicated function taking the names of the extracted elements at the end. This isn't conceptually different from the code in the question but does seem somewhat less complex when done in base R in the following way:
e <- as.environment("package:rrapply")
names(Filter(`!`, eapply(e, is.function)))
or the same code written as a pipeline:
library(magrittr)
"package:rrapply" %>%
as.environment %>%
eapply(is.function) %>%
Filter(`!`, .) %>%
names

Creating a data frame by applying a function to each element of a vector and combining the results

I am working on a project where we frequently work with a list of usernames. We also have a function to take a username and return a dataframe with that user's data. E.g.
users = c("bob", "john", "michael")
get_data_for_user = function(user)
{
data.frame(user=user, data=sample(10))
}
We often:
Iterate over each element of users
Call get_data_for_user to get their data
rbind the results into a single dataerame
I am currently doing this in a purely imperative way:
ret = get_data_for_user(users[1])
for (i in 2:length(users))
{
ret = rbind(ret, get_data_for_user(users[i]))
}
This works, but my impression is that all the cool kids are now using libraries like purrr to do this in a single line. I am fairly new to purrr, and the closest I can see is using map_df to convert the vector of usernames to a vector of dataframes. I.e.
dfs = map_df(users, get_data_for_user)
That is, it seems like I would still be on the hook for writing a loop to do the rbind.
I'd like to clarify whether my solution (which works) is currently considered best practice in R / amongst users of the tidyverse.
Thanks.
That looks right to me - map_df handles the rbind internally (you'll need {dplyr} in addition to {purrr}).
FWIW, purrr::map_dfr() will do the same thing, but the function name is a bit more explicit, noting that it will be binding rows; purrr::map_dfc() binds columns.
I would suggest a slight adjustment:
dfs = map_dfr(users, get_data_for_user)
map_dfr() explicitely states that you want to do a row bind. And I would be inclined to call this best practice when working with purrr.
For the sake of completeness, here are some additional approaches:
using built-in functions
Reduce(rbind, lapply(users, get_data_for_user))
using data.table approach
library(data.table)
rbindlist(lapply(users, get_data_for_user))

Sort a data.frame by multiple columns whose names are contained in a single object?

I want to sort a data.frame by multiple columns, ideally using base R without any external packages (though if necessary, so be it). Having read How to sort a dataframe by column(s)?, I know I can accomplish this with the order() function as long as I either:
Know the explicit names of each of the columns.
Have a separate object representing each individual column by which to sort.
But what if I only have one vector containing multiple column names, of length that's unknown in advance?
Say the vector is called sortnames.
data[order(data[, sortnames]), ] won't work, because order() treats that as a single sorting argument.
data[order(data[, sortnames[1]], data[, sortnames[2]], ...), ] will work if and only if I specify the exact correct number of sortname values, which I won't know in advance.
Things I've looked at but not been totally happy with:
eval(parse(text=paste("data[with(data, order(", paste(sortnames, collapse=","), ")), ]"))). Maybe this is fine, but I've seen plenty of hate for using eval(), so asking for alternatives seemed worthwhile.
I may be able to use the Deducer library to do this with sortData(), but like I said, I'd rather avoid using external packages.
If I'm being too stubborn about not using external packages, let me know. I'll get over it. All ideas appreciated in advance!
You can use do.call:
data<-data.frame(a=rnorm(10),b=rnorm(10))
data<-data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10))
sortnames <- c("a", "b")
data[do.call("order", data[sortnames]), ]
This trick is useful when you want to pass multiple arguments to a function and these arguments are in convenient named list.

Subsetting within a function

I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.

Resources