How does dplyr's select helper function everything() differ from copying? - r

What is the use case for
select(iris, everything())
as opposed to e.g. just copying the data.frame?

Looking for references to everything in ?select, they have an example use for reordering columns:
# Reorder variables: keep the variable "Species" in the front
select(iris, Species, everything())
In this case the Species column is moved to the first column, all columns are kept, and no columns are duplicated.
Select helpers are used for more than just the select function - for example, in dplyr version 1.0 and greater, you may want to use it in across() to mutate or summarize all columns.
Since this question was asked, the select helpers have been broken out into their own package, tidyselect. The tidyselect page on CRAN has a lengthy list of reverse imports - it's likely that many of the packages importing tidyselect have cases where everything() is useful.

Another example use case:
# Moves the variable Petal.Length to the end
select(iris, -Petal.Length, everything())
(I saw it here: https://stackoverflow.com/a/30472217/4663008)
Either way, both Gregor's answer and mine are confusing to me - I would have expected Species to be duplicated in Gregor's example or removed in my example. e.g. if you try something more complicated based on the previous two examples, it doesn't work:
> dplyr::select(iris, Petal.Width, -Petal.Length, everything())
Petal.Width Sepal.Length Sepal.Width Petal.Length Species
1 0.2 5.1 3.5 1.4 setosa
2 0.2 4.9 3.0 1.4 setosa
3 0.2 4.7 3.2 1.3 setosa
Update:
After a quick response from hadley on github, I found out that there is a special behaviour using everything() combined with a negative in the first position in select() that will start select() off with all the variables and then everything() draws them back out again. A negative variable in non-first positions do not work as one might expect.
I agree that the negative variable in first position and the everything() select_helper function needs to be better explained in the documentation
Update 2: the documentation for ?select has now been updated to state "Positive values select variables; negative values to drop variables. If the first expression is negative, select() will automatically start with all variables."

Related

R: How to subset with filtering multiple (date-)variables at once

I have a dataset with multiple date-variables and want to create subsets, where I can filter out certain rows by defining the wanted date of the date-variables.
To be more precise: Each row in the dataset represents a patient case in a psychiatry and contains all the applied seclusions. So for each case there is either no seclusion, or they are documented as seclusion_date1, seclusion_date2..., seclusion_enddate1, seclusion_enddate2...(depending on how many seclusions were happening).
My plan is to create a subset with only those cases, where there is either no seclusion documented or the seclusion_date1 (first seclusion) is after 2019-06-30 and all the possible seclusion_enddates (1, 2, 3....) are before 2020-05-01. Cases with seclusions happening before 2019-06-30 and after 2020-05-01 would be excluded.
I'm very new in the R language so my tries are possibly very wrong. I appreciate any help or ideas.
I tried it with the subset function in R.
To filter all possible seclusion_enddates at once, I tried to use starts_with and I tried writing a loop.
all_seclusion_enddates <- function() { c(WMdata, any_of(c("seclusion_enddate")), starts_with("seclusion_enddate")) }
Error: any_of()` must be used within a selecting function.
and then my plan would have been: cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & all_seclusion_enddates <= "2020-04-30")
loop:
for(i in 1:53) { cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & ((paste0("seclusion_enddate", i))) <= "2020-04-30" & restraint_date1 >= "2019-07-01" & ((paste0('seclusion_enddate', i))) <= "2020-04-30") }
Result: A subset with 0 obs. was created.
Since you don't provide a reproducible example, I can't see your specific problem, but I can help with the core issue.
any_of, starts_with and the like are functions used by the tidyverse set of packages to select columns within their functions. They can only be used within tidyverse selector functions to control their behavior, which is why you got that error. They probably are the tools I'd use to solve this problem, though, so here's how you can use them:
Starting with the default dataset iris, we use the filter_at function from dplyr (enter ?filter_at in the R console to read the help). This function filters (selects specific rows) from a data.frame (given to the .tbl argument) based on a criteria (given to .vars_predicate argument) which is applied to specific columns based on selectors given to the .vars argument.
library(dplyr)
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
2 5.2 4.1 1.5 0.1 setosa
3 5.5 4.2 1.4 0.2 setosa
In this example, we take the dataframe iris, pass it into filter_at with the %>% pipe command, then tell it to look only in columns which start with 'Sepal', then tell it to select rows where all the selected columns match the given condition: value > 4. If we wanted rows where any column matched the condition, we could use any_vars(.>4).
You can add multiple conditions by piping it into other filter functions:
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4)) %>%
filter(Petal.Width > 0.3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
Here we filter the previous result again to get rows that also have Petal.Width > 0.3
In your case, you'd want to make sure your date values are formatted as date (with as.Date), then filter on seclusion_date1 and vars(starts_with('secusion_enddate'))

Dplyr select(contains()) with dynamic variable

I have a DF, which is updated every quarter with new data, i.e. it gets wider every couple of months. In a next step, I would like to calculate the difference of values between the first and the latest observation. Using contains() I would like to select the first and last observation dynamically, avoiding writing the names of the variables again after every update.
In other cases, I have been using !!sym() for mutate() like in the following example, which is working fine:
df <- df %>%
mutate(var1 = ifelse(!!sym(first_year) == 0, 1, 0))
But when I try, the following, I get an error (first_year equals 2008 in this case):
df <- df %>%
select(contains(!!sym(first_year)))
Error: object '2008' not found
Ist there a way to use dynamic variables in select(contains()) or select(matches()) - or is this not possible?
Thanks for any help!
The documentation of ?contains states that the first argument should be a character vector:
match A character vector. If length > 1, the union of the matches is taken.
Therefore, you don't have to use any tidy evaluation function such as sym():
library(dplyr)
x="Spe"
iris %>% select(contains(x)) %>% head()
#> Species
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> 4 setosa
#> 5 setosa
#> 6 setosa
Created on 2021-03-15 by the reprex package (v1.0.0)
However, we have very little information about what you are working with (what do first_year and df look like?); this answer might be incorrect because of that.

Using distinct() with a vector of column names

I have a question using distinct() from dplyr on a tibble/data.frame. From the documentation it is clear that you can use it by naming explicitely the column names. I have a data frame with >100 columns and want to use the funtion just on a subset. My intuition said I put the column names in a vector and use it as an argument for distinct. But distinct uses only the first vector element
Example on iris:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct_(iris, exclude.columns)
This is different from
exclude.columns <- c('Sepal.Width', 'Species')
distinct_(iris, exclude.columns)
I think distinct is not made for this operation. Another option would be to subset the data.frame then use distinct and join again with the excluded columns. But my question is if there is another option using just one function?
As suggested in my comment, you could also try:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct(iris, !!! syms(exclude.columns))
Output (first 10 rows):
Sepal.Width Species
1 3.5 setosa
2 3.0 setosa
3 3.2 setosa
4 3.1 setosa
5 3.6 setosa
6 3.9 setosa
7 3.4 setosa
8 2.9 setosa
9 3.7 setosa
10 4.0 setosa
However, that was suggested more than 2 years ago. A more proper usage of latest dplyr functionalities would be:
distinct(iris, across(all_of(exclude.columns)))
It is not entirely clear to me whether you would like to keep only the exclude.columns or actually exclude them; if the latter then you just put minus in front i.e. distinct(iris, across(-all_of(exclude.columns))).
Your objective sounds unclear. Are you trying to get all distinct rows across all columns except $Species and $Sepal.Width? If so, that doesn't make sense.
Let's say two rows are the same in all other variables except for $Sepal.Width. Using distinct() in the way you described would throw out the second row because it was not distinct from the first. Except that it was in the column you ignored.
You need to rethink your objective and whether it makes sense.
If you are just worried about duplicate rows, then
data %>%
distinct(across(everything()))
will do the trick.

External functions inside filter of dplyr in R

How does an external function inside dplyr::filter know the columns just by their names without the use of the data.frame from which it is coming?
For example consider the following code:
filter(hflights, Cancelled == 1, !is.na(DepDelay))
How does is.na know that DepDelay is from hflights? There could possibly have been a DepDelay vector defined elsewhere in my code. (Assuming that hflights has columns named 'Cancelled', 'DepDelay').
In python we would have to use the column name along with the name of the dataframe. Therefore here I was expecting something like
!is.na(hflights$DepDelay)
Any help would be really appreciated.
While I'm not an expert enough to give a precise answer, hopefully I won't lead you too far astray.
It is essentially a question of environment. filter() first looks for any vector object within the data frame environment named in its first argument. If it doesn't find it, it will then go "up a level", so to speak, to the global environment and look for any other vector object of that name. Consider:
library(dplyr)
Species <- iris$Species
iris2 <- select(iris, -Species) # Remove the Species variable from the data frame.
filter(iris2, Species == "setosa")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
More information on the topic can be found here (warning, the book is a work in progress).
Most functions from the dplyr and tidyr packages are specifically designed to handle data frames, and all of those functions require the name of the data frame as their first argument. This allows for usage of the pipe (%>%) which allows to build a more intuitive workflow. Think of the pipe as the equivalent of saying "... and then ...". In the context shown above, you could do:
iris %>%
select(-Species) %>%
filter(Species == "setosa")
And you get the same output as above. Combining the concept of the pipe and focusing the lexical scope of variables to the referenced data frames is meant to lead to more readable code for humans, which is one of the principles of the tidyverse set of packages, which both dplyr and tidyr are components of.

Best way to subset RNAseq dataset for comparison in R

I have a single-cell RNAseq dataset that I have been using R to analyze. So I have a data frame with 205 columns and 15000 rows. Each column is a cell and each row is a gene.
I have an annotation matrix that has the identity of each cell. For example, patient ID, disease status, etc...
I want to do different comparisons based on the grouping info provided by the annotation matrix.
I know that in python, you can create a dictionary that is attached to the cell IDs.
What is an efficient way in R to perform subsetting of the same dataset in different ways?
So far what I have been doing is:
EC_index <-subset(annotation_index_LN, conditions == "EC_LN")
CP_index <-subset(annotation_index_LN, conditions =="CP_LN")
CD69pos <-subset(annotation_index_LN, CD69 == 100)
EC_CD69pos <- subset(EC_index, CD69 == 100)
EC_CD69pos <- subset(EC_CD69pos, id %in% colnames(manual_normalized))
CP_CD69pos <- subset(CP_index, CD69 == 100)
CP_CD69pos <- subset(CP_CD69pos, id %in% colnames(manual_normalized))
This probably won't entirely answer your question, but I think that even before you begin trying to subset your data etc. you might want to think about converting this into a SummarizedExperiment. This is a type of object that can hold annotation data for features and samples and will keep everything properly referenced if you decide to subset samples, remove rows, etc. This type of object is commonly implemented by packages hosted on Bioconductor. They have loads of tutorials on various genomics pipelines, and I'm sure you can find more detailed information there.
http://bioconductor.org/help/course-materials/
Following is from the iris data in R since you haven't given a minimal example of your data.
For that you need a R package that gives access to %>%: the magrittr R package, but also available in dplyr.
If you have to a lot of subsetting, the have the following in a function where you pass the arguments to subset.
iris %>%
subset(Species == "setosa" & Petal.Width == 0.2 & Petal.Length == 1.4) %>%
subset(select = !is.na(str_match(colnames(iris), "Len")))
# Sepal.Length Petal.Length
# 1 5.1 1.4
# 2 4.9 1.4
# 5 5.0 1.4
# 9 4.4 1.4
# 29 5.2 1.4
# 34 5.5 1.4
# 48 4.6 1.4
# 50 5.0 1.4

Resources