Dplyr select(contains()) with dynamic variable - r

I have a DF, which is updated every quarter with new data, i.e. it gets wider every couple of months. In a next step, I would like to calculate the difference of values between the first and the latest observation. Using contains() I would like to select the first and last observation dynamically, avoiding writing the names of the variables again after every update.
In other cases, I have been using !!sym() for mutate() like in the following example, which is working fine:
df <- df %>%
mutate(var1 = ifelse(!!sym(first_year) == 0, 1, 0))
But when I try, the following, I get an error (first_year equals 2008 in this case):
df <- df %>%
select(contains(!!sym(first_year)))
Error: object '2008' not found
Ist there a way to use dynamic variables in select(contains()) or select(matches()) - or is this not possible?
Thanks for any help!

The documentation of ?contains states that the first argument should be a character vector:
match A character vector. If length > 1, the union of the matches is taken.
Therefore, you don't have to use any tidy evaluation function such as sym():
library(dplyr)
x="Spe"
iris %>% select(contains(x)) %>% head()
#> Species
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> 4 setosa
#> 5 setosa
#> 6 setosa
Created on 2021-03-15 by the reprex package (v1.0.0)
However, we have very little information about what you are working with (what do first_year and df look like?); this answer might be incorrect because of that.

Related

R: How to subset with filtering multiple (date-)variables at once

I have a dataset with multiple date-variables and want to create subsets, where I can filter out certain rows by defining the wanted date of the date-variables.
To be more precise: Each row in the dataset represents a patient case in a psychiatry and contains all the applied seclusions. So for each case there is either no seclusion, or they are documented as seclusion_date1, seclusion_date2..., seclusion_enddate1, seclusion_enddate2...(depending on how many seclusions were happening).
My plan is to create a subset with only those cases, where there is either no seclusion documented or the seclusion_date1 (first seclusion) is after 2019-06-30 and all the possible seclusion_enddates (1, 2, 3....) are before 2020-05-01. Cases with seclusions happening before 2019-06-30 and after 2020-05-01 would be excluded.
I'm very new in the R language so my tries are possibly very wrong. I appreciate any help or ideas.
I tried it with the subset function in R.
To filter all possible seclusion_enddates at once, I tried to use starts_with and I tried writing a loop.
all_seclusion_enddates <- function() { c(WMdata, any_of(c("seclusion_enddate")), starts_with("seclusion_enddate")) }
Error: any_of()` must be used within a selecting function.
and then my plan would have been: cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & all_seclusion_enddates <= "2020-04-30")
loop:
for(i in 1:53) { cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & ((paste0("seclusion_enddate", i))) <= "2020-04-30" & restraint_date1 >= "2019-07-01" & ((paste0('seclusion_enddate', i))) <= "2020-04-30") }
Result: A subset with 0 obs. was created.
Since you don't provide a reproducible example, I can't see your specific problem, but I can help with the core issue.
any_of, starts_with and the like are functions used by the tidyverse set of packages to select columns within their functions. They can only be used within tidyverse selector functions to control their behavior, which is why you got that error. They probably are the tools I'd use to solve this problem, though, so here's how you can use them:
Starting with the default dataset iris, we use the filter_at function from dplyr (enter ?filter_at in the R console to read the help). This function filters (selects specific rows) from a data.frame (given to the .tbl argument) based on a criteria (given to .vars_predicate argument) which is applied to specific columns based on selectors given to the .vars argument.
library(dplyr)
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
2 5.2 4.1 1.5 0.1 setosa
3 5.5 4.2 1.4 0.2 setosa
In this example, we take the dataframe iris, pass it into filter_at with the %>% pipe command, then tell it to look only in columns which start with 'Sepal', then tell it to select rows where all the selected columns match the given condition: value > 4. If we wanted rows where any column matched the condition, we could use any_vars(.>4).
You can add multiple conditions by piping it into other filter functions:
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4)) %>%
filter(Petal.Width > 0.3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
Here we filter the previous result again to get rows that also have Petal.Width > 0.3
In your case, you'd want to make sure your date values are formatted as date (with as.Date), then filter on seclusion_date1 and vars(starts_with('secusion_enddate'))

Creating multiple variables from single call to piping?

I am new to tidyverse and want to use pipes to create two new variables, one representing the sum of the petal lengths by Species, and one representing the number of instances of each Species, and then to represent that in a new list alongside the Species names.
The following code does the job, but
library(dplyr)
petal_lengths <- iris %>% group_by(Species) %>% summarise(total_petal_length = sum(Petal.Length))
totals_per_species <- iris %>% count(Species, name="Total")
combined_data <- modifyList(petal_lengths,totals_per_species)
My questions are:
Is it possible to do this without the creating those two intermediate variables petal_lengths and totals_per_species, i.e. through a single line of piping code rather than two.
If so, is doing this desirable, either abstractly or according to standard conceptions of good tidyverse coding style?
I read here that
The pipe can only transport one object at a time, meaning it’s not so
suited to functions that need multiple inputs or produce multiple
outputs.
which makes me think maybe the answer to my first question is "No", but I'm not sure.
You could achieve your desired result in one pipeline like so:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(total_petal_length = sum(Petal.Length), Total = n())
#> # A tibble: 3 × 3
#> Species total_petal_length Total
#> <fct> <dbl> <int>
#> 1 setosa 73.1 50
#> 2 versicolor 213 50
#> 3 virginica 278. 50
I think Stefan's answer is the correct one for this particular example, and in general you can get the pipe to work with most data manipulation tasks without writing intermediate variables. However, there is perhaps a broader question here.
There are some situations in which the writing of intermediate variables is necessary, and other situations where you have to write more complicated code in the pipe to avoid creating intermediate variables.
I have used a little helper function in some situations to avoid this, which writes a new variable as a side effect. This variable can be re-used within the same pipeline:
branch <- function(.data, newvar, value) {
newvar <- as.character(as.list(match.call())$newvar)
assign(newvar, value, parent.frame(2))
return(.data)
}
You would use it in the pipeline like this:
iris %>%
branch(totals_per_species, count(., Species, name = "Total")) %>%
group_by(Species) %>%
summarise(total_petal_length = sum(Petal.Length)) %>%
modifyList(totals_per_species)
#> # A tibble: 3 x 3
#> Species total_petal_length Total
#> <fct> <dbl> <int>
#> 1 setosa 73.1 50
#> 2 versicolor 213 50
#> 3 virginica 278. 50
This function works quite well in interactive sessions, but there are probably scoping problems when used in more complex settings. It's certainly not standard coding practice, though I have often wondered whether a more robust version might be a useful addition to the tidyverse.

how to use a loop | apply | map to slice a data-frame for multiple variable values and create multiple statistics summary() in r

I am trying to get multiple summary() outputs from a data-frame. I want to subset according to some characteristics multiple times. Then get the summary() of a certain variable for each slice and put all summary() outputs together in either a dataframe or a list.
Ideally i would like to get the name of each building_id i use to slice the data as a name for that row of summary(). So i thought of using a for loop.
The data are sufficiently large (about 20 m. lines) and i am using the train and building_metadata dataframes joined in one from the ashrae energy prediction from kaggle here
I have created a tibble which holds the building ids i want subset by. I want to get the summary() of variable "energy_sqm" (which i have already created) so i am trying to put this slicing in a for loop:
Warning 1: My building_id tibble has values like 50, 67, 778, 1099 etc. So one of problems i have is with the use of these numbers if i try to use them for some sort of indexing or naming my summary outputs. I think it tries to make row 50, 67 etc in the several differnt trials i did.
summaries_output <- tibble() # or list() `
for (id in building_id){
temp_stats <- joined %>%
filter(building_id == "id") %>%
pull(energy_sqm) %>%
summary() %>%
broom:tidy()
summaries_output <- bind_rows(summaries_output, temp_stats, .id = "id")
`
My problems:
a) whatever summaries_output i use to initialize i cant get it to retain anything inside the loop so i am guessing i am messing up the loop also.
b) Ideally i would like to have the building_id as an identifier of the summary() statistic
c) Could someone propose what is the good practice principle for these kind of loops in terms of using list, tible or whatever.
Details: The class() of summary() is "summaryDefault" "table" which i don't know anything about.
Thanks for the help.
We can also use tidyverse. After grouping by 'Species', tidy the summary output of 'Sepal.Length'. Here, the tidy output is a tibble/data.frame. In dplyr 1.0.0, we could use that without wrapping in a list, but it could also include a column name attribute with $ because we have out and the column names from tidy. To avoid that, we wrap in a list and then unnest the column created
library(dplyr)
library(broom)
library(tidyr)
iris %>%
group_by(Species) %>%
summarise(out = list(tidy(summary(Sepal.Length)))) %>%
unnest(c(out))
# A tibble: 3 x 7
# Species minimum q1 median mean q3 maximum
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 setosa 4.3 4.8 5 5.01 5.2 5.8
#2 versicolor 4.9 5.6 5.9 5.94 6.3 7
#3 virginica 4.9 6.22 6.5 6.59 6.9 7.9
This appears to be summarizing by group. Here's a way to do it with data.table although I am unsure your exact expected output:
library(broom)
library(data.table)
dt_iris = as.data.table(iris)
dt_iris[, tidy(summary(Sepal.Length)), by = Species]
#> Species minimum q1 median mean q3 maximum
#> 1: setosa 4.3 4.800 5.0 5.006 5.2 5.8
#> 2: versicolor 4.9 5.600 5.9 5.936 6.3 7.0
#> 3: virginica 4.9 6.225 6.5 6.588 6.9 7.9
Created on 2020-07-11 by the reprex package (v0.3.0)

Using distinct() with a vector of column names

I have a question using distinct() from dplyr on a tibble/data.frame. From the documentation it is clear that you can use it by naming explicitely the column names. I have a data frame with >100 columns and want to use the funtion just on a subset. My intuition said I put the column names in a vector and use it as an argument for distinct. But distinct uses only the first vector element
Example on iris:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct_(iris, exclude.columns)
This is different from
exclude.columns <- c('Sepal.Width', 'Species')
distinct_(iris, exclude.columns)
I think distinct is not made for this operation. Another option would be to subset the data.frame then use distinct and join again with the excluded columns. But my question is if there is another option using just one function?
As suggested in my comment, you could also try:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct(iris, !!! syms(exclude.columns))
Output (first 10 rows):
Sepal.Width Species
1 3.5 setosa
2 3.0 setosa
3 3.2 setosa
4 3.1 setosa
5 3.6 setosa
6 3.9 setosa
7 3.4 setosa
8 2.9 setosa
9 3.7 setosa
10 4.0 setosa
However, that was suggested more than 2 years ago. A more proper usage of latest dplyr functionalities would be:
distinct(iris, across(all_of(exclude.columns)))
It is not entirely clear to me whether you would like to keep only the exclude.columns or actually exclude them; if the latter then you just put minus in front i.e. distinct(iris, across(-all_of(exclude.columns))).
Your objective sounds unclear. Are you trying to get all distinct rows across all columns except $Species and $Sepal.Width? If so, that doesn't make sense.
Let's say two rows are the same in all other variables except for $Sepal.Width. Using distinct() in the way you described would throw out the second row because it was not distinct from the first. Except that it was in the column you ignored.
You need to rethink your objective and whether it makes sense.
If you are just worried about duplicate rows, then
data %>%
distinct(across(everything()))
will do the trick.

External functions inside filter of dplyr in R

How does an external function inside dplyr::filter know the columns just by their names without the use of the data.frame from which it is coming?
For example consider the following code:
filter(hflights, Cancelled == 1, !is.na(DepDelay))
How does is.na know that DepDelay is from hflights? There could possibly have been a DepDelay vector defined elsewhere in my code. (Assuming that hflights has columns named 'Cancelled', 'DepDelay').
In python we would have to use the column name along with the name of the dataframe. Therefore here I was expecting something like
!is.na(hflights$DepDelay)
Any help would be really appreciated.
While I'm not an expert enough to give a precise answer, hopefully I won't lead you too far astray.
It is essentially a question of environment. filter() first looks for any vector object within the data frame environment named in its first argument. If it doesn't find it, it will then go "up a level", so to speak, to the global environment and look for any other vector object of that name. Consider:
library(dplyr)
Species <- iris$Species
iris2 <- select(iris, -Species) # Remove the Species variable from the data frame.
filter(iris2, Species == "setosa")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
More information on the topic can be found here (warning, the book is a work in progress).
Most functions from the dplyr and tidyr packages are specifically designed to handle data frames, and all of those functions require the name of the data frame as their first argument. This allows for usage of the pipe (%>%) which allows to build a more intuitive workflow. Think of the pipe as the equivalent of saying "... and then ...". In the context shown above, you could do:
iris %>%
select(-Species) %>%
filter(Species == "setosa")
And you get the same output as above. Combining the concept of the pipe and focusing the lexical scope of variables to the referenced data frames is meant to lead to more readable code for humans, which is one of the principles of the tidyverse set of packages, which both dplyr and tidyr are components of.

Resources