How to refer to variable instead of column with dplyr - r

When using dplyr:filter, I often compute a local variable that holds the viable choices:
df <- as_tibble(data.frame(id=c("a","b"), val=1:6))
ids <- c("b","c")
filter(df, id %in% ids)
# giving id %in% c("b","c")
However, if the dataset by chance has a column with the same name, this fails to achieve the intended purpose:
df$ids <- "a"
filter(df, id %in% ids)
# giving id %in% "a"
How should I explicitly refer to the ids variable instead of the ids column?

Unquote with !! to tell filter to look in the calling environment instead of the data frame:
library(tidyverse)
df <- data_frame(id = rep(c("a","b"), 3), val = 1:6)
ids <- c("b", "c")
df %>% filter(id %in% ids)
#> # A tibble: 3 x 2
#> id val
#> <chr> <int>
#> 1 b 2
#> 2 b 4
#> 3 b 6
df <- df %>% mutate(ids = "a")
df %>% filter(id %in% ids)
#> # A tibble: 3 x 3
#> id val ids
#> <chr> <int> <chr>
#> 1 a 1 a
#> 2 a 3 a
#> 3 a 5 a
df %>% filter(id %in% !!ids)
#> # A tibble: 3 x 3
#> id val ids
#> <chr> <int> <chr>
#> 1 b 2 a
#> 2 b 4 a
#> 3 b 6 a
Of course, the better way to avoid such issues is to not put identically-named vectors in your global environment.

Related

Accessing variable name in for loop in R?

I am trying to run a for loop where I randomly subsample a dataset using sample_n command. I also want to name each new subsampled dataframe as "df1" "df2" "df3". Where the numbers correspond to i in the for loop. I know the way I wrote this code is wrong and why i am getting the error. How can I access "df" "i" in the for loop so that it reads as df1, df2, etc.? Happy to clarify if needed. Thanks!
for (i in 1:9){ print(get(paste("df", i, sep=""))) = sub %>%
group_by(dietAandB) %>%
sample_n(1) }
Error in print(get(paste("df", i, sep = ""))) = sub %>% group_by(dietAandB) %>% :
target of assignment expands to non-language object
Instead of using get you could use assign.
Using some fake example data:
library(dplyr, warn=FALSE)
sub <- data.frame(
dietAandB = LETTERS[1:2]
)
for (i in 1:2) {
assign(paste0("df", i), sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
}
df1
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
df2
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
But the more R-ish way to do this would be to use a list instead of creating single objects:
df <- list(); for (i in 1:2) { df[[i]] = sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup() }
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
Or more concise to use lapply instead of a for loop
df <- lapply(1:2, function(x) sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
It depends on the sample size which is missing in your question. So, As an example I considered the mtcars dataset (32 rows) and sampling three subsamples of size 20 from the data:
library(dplyr)
for (i in 1:3) {
assign(paste0("df", i), sample_n(mtcars, 20))
}

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

Iterating name of a field with dplyr::summarise function

first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)

Spread data with non-unique keys with R

I have the following data frame:
ID
Group
1
A
1
B
2
C
2
D
And I want to reshape the data frame into a wider version in terms of ID. Thus, the new data frame looks like this:
ID
Group1
Group2
1
A
B
2
C
D
You can do this by adding a helper column and then using tidyr::pivot_wider():
library(dplyr)
library(tidyr)
data <- tibble(
id = c(1, 1, 2, 2),
group = letters[1:4]
)
# Add a helper column to use when pivoting. This uses the row number
# over each subgroup, i.e. over each value of `id`
transformed_data <- data %>%
group_by(id) %>%
mutate(helper = paste0("Group", row_number())) %>%
ungroup()
# Here's what the helper column looks like
transformed_data
#> # A tibble: 4 x 3
#> id group helper
#> <dbl> <chr> <chr>
#> 1 1 a Group1
#> 2 1 b Group2
#> 3 2 c Group1
#> 4 2 d Group2
# Pivot the data using the helper column
transformed_data %>%
pivot_wider(names_from = helper, values_from = group)
#> # A tibble: 2 x 3
#> id Group1 Group2
#> <dbl> <chr> <chr>
#> 1 1 a b
#> 2 2 c d

tidyr - unique way to get combinations (using tidyverse only)

I wanted to get all unique pairwise combinations of a unique string column of a dataframe using the tidyverse (ideally).
Here is a dummy example:
library(tidyverse)
a <- letters[1:3] %>%
tibble::as_tibble()
a
#> # A tibble: 3 x 1
#> value
#> <chr>
#> 1 a
#> 2 b
#> 3 c
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2"))
#> # A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a a
#> 2 a b
#> 3 a c
#> 4 b a
#> 5 b b
#> 6 b c
#> 7 c a
#> 8 c b
#> 9 c c
Is there a way to remove 'duplicate' combinations here. That is have the output be the following in this example:
# A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a b
#> 2 a c
#> 3 b c
I was hoping there would be a nice purrr::map or filter approach to pipe into to complete the above.
EDIT: There are similar questions to this one e.g. here, marked by #Sotos. Here I am specifically looking for tidyverse (purrr, dplyr) ways to complete the pipeline I have setup. The other answers use various other packages that I do not want to include as dependencies.
wish there was a better way, but I usually use this...
library(tidyverse)
df <- tibble(value = letters[1:3])
df %>%
expand(value, value1 = value) %>%
filter(value < value1)
# # A tibble: 3 x 2
# value value1
# <chr> <chr>
# 1 a b
# 2 a c
# 3 b c
Something like this?
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2")) %>%
rowwise() %>%
mutate(words1 = sort(c(words1, words2))[1], # sort order of words for each row
words2 = sort(c(words1, words2))[2]) %>%
filter(words1 != words2) %>% # remove word combinations with itself
unique() # remove duplicates
# A tibble: 3 x 2
words1 words2
<chr> <chr>
1 a b
2 a c
3 b c

Resources