When using dplyr:filter, I often compute a local variable that holds the viable choices:
df <- as_tibble(data.frame(id=c("a","b"), val=1:6))
ids <- c("b","c")
filter(df, id %in% ids)
# giving id %in% c("b","c")
However, if the dataset by chance has a column with the same name, this fails to achieve the intended purpose:
df$ids <- "a"
filter(df, id %in% ids)
# giving id %in% "a"
How should I explicitly refer to the ids variable instead of the ids column?
Unquote with !! to tell filter to look in the calling environment instead of the data frame:
library(tidyverse)
df <- data_frame(id = rep(c("a","b"), 3), val = 1:6)
ids <- c("b", "c")
df %>% filter(id %in% ids)
#> # A tibble: 3 x 2
#> id val
#> <chr> <int>
#> 1 b 2
#> 2 b 4
#> 3 b 6
df <- df %>% mutate(ids = "a")
df %>% filter(id %in% ids)
#> # A tibble: 3 x 3
#> id val ids
#> <chr> <int> <chr>
#> 1 a 1 a
#> 2 a 3 a
#> 3 a 5 a
df %>% filter(id %in% !!ids)
#> # A tibble: 3 x 3
#> id val ids
#> <chr> <int> <chr>
#> 1 b 2 a
#> 2 b 4 a
#> 3 b 6 a
Of course, the better way to avoid such issues is to not put identically-named vectors in your global environment.
Related
I am trying to run a for loop where I randomly subsample a dataset using sample_n command. I also want to name each new subsampled dataframe as "df1" "df2" "df3". Where the numbers correspond to i in the for loop. I know the way I wrote this code is wrong and why i am getting the error. How can I access "df" "i" in the for loop so that it reads as df1, df2, etc.? Happy to clarify if needed. Thanks!
for (i in 1:9){ print(get(paste("df", i, sep=""))) = sub %>%
group_by(dietAandB) %>%
sample_n(1) }
Error in print(get(paste("df", i, sep = ""))) = sub %>% group_by(dietAandB) %>% :
target of assignment expands to non-language object
Instead of using get you could use assign.
Using some fake example data:
library(dplyr, warn=FALSE)
sub <- data.frame(
dietAandB = LETTERS[1:2]
)
for (i in 1:2) {
assign(paste0("df", i), sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
}
df1
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
df2
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
But the more R-ish way to do this would be to use a list instead of creating single objects:
df <- list(); for (i in 1:2) { df[[i]] = sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup() }
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
Or more concise to use lapply instead of a for loop
df <- lapply(1:2, function(x) sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
It depends on the sample size which is missing in your question. So, As an example I considered the mtcars dataset (32 rows) and sampling three subsamples of size 20 from the data:
library(dplyr)
for (i in 1:3) {
assign(paste0("df", i), sample_n(mtcars, 20))
}
tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)
I have the following data frame:
ID
Group
1
A
1
B
2
C
2
D
And I want to reshape the data frame into a wider version in terms of ID. Thus, the new data frame looks like this:
ID
Group1
Group2
1
A
B
2
C
D
You can do this by adding a helper column and then using tidyr::pivot_wider():
library(dplyr)
library(tidyr)
data <- tibble(
id = c(1, 1, 2, 2),
group = letters[1:4]
)
# Add a helper column to use when pivoting. This uses the row number
# over each subgroup, i.e. over each value of `id`
transformed_data <- data %>%
group_by(id) %>%
mutate(helper = paste0("Group", row_number())) %>%
ungroup()
# Here's what the helper column looks like
transformed_data
#> # A tibble: 4 x 3
#> id group helper
#> <dbl> <chr> <chr>
#> 1 1 a Group1
#> 2 1 b Group2
#> 3 2 c Group1
#> 4 2 d Group2
# Pivot the data using the helper column
transformed_data %>%
pivot_wider(names_from = helper, values_from = group)
#> # A tibble: 2 x 3
#> id Group1 Group2
#> <dbl> <chr> <chr>
#> 1 1 a b
#> 2 2 c d
I wanted to get all unique pairwise combinations of a unique string column of a dataframe using the tidyverse (ideally).
Here is a dummy example:
library(tidyverse)
a <- letters[1:3] %>%
tibble::as_tibble()
a
#> # A tibble: 3 x 1
#> value
#> <chr>
#> 1 a
#> 2 b
#> 3 c
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2"))
#> # A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a a
#> 2 a b
#> 3 a c
#> 4 b a
#> 5 b b
#> 6 b c
#> 7 c a
#> 8 c b
#> 9 c c
Is there a way to remove 'duplicate' combinations here. That is have the output be the following in this example:
# A tibble: 9 x 2
#> words1 words2
#> <chr> <chr>
#> 1 a b
#> 2 a c
#> 3 b c
I was hoping there would be a nice purrr::map or filter approach to pipe into to complete the above.
EDIT: There are similar questions to this one e.g. here, marked by #Sotos. Here I am specifically looking for tidyverse (purrr, dplyr) ways to complete the pipeline I have setup. The other answers use various other packages that I do not want to include as dependencies.
wish there was a better way, but I usually use this...
library(tidyverse)
df <- tibble(value = letters[1:3])
df %>%
expand(value, value1 = value) %>%
filter(value < value1)
# # A tibble: 3 x 2
# value value1
# <chr> <chr>
# 1 a b
# 2 a c
# 3 b c
Something like this?
tidyr::crossing(a, a) %>%
magrittr::set_colnames(c("words1", "words2")) %>%
rowwise() %>%
mutate(words1 = sort(c(words1, words2))[1], # sort order of words for each row
words2 = sort(c(words1, words2))[2]) %>%
filter(words1 != words2) %>% # remove word combinations with itself
unique() # remove duplicates
# A tibble: 3 x 2
words1 words2
<chr> <chr>
1 a b
2 a c
3 b c