Accessing variable name in for loop in R? - r

I am trying to run a for loop where I randomly subsample a dataset using sample_n command. I also want to name each new subsampled dataframe as "df1" "df2" "df3". Where the numbers correspond to i in the for loop. I know the way I wrote this code is wrong and why i am getting the error. How can I access "df" "i" in the for loop so that it reads as df1, df2, etc.? Happy to clarify if needed. Thanks!
for (i in 1:9){ print(get(paste("df", i, sep=""))) = sub %>%
group_by(dietAandB) %>%
sample_n(1) }
Error in print(get(paste("df", i, sep = ""))) = sub %>% group_by(dietAandB) %>% :
target of assignment expands to non-language object

Instead of using get you could use assign.
Using some fake example data:
library(dplyr, warn=FALSE)
sub <- data.frame(
dietAandB = LETTERS[1:2]
)
for (i in 1:2) {
assign(paste0("df", i), sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
}
df1
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
df2
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
But the more R-ish way to do this would be to use a list instead of creating single objects:
df <- list(); for (i in 1:2) { df[[i]] = sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup() }
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
Or more concise to use lapply instead of a for loop
df <- lapply(1:2, function(x) sub %>% group_by(dietAandB) %>% sample_n(1) |> ungroup())
df
#> [[1]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B
#>
#> [[2]]
#> # A tibble: 2 × 1
#> dietAandB
#> <chr>
#> 1 A
#> 2 B

It depends on the sample size which is missing in your question. So, As an example I considered the mtcars dataset (32 rows) and sampling three subsamples of size 20 from the data:
library(dplyr)
for (i in 1:3) {
assign(paste0("df", i), sample_n(mtcars, 20))
}

Related

Iterating name of a field with dplyr::summarise function

first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)

R purrr how to rename column of nested df

I have a list of data frames, each with two columns named "place" and "data".
"place" is a character and "data" is a nested data frame with one numeric column named "value".
For each data frame from the list, I'd like to rename the "value" column of the nested data frame with the value of "place" column.
library(tidyverse)
some_dt = tibble(place = c("a","a", "b","b","c","c"),
value = c(1,2,1,4,5,6))
# here is a list of data frames...
ls_df <-
some_dt %>%
group_by(place) %>%
nest() %>%
split(.$place)
I'm tried:
map2(ls_df$data,
ls_df$place,
~rename(.x, .y = "value"))
or:
map2(ls_df$data,
ls_df$place,
~rename_with(.x, ~ .y, "value"))
but I'm getting an empty list as result.
How can I rename the "value" column with the content of the outer data frame column?
We may loop over the list ('ls_df') with map, extract the 'place' column and then rename the extracted 'data' column with the 'place' value
library(dplyr)
library(purrr)
ls_df2 <- map(ls_df, ~ {
nm <- .x$place
.x$data[[1]] <- .x$data[[1]] %>%
rename_with(~ nm, "value")
.x
})
-checking
> map(ls_df2, ~ .x$data)
$a
$a[[1]]
# A tibble: 2 × 1
a
<dbl>
1 1
2 2
$b
$b[[1]]
# A tibble: 2 × 1
b
<dbl>
1 1
2 4
$c
$c[[1]]
# A tibble: 2 × 1
c
<dbl>
1 5
2 6
Note that when we are splitting the data, it returns a list. Therefore, we cannot access the columns 'data' directly i.e
> ls_df$data
NULL
> ls_df$place
NULL
Or another option is
some_dt %>%
nest_by(place) %>%
mutate(data = data %>%
rename_with(~ place, value) %>%
list(.)) %>%
ungroup
# A tibble: 3 × 2
place data
<chr> <list>
1 a <tibble [2 × 1]>
2 b <tibble [2 × 1]>
3 c <tibble [2 × 1]>
You can also iterate over each list element and then use mutate to rename the nested data frame using the place.
ls_df %>%
modify(~ mutate(.x,
data = map(data,
~ set_names(.x, first(place)))))
In this case, you can actually simplify this further.
ls_df %>%
modify(~ mutate(.x,
data = map2(data, place, set_names)))
# which can collapse down to as simple as this
ls_df %>%
modify(mutate, data = map2(data, place, set_names))
With that approach, you can actually consider whether you actually need the list. The nested tibble may be easier to work with directly.
ls_df %>%
bind_rows() %>%
mutate(data = map2(data, place, set_names))
You could also try something like this:
library(tidyverse)
map(ls_df,
~ map2(.x$place,
.x$data,
~rename(.y,
!!sym(.x) := value)
)
)
#> $a
#> $a[[1]]
#> # A tibble: 2 x 1
#> a
#> <dbl>
#> 1 1
#> 2 2
#>
#>
#> $b
#> $b[[1]]
#> # A tibble: 2 x 1
#> b
#> <dbl>
#> 1 1
#> 2 4
#>
#>
#> $c
#> $c[[1]]
#> # A tibble: 2 x 1
#> c
#> <dbl>
#> 1 5
#> 2 6
You can create a function which renames using base colnames() then map that over all the list elements as follows:
# The fn:
rnm <- function(x) {
colnames(x$data[[1]]) <- x$place
x
}
# Result:
res <- ls_df |> purrr::map(.f = rnm)
# Check if it's the desired output:
res$a$data
# [[1]]
# A tibble: 2 × 1
# a
# <dbl>
# 1 1
# 2 2

Mutate All columns in a list of tibbles

Lets suppose I have the following list of tibbles:
a_list_of_tibbles <- list(
a = tibble(a = rnorm(10)),
b = tibble(a = runif(10)),
c = tibble(a = letters[1:10])
)
Now I want to map them all into a single dataframe/tibble, which is not possible due to the differing column types.
How would I go about this?
I have tried this, but I want to get rid of the for loop
for(i in 1:length(a_list_of_tibbles)){
a_list_of_tibbles[[i]] <- a_list_of_tibbles[[i]] %>% mutate_all(as.character)
}
Then I run:
map_dfr(.x = a_list_of_tibbles, .f = as_tibble)
We could do the computation within the map - use across instead of the suffix _all (which is getting deprecated) to loop over the columns of the dataset
library(dplyr)
library(purrr)
map_dfr(a_list_of_tibbles,
~.x %>%
mutate(across(everything(), as.character) %>%
as_tibble))
-output
# A tibble: 30 × 1
a
<chr>
1 0.735200825884485
2 1.4741501589461
3 1.39870958697574
4 -0.36046362308853
5 -0.893860999301402
6 -0.565468636033674
7 -0.075270267983768
8 2.33534260196058
9 0.69667906338348
10 1.54213170143702
# … with 20 more rows
Another alternative is to use:
library(tidyverse)
map_depth(a_list_of_tibbles, 2, as.character) %>%
bind_rows()
#> # A tibble: 30 × 1
#> a
#> <chr>
#> 1 0.0894618169853206
#> 2 -1.50144637645091
#> 3 1.44795821718513
#> 4 0.0795342912030257
#> 5 -0.837985570593029
#> 6 -0.050845557103668
#> 7 0.031194556366589
#> 8 0.0989551909839589
#> 9 1.87007290229274
#> 10 0.67816212007413
#> # … with 20 more rows
Created on 2021-12-20 by the reprex package (v2.0.1)

Modify a vector based on a vector of regular expressions (regex) using (if possible) a functional approach

I have a dataframe with some columns that I want to modify depending on whether they match some patterns included in a vector with regular expressions
library(fuzzyjoin)
library(tidyverse)
(df <- tribble(~a,
"GUA-ABC",
"REF-CDE",
"ACC.S93",
"ACC.ATN"))
#> # A tibble: 4 x 1
#> a
#> <chr>
#> 1 GUA-ABC
#> 2 REF-CDE
#> 3 ACC.S93
#> 4 ACC.ATN
Depending on the pattern I want to paste a text, for example, for those that contain GUA- paste "GUA001" at the end of the chain joined by a point and for those that contain REF- paste "GUA002" in the same way, to be able to obtain the following:
# This is the resulting data.frame I need
#> # A tibble: 4 x 1
#> a
#> <chr>
#> 1 GUA-ABC.GUA001
#> 2 REF-CDE.GUA002
#> 3 ACC.S93
#> 4 ACC.ATN
I have thought of some approaches.
Approach # 1
# list of patterns to search
patterns <- c("\\b^GUA\\b", "\\b^REF\\b")
# Create a named list for recoding
model_key <- list("\\b^GUA\\b" = "GUA001",
"\\b^REF\\b" = "GUA002")
# Create a data.frame of regexs
(k <- tibble(regex = patterns))
#> # A tibble: 2 x 1
#> regex
#> <chr>
#> 1 "\\b^GUA\\b"
#> 2 "\\b^REF\\b"
# perform a regex_left_join to identify the pattern
df %>%
regex_left_join(k, by = c(a = "regex")) %>%
mutate(
across(regex, recode, !!!model_key),
a = case_when(
!is.na(regex) ~ str_c(a, regex, sep = "."),
TRUE ~ a)
) %>% select(-regex)
#> # A tibble: 4 x 1
#> a
#> <chr>
#> 1 GUA-ABC.GUA001
#> 2 REF-CDE.GUA002
#> 3 ACC.S93
#> 4 ACC.ATN
Why is this approach not optimal? The original data frame has millions of rows and fuzzyjoin::regex_left_join takes too long to do this.
Approach # 2
patron <- c("GUA001" = "\\b^GUA\\b", "GUA002" = "\\b^REF\\b")
newtex <- c("GUA001", "GUA002")
pegar <- function(string, pattern, text_to_paste) {
if_else(condition = str_detect(string, pattern),
true = str_c(string, text_to_paste, sep = "."),
false = string)
}
map2_dfr(.x = patron, .y = newtex, ~ pegar(string = df$a,
pattern = .x,
text_to_paste = .y))
#> # A tibble: 4 x 2
#> GUA001 GUA002
#> <chr> <chr>
#> 1 GUA-ABC.GUA001 GUA-ABC
#> 2 REF-CDE REF-CDE.GUA002
#> 3 ACC.S93 ACC.S93
#> 4 ACC.ATN ACC.ATN
Created on 2021-05-20 by the reprex package (v2.0.0)
With approach # 2 I can't get a single column.
As a side note, using str_replace_all and using a named vector to replace some of the values within the string has not seemed like a good alternative at the moment.
Is there a way to do this more optimally?
One option utilizing stringr and purrr could be:
imap_dfr(model_key,
~ df %>%
filter(str_detect(a, .y)) %>%
mutate(a = str_c(a, .x, sep = "."))) %>%
bind_rows(df %>%
filter(str_detect(a, str_c(names(model_key), collapse = "|"), negate = TRUE)))
a
<chr>
1 GUA-ABC.GUA001
2 REF-CDE.GUA002
3 ACC.S93
4 ACC.ATN
What about a boring old loop?
## make df millions of rows
df <- df[rep(1:4,1e6),]
system.time({
val <- c("GUA\\-", "REF\\-", "ACC\\.", "QQQ\\.")
rpl <- c("GUA001", "GUA002", "ACC001", "QQQ001")
for(i in seq_along(val)) {
sel <- grepl(val[i], df$a)
df$a[sel] <- paste(df$a[sel], rpl[i], sep=".")
}
})
## user system elapsed
## 2.14 0.03 2.17
2 seconds to complete
df
## A tibble: 4,000,000 x 1
# a
# <chr>
# 1 GUA-ABC.GUA001
# 2 REF-CDE.GUA002
# 3 ACC.S93.ACC001
# 4 ACC.ATN.ACC001
# ...
If the functional approach is absolutely necessary, you can squish it into a Reduce function:
Reduce(
function(str, args) {
sel <- grepl(args[1], str)
str[sel] <- paste(str[sel], args[2], sep=".")
str
},
Map(c, val, rpl), init = df$a
)

How to refer to variable instead of column with dplyr

When using dplyr:filter, I often compute a local variable that holds the viable choices:
df <- as_tibble(data.frame(id=c("a","b"), val=1:6))
ids <- c("b","c")
filter(df, id %in% ids)
# giving id %in% c("b","c")
However, if the dataset by chance has a column with the same name, this fails to achieve the intended purpose:
df$ids <- "a"
filter(df, id %in% ids)
# giving id %in% "a"
How should I explicitly refer to the ids variable instead of the ids column?
Unquote with !! to tell filter to look in the calling environment instead of the data frame:
library(tidyverse)
df <- data_frame(id = rep(c("a","b"), 3), val = 1:6)
ids <- c("b", "c")
df %>% filter(id %in% ids)
#> # A tibble: 3 x 2
#> id val
#> <chr> <int>
#> 1 b 2
#> 2 b 4
#> 3 b 6
df <- df %>% mutate(ids = "a")
df %>% filter(id %in% ids)
#> # A tibble: 3 x 3
#> id val ids
#> <chr> <int> <chr>
#> 1 a 1 a
#> 2 a 3 a
#> 3 a 5 a
df %>% filter(id %in% !!ids)
#> # A tibble: 3 x 3
#> id val ids
#> <chr> <int> <chr>
#> 1 b 2 a
#> 2 b 4 a
#> 3 b 6 a
Of course, the better way to avoid such issues is to not put identically-named vectors in your global environment.

Resources