Run the same codes with data and variable names changed in R - r

I need to run very similar codes for 3 different dataset. My current codes look like this:
## data a
a_dat2 <- merge(a_dat, zip, by = "zip", all.x = T)
a_dat2 <- a_dat2 %>%
group_by(zip) %>%
summarize(dist_a_min = min(dist))
## data b
b_dat2 <- merge(b_dat, zip, by = "zip", all.x = T)
b_dat2 <- b_dat2 %>%
group_by(zip) %>%
summarize(dist_b_min = min(dist))
## data c
c_dat2 <- merge(c_dat, zip, by = "zip", all.x = T)
c_dat2 <- c_dat2 %>%
group_by(zip) %>%
summarize(dist_c_min = min(dist))
The codes for the 3 dataset are same except that the name of the data varies: a_dat, b_dat, c_dat. The variable name dist varies too: dist_a_min, dist_b_min, dist_c_min. What function/loop can be used to shorten the codes so that I don't need to copy and paste for each dataset separately?

An option would be to place the elements in a list with mget, loop through the list with imap, join (?left_join) with 'zip' dataset, grouped by 'zip' and get the min of 'dist' while creating the column name based on the identifier name substring
library(tidyverse)
mget(ls(pattern = "_dat2$")) %>%
imap(~ left_join(.x, zip, by = 'zip') %>%
group_by(zip) %>%
summarise((! str_c('dist_', substr(.y, 1, 1), '_min') := min(dist)))
Or another option is to create a function for repeated tasks
joinSumm <- function(dat, groupName, colName, data2) {
groupName <- enquo(groupName)
colName <- enquo(colName)
nm1 <- str_c('dist_', str_sub(rlang::as_name(enquo(dat)), 1, 1), '_min')
dat %>%
left_join(data2, by = rlang::as_name(groupName)) %>%
group_by(!! groupName) %>%
summarise((!! nm1) := min(!! colName))
}
joinSumm(a_dat2, zip, dist, zip)
joinSumm(b_dat2, zip, dist, zip)
A reproducible example with built-in dataset iris (without the join part)
list(a_dat = iris, b_dat = iris, c_dat = iris) %>%
imap(~ .x %>%
group_by(Species) %>%
summarise(!! str_c('dist_', substr(.y, 1, 1), '_min') := min(Sepal.Length)))
#$a_dat
# A tibble: 3 x 2
# Species dist_a_min
# <fct> <dbl>
#1 setosa 4.3
#2 versicolor 4.9
#3 virginica 4.9
#$b_dat
# A tibble: 3 x 2
# Species dist_b_min
# <fct> <dbl>
#1 setosa 4.3
#2 versicolor 4.9
#3 virginica 4.9
$c_dat
# A tibble: 3 x 2
# Species dist_c_min
# <fct> <dbl>
#1 setosa 4.3
#2 versicolor 4.9
#3 virginica 4.9

Related

Search elements of a single character string in a dataframe column to subset it

I have two dataframes:
set.seed(1)
df1 <- data.frame(k1 = "AFD(1);Acf(2);Vgr7(2);"
,k2 = "ABC(7);BHG(46);TFG(675);")
df2 <- data.frame(site =c("AFD(1);AFD(2);", "Acf(2);", "TFG(677);",
"XX(275);", "ABC(7);", "ABC(9);")
,p1 = rnorm(6, mean = 5, sd = 2)
,p2 = rnorm(6, mean = 6.5, sd = 2))
The first dataframe is in fact a list of often very long strings, made of 'elements". Each "element" is made of a few letters/numbers, followed by a number in brackets, followed by a semicolon. In this example I only put 3 "elements" into each string, but in my real dataframe there are tens to hundreds of them.
> df1
k1 k2
1 AFD(1);Acf(2);Vgr7(2); ABC(7);BHG(46);TFG(675);
The second dataframe shares some of the "elements" with df1. Its first column, called site, contains some (not all) "elements" from the first dataframe, sometimes the "element" forms the whole string, and sometimes is a part of a longer string:
> df2
site p1 p2
1 AFD(1);AFD(2); 4.043700 3.745881
2 Acf(2); 5.835883 5.670011
3 TFG(677); 7.717359 5.711420
4 XX(275); 4.794425 6.381373
5 ABC(7); 5.775343 8.700051
6 ABC(9); 4.892390 8.026351
I would like to filter the whole df2 using df2$site and each k column from df1 (there are many K columns, not all of them contain k in the names).
The easiest way to explain this is to show how the desired output would look like.
> outcome
k site p1 p2
1 k1 AFD(1);AFD(2): 4.043700 3.745881
2 k1 Acf(2); 5.835883 5.670011
3 k2 ABC(7); 5.775343 8.700051
The first column of the outcome dataframe corresponds to the column names in df1. The second column corresponds to the site column of df2 and contains only sites from df1 columns that were found in df2$sites. Other columns are from df2.
I appreciate that this question is made of two separate "problems", one grepping-related and one related to looping through df1 columns. I decided to show the task in its entirety in case there exists a solution that addresses both in one go.
FAILED SOLUTION 1
I can create a string to grep, but for each column separately:
# this replaces the semicolons with "|", but does not escape the brackets.
k1_pattern <- df1 %>%
select(k1) %>%
deframe() %>%
str_replace_all(";","|")
And then I am not sure how to use it. This (below) didn't work, maybe because I didn't escape brackets, but I am struggling with doing it:
k1_result <- df2 %>%
filter(grepl(pattern = k1_pattern, site))
But even if it did work, it would only deal with a single column from df1, and I have many, and would like to perform this operation on all df1 columns at the same time.
FAILED SOLUTION 2
I can create a list of sites to search in df2 from columns in df1:
k1_sites<- df1 %>%
select(k1) %>%
deframe() %>%
strsplit(., "[;]") %>%
unlist()
but the delimiter is lost here, and %in% cannot be used, as the match will sometimes be partial.
library(dplyr)
df2 %>%
mutate(site_list = strsplit(site, ";")) %>%
rowwise() %>%
filter(length(intersect(site_list,
unlist(strsplit(x = paste0(c(t(df1)), collapse=""),
split = ";")))) != 0) %>%
select(-site_list)
#> # A tibble: 3 x 3
#> # Rowwise:
#> site p1 p2
#> <chr> <dbl> <dbl>
#> 1 AFD(1);AFD(2); 3.75 7.47
#> 2 Acf(2); 5.37 7.98
#> 3 ABC(7); 5.66 9.52
Updated answer:
library(dplyr)
library(tidyr)
df1 %>%
rownames_to_column("id") %>%
pivot_longer(-id, names_to = "k", values_to = "site") %>%
separate_rows(site, sep = ";") %>%
filter(site != "") %>%
select(-id) -> df1_k
df2 %>%
tibble::rownames_to_column("id") %>%
separate_rows(site, sep = ";") %>%
full_join(., df1_k, by = c("site")) %>%
group_by(id) %>%
fill(k, .direction = "downup") %>%
filter(!is.na(id) & !is.na(k)) %>%
summarise(k = first(k),
site = paste0(site, collapse = ";"),
p1 = first(p1),
p2 = first(p2), .groups = "drop") %>%
select(-id)
#> # A tibble: 3 x 4
#> k site p1 p2
#> <chr> <chr> <dbl> <dbl>
#> 1 k1 AFD(1);AFD(2); 3.75 7.47
#> 2 k1 Acf(2); 5.37 7.98
#> 3 k2 ABC(7); 5.66 9.52
Here's a way going to a long format for exact matching (so no regex):
library(dplyr)
library(tidyr)
df1_long = df1 |> stack() |>
separate_rows(values, sep = ";") |>
filter(values != "")
df2 |>
mutate(id = row_number()) |>
separate_rows(site, sep = ";") |>
filter(site != "") |>
left_join(df1_long, by = c("site" = "values")) %>%
group_by(id) |>
filter(any(!is.na(ind))) %>%
summarize(
site = paste(site, collapse = ";"),
across(-site, \(x) first(na.omit(x)))
)
# # A tibble: 3 × 5
# id site p1 p2 ind
# <int> <chr> <dbl> <dbl> <fct>
# 1 1 AFD(1);AFD(2) 3.75 7.47 k1
# 2 2 Acf(2) 5.37 7.98 k1
# 3 5 ABC(7) 5.66 9.52 k2

Update single row of data frame in R

my dataframe:
Name
Value
Setosa
1
Versicolor
2
So first of all, I want to check if an input matches the name in any row.
My solution for the filter so far:
# input$df <- Versicolor
import dplyr
df_table <- df_table %>%
dplyr::filter(grepl(input$df, Name, ignore.case = TRUE))
If there is a match, I'd like to update/overwrite this row with new values, like in the following table:
Name
Value
Setosa
1
Versicolor
3
The name stays the same, but only the value changes.
Do you have any advice?
You can try the following:
df_table[df_table$Name == input$df, 'Value'] <- new_value
This will update the Value column for all rows where the value in Name is the same as input$df which in your example is Versicolor
We can use a join
library(dplyr)
df_table %>%
left_join(input, by = c("Name")) %>%
mutate(Value = coalesce(Value.x, Value.x), .keep = "unused")
Or with data.table
library(data.table)
setDT(df_table)[input, Value := i.Value, on = .(Name)]
We can use {powerjoin} :
library(powerjoin)
df1 <- data.frame(Name = c("Setosa", "Versicolor"), Value = c(1, 2))
df2 <- data.frame(Name = "Versicolor", Value = 3)
power_left_join(df1, df2, by = "Name", conflict = coalesce_yx)
#> Name Value
#> 1 Setosa 1
#> 2 Versicolor 3
# or doing a row-wise sum:
power_left_join(df1, df2, by = "Name", conflict = rw ~ sum(.x, .y, na.rm = TRUE))
#> Name Value
#> 1 Setosa 1
#> 2 Versicolor 5
Created on 2022-04-14 by the reprex package (v2.0.1)

Dplyr multiple piped dynamic variables?

I do this a lot:
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(num_Species = n_distinct(Species)) %>%
mutate(perc_Species = 100 * num_Species / sum(num_Species))
So I would like to create a function that outputs the same thing but with dynamically named num_ and perc_ columns:
num_perc <- function(df, group_var, summary_var) {
}
I found this resource useful but it did not directly address how to reuse newly created column names in the way I want.
What you can do is use as_label(enquo()) on your group_var to extract variable passed as a character vector to generate your new columns. You can see a clear example of this is 6.1.3 in the linked document you sent. In this way, we can dynamically prepend num_ and perc_ to your summary variable, and just have to pass in df and group_var.
library(dplyr)
num_perc <- function(df, group_var) {
summary_lbl <- as_label(enquo(group_var))
num_lbl <- paste0("num_", summary_lbl)
perc_lbl <- paste0("perc_", summary_lbl)
df %>%
group_by({{ group_var }}) %>%
summarize(!!num_lbl := n_distinct({{ group_var }})) %>%
mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}
num_perc(iris, Species)
#> # A tibble: 3 × 3
#> Species num_Species perc_Species
#> <fct> <int> <dbl>
#> 1 setosa 1 33.3
#> 2 versicolor 1 33.3
#> 3 virginica 1 33.3
In this case where group_var and summary_var actually differ, it's the same solution essentially.
num_perc <- function(df, group_var, summary_var) {
summary_lbl <- as_label(enquo(summary_var))
num_lbl <- paste0("num_", summary_lbl)
perc_lbl <- paste0("perc_", summary_lbl)
df %>%
group_by({{ group_var }}) %>%
summarize(!!num_lbl := n_distinct({{ summary_var }})) %>%
mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}
num_perc(iris, Species, Species)
Another possible solution, which uses deparse(substitute(...)) to get the name of the function parameters as strings:
library(tidyverse)
f <- function(df, group_var, summary_var)
{
group_var <- deparse(substitute(group_var))
summary_var <- deparse(substitute(summary_var))
df %>%
group_by(!!sym(group_var)) %>%
summarise(!!str_c("num_", summary_var) := n_distinct(summary_var)) %>%
mutate(!!str_c("per_", summary_var) := 100 * !!sym(str_c("num_", summary_var)) / sum(!!sym(str_c("num_", summary_var))))
}
f(iris, Species, Species)
#> # A tibble: 3 × 3
#> Species num_Species per_Species
#> <fct> <int> <dbl>
#> 1 setosa 1 33.3
#> 2 versicolor 1 33.3
#> 3 virginica 1 33.3
Are you sure n_distinct is what you want to do? In the case of the iris dataset, there are three Species - setosa, versicolor, virginica. Therefore, each species is 1/3 unique species. The Iris dataset is balanced in the sense that there are 50 of each species, so each species represents 1/3 of the data set but more generally this will not be the case.
A function with data masking will cover imbalanced datasets for you:
library(dplyr)
my_func <- function(df, var, percent){
df %>%
count({{var}}) %>%
mutate(percent = 100 * n/sum(n))
}
my_func(iris, Species, percent)
iris %>%
my_func(Species, percent) #or with pipe

Adding name to summarised across columns using custom function

I want to use a custom function and return columns with an added "_cat_mean" to each column.
In the code below "$cat_mean" is added and I can't select it by that name.
summarise_categories <- function(x) {
tibble(
cat_mean = round(mean(x) * 2) / 2
)
}
iris_summarised = iris %>%
group_by(Species) %>%
summarise(across(ends_with("Length"), ~summarise_categories(.)))
Select columns by the name which is displayed doesn't work
iris_summarised %>%
select(Species, Sepal.Length$cat_mean)
But this works
iris_summarised %>%
select(Species, Sepal.Length)
I want the column to be named "Sepal.Length_cat_mean"
You can use .names argument in across to give new column names.
library(dplyr)
summarise_categories <- function(x) {
round(mean(x) * 2) / 2
}
iris %>%
group_by(Species) %>%
summarise(across(ends_with("Length"), summarise_categories,
.names = '{col}_cat_mean')) -> iris_summarised
iris_summarised
# Species Sepal.Length_cat_mean Petal.Length_cat_mean
# <fct> <dbl> <dbl>
#1 setosa 5 1.5
#2 versicolor 6 4.5
#3 virginica 6.5 5.5
Using base R with colMeans and by
by(iris[-5], iris$Species, function(x) round(colMeans(x) * 2) /2)

How to create a column for each level of another column in R?

The goal I am trying to achieve is an expanded data frame in which I will have created a new column for each level of a specific column in R. Here is a sample of the initial data frame and the data frame I am trying to achieve:
Original Data Frame:
record crop_land fishing_ground
BiocapPerCap 1.5 3.4
Consumption 2.3 0.5
Goal Data Frame:
crop_land.BiocapPerCap crop_land.Consumption fishing_ground.BiocapPerCap fishing_ground.Consumption
1.5 2.3 3.4 0.5
We can use pivot_wider from the tidyr package as follows.
library(tidyr)
library(magrittr)
dat2 <- dat %>%
pivot_wider(names_from = "record", values_from = c("crop_land", "fishing_ground"),
names_sep = ".")
dat2
# # A tibble: 1 x 4
# crop_land.BiocapPerCap crop_land.Consumption fishing_ground.BiocapPer~ fishing_ground.Consumpti~
# <dbl> <dbl> <dbl> <dbl>
# 1 1.5 2.3 3.4 0.5
DATA
dat <- read.table(text = "record crop_land fishing_ground
BiocapPerCap 1.5 3.4
Consumption 2.3 0.5",
header = TRUE, stringsAsFactors = FALSE)
Using tidyr is one option.
tidyr::pivot_longer() converts crop_land and fishing_ground to variable-value pairs. tidyr::unite() combines the record and variable to new names.
tidyr::pivot_wider() creates the wide data frame you are after.
library(tidyr)
library(magrittr) # for %>%
tst <- data.frame(
record = c("BiocapPerCap", "Consumption"),
crop_land = c(1.5, 2.3),
fishing_ground = c(3.4, 0.5)
)
pivot_longer(tst, -record) %>%
unite(new_name, record, name, sep = '.') %>%
pivot_wider(names_from = new_name, values_from = value)

Resources