mutate across with vectorized function parameters - r

I know the "across" paradigm is "many columns, one function" so this might not be possible. The idea is that I want to apply the same function to several columns with a parameter varying based on the column.
I got this to work using cur_column() but it basically amounts to computing the parameters 1 by 1 rather than providing a vector of equal size to the number of columns containing the parameters.
This first block produces what I want but it I'm wondering if there's a cleaner way.
library(dplyr)
df = data.frame(column1 = 1:100, column2 = 1:100)
parameters = data.frame(
column_names = c('column1','column2'),
parameters = c(10,100))
custom_function = function(x,addend){
x + addend
}
df2 = df %>% mutate(
across(x$column_names,
~custom_function(.,addend = x %>%
filter(column_names == cur_column()) %>%
pull(parameters))))
What I would like to do for the last line would look like
df2 = df %>% mutate(
across(x$column_names,~custom_function(.,addend = x$parameters)))

We can do this in base with mapply:
mapply(`+`, df[,parameters$column_names], parameters$parameters)
##> column1 column2
##> [1,] 11 101
##> [2,] 12 102
##> [3,] 13 103
##> [4,] 14 104
##> [5,] 15 105
##> ...

I think a mapping function operating on the parameters would easier than across on the data:
library(tidyverse)
with(parameters, as_tibble(map2(df[column_names], parameters, custom_function)))
#> # A tibble: 100 x 2
#> column1 column2
#> <dbl> <dbl>
#> 1 11 101
#> 2 12 102
#> 3 13 103
#> 4 14 104
#> 5 15 105
#> 6 16 106
#> 7 17 107
#> 8 18 108
#> 9 19 109
#> 10 20 110
#> # ... with 90 more rows
Created on 2022-12-15 with reprex v2.0.2

We could either use match on the column name (cur_column()) with the column_names column of 'parameters' and extract the 'parameters' column value to be used as input in custom_function
library(dplyr)
df %>%
mutate(across(all_of(parameters$column_names),
~ custom_function(.x, parameters$parameters[match(cur_column(),
parameters$column_names)])))
-output
column1 column2
1 11 101
2 12 102
3 13 103
4 14 104
5 15 105
6 16 106
7 17 107
8 18 108
...
Or convert the two column data.frame to a named vector (deframe) and directly extract the value from the name
library(tibble)
params <- deframe(parameters)
df %>%
mutate(across(all_of(names(params)),
~ custom_function(.x, params[cur_column()])))
-output
column1 column2
1 11 101
2 12 102
3 13 103
4 14 104
5 15 105
6 16 106
...

Edit (this doesn't work on the second column). I don't think across is what you want here. One of the other answers is better.
You could also just add the column to your function like this:
library(dplyr)
df = data.frame(column1 = 1:100, column2 = 1:100)
parameters = data.frame(
column_names = c('column1','column2'),
parameters = c(10,100))
custom_function = function(x,column, parameters){
addend = parameters %>%
filter(column_names == column) %>%
pull(parameters)
x + addend
}
df2 = df %>% mutate(
across(parameters$column_names,custom_function,cur_column(),parameters))

Related

How can I merge rows by a common variable and sum numeric columns while adding character rows to one, comma separated column?

Below is a sample data frame of what I am working with.
df <- data.frame(
Sample = c('A', 'A', 'B', 'C'),
Length = c('100', '110', '99', '102'),
Molarity = c(5,4,6,7)
)
df
Sample Length Molarity
1 A 100 5
2 A 110 4
3 B 99 6
4 C 102 7
I would like the result listed below but am unsure how to approach the problem.
Sample Length Molarity
1 A 100,110 9
2 B 99 6
3 C 102 7
We may do group by summarisation
library(dplyr)
df %>%
group_by(Sample) %>%
summarise(Length = toString(Length), Molarity = sum(Molarity))
-output
# A tibble: 3 × 3
Sample Length Molarity
<chr> <chr> <dbl>
1 A 100, 110 9
2 B 99 6
3 C 102 7
A base R option:
df |>
within({
Length <- ave(Length, Sample, FUN = toString)
Molarity <- ave(Molarity, Sample, FUN = mean)
}) |>
unique()
Sample Length Molarity
1 A 100, 110 4.5
3 B 99 6.0
4 C 102 7.0

Create new columns based on count of occurrence in another data frame in r

I want to create new columns in df1 based on count of occurrence of columns in df2
df1:
df2:
[![enter image description here][2]][2]
For ID 100 in data frame 1 no RM assigned in data frame 2, for ID 103 there are 2 RM in dataframe 2 and for 108 there are 3 RM
So my final data frame is
I tried merge function using left join but not sure how to count the numbers of non empty cells
Here is the sample dataset
df1 <- data.frame(Id = c(100,101,103,105,108,109),
channel = c("A","C","C","C","D","D"),
duration = c(12,23,56,89,73,76))
df2 <- data.frame(ID=c(100,103,109,105,101,108),
RM1= c("","john","","Miller","","Maddy"),
RM2 = c("","Ryan","","","","sean"),
RM3 = c("","","","","","Arvind"))
Add the aggregate in df2 and merge
df2$Total=rowSums(df2[,-1]!="")
merge(
df1,
df2[c("ID","Total")],
by="ID",
all.x=T
)
ID channel duration Total
1 100 A 12 0
2 101 C 23 0
3 103 C 56 2
4 105 C 89 1
5 108 D 73 3
6 109 D 76 0
Here is a tidyverse solution. Join by ID, reshape to long format and count the values not left blank.
df1 <- data.frame(ID = c(100,101,103,105,108,109),
channel = c("A","C","C","C","D","D"),
duration = c(12,23,56,89,73,76))
df2 <- data.frame(ID=c(100,103,109,105,101,108),
RM1= c("","john","","Miller","","Maddy"),
RM2 = c("","Ryan","","","","sean"),
RM3 = c("","","","","","Arvind"))
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
inner_join(df1, df2) %>%
pivot_longer(cols = starts_with("RM")) %>%
group_by(ID, channel, duration) %>%
summarise(Total = sum(value != ""), .groups = "drop")
#> Joining, by = "ID"
#> # A tibble: 6 x 4
#> ID channel duration Total
#> <dbl> <chr> <dbl> <int>
#> 1 100 A 12 0
#> 2 101 C 23 0
#> 3 103 C 56 2
#> 4 105 C 89 1
#> 5 108 D 73 3
#> 6 109 D 76 0
Created on 2022-03-23 by the reprex package (v2.0.1)

Create numerical discrete values if values in a column equal in R

I have a column of IDs in a dataframe that sometimes has duplicates, take for example,
ID
209
315
109
315
451
209
What I want to do is take this column and create another column that indicates what ID the row belongs to. i.e. I want it to look like,
ID
ID Category
209
1
315
2
109
3
315
2
451
4
209
1
Essentially, I want to loop through the IDs and if it equals to a previous one, I indicate that it is from the same ID, and if it is a new ID, I create a new indicator for it.
Does anyone know is there a quick function in R that I could do this with? Or have any other suggestions?
Convert to factor with levels ordered with unique (order of appearance in the data set) and then to numeric:
data$IDCategory <- as.numeric(factor(data$ID, levels = unique(data$ID)))
#> data
# ID IDCategory
#1 209 1
#2 315 2
#3 109 3
#4 315 2
#5 451 4
#6 209 1
library(tidyverse)
data <- tibble(ID= c(209,315,109,315,451,209))
data %>%
left_join(
data %>%
distinct(ID) %>%
mutate(`ID Category` = row_number())
)
#> Joining, by = "ID"
#> # A tibble: 6 × 2
#> ID `ID Category`
#> <dbl> <int>
#> 1 209 1
#> 2 315 2
#> 3 109 3
#> 4 315 2
#> 5 451 4
#> 6 209 1
Created on 2022-03-10 by the reprex package (v2.0.0)
df <- df %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, drop=TRUE)))
Answer with data.table
library(data.table)
df <- as.data.table(df)
df <- df[
j = `ID Category` := as.numeric(interaction(ID, drop=TRUE))
]
The pro of this solution is that you can create an unique ID for a group of variables. Here you only need ID, but if you want to have an unique ID let say for the couple [ID—Location] you could.
data <- tibble(ID= c(209,209,209,315,315,315), Location = c("A","B","C","A","A","B"))
data <- data %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, Location, drop=TRUE)))
another way:
merge(data,
data.frame(ID = unique(data$ID),
ID.Category = seq_along(unique(data$ID))
), sort = F)
# ID ID.Category
# 1 209 1
# 2 209 1
# 3 315 2
# 4 315 2
# 5 109 3
# 6 451 4
data:
tibble(ID = c(209,315,109,315,451,209)) -> data

R output BOTH maximum and minimum value by group in dataframe

Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82

What is this arrange function in the second line doing here?

I'm currently reviewing R for Data Science when I encounter this chunk of code.
The question for this code is as follows. I don't understand the necessity of the arrange function here. Doesn't arrange function just reorder the rows?
library(tidyverse)
library(nycflights13))
flights %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
However, it does output differently with or without the arrange function, as shown below:
#with the arrange function
tailnum n
<chr> <int>
1 N954UW 206
2 N952UW 163
3 N957UW 142
4 N5FAAA 117
5 N38727 99
6 N3742C 98
7 N5EWAA 98
8 N705TW 97
9 N765US 97
10 N635JB 94
# ... with 3,745 more rows
and
#Without the arrange function
tailnum n
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N961UW 139
5 N713TW 128
6 N765US 122
7 N721TW 120
8 N5FAAA 117
9 N945UW 104
10 N19130 101
# ... with 3,774 more rows
I'd appreciate it if you can help me understand this. Why is it necessary to include the arrange function here?
Yes, arrange just orders the rows but you are filtering after that which changes the result.
Here is a simplified example to demonstrate how the output differs with and without arrange.
library(dplyr)
df <- data.frame(a = 1:5, b = c(7, 8, 9, 1, 2))
df %>% filter(cumsum(b) < 20)
# a b
#1 1 7
#2 2 8
df %>% arrange(b) %>% filter(cumsum(b) < 20)
# a b
#1 4 1
#2 5 2
#3 1 7
#4 2 8

Resources