Tidy Way to Multiply One Set of Columns Against Another Set - r

I'm working on a survey research project in which we need to multiply one group of columns against another group of columns. I can do this easily in base R, but I can't figure out how to do it within a tidy, pipe-based workflow. I found many solutions for multiplying a set of columns against one column, but not for multiple columns. Any help is greatly appreciated.
The example below demonstrates how I solve the problem in base R.
library(tidyverse)
df <- tibble(
a1 = c(1, 2, 3, 4, 5),
a2 = c(5, 4, 3, 2, 1),
a3 = c(1, 5, 2, 4, 3),
b1 = c(1, 1, 2, 1, 1),
b2 = c(3, 3, 5, 4, 1),
b3 = c(2, 1, 4, 2, 1)
)
new_df <- df[, c("a1", "a2", "a3")] * df[, c("b1", "b2", "b3")]
names(new_df) <- c("c1", "c2", "c3")
Created on 2022-06-14 by the reprex package (v2.0.1)

Not everything needs to be tidy. The base R solution that you have works perfectly fine, is neat and scalable.
The same can be achieved with tidyverse but it would not be neat. Here's a way with across.
library(dplyr)
df %>%
mutate(across(starts_with('a'), .names = '{sub("a", "c", col)}') *
across(starts_with('b'))) %>%
select(starts_with('c'))
# c1 c2 c3
# <dbl> <dbl> <dbl>
#1 1 15 2
#2 2 12 5
#3 6 15 8
#4 4 8 8
#5 5 1 3
The base R code can also be written as -
df %>% select(starts_with('a')) * df %>% select(starts_with('b'))

You could stack the two groups of columns pairwisely, multiply them together, and then pivot the long data to wide.
library(tidyverse)
df %>%
rowid_to_column("id") %>%
pivot_longer(-id, names_to = c(".value", "set"), names_pattern = "(.)(.)") %>%
mutate(c = a * b) %>%
pivot_wider(id, names_from = set, values_from = c, names_prefix = "c")
# # A tibble: 5 × 4
# id c1 c2 c3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 15 2
# 2 2 2 12 5
# 3 3 6 15 8
# 4 4 4 8 8
# 5 5 5 1 3

Here is a mostly tidyverse option, except for using split.default. We can split into a list of dataframes based on the ending, then we can use reduce to perform the calculation for each dataframe, then return as 1 single dataframe (and finally add the c prefix to the column names).
library(tidyverse)
df %>%
split.default(., sub("\\D+", "", names(.))) %>%
map_df(., ~ reduce(.x, .f = `*`)) %>%
rename_with( ~ paste0("c", .x))
Output
c1 c2 c3
<dbl> <dbl> <dbl>
1 1 15 2
2 2 12 5
3 6 15 8
4 4 8 8
5 5 1 3

I agree with #Ronak Shah that sometimes it's easier with base R functions but here is the pipe solution you are asking:
df <- tibble(
a1 = c(1, 2, 3, 4, 5),
a2 = c(5, 4, 3, 2, 1),
a3 = c(1, 5, 2, 4, 3),
b1 = c(1, 1, 2, 1, 1),
b2 = c(3, 3, 5, 4, 1),
b3 = c(2, 1, 4, 2, 1)
)
data.frame(df) %>%
mutate(c = select(cur_data(), a1:a3)*select(cur_data(), b1:b3)) %>%
invoke(.f = data.frame) %>%
rename_with(~str_remove(.,".a"))
The output looks like:
a1 a2 a3 b1 b2 b3 c1 c2 c3
1 1 5 1 1 3 2 1 15 2
2 2 4 5 1 3 1 2 12 5
3 3 3 2 2 5 4 6 15 8
4 4 2 4 1 4 2 4 8 8
5 5 1 3 1 1 1 5 1 3

Related

How to a create a new dataframe of consolidated values from multiple columns in R

I have a dataframe, df1, that looks like the following:
sample
99_Ape_1
93_Cat_1
87_Ape_2
84_Cat_2
90_Dog_1
92_Dog_2
A
2
3
1
7
4
6
B
5
9
7
0
3
7
C
6
8
9
2
3
0
D
3
9
0
5
8
3
I want to consolidate the dataframe by summing the values based on animal present in the header row, i.e. by "Ape", "Cat", "Dog", and end up with the following dataframe:
sample
Ape
Cat
Dog
A
3
10
10
B
12
9
10
C
15
10
3
D
3
14
11
I have created a list that represents all the animals called "animals_list"
I have then created a list of dataframes that subsets each animal into a separate dataframe with:
animals_extract <- c()
for (i in 1:length(animals_list)){
species_extract[[i]] <- df1[, grep(animals_list[i], names(df1))]
}
I am then trying to sum each variable in the row by sample:
for (i in 1:length(species_extract)){
species_extract[[i]]$total <- rowSums(species_extract[[i]])
}
and then create a dataframe 'animal_total' by binding all values in the new 'total' column.
animal_total <- NULL
for (i in 1:length(species_extract)){
animal_total[i] <- cbind(species_extract[[i]]$total)
}
Unfortunately, this doesn't seem to work at all and I think I may have taken the wrong route. Any help would be really appreciated!
EDIT: my dataframe has over 300 animals, meaning incorporating use of my list of identifiers (animals_list) would be highly appreciated! I would also note that some column names do not follow the structure, "number_animal_number" and therefore I can't use a repetitive search (sorry!).
a data.table approach
library(data.table)
library(rlist)
#set data to data.table format
setDT(df1)
# split column 2:n by regex on column names
L <- split.default(df1[,-1], gsub(".*_(.*)_.*", "\\1", names(df1)[-1]))
# Bind together again
data.table(sample = df1$sample,
as.data.table(list.cbind(lapply(L, rowSums))))
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Update: After clarification:
This may work depending on the other names of your animals. but this is a start:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
cols = -sample
) %>%
mutate(name1 = str_extract(name, '(?<=\\_)(.*?)(?=\\_)')) %>%
group_by(sample, name1) %>%
summarise(sum=sum(value)) %>%
pivot_wider(
names_from = name1,
values_from= sum
)
Output:
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
First answer:
Here is how we could do it with dplyr:
library(dplyr)
df %>%
mutate(Cat = rowSums(select(., contains("Cat"))),
Ape = rowSums(select(., contains("Ape"))),
Dog = rowSums(select(., contains("Dog")))) %>%
select(sample, Cat, Ape, Dog)
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
An alternative data.table solution
library(data.table)
# Construct data table
dt <- as.data.table(list(sample = c("A", "B", "C", "D"),
`99_Ape_1` = c(2, 5, 6, 3),
`93_Cat_1` = c(3, 9, 8, 9),
`87_Ape_2` = c(1, 7, 9, 0),
`84_Cat_2` = c(7, 0, 2, 5),
`90_Dog_1` = c(4, 3, 3, 8),
`92_Dog_2` = c(6, 7, 0, 3)))
# Alternatively convert existing dataframe
# dt <- setDT(df)
# Use Regex pattern to drop ids from column names
names(dt) <- gsub("((^[0-9_]{3})|(_[0-9]{1}$))", "", names(dt))
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Alternatively, leaving the column names as is (after comment from OP to previous answer) and assuming that there are multiple observations of the same samples:
dt <- as.data.table(list(sample = c("A", "B", "C", "D", "A"),
`99_Ape_1` = c(2, 5, 6, 3, 1),
`93_Cat_1` = c(3, 9, 8, 9, 1),
`87_Ape_2` = c(1, 7, 9, 0, 1),
`84_Cat_2` = c(7, 0, 2, 5, 1),
`90_Dog_1` = c(4, 3, 3, 8, 1),
`92_Dog_2` = c(6, 7, 0, 3, 1)))
dt
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 2 3 1 7 4 6
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
# 5: A 1 1 1 1 1 1
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 3 4 2 8 5 7
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3

Ratio of largest value per row in dataframe in R

I have a dataframe somewhat similar to the one below (df). I need to add a new column indicating the ratio of the largest value for each row (= largest value in row divided by sum of all values in the row). The output should look similar to df1.
df <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2))
df1 <- data.frame('x' = c(1, 4, 1, 4, 1), 'y' = c(4, 6, 5, 2, 3), 'z' = c(5, 3, 2, 3, 2), 'ratio' = c(0.5, 0.462, 0.625, 0.444, 0.5)
Thank you!
Here is a solution using dplyr:
df %>%
rowwise() %>%
mutate(max_value = max(x,y,z),
sum_values = sum(x,y,z),
ratio = max_value / sum_values) #%>%
#select(-max_value, -sum_values) #uncomment this line if you want to df1 as in your question
# A tibble: 5 x 6
x y z max_value sum_values ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
library(tidyverse)
df %>%
rowwise() %>%
mutate(MAX = max(x,y,z, na.rm = TRUE ),
SUM = sum(x,y,z, na.rm = TRUE),
ratio = MAX / SUM)
# A tibble: 5 x 6
x y z MAX SUM ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 5 5 10 0.5
2 4 6 3 6 13 0.462
3 1 5 2 5 8 0.625
4 4 2 3 4 9 0.444
5 1 3 2 3 6 0.5
Another option with rowSums and pmax
library(dplyr)
library(purrr)
df %>%
mutate(ratio = reduce(., pmax)/rowSums(.))
# x y z ratio
#1 1 4 5 0.5000000
#2 4 6 3 0.4615385
#3 1 5 2 0.6250000
#4 4 2 3 0.4444444
#5 1 3 2 0.5000000
Or in base R
df$ratio <- do.call(pmax, df)/rowSums(df)
Additional solution
df$ratio <- apply(df, 1, function(x) max(x, na.rm = T) / sum(x, na.rm = T))

Using purrr map to apply function to selection of columns in DataFrame in dplyr pipeline

I have the following dataframe
test <- data.frame(x = c(6, 9, 3, NA),
y = c(3, NA, 2, 3),
z = c(6, 3, NA, 5),
h = c(NA, 6, 7, 2))
This is the list of columns i would like to iterate over
mylist <- list(test$y, test$z)
I want to change columns "y" and "z" based on the condition in ifelse
Here is my attempt...seems not to work
test <- test %>%
map_df(mylist, if(is.na(mylist), 0, 1))
(in reality i have a much larger dataframe, this is just test data)
Do I need to use mutate?
Can I use select in the pipeline? like this?
test <- test %>%
map_df(select(y, z), if(is.na(.), 0, 1))
Here is the expected output
test <- data.frame(x = c(6, 9, 3, NA),
y = c(1, 0, 1, 1),
z = c(1, 1, 0, 1),
h = c(NA, 6, 7, 2))
Thanks for the help
We can use mutate_at to specify columns
library(dplyr)
test %>% mutate_at(vars(y, z), ~as.integer(!is.na(.)))
# x y z h
#1 6 1 1 NA
#2 9 0 1 6
#3 3 1 0 7
#4 NA 1 1 2
Or if ifelse is preferred
test %>% mutate_at(vars(y, z), ~ifelse(is.na(.), 0, 1))
We can also do the same in base R
cols <- c("y", "z")
test[cols] <- as.integer(!is.na(test[cols]))
As the OP mentioned about map from purrr, use map_at
library(tidyverse)
test %>%
map_at(vars('y', 'z'), ~ +(!is.na(.x))) %>%
bind_cols
# A tibble: 4 x 4
# x y z h
# <dbl> <int> <int> <dbl>
#1 6 1 1 NA
#2 9 0 1 6
#3 3 1 0 7
#4 NA 1 1 2
Or we can do this in base R
test[c('y','z')] <- +(!is.na(test[c('y', 'z')]))
test
# x y z h
#1 6 1 1 NA
#2 9 0 1 6
#3 3 1 0 7
#4 NA 1 1 2

Using mutate rowwise over a subset of columns

I am trying to create a new column that will contain a result of calculations done rowwise over a subset of columns of a tibble, and add this new column to the existing tibble. Like so:
df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)
I effectively want to do a dplyr equivalent of this code from base R:
df$SumA <- rowSums(df[,grepl("^A", colnames(df))])
My problem is that this doesn't work:
df %>%
select(starts_with("A")) %>%
mutate(SumA = rowSums(.))
# some code here
...because I got rid of the "ID" column in order to let mutate run the rowSums over the other (numerical) columns. I have tried to cbind or bind_cols in the pipe after the mutate, but it doesn't work. None of the variants of mutate work, because they work in-place (within each cell of the tibble, and not across the columns, even with rowwise).
This does work, but doesn't strike me as an elegant solution:
df %>%
mutate(SumA = rowSums(.[,grepl("^A", colnames(df))]))
Is there any tidyverse-based solution that does not require grepl or square brackets but only more standard dplyr verbs and parameters?
My expected output is this:
df_out <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3),
SumA = c(6, 6, 6)
)
Best
kJ
Here's one way to approach row-wise computation in the tidyverse using purrr::pmap. This is best used with functions that actually need to be run row by row; simple addition could probably be done a faster way. Basically we use select to provide the input list to pmap, which lets us use the select helpers such as starts_with or matches if you need regex.
library(tidyverse)
df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)
df %>%
mutate(
SumA = pmap_dbl(
.l = select(., starts_with("A")),
.f = function(...) sum(...)
)
)
#> # A tibble: 3 x 5
#> ID A1 A2 A3 SumA
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 one 1 2 3 6
#> 2 two 1 2 3 6
#> 3 three 1 2 3 6
Created on 2019-01-30 by the reprex package (v0.2.1)
Here's a different approach that doesn't move rowwise but instead exploits the vectorised nature of addition and that addition commutes. That lets use repeatedly apply + with purrr::reduce
library(tidyverse)
df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)
df %>%
mutate(
SumA = reduce(
.x = select(., starts_with("A")),
.f = `+`
)
)
#> # A tibble: 3 x 5
#> ID A1 A2 A3 SumA
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 one 1 2 3 6
#> 2 two 1 2 3 6
#> 3 three 1 2 3 6
Created on 2019-01-30 by the reprex package (v0.2.1)
1) To do it with rowSums try nesting a second pipeline in the mutate like this:
library(dplyr)
df %>% mutate(Sum = select(., starts_with("A")) %>% rowSums)
giving:
# A tibble: 3 x 5
ID A1 A2 A3 Sum
<chr> <dbl> <dbl> <dbl> <dbl>
1 one 1 2 3 6
2 two 1 2 3 6
3 three 1 2 3 6
2) An alternative is to reshape it to long form and then summarize:
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(Sum = gather(., key, value, -ID) %>%
group_by(., ID) %>%
summarize(sum = sum(value)) %>%
ungroup %>%
pull(sum))
giving:
# A tibble: 3 x 5
ID A1 A2 A3 Sum
<chr> <dbl> <dbl> <dbl> <dbl>
1 one 1 2 3 6
2 two 1 2 3 6
3 three 1 2 3 6
[upd] I didn't notice that #Calum used a nearly the same approach.
Another possible way to do that:
library(dplyr)
library(purrr)
dat %>%
mutate(SumA = pmap_dbl(select(., contains('A')), sum))
Data:
# dat <- tibble(
# ID = c("one", "two", "three"),
# A1 = c(1, 1, 1),
# A2 = c(2, 2, 2),
# A3 = c(3, 3, 3)
# )
Output:
# # A tibble: 3 x 5
# ID A1 A2 A3 SumA
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one 1 2 3 6
# 2 two 1 2 3 6
# 3 three 1 2 3 6
You could nest and use rowSums on the nested columns :
library(tidyverse)
df %>% nest(-ID) %>%
mutate(SumA = map_dbl(data,rowSums)) %>%
unnest
# # A tibble: 3 x 5
# ID SumA A1 A2 A3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one 6 1 2 3
# 2 two 6 1 2 3
# 3 three 6 1 2 3
Or this variant on the pmap approach :
df %>% mutate(SumA = pmap_dbl(.[-1],sum))
# # A tibble: 3 x 5
# ID A1 A2 A3 SumA
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one 1 2 3 6
# 2 two 1 2 3 6
# 3 three 1 2 3 6
And to show that base is sometimes easier :
df$SumA <- rowSums(df[-1])

How to sum a substring reference

I'm attempting to select the correct column to sum the total of a from within a data frame column using ddply:
df2 <- ddply(df1,'col1', summarise, total = sum(substr(variable,1,3)))
It appears not to be working because you can't sum a character, but I am trying to pass the reference to the column, not sum the literal result of the substring. Is there a way to get around this?
Example Data & Desired output:
variable = "Aug 2017"
col1 Jun Jul Aug
1 A 1 2 3
2 A 1 2 3
3 A 1 2 3
4 A 1 2 3
5 A 1 2 3
6 B 2 3 4
7 B 2 3 4
8 B 2 3 4
9 C 3 4 5
10 C 3 4 5
Desired Output:
1 A 15
2 B 12
3 C 10
This works with dplyr instead of plyr.
# create data
df1 <- data.frame(
col1 = c(rep('A', 5), rep('B', 3), rep('C', 2)),
Jun = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3),
Jul = c(2, 2, 2, 2, 2, 3, 3, 3, 4, 4),
Aug = c(3, 3, 3, 3, 3, 4, 4, 4, 5, 5))
variable = 'Aug 2017'
# load dplyr library
library(dplyr)
# summarize each column that matches some string
df1 %>%
select(col1, matches(substr(variable, 1, 3))) %>%
group_by(col1) %>%
summarize_each(funs = 'sum')
# A tibble: 3 × 2
col1 Aug
<fctr> <dbl>
1 A 15
2 B 12
3 C 10
I also highly recommend reading about nonstandard and standard evaluation, here:
http://adv-r.had.co.nz/Computing-on-the-language.html

Resources