I am trying to create a new column that will contain a result of calculations done rowwise over a subset of columns of a tibble, and add this new column to the existing tibble. Like so:
df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)
I effectively want to do a dplyr equivalent of this code from base R:
df$SumA <- rowSums(df[,grepl("^A", colnames(df))])
My problem is that this doesn't work:
df %>%
select(starts_with("A")) %>%
mutate(SumA = rowSums(.))
# some code here
...because I got rid of the "ID" column in order to let mutate run the rowSums over the other (numerical) columns. I have tried to cbind or bind_cols in the pipe after the mutate, but it doesn't work. None of the variants of mutate work, because they work in-place (within each cell of the tibble, and not across the columns, even with rowwise).
This does work, but doesn't strike me as an elegant solution:
df %>%
mutate(SumA = rowSums(.[,grepl("^A", colnames(df))]))
Is there any tidyverse-based solution that does not require grepl or square brackets but only more standard dplyr verbs and parameters?
My expected output is this:
df_out <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3),
SumA = c(6, 6, 6)
)
Best
kJ
Here's one way to approach row-wise computation in the tidyverse using purrr::pmap. This is best used with functions that actually need to be run row by row; simple addition could probably be done a faster way. Basically we use select to provide the input list to pmap, which lets us use the select helpers such as starts_with or matches if you need regex.
library(tidyverse)
df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)
df %>%
mutate(
SumA = pmap_dbl(
.l = select(., starts_with("A")),
.f = function(...) sum(...)
)
)
#> # A tibble: 3 x 5
#> ID A1 A2 A3 SumA
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 one 1 2 3 6
#> 2 two 1 2 3 6
#> 3 three 1 2 3 6
Created on 2019-01-30 by the reprex package (v0.2.1)
Here's a different approach that doesn't move rowwise but instead exploits the vectorised nature of addition and that addition commutes. That lets use repeatedly apply + with purrr::reduce
library(tidyverse)
df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)
df %>%
mutate(
SumA = reduce(
.x = select(., starts_with("A")),
.f = `+`
)
)
#> # A tibble: 3 x 5
#> ID A1 A2 A3 SumA
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 one 1 2 3 6
#> 2 two 1 2 3 6
#> 3 three 1 2 3 6
Created on 2019-01-30 by the reprex package (v0.2.1)
1) To do it with rowSums try nesting a second pipeline in the mutate like this:
library(dplyr)
df %>% mutate(Sum = select(., starts_with("A")) %>% rowSums)
giving:
# A tibble: 3 x 5
ID A1 A2 A3 Sum
<chr> <dbl> <dbl> <dbl> <dbl>
1 one 1 2 3 6
2 two 1 2 3 6
3 three 1 2 3 6
2) An alternative is to reshape it to long form and then summarize:
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(Sum = gather(., key, value, -ID) %>%
group_by(., ID) %>%
summarize(sum = sum(value)) %>%
ungroup %>%
pull(sum))
giving:
# A tibble: 3 x 5
ID A1 A2 A3 Sum
<chr> <dbl> <dbl> <dbl> <dbl>
1 one 1 2 3 6
2 two 1 2 3 6
3 three 1 2 3 6
[upd] I didn't notice that #Calum used a nearly the same approach.
Another possible way to do that:
library(dplyr)
library(purrr)
dat %>%
mutate(SumA = pmap_dbl(select(., contains('A')), sum))
Data:
# dat <- tibble(
# ID = c("one", "two", "three"),
# A1 = c(1, 1, 1),
# A2 = c(2, 2, 2),
# A3 = c(3, 3, 3)
# )
Output:
# # A tibble: 3 x 5
# ID A1 A2 A3 SumA
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one 1 2 3 6
# 2 two 1 2 3 6
# 3 three 1 2 3 6
You could nest and use rowSums on the nested columns :
library(tidyverse)
df %>% nest(-ID) %>%
mutate(SumA = map_dbl(data,rowSums)) %>%
unnest
# # A tibble: 3 x 5
# ID SumA A1 A2 A3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one 6 1 2 3
# 2 two 6 1 2 3
# 3 three 6 1 2 3
Or this variant on the pmap approach :
df %>% mutate(SumA = pmap_dbl(.[-1],sum))
# # A tibble: 3 x 5
# ID A1 A2 A3 SumA
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one 1 2 3 6
# 2 two 1 2 3 6
# 3 three 1 2 3 6
And to show that base is sometimes easier :
df$SumA <- rowSums(df[-1])
Related
I'm working on a survey research project in which we need to multiply one group of columns against another group of columns. I can do this easily in base R, but I can't figure out how to do it within a tidy, pipe-based workflow. I found many solutions for multiplying a set of columns against one column, but not for multiple columns. Any help is greatly appreciated.
The example below demonstrates how I solve the problem in base R.
library(tidyverse)
df <- tibble(
a1 = c(1, 2, 3, 4, 5),
a2 = c(5, 4, 3, 2, 1),
a3 = c(1, 5, 2, 4, 3),
b1 = c(1, 1, 2, 1, 1),
b2 = c(3, 3, 5, 4, 1),
b3 = c(2, 1, 4, 2, 1)
)
new_df <- df[, c("a1", "a2", "a3")] * df[, c("b1", "b2", "b3")]
names(new_df) <- c("c1", "c2", "c3")
Created on 2022-06-14 by the reprex package (v2.0.1)
Not everything needs to be tidy. The base R solution that you have works perfectly fine, is neat and scalable.
The same can be achieved with tidyverse but it would not be neat. Here's a way with across.
library(dplyr)
df %>%
mutate(across(starts_with('a'), .names = '{sub("a", "c", col)}') *
across(starts_with('b'))) %>%
select(starts_with('c'))
# c1 c2 c3
# <dbl> <dbl> <dbl>
#1 1 15 2
#2 2 12 5
#3 6 15 8
#4 4 8 8
#5 5 1 3
The base R code can also be written as -
df %>% select(starts_with('a')) * df %>% select(starts_with('b'))
You could stack the two groups of columns pairwisely, multiply them together, and then pivot the long data to wide.
library(tidyverse)
df %>%
rowid_to_column("id") %>%
pivot_longer(-id, names_to = c(".value", "set"), names_pattern = "(.)(.)") %>%
mutate(c = a * b) %>%
pivot_wider(id, names_from = set, values_from = c, names_prefix = "c")
# # A tibble: 5 × 4
# id c1 c2 c3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 15 2
# 2 2 2 12 5
# 3 3 6 15 8
# 4 4 4 8 8
# 5 5 5 1 3
Here is a mostly tidyverse option, except for using split.default. We can split into a list of dataframes based on the ending, then we can use reduce to perform the calculation for each dataframe, then return as 1 single dataframe (and finally add the c prefix to the column names).
library(tidyverse)
df %>%
split.default(., sub("\\D+", "", names(.))) %>%
map_df(., ~ reduce(.x, .f = `*`)) %>%
rename_with( ~ paste0("c", .x))
Output
c1 c2 c3
<dbl> <dbl> <dbl>
1 1 15 2
2 2 12 5
3 6 15 8
4 4 8 8
5 5 1 3
I agree with #Ronak Shah that sometimes it's easier with base R functions but here is the pipe solution you are asking:
df <- tibble(
a1 = c(1, 2, 3, 4, 5),
a2 = c(5, 4, 3, 2, 1),
a3 = c(1, 5, 2, 4, 3),
b1 = c(1, 1, 2, 1, 1),
b2 = c(3, 3, 5, 4, 1),
b3 = c(2, 1, 4, 2, 1)
)
data.frame(df) %>%
mutate(c = select(cur_data(), a1:a3)*select(cur_data(), b1:b3)) %>%
invoke(.f = data.frame) %>%
rename_with(~str_remove(.,".a"))
The output looks like:
a1 a2 a3 b1 b2 b3 c1 c2 c3
1 1 5 1 1 3 2 1 15 2
2 2 4 5 1 3 1 2 12 5
3 3 3 2 2 5 4 6 15 8
4 4 2 4 1 4 2 4 8 8
5 5 1 3 1 1 1 5 1 3
I'd like to remove the "(N)" from the column names.
Example data:
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
I got so far, but don't know how to figure out the rest of regex
df %>%
rename_with(stringr::str_replace,
pattern = "[//(],N//)]", replacement = "")
But the n from the "number (N)" is gone.
name id N) umber (N)
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
One liner: rename_with(df, ~str_remove_all(., ' \\(N\\)'))
or dplyr only: rename_with(df, ~sub(' \\(N\\)', '', .))
We could use the rename_with function from dplyr package and apply a function (in this case str_remove from stringr package).
And then use \\ to escape (:
library(dplyr)
library(stringr)
df %>%
rename_with(~str_remove_all(., ' \\(N\\)'))
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
A possible solution:
library(tidyverse)
df <- tibble(
name = c("A", "B", "C", "D"),
`id (N)` = c(1, 2, 3, 4),
`Number (N)` = c(3, 1, 2, 8)
)
df %>% names %>% str_remove("\\s*\\(N\\)\\s*") %>% set_names(df,.)
#> # A tibble: 4 × 3
#> name id Number
#> <chr> <dbl> <dbl>
#> 1 A 1 3
#> 2 B 2 1
#> 3 C 3 2
#> 4 D 4 8
Perhaps you can try
setNames(df, gsub("\\s\\(.*\\)", "", names(df)))
which gives
name id Number
<chr> <dbl> <dbl>
1 A 1 3
2 B 2 1
3 C 3 2
4 D 4 8
A simple solution is
colnames(df) <- gsub(" \\(N\\)", "", colnames(df))
I'm trying to see how to apply rowSums() to specific columns only.
here is a reprex:
df <- tibble(
"ride" = c("bicycle", "motorcycle", "car", "other"),
"A" = c(1, NA, 1, NA),
"B" = c(NA, 2, NA, 2)
)
I can get the desired result, by index[2:3]
df %>%
mutate(total = rowSums(.[2:3], na.rm = TRUE))
# A tibble: 4 × 4
ride A B total
<chr> <dbl> <dbl> <dbl>
1 bicycle 1 NA 1
2 motorcycle NA 2 2
3 car 1 NA 1
4 other NA 2 2
however, if I try specifying columns by name, strange results occur
df %>%
mutate(total = sum(c_across(c("A":"B")), na.rm = TRUE))
# A tibble: 4 × 4
ride A B total
<chr> <dbl> <dbl> <dbl>
1 bicycle 1 NA 6
2 motorcycle NA 2 6
3 car 1 NA 6
4 other NA 2 6
What am I doing wrong?
I can achieve what I want, by something like this:
df %>%
mutate_all(~replace(., is.na(.), 0)) %>%
mutate(total = A + B)
but I'd like to specify column names by passing a vector, so I can change to different combination of column names in future.
Something like this is what I'd like to achieve:
cols_to_sum <- c("A","B")
df %>%
mutate(total = sum(across(cols_to_sum), na.rm = TRUE))
You may use select to specify the columns you want to sum.
library(dplyr)
cols_to_sum <- c("A","B")
df %>%
mutate(total = rowSums(select(., all_of(cols_to_sum)), na.rm = TRUE))
# ride A B total
# <chr> <dbl> <dbl> <dbl>
#1 bicycle 1 NA 1
#2 motorcycle NA 2 2
#3 car 1 NA 1
#4 other NA 2 2
c_across works with rowwise -
df %>%
rowwise() %>%
mutate(total = sum(c_across(all_of(cols_to_sum)), na.rm = TRUE)) %>%
ungroup
Consider the following tibbles:
library(tidyverse)
tbl_base_ids = tibble(base_id = c("ABC", "ABCDEF", "ABCDEFGHI"), base_id_length = c(3, 6, 9), record_id_length = c(10, 12, 15))
tbl_records = tibble(record_id = c("ABC1234567", "ABCDEF123456", "ABCDEFGHI123456"))
I'd like to join matching rows to produce a third tibble:
tbl_records_with_base
record_id
base_id
base_id_length
record_id_length
As you can see, this is not a matter of joining one or more variables from each of the first two. This requires matching variable derivatives. In SQL, I'd do this:
SELECT A.record_id,
B.base_id,
B.base_id_length,
B.record_id_length
FROM tbl_records A
JOIN tbl_base_ids B
ON LENGTH(a.record_id) = B.record_id_length
AND LEFT(a.record_id, B.base_id_length) = B.base_id
I've tried variations of dplyr joins and using the match function, to but to no avail. Can someone help? Thank you.
You should come up with some logic to separate base_id from record_id. because joining only on record_id_length would not be enough. For this example we can get base_id if we remove all numbers from record_id. Based on your actual dataset you need to change this if needed.
Once we do that we can join tbl_records with tbl_base_ids by base_id and record_id_length.
library(dplyr)
tbl_records %>%
mutate(base_id = sub('\\d+', '', record_id),
record_id_length = nchar(record_id)) %>%
inner_join(tbl_base_ids, by = c("base_id", "record_id_length")) -> result
result
# record_id base_id record_id_length base_id_length
# <chr> <chr> <dbl> <dbl>
#1 ABC1234567 ABC 10 3
#2 ABCDEF123456 ABCDEF 12 6
#3 ABCDEFGHI123456 ABCDEFGHI 15 9
I suggest using the fuzzyjoin package.
library(dplyr)
library(fuzzyjoin)
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(tbl_records, ., by = c("record_id" = "record_ptn"))
# # A tibble: 3 x 5
# record_id base_id base_id_length record_id_length record_ptn
# <chr> <chr> <dbl> <dbl> <chr>
# 1 ABC1234567 ABC 3 10 ^ABC.{7}$
# 2 ABCDEF123456 ABCDEF 6 12 ^ABCDEF.{6}$
# 3 ABCDEFGHI123456 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$
A note about this: the order of tables matters, where the regex must reside on the RHS of the by= settings. For instance, this does not work if we reverse it:
tbl_base_ids %>%
mutate(record_ptn = sprintf("^%s.{%i}$", base_id, pmax(0, record_id_length - base_id_length))) %>%
regex_full_join(., tbl_records, by = c("record_ptn" = "record_id"))
# # A tibble: 6 x 5
# base_id base_id_length record_id_length record_ptn record_id
# <chr> <dbl> <dbl> <chr> <chr>
# 1 ABC 3 10 ^ABC.{7}$ <NA>
# 2 ABCDEF 6 12 ^ABCDEF.{6}$ <NA>
# 3 ABCDEFGHI 9 15 ^ABCDEFGHI.{6}$ <NA>
# 4 <NA> NA NA <NA> ABC1234567
# 5 <NA> NA NA <NA> ABCDEF123456
# 6 <NA> NA NA <NA> ABCDEFGHI123456
I want to sort each column alphabetically in every dataframe of a list of dataframe.
Example data:
A <- c(1, 2, 3, 4)
B <- c(1, 2, 3, 4)
C <- c(1, 2, 3, 4)
df1 <- tibble(B, C, A)
df2<- tibble(C, B, A)
list_df1_2 <- list(df1, df2)
list_df1_2
# For example tried
list_df1_2_ordered <- purrr::map(list_df1_2, function(.x) order(colnames(x)))
For a base R option, you may use lapply on your list and then sort the columns of each data frame by column name:
list_df1_2 <- lapply(list_df1_2, function(x) x[ , order(names(x))])
With tidyverse, we can wrap the order within select
library(dplyr)
library(purrr)
list_df1_2 <- map(list_df1_2, ~ .x %>%
select(order(names(.))))
-output
list_df1_2
#[[1]]
# A tibble: 4 x 3
# A B C
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#[[2]]
# A tibble: 4 x 3
# A B C
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4