Cumulative Sum of String - r

I have a table which looks like:
Order
Col A
Col B
1
a
2,3,4,5
2
a
3,5,6,7,8
3
a
1,2,4,9
4
a
3,5,7,11,12
I want to aggregate this table by Col A. The output should look like the following:
Order
Col A
Col B
Col C
1
a
2,3,4,5
2,3,4,5
2
a
3,5,6,7,8
2,3,4,5,6,7,8
3
a
1,2,4,9
1,2,3,4,5,6,7,8,9
4
a
3,5,7,11,12
1,2,3,4,5,6,7,8,9,11,12
Please guide me on how I get the desirable output in R?

This ought to do it:
library(dplyr)
df %>%
group_by(ColA) %>%
mutate(
result = strsplit(Colb, split = ","),
result = lapply(result, as.numeric),
result = Reduce(f = union, x = result, accumulate = TRUE),
result = lapply(result, sort),
result = sapply(result, paste, collapse = ",")
) %>%
ungroup()
# # A tibble: 4 × 4
# Order ColA Colb result
# <int> <chr> <chr> <chr>
# 1 1 a 2,3,4,5 2,3,4,5
# 2 2 a 3,5,6,7,8 2,3,4,5,6,7,8
# 3 3 a 1,2,4,9 1,2,3,4,5,6,7,8,9
# 4 4 a 3,5,7,11,12 1,2,3,4,5,6,7,8,9,11,12
Using this data:
df = read.table(text = "Order ColA Colb
1 a '2,3,4,5'
2 a '3,5,6,7,8'
3 a '1,2,4,9'
4 a '3,5,7,11,12' ", header = T)

df %>%
group_by(ColA)%>%
mutate(ColC = map_chr(accumulate(strsplit(ColB,','),
~union(.x,.y)), str_c, collapse=','))
# A tibble: 4 × 4
# Groups: ColA [1]
Order ColA ColB ColC
<int> <chr> <chr> <chr>
1 1 a 2,3,4,5 2,3,4,5
2 2 a 3,5,6,7,8 2,3,4,5,6,7,8
3 3 a 1,2,4,9 2,3,4,5,6,7,8,1,9
4 4 a 3,5,7,11,12 2,3,4,5,6,7,8,1,9,11,12

Related

How to get all combinations of 2 from a grouped column in a data frame

I could write a loop to do this, but I was wondering how this might be done in R with dplyr. I have a data frame with two columns. Column 1 is the group, Column 2 is the value. I would like a data frame that has every combination of two values from each group in two separate columns. For example:
input = data.frame(col1 = c(1,1,1,2,2), col2 = c("A","B","C","E","F"))
input
#> col1 col2
#> 1 1 A
#> 2 1 B
#> 3 1 C
#> 4 2 E
#> 5 2 F
and have it return
output = data.frame(col1 = c(1,1,1,2), col2 = c("A","B","C","E"), col3 = c("B","C","A","F"))
output
#> col1 col2 col3
#> 1 1 A B
#> 2 1 B C
#> 3 1 C A
#> 4 2 E F
I'd like to be able to include it within dplyr syntax:
input %>%
group_by(col1) %>%
???
I tried writing my own function that produces a data frame of combinations like what I need from a vector and sent it into the group_map function, but didn't have success:
combos = function(x, ...) {
x = t(combn(x, 2))
return(as.data.frame(x))
}
input %>%
group_by(col1) %>%
group_map(.f = combos)
Produced an error.
Any suggestions?
You can do :
library(dplyr)
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))
# col1 X1 X2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 A C
#3 1 B C
#4 2 E F
input %>%
group_by(col1) %>%
nest(data=-col1) %>%
mutate(out= map(data, ~ t(combn(unlist(.x), 2)))) %>%
unnest(out) %>% select(-data)
# A tibble: 4 x 2
# Groups: col1 [2]
col1 out[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Or :
combos = function(x, ...) {
return(tibble(col1=x[[1,1]],col2=t(combn(unlist(x[[2]], use.names=F), 2))))
}
input %>%
group_by(col1) %>%
group_map(.f = combos, .keep=T) %>% invoke(rbind,.) %>% tibble
# A tibble: 4 x 2
col1 col2[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Thank you! In terms of parsimony, I like both the answer from Ben
input %>%
group_by(col1) %>%
do(data.frame(t(combn(.$col2, 2))))
and Ronak
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))

Transpose and sum distinct values in R

IS there a way to transpose and summing distinct values in R For example
df
Cola Order Quantity Loc
ABC 1 4 LocA
ABC 1 4 LocB
CSD 4 6 LocA
CDS 3 2 LocB
We have same values for Order and Quantity but still need to take sum of it.
Expected Output (Transpose with respect to Quantity)
Cola Order Quantity LocA_Quantity Loc B_Quantity
ABC 2 8 4 4
CSD 4 6 6
CDS 3 2 2
Create the dataset:
library(tibble)
df = tribble(
~Cola, ~Order, ~Quantity, ~Loc,
'ABC', 1, 4, 'LocA',
'ABC', 1, 4, 'LocB',
'CSD', 4, 6, 'LocA',
'CDS', 3, 2, 'LocB'
)
Create the summaries:
library(dplyr)
df %>%
group_by(Cola) %>%
summarise(
Order = sum(Order),
LocA_Quantity = sum(Quantity * if_else(Loc == "LocA", 1, 0)),
LocB_Quantity = sum(Quantity * if_else(Loc == "LocB", 1, 0)),
Quantity = sum(Quantity)
)
You can do it for both Quantity and order and drop columns you dont want at the end, i.e.
library(tidyverse)
df %>%
group_by(Cola) %>%
mutate_at(vars(2:3), list(new = sum)) %>%
pivot_wider(names_from = Loc, values_from = 2:3)
## A tibble: 3 x 7
## Groups: Cola [3]
# Cola Order_new Quantity_new Order_LocA Order_LocB Quantity_LocA Quantity_LocB
# <fct> <int> <int> <int> <int> <int> <int>
#1 ABC 2 8 1 1 4 4
#2 CSD 4 6 4 NA 6 NA
#3 CDS 3 2 NA 3 NA 2
1) dplyr/tidyr Using the data shown reproducibly in the Note at the end, sum the orders and quantity and create a Quantity_ column equal to Quantity by Cola. Then reshape the Quantity_ column to wide form.
library(dplyr)
library(tidyr)
df %>%
group_by(Cola) %>%
mutate(Quantity_ = Quantity,
Order = sum(Order),
Quantity = sum(Quantity)) %>%
ungroup %>%
pivot_wider(names_from = "Loc", values_from = "Quantity_",
names_prefix = "Quantity_", values_fill = list(Quantity_ = 0))
giving:
# A tibble: 3 x 5
Cola Order Quantity Quantity_LocA Quantity_LocB
<chr> <int> <int> <int> <int>
1 ABC 2 8 4 4
2 CSD 4 6 6 0
3 CDS 3 2 0 2
2) Base R We can do much the same in base R using transform/ave and reshape like this:
df2 <- transform(df,
Quantity_ = Quantity,
Quantity = ave(Quantity, Cola, FUN = sum),
Order = ave(Order, Cola, FUN = sum))
wide <- reshape(df2, dir = "wide", idvar = c("Cola", "Quantity", "Order"),
timevar = "Loc", sep = "")
wide
## Cola Order Quantity Quantity_LocA Quantity_LocB
## 1 ABC 2 8 4 4
## 3 CSD 4 6 6 NA
## 4 CDS 3 2 NA 2
Note
Lines <- "Cola Order Quantity Loc
ABC 1 4 LocA
ABC 1 4 LocB
CSD 4 6 LocA
CDS 3 2 LocB"
df <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Split Dataframe into list of one-row Dataframes

I would like to split a dataframe
df <- data.frame(a = 1:4, b = letters[1:4])
a b
1 1 a
2 2 b
3 3 c
4 4 d
into a list of one-row dataframes
list(
data.frame(a = 1, b = letters[1])
, data.frame(a = 2, b = letters[2])
, data.frame(a = 3, b = letters[3])
, data.frame(a = 4, b = letters[4])
)
[[1]]
a b
1 1 a
[[2]]
a b
1 2 b
[[3]]
a b
1 3 c
[[4]]
a b
1 4 d
Is there an elegant solution to this?
Using dplyr:
df %>%
rowid_to_column() %>%
group_split(rowid, keep = FALSE)
[[1]]
# A tibble: 1 x 2
a b
<int> <fct>
1 1 a
[[2]]
# A tibble: 1 x 2
a b
<int> <fct>
1 2 b
[[3]]
# A tibble: 1 x 2
a b
<int> <fct>
1 3 c
[[4]]
# A tibble: 1 x 2
a b
<int> <fct>
1 4 d
Or:
df %>%
mutate(rowid = 1:n()) %>%
group_split(rowid, keep = FALSE)
Or a shortened version (provided by #arg0naut91):
group_split(df, row_number(), keep = FALSE)
A simple way would be to use the split() command built into R
split( df, 1:length( df$a ) )
It should be robust enough to handle duplicates in df$a.
It would be with asplit
lapply(asplit(df, 1), as.data.frame.list)
#[[1]]
# a b
#1 1 a
#[[2]]
# a b
#1 2 b
#[[3]]
# a b
#1 3 c
#[[4]]
# a b
#1 4 d
Or with pmap
library(purrr)
pmap(df, tibble)
#[[1]]
# A tibble: 1 x 2
# a b
# <int> <fct>
#1 1 a
#[[2]]
# A tibble: 1 x 2
# a b
# <int> <fct>
#1 2 b
#[[3]]
# A tibble: 1 x 2
# a b
# <int> <fct>
#1 3 c
#[[4]]
# A tibble: 1 x 2
# a b
# <int> <fct>
#1 4 d

Removing mirrored combinations of variables in a data frame

I'm looking to get each unique combination of two variables:
library(purrr)
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`)
# A tibble: 6 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 1 2
4 3 2
5 1 3
6 2 3
How do I remove out the mirrored combinations? That is, I want only one of rows 1 and 3 in the data frame above, only one of rows 2 and 5, and only one of rows 4 and 6. My desired output would be something like:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2
I don't care if a particular id value is in id1 or id2, so the below is just as acceptable as the output:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 1 2
2 1 3
3 2 3
A tidyverse version of Dan's answer:
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`) %>%
mutate(min = pmap_int(., min), max = pmap_int(., max)) %>% # Find the min and max in each row
unite(check, c(min, max), remove = FALSE) %>% # Combine them in a "check" variable
distinct(check, .keep_all = TRUE) %>% # Remove duplicates of the "check" variable
select(id1, id2)
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2
A Base R approach:
# create a string with the sorted elements of the row
df$temp <- apply(df, 1, function(x) paste(sort(x), collapse=""))
# then you can simply keep rows with a unique sorted-string value
df[!duplicated(df$temp), 1:2]

How to filter rows for every column independently using dplyr

I have the following tibble:
library(tidyverse)
df <- tibble::tribble(
~gene, ~colB, ~colC,
"a", 1, 2,
"b", 2, 3,
"c", 3, 4,
"d", 1, 1
)
df
#> # A tibble: 4 x 3
#> gene colB colC
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 b 2 3
#> 3 c 3 4
#> 4 d 1 1
What I want to do is to filter every columns after gene column
for values greater or equal 2 (>=2). Resulting in this:
gene, colB, colC
a NA 2
b 2 3
c 3 4
How can I achieve that?
The number of columns after genes actually is more than just 2.
One solution: convert from wide to long format, so you can filter on just one column, then convert back to wide at the end if required. Note that this will drop genes where no values meet the condition.
library(tidyverse)
df %>%
gather(name, value, -gene) %>%
filter(value >= 2) %>%
spread(name, value)
# A tibble: 3 x 3
gene colB colC
* <chr> <dbl> <dbl>
1 a NA 2
2 b 2 3
3 c 3 4
The forthcoming dplyr 0.6 (install from GitHub now, if you like) has filter_at, which can be used to filter to any rows that have a value greater than or equal to 2, and then na_if can be applied similarly through mutate_at, so
df %>%
filter_at(vars(-gene), any_vars(. >= 2)) %>%
mutate_at(vars(-gene), funs(na_if(., . < 2)))
#> # A tibble: 3 x 3
#> gene colB colC
#> <chr> <dbl> <dbl>
#> 1 a NA 2
#> 2 b 2 3
#> 3 c 3 4
or similarly,
df %>%
mutate_at(vars(-gene), funs(na_if(., . < 2))) %>%
filter_at(vars(-gene), any_vars(!is.na(.)))
which can be translated for use with dplyr 0.5:
df %>%
mutate_at(vars(-gene), funs(na_if(., . < 2))) %>%
filter(rowSums(is.na(.)) < (ncol(.) - 1))
All return the same thing.
We can use data.table
library(data.table)
setDT(df)[df[, Reduce(`|`, lapply(.SD, `>=`, 2)), .SDcols = colB:colC]
][, (2:3) := lapply(.SD, function(x) replace(x, x < 2, NA)), .SDcols = colB:colC][]
# gene colB colC
#1: a NA 2
#2: b 2 3
#3: c 3 4
Or with melt/dcast
dcast(melt(setDT(df), id.var = 'gene')[value>=2], gene ~variable)
# gene colB colC
#1: a NA 2
#2: b 2 3
#3: c 3 4
Alternatively we could also try the below code
df %>% rowwise %>%
filter(any(c_across(starts_with('col'))>=2)) %>%
mutate(across(starts_with('col'), ~ifelse(!(.>=2), NA, .)))
Created on 2023-02-05 with reprex v2.0.2
# A tibble: 3 × 3
# Rowwise:
gene colB colC
<chr> <dbl> <dbl>
1 a NA 2
2 b 2 3
3 c 3 4

Resources