Concatenate by Group [duplicate] - r

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
Looking to concatenate a column of strings (separated by a ",") when grouped by another column. Example of raw data:
Column1 Column2
1 a
1 b
1 c
1 d
2 e
2 f
2 g
2 h
3 i
3 j
3 k
3 l
Results Needed:
Column1 Grouped_Value
1 "a,b,c,d"
2 "e,f,g,h"
3 "i,j,k,l"
I've tried using dplyr, but I seem to be getting the below as a result
Column1 Grouped_Value
1 "a,b,c,d,e,f,g,h,i,j,k,l"
2 "a,b,c,d,e,f,g,h,i,j,k,l"
3 "a,b,c,d,e,f,g,h,i,j,k,l"
summ_data <-
df_columns %>%
group_by(df_columns$Column1) %>%
summarise(Grouped_Value = paste(df_columns$Column2, collapse =","))

We can do this with aggregate
aggregate(Column2 ~ Column1, df1, toString)
Or with dplyr
library(dplyr)
df1 %>%
group_by(Column1) %>%
summarise(Grouped_value =toString(Column2))
# A tibble: 3 x 2
# Column1 Grouped_value
# <int> <chr>
#1 1 a, b, c, d
#2 2 e, f, g, h
#3 3 i, j, k, l
NOTE: toString is wrapper for paste(., collapse=', ')
The issue in OP' solution is that it is pasteing the whole column (df1$Column2 or df1[['Column2']] - breaks the grouping and select the whole column) instead of the grouped elements
data
df1 <- structure(list(Column1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), Column2 = c("a", "b", "c", "d", "e", "f", "g", "h",
"i", "j", "k", "l")), class = "data.frame", row.names = c(NA,
-12L))

First commandment of dplyr
Don't use dollar signs in dplyr commands!
Use
group_by(Column1)
and
summarise(Grouped_Value = paste(Column2, collapse =","))

Related

How to collapse rows by identical values in a column

Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe
We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

Using a vector as a grep pattern

I am new to R. I am trying to search the columns using grep multiple times within an apply loop. I use grep to specify which rows are summed based on the vector individuals
individuals <-c("ID1","ID2".....n)
bcdata_total <- sapply(individuals, function(x) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
bcdata is of random size and contains random data but contains columns that have individuals in part of the string
>head(bcdata)
ID1-4 ID1-3 ID2-5
A 3 2 1
B 2 2 3
C 4 5 5
grep(individuals[1],colnames(bcdata_clean)) returns a vector that looks like
[1] 1 2, a list of the column names containing ID1. That vector is used to select columns to be summed in bcdata_clean. This should occur n number of times depending on the length of individuals
However this returns the error
In grep(individuals, colnames(bcdata)) :
argument 'pattern' has length > 1 and only the first element will be used
And results in all the columns of bcdata being identical
Ideally individuals would increment each time the function is run like this for each iteration
apply(bcdata_clean[,grep(individuals[1,2....n], colnames(bcdata_clean))], 1, sum)
and would result in something like this
>head(bcdata_total)
ID1 ID2
A 5 1
B 4 3
C 9 5
But I'm not sure how to increment individuals. What is the best way to do this within the function?
You can use split.default to split data on similarly named columns and sum them row-wise.
sapply(split.default(df, sub('-.*', '', names(df))), rowSums, na.rm. = TRUE)
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))
Passing individuals as my argument in function(x) fixed my issue
bcdata_total <- sapply(individuals, function(individuals) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
An option with tidyverse
library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column('rn') %>%
pivot_longer(cols = -rn, names_to = c(".value", "grp"), names_sep="-") %>%
group_by(rn) %>%
summarise(across(starts_with('ID'), sum, na.rm = TRUE), .groups = 'drop') %>%
column_to_rownames('rn')
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))

check if numbers in a column are ascending by a certain value (R dataframe)

I have a column of numbers (index) in a dataframe like the below. I am attempting to check if these numbers are in ascending order by the value of 1. For example, group B and C do not ascend by 1. While I can check by sight, my dataframe is thousands of rows long, so I'd prefer to automate this. Does anyone have advice? Thank you!
group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2
...
I think this works. diff calculates the difference between the two subsequent numbers, and then we can use all to see if all the differences are 1. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
group_by(group) %>%
summarize(Result = all(diff(index) == 1)) %>%
ungroup()
dat2
# # A tibble: 3 x 2
# group Result
# <chr> <lgl>
# 1 A TRUE
# 2 B FALSE
# 3 C FALSE
DATA
dat <- read.table(text = "group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2",
header = TRUE, stringsAsFactors = FALSE)
Maybe aggregate could help
> aggregate(.~group,df1,function(v) all(diff(v)==1))
group index
1 A TRUE
2 B FALSE
3 C FALSE
We can do a group by group, get the difference between the current and previous value (shift) and check if all the differences are equal to 1.
library(data.table)
setDT(df1)[, .(Result = all((index - shift(index))[-1] == 1)), group]
# group Result
#1: A TRUE
#2: B FALSE
#3: C FALSE
data
df1 <- structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "C", "C", "C", "C"), index = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 2L, 0L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-13L))

Count # of IDs that meet both criteria

I have a dataset that has two columns. One is userid, the other is company type, like below:
userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A
I want to know how many unique userid's there are that have company.type of A and B or A and C, (but not B and C).
I'm assuming it's some sort of aggregate function, but I'm not sure how to place the qualifier that company.type has to be A and B or A and C only.
We can do this with base R using table
tbl <- table(df1) > 0
sum(((tbl[, 1] & tbl[,2]) | (tbl[,1] & tbl[,3])) & (!(tbl[,2] & tbl[,3])))
#[1] 2
Here's an idea with dplyr. setequal checks if two vectors are composed of the same elements, regardless of ordering:
library(dplyr)
df %>%
group_by(userid) %>%
summarize(temp = setequal(company.type, c("A", "B")) |
setequal(company.type, c("A", "C"))) %>%
pull(temp) %>%
sum()
# [1] 2
Data:
df <- structure(list(userid = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), company.type = c("A",
"A", "C", "B", "B", "B", "A")), .Names = c("userid", "company.type"
), class = "data.frame", row.names = c(NA, -7L))
See: Check whether two vectors contain the same (unordered) elements in R
Sort DF and reduce it to one row per userid with a types column consisting of a comma-separated string of company types. Then filter it using the indicated condition. Finally use tally to get the number of rows left after filtering. To get the details omit the tally line.
library(dplyr)
DF %>%
arrange(userid, company.type) %>%
group_by(userid) %>%
summarize(types = toString(company.type)) %>%
ungroup %>%
filter(grepl("A.*B|A.*C", types) & ! grepl("B.*C", types)) %>%
tally
giving:
# A tibble: 1 x 1
n
<int>
1 2
Note
The input used, in reproducible form, is:
Lines <- "userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A"
DF <- read.table(text = Lines, header = TRUE)

Iterate through grouped rows to get different pair combinations

Having the following table:
read.table(text = "route origin dest seq
1 a b 1
1 b c 2
1 c d 3
1 d e 4
2 f g 1
2 g h 2
2 h i 3", header = TRUE)
I'm trying to find a way of going through each row, grouped by route, and iterate every potential combination of origin destination pairs, taking into account the seq variable and the route as mentioned.
The output should look something like this:
origin dest
a b
a c
a d
a e
b c
b d
(...) (...)
The idea behind this is that a train e.g route 1, goes from a to e. However, I want to list every single possibility of train pairs with that. I tried with igraph but unsuccessfully.
Any ideas with dplyr or so?
library(dplyr)
library(tidyr)
df %>%
mutate_if(is.factor, as.character) %>% #convert factor variable to character
group_by(route) %>%
expand(origin = paste(origin, seq, sep = "_"), dest = paste(dest, seq, sep = "_")) %>% #all possible combination of origin & destination grouped by route
rowwise() %>%
filter(strsplit(origin, split = "_")[[1]][1] != strsplit(dest, split = "_")[[1]][1] &
strsplit(origin, split = "_")[[1]][2] <= strsplit(dest, split = "_")[[1]][2]) %>%
mutate(origin = gsub("_.*$", "", origin),
dest = gsub("_.*$", "", dest))
Output is:
route origin dest
1 1 a b
2 1 a c
3 1 a d
4 1 a e
5 1 b c
...
Sample data:
df <- structure(list(route = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), origin = structure(1:7, .Label = c("a",
"b", "c", "d", "f", "g", "h"), class = "factor"), dest = structure(1:7, .Label = c("b",
"c", "d", "e", "g", "h", "i"), class = "factor"), seq = c(1L,
2L, 3L, 4L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-7L))
# route origin dest seq
#1 1 a b 1
#2 1 b c 2
#3 1 c d 3
#4 1 d e 4
#5 2 f g 1
#6 2 g h 2
#7 2 h i 3

Resources