How to collapse rows by identical values in a column - r

Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe

We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

Related

Rename columns of a dataframe based on another dataframe except columns not in that dataframe in R

Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5

Using a vector as a grep pattern

I am new to R. I am trying to search the columns using grep multiple times within an apply loop. I use grep to specify which rows are summed based on the vector individuals
individuals <-c("ID1","ID2".....n)
bcdata_total <- sapply(individuals, function(x) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
bcdata is of random size and contains random data but contains columns that have individuals in part of the string
>head(bcdata)
ID1-4 ID1-3 ID2-5
A 3 2 1
B 2 2 3
C 4 5 5
grep(individuals[1],colnames(bcdata_clean)) returns a vector that looks like
[1] 1 2, a list of the column names containing ID1. That vector is used to select columns to be summed in bcdata_clean. This should occur n number of times depending on the length of individuals
However this returns the error
In grep(individuals, colnames(bcdata)) :
argument 'pattern' has length > 1 and only the first element will be used
And results in all the columns of bcdata being identical
Ideally individuals would increment each time the function is run like this for each iteration
apply(bcdata_clean[,grep(individuals[1,2....n], colnames(bcdata_clean))], 1, sum)
and would result in something like this
>head(bcdata_total)
ID1 ID2
A 5 1
B 4 3
C 9 5
But I'm not sure how to increment individuals. What is the best way to do this within the function?
You can use split.default to split data on similarly named columns and sum them row-wise.
sapply(split.default(df, sub('-.*', '', names(df))), rowSums, na.rm. = TRUE)
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))
Passing individuals as my argument in function(x) fixed my issue
bcdata_total <- sapply(individuals, function(individuals) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
An option with tidyverse
library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column('rn') %>%
pivot_longer(cols = -rn, names_to = c(".value", "grp"), names_sep="-") %>%
group_by(rn) %>%
summarise(across(starts_with('ID'), sum, na.rm = TRUE), .groups = 'drop') %>%
column_to_rownames('rn')
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))

Concatenate by Group [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
Looking to concatenate a column of strings (separated by a ",") when grouped by another column. Example of raw data:
Column1 Column2
1 a
1 b
1 c
1 d
2 e
2 f
2 g
2 h
3 i
3 j
3 k
3 l
Results Needed:
Column1 Grouped_Value
1 "a,b,c,d"
2 "e,f,g,h"
3 "i,j,k,l"
I've tried using dplyr, but I seem to be getting the below as a result
Column1 Grouped_Value
1 "a,b,c,d,e,f,g,h,i,j,k,l"
2 "a,b,c,d,e,f,g,h,i,j,k,l"
3 "a,b,c,d,e,f,g,h,i,j,k,l"
summ_data <-
df_columns %>%
group_by(df_columns$Column1) %>%
summarise(Grouped_Value = paste(df_columns$Column2, collapse =","))
We can do this with aggregate
aggregate(Column2 ~ Column1, df1, toString)
Or with dplyr
library(dplyr)
df1 %>%
group_by(Column1) %>%
summarise(Grouped_value =toString(Column2))
# A tibble: 3 x 2
# Column1 Grouped_value
# <int> <chr>
#1 1 a, b, c, d
#2 2 e, f, g, h
#3 3 i, j, k, l
NOTE: toString is wrapper for paste(., collapse=', ')
The issue in OP' solution is that it is pasteing the whole column (df1$Column2 or df1[['Column2']] - breaks the grouping and select the whole column) instead of the grouped elements
data
df1 <- structure(list(Column1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), Column2 = c("a", "b", "c", "d", "e", "f", "g", "h",
"i", "j", "k", "l")), class = "data.frame", row.names = c(NA,
-12L))
First commandment of dplyr
Don't use dollar signs in dplyr commands!
Use
group_by(Column1)
and
summarise(Grouped_Value = paste(Column2, collapse =","))

Count # of IDs that meet both criteria

I have a dataset that has two columns. One is userid, the other is company type, like below:
userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A
I want to know how many unique userid's there are that have company.type of A and B or A and C, (but not B and C).
I'm assuming it's some sort of aggregate function, but I'm not sure how to place the qualifier that company.type has to be A and B or A and C only.
We can do this with base R using table
tbl <- table(df1) > 0
sum(((tbl[, 1] & tbl[,2]) | (tbl[,1] & tbl[,3])) & (!(tbl[,2] & tbl[,3])))
#[1] 2
Here's an idea with dplyr. setequal checks if two vectors are composed of the same elements, regardless of ordering:
library(dplyr)
df %>%
group_by(userid) %>%
summarize(temp = setequal(company.type, c("A", "B")) |
setequal(company.type, c("A", "C"))) %>%
pull(temp) %>%
sum()
# [1] 2
Data:
df <- structure(list(userid = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), company.type = c("A",
"A", "C", "B", "B", "B", "A")), .Names = c("userid", "company.type"
), class = "data.frame", row.names = c(NA, -7L))
See: Check whether two vectors contain the same (unordered) elements in R
Sort DF and reduce it to one row per userid with a types column consisting of a comma-separated string of company types. Then filter it using the indicated condition. Finally use tally to get the number of rows left after filtering. To get the details omit the tally line.
library(dplyr)
DF %>%
arrange(userid, company.type) %>%
group_by(userid) %>%
summarize(types = toString(company.type)) %>%
ungroup %>%
filter(grepl("A.*B|A.*C", types) & ! grepl("B.*C", types)) %>%
tally
giving:
# A tibble: 1 x 1
n
<int>
1 2
Note
The input used, in reproducible form, is:
Lines <- "userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A"
DF <- read.table(text = Lines, header = TRUE)

Count matching instances between two data frames [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm a newbie with R and can't find my answer/anything that works.
I've got two data frames that look like..
Teams
A
B
C
...
and
TCF
A
B
C
C
B
A
...
I need to count the number of instances that each of the first DF column occurs in the second DF and return the value to the first DF. Thanks in advance!
You could use base R to do this:
sapply(unique(df1$Teams), function(x) sum(df2$TCF %in% x))
#A B C
#2 2 2
Or
setNames(table(match(df2$TCF, unique(df1$Teams))), unique(df1$Teams))
#A B C
#2 2 2
Or using data.table
library(data.table)
setkey(setDT(df1), Teams)
setkey(setDT(df2), TCF)
df2[J(unique(df1$Teams)),.N, by=.EACHI]
# TCF N
#1: A 2
#2: B 2
#3: C 2
data
df1 <- structure(list(Teams = c("A", "B", "C")), .Names = "Teams",
class = "data.frame", row.names = c(NA,-3L))
df2 <- structure(list(TCF = c("A", "B", "C", "C", "B", "A")), .Names = "TCF",
class = "data.frame", row.names = c(NA, -6L))
Would this option be easier to your eyes?
library(dplyr)
df2 %>% count(TCF) %>% filter(TCF %in% unique(df1$Teams))
# Source: local data frame [3 x 2]
# TCF n
# 1 A 2
# 2 B 2
# 3 C 2
Data
df1 <- structure(list(Teams = c("A", "B", "C")), .Names = "Teams", class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(TCF = structure(c(1L, 2L, 3L, 3L, 2L, 1L, 4L,
5L, 5L), .Label = c("A", "B", "C", "X", "Y"), class = "factor")), .Names = "TCF", row.names = c(NA,
-9L), class = "data.frame")

Resources