Sort and concatenate values by group [duplicate] - r

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 6 years ago.
I've got a list of Groups and Names, as seen in DF below. I'm looking to arrange this list alphabetically and concatenate each name separated by a comma, as seen in DF2 below. I thought this would be simple, but it is proving to be more challenging than expected!
DF <- tibble::data_frame(
Group = c(1, 1, 1, 2, 2, 3, 3, 3),
Name = c("A", "B", "C", "B", "A", "B", "C", "A"))
DF2 <- tibble::data_frame(
Group = c(1, 2, 3),
Name = c("A, B, C", "A, B", "A, B, C"))
I'd appreciate any help in solving this to account for an unknown number of names listed per group, either with or without a dplyr pipeline.
Thanks!

We can use data.table
library(data.table)
setDT(DF)[order(Name), .(Comb = toString(Name)) , by = Group]

In base R:
aggregate(Name~Group, DF, function(x) paste0(sort(x), collapse = ","))
# Group Name
#1 1 A,B,C
#2 2 A,B
#3 3 A,B,C

Related

How can I make a data.frame using chr vertor [duplicate]

This question already has answers here:
aggregating unique values in columns to single dataframe "cell" [duplicate]
(2 answers)
Closed 2 years ago.
I have the data.frame below.
> Chr Chr
> A E
> A F
> A E
> B G
> B G
> C H
> C I
> D E
and... I want to convert the dataset as belows as you may be noticed.
I want to coerce all chr vectors into an row.
chr chr
A E,F
B G
C H,I
D E
they are all characters, so I tried to do several things so that I want to make.
Firstly, I used unique function for FILTER <- unique(chr[,15])1st column and try to subset them using
FILTER data that I created using rbind or bind rows function.
Secondly, I tested to check whether my idea works or not
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
result = unique(Top10Data[TGT == FILTER[i]]$`NM`))
print(result)
}
to this stage, it seems to be working well.
The problem for me is that when I used both functions, the data frame only creates 1 column and ignored the others vector (2nd variables from above data.frame) all.
Only For the chr [1,1], those functions do work well, but I have chr vectors such as chr[1,n], which is unable to be coerced.
here's my code for your reference.
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
CGONM <- rbind(NN,unique(Top10Data[TGT == FILTER[i]]$`NM`))
}
Base R solutions:
# Solution 1:
df_str_agg1 <- aggregate(var2~var1, df, FUN = function(x){
paste0(unique(x), collapse = ",")})
# Solution 2:
df_str_agg2 <- data.frame(do.call("rbind",lapply(split(df, df$var1), function(x){
data.frame(var1 = unique(x$var1),
var2 = paste0(unique(x$var2), collapse = ","))
}
)
),
row.names = NULL
)
Tidyverse solution:
library(tidyverse)
df_str_agg3 <-
df %>%
group_by(var1) %>%
summarise(var2 = str_c(unique(var2), collapse = ",")) %>%
ungroup()
Data:
df <- data.frame(var1 = c("A", "A", "A", "B", "B", "C", "C", "D"),
var2 = c("E", "F", "E", "G", "G", "H", "I", "E"), stringsAsFactors = FALSE)

How to replace parts of character cells in a list of dataframes in R

I have a list "L" of dataframes that looks like this (there are more than 2 dataframes in reality):
> L
[[1]]
VAR
1 "Ab", "B", "C", "Dd",
[[2]]
VAR
1 "Ee", "B", "Ab", "H",
I.e. each dataframe contains one variable called "VAR" with one observation that consists of a list of characters. I'm looking for a way to replace all characters that satisfy a given condition with a number. In the example above, I would like to replace all "Ab"s with the number 5 and all "B"s with the number 3. How can this be done so that it applies to every dataframe (i.e. all "A"s) in the list "L"? Thanks!
We can use chartr
lapply(L, function(x) transform(x, VAR = chartr('A', '5', VAR)))
#[[1]]
# VAR
#1 5, B, C, D
#[[2]]
# VAR
#1 E, F, 5, H
Update
We can use gsub to match word that starts with 'A' followed by zero or more non-white space character (\\S*) and replace it with 5.
lapply(L1, function(x) transform(x, VAR = gsub("\\bA\\S*", 5, VAR)))
If we are looking for an exact match, then replace A\\S* with \\bAb\\b
lapply(L1, function(x) transform(x, VAR = gsub("\\bAb\\b", 5, VAR)))
data
L <- list(data.frame(VAR = "A, B, C, D", stringsAsFactors=FALSE),
data.frame(VAR = "E, F, A, H", stringsAsFactors=FALSE))
L1 <- list(data.frame(VAR = "Ab, B, C, D", stringsAsFactors=FALSE),
data.frame(VAR = "E, F, Ab, H", stringsAsFactors=FALSE))
L <- list(data.frame(VAR = c("Ab", "B", "C", "D"), stringsAsFactors=FALSE),
data.frame(VAR = c("E", "F", "Ab", "H"), stringsAsFactors=FALSE))
you could also use purrr and replace
purrr::map(L, ~replace(.x,.x=="Ab",5))

Grouping factor levels in a data.table

I'm trying to combine factor levels in a data.table & wondering if there's a data.table-y way to do so.
Example:
DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))
I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.
Here's what I've been doing, which has been quite slow in the full version of the problem:
DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]
Another approach, suggested by this related question, would I guess translate like so:
DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")
Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)
DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev
Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table?
For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.
(Keying DT before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)
Update:
I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:
levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)
Original Answer:
As suggested by #Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:
match_dt = data.table(ind = as.factor(1:12),
grp = as.factor(c("A", "B", "A", "B", "C", "C",
"C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]
We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):
levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

How to calculate how many times vector appears in a list? in R

I have a list of 10,000 vectors, and each vector might have different elements and different lengths. I would like to know how many unique vectors I have and how often each unique vector appears in the list.
I guess the way to go is the function "unique", but I don't know how I could use it to also get the number of times each vector is repeated.
So what I would like to get is something like that:
"a" "b" "c" d" 301
"a" 277
"b" c" 49
being the letters, the contents of each unique vector, and the numbers, how often are repeated.
I would really appreciate any possible help on this.
thank you very much in advance.
Tina.
Maybe you should look at table:
Some sample data:
myList <- list(A = c("A", "B"),
B = c("A", "B"),
C = c("B", "A"),
D = c("A", "B", "B", "C"),
E = c("A", "B", "B", "C"),
F = c("A", "C", "B", "B"))
Paste your vectors together and tabulate them.
table(sapply(myList, paste, collapse = ","))
#
# A,B A,B,B,C A,C,B,B B,A
# 2 2 1 1
You don't specify whether order matters (that is, is A, B the same as B, A). If it does, you can try something like:
table(sapply(myList, function(x) paste(sort(x), collapse = ",")))
#
# A,B A,B,B,C
# 3 3
Wrap this in data.frame for a vertical output instead of horizontal, which might be easier to read.
Also, do be sure to read How to make a great R reproducible example? as already suggested to you.
As it is, I'm just guessing at what you're trying to do.

Merge and paste duplicate columns in R

Suppose I have two data frames with some common variable x:
df1 <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d")
)
df2 <- data.frame(
x=c(1, 1, 2, 2, 3, 4, 5),
z=c("A", "B", "C", "D", "E", "F", "G")
)
We can assume that each entry of the variable we're merging over, x, appears exactly once in df1; however, it may appear an arbitrary number of times in df2.
I want to merge df2 'into' df1, while preserving df1. Is there a fast way of merging these two data frames such that the merged output would be of the form (for example):
df_merged <- data.frame(
x=c(1, 2, 3, 4),
y=c("a", "b", "c", "d"),
z=c("A B", "C D", "E", "F")
)
Essentially, I want df_merged to be a composition of the original df1, in addition to any variables in df2 coerced to match the format of df1. The various incantations of merge will append new rows to the merged output, which I want to avoid.
We can assume that each entry of the variable we're merging over, x, appears exactly once.
Speed is also a priority since I'll be merging fairly large data frames.
merge( df1,
aggregate(df2$z , df2[1], FUN=paste, collapse=" ", sep=""),
by.x="x", by.y=1)
x y x
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Warning message:
In merge.data.frame(df1, aggregate(df2$z, df2[1], FUN = paste, collapse = " ", :
column name ‘x’ is duplicated in the result
> M1 <- .Last.value
> names(M1)[3] <- "z"
> M1
x y z
1 1 a A B
2 2 b C D
3 3 c E
4 4 d F
Another option:
df2.z <- with(df2, tapply(z, x, paste, collapse=' '))
transform(df1, z=df2.z[match(x, names(df2.z))])
# x y z
# 1 1 a A B
# 2 2 b C D
# 3 3 c E
# 4 4 d F
If df1$x is in order, then use df2.z[names(df2.z) %in% x] in the transform statement.
I'm submitting this question with my own potential answer, but it is fairly slow and I'm curious what other methods might be available.
by <- "x"
df2_processed <- as.data.frame(
sapply( names(df2), function(x) {
tapply( df2[[x]], df2[[by]], function(xx) {
if( x == by ) {
return(xx[1])
} else {
paste(xx, collapse=" ")
}
})
}), optional=TRUE, stringsAsFactors=FALSE )
merge( df1, df2_processed, all.x=TRUE )

Resources