How to keep all columns when concatenating rows with dplyr::summarise? - r

I want to aggregate one column (C) in a data frame according to one grouping variable A, and separate the individual values by a comma while keeping all the other column B. However, B can either have a character (which is always the same for all the rows) or be empty. In this case, I would like to keep the character whenever it is present on one row.
Here is a simplified example:
data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c("", "", "", "a" , "", "a"), C = c(5:10))
data
Based on this question Collapse / concatenate / aggregate a column to a single comma separated string within each group, I have the following code:
library(dplyr)
data %>%
group_by(A) %>%
summarise(test = toString(C)) %>%
ungroup()
Here it is what I would like to obtain:
A B C
1 111 5,6,7
2 222 a 8,9,10

Use summarise_all()
To keep all your columns, you can use summarise_all():
data %>%
group_by(A) %>%
summarise_all(toString)
# A tibble: 2 x 3
A B C
<dbl> <chr> <chr>
1 111 1, 2, 1 5, 6, 7
2 222 2, 1, 2 8, 9, 10
Edit for updated question
You can add a B column to summarise to achieve the desided results:
data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c("", "", "", "a" , "", "a"), C = c(5:10))
data
library(dplyr)
data %>%
group_by(A) %>%
summarise(B = names(sort(table(B),decreasing=TRUE))[1],
C = toString(C)) %>%
ungroup()
# A tibble: 2 x 3
A B C
<dbl> <fct> <chr>
1 111 "" 5, 6, 7
2 222 a 8, 9, 10
This will return the most frequent value in B column (as order gives you ordered indexes).
Hope this helps.

You could write one function to return unique values
library(dplyr)
get_common_vars <- function(x) {
if(n_distinct(x) > 1) unique(x[x !='']) else unique(x)
}
and then use it on all columns that you are interested :
data %>%
group_by(A) %>%
mutate(C = toString(C)) %>%
summarise_at(vars(B:C), get_common_vars)
# ^------ Include all columns here
# A tibble: 2 x 3
# A B C
# <dbl> <fct> <chr>
#1 111 "" 5, 6, 7
#2 222 a 8, 9, 10

You can also use the paste() function and leverage the collapse argument.
data %>%
group_by(A) %>%
summarise(
B = paste(unique(B), collapse = ""),
C = paste(C, collapse = ", "))
# A tibble: 2 x 3
A B C
<chr> <chr> <chr>
1 111 "" 5, 6, 7
2 222 a 8, 9, 10

Related

Reassigning labels using dplyr

Each ID records a series of signal label: "alpha", "beta" and "unknown".
If an ID has only two labels. Then I wish to assign the dominating label to all i.e. if the recorded labels of an ID is
c("alpha", "alpha", "unknown"), it becomes c("alpha", "alpha", "alpha")
Can someone please help me with this.
library(tidyverse)
# Data preparation (you can directly work with the tbl below)
ID <- c(rep("A", 14), rep("B", 14), rep("C", 10), rep("D", 22), rep("E", 2))
series <- c(11, 3, 12, 2, 8, 2, 11, 8, 3, 2)
label <- unlist(
sapply(series, function(x) {case_when(x < 5 ~ rep("unknown", x),
x >= 5 ~ case_when(x > 10 ~ rep("alpha", x),
x <= 10 ~ rep("beta", x)) )
}))
# tbl
tbl <- tibble(ID = ID,
label = label)
If I understood it correctly, from this
tbl %>% group_by(ID) %>% summarise(n_distinct(label))
1 A 2
2 B 2
3 C 2
4 D 3
5 E 1
We want to update labels for IDs A, B and C but not D or E. We can make use of the table function to get the most occurring within those IDS.
tbl2 <- tbl %>%
group_by(ID) %>%
mutate(label = if(n_distinct(label) == 2) names(which.max(table(label))) else label)
Which now gives the number of distinct labels per ID
tbl2 %>% group_by(ID) %>% summarise(n_distinct(label))
ID `n_distinct(label)`
<chr> <int>
1 A 1
2 B 1
3 C 1
4 D 3
5 E 1

How to group the data by id and get unique values of all columns in R?

I have a table with ID and other columns. I want to group the data by Ids and get the unique values of all columns.
from above table group by ID and get unique(Alt1, Alt2, Alt3)
Resul should be in vector form
A -> 1,2,3,5
B ->1,3,4,5,7
We can get data in long format and for each ID make a list of unique values.
library(dplyr)
library(tidyr)
df1 <- df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(value = list(unique(value))) %>%
unnest(value)
df1
# ID value
# <fct> <dbl>
# 1 A 1
# 2 A 3
# 3 A 2
# 4 A 5
# 5 B 1
# 6 B 4
# 7 B 5
# 8 B 3
# 9 B 6
#10 B 7
We can store it as a list if needed using split.
split(df1$value, df1$ID)
#$A
#[1] 1 3 2 5
#$B
#[1] 1 4 5 3 6 7
data.table equivalent of the above would be :
library(Data.table)
setDT(df)
df2 <- melt(df, id.vars = 'ID')[, .(value = list(unique(value))), ID]
unique values are present in df2$value as a vector.
data
df <- data.frame(ID = c('A', 'A', 'B', 'B'),
Alt1 = c(1, 2, 1, 3),
Alt2 = c(3, 5, 4, 6),
Alt3 = c(1, 3, 5, 7))

dplyr approach for sumifs like excel

i have a key in tableA and in tableB i have key and numeric. How can i achieve formula excel sumifs(numeric,tableB.key,tableA.key,tableA.key,1)
with dplyr without join the two table
i already tried summarise_if within mutate
mutate(newColumn = summarise_if(tableB, .predicate = tableB$Key == .$Key, .funs = sum(tableB$numeric)))
but i get this error
In tableB$Key == .$Key:
longer object length is not a multiple of shorter object length
tableA tableB
key key numeric
1 1 10
2 1 30
3
4
Expected
key newColumn
1 40
2
3
4
you could try
library(tidyverse)
tableA <- tibble(key = c(1, 2, 3, 4))
tableB <- tibble(key = c(1, 1, 2, 2),
numeric = c(10, 30, 10, 15))
(function(){
tmpDF <- tableB %>%
filter(key %in% tableA$key) %>%
group_by(key) %>%
summarise(newColumn = sum(numeric))
tableA %>%
mutate(new = ifelse(key == tmpDF$key, tmpDF$newColumn, 0)
)
})()
which gives
# A tibble: 4 x 2
# key new
# <dbl> <dbl>
# 1 40
# 2 25
# 3 0
# 4 0

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Consider the following two data.frames:
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])
I would like to remove the exact rows of a1 that are in a2 so that the result should be:
A B
4 d
5 e
4 d
2 b
Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?
The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.
library(data.table)
a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])
# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]
# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)
# A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3
You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.
library(dplyr)
a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)
## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <-
a1 %>%
group_by(A, B) %>%
mutate(tmp_id = row_number()) %>%
ungroup()
# Create a count
a2_tmp <-
a2 %>%
group_by(A, B) %>%
summarise(count = n()) %>%
ungroup()
## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
select(-tmp_id, -count)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
EDIT
Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).
library(dplyr)
left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number()) %>% ungroup(),
a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)
## # A tibble: 4 x 2
## A B
## <dbl> <chr>
## 1 4 d
## 2 5 e
## 3 4 d
## 4 2 b
I think this solution is a little simpler (perhaps very little) than the first.
I guess this is similar to DWal's solution but in base R
a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))
a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))
a1[!a1_temp %in% a2_temp,]
# A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b
Here's another solution with dplyr:
library(dplyr)
a1 %>%
arrange(A) %>%
group_by(A) %>%
filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))
Result:
# A tibble: 4 x 2
# Groups: A [3]
A B
<dbl> <fctr>
1 2 b
2 4 d
3 4 d
4 5 e
This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Resources