How to get the top element per group with multiple columns? - r

I have the use-case shown below. Basically I have a data frame with three columns. I want to group by two columns (c1,c2) and sum the third one c3. Then I want to pick only the top 1 c1 with maximum c3 (among all c2) i.e. sorting would be unnecessary since I'm only interested in the max.
library(plyr)
df <- data.frame(c1=c('a','a','a','b','b','c'),c2=c('x','y','y','x','y','x'),c3=c(1,2,3,4,5,6))
df
c1 c2 c3
1 a x 1
2 a y 2
3 a y 3
4 b x 4
5 b y 5
6 c x 6
sel <- plyr::ddply(df, c('c1','c2'), plyr::summarize,c3=sum(c3))
sel[with(sel, order(c1,-c3)),]
c1 c2 c3
2 a y 5 <<< this one highest c3 for (c1,c2) combination
1 a x 1
4 b y 5 <<< this one highest c3 for (c1,c2) combination
3 b x 4
5 c x 6 <<< this one highest c3 for (c1,c2) combination
I could do this in a loop but I'm wondering how it can be done in a vector fashion or using a high-level function.

Here's a base R approach:
df2 <- aggregate(c3~c1+c2, df, sum)
subset(df2[order(-df2$c3),], !duplicated(c1))
# c1 c2 c3
#3 c x 6
#4 a y 5
#5 b y 5

Another solution from dplyr.
library(dplyr)
df2 <- df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
filter(c3 == max(c3))
df2
# A tibble: 3 x 3
# Groups: c1 [3]
c1 c2 c3
<fctr> <fctr> <dbl>
1 a y 5
2 b y 5
3 c x 6

Here is another option with data.table
library(data.table)
setDT(df)[, .(c3 = sum(c3)) , .(c1, c2)][, .SD[which.max(c3)], .(c1)]
# c1 c2 c3
#1: a y 5
#2: b y 5
#3: c x 6

Using dplyr:
df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
top_n(1, c3)
Or the last line can be slice(which.max(c3)), which will guarantee a single row.

Related

R: creating combinations of elements within a group and adding up numbers associated with combinations in a new data frame

I have the following dataset:
Letter ID Number
A A1 1
A A2 2
A A3 3
B B1 1
B B2 2
B B3 3
B B4 4
My aim is first to create all possible combinations of IDs within the same "Letter" group. For example, for the letter A, it would be only three combinations: A1-A2,A2-A3,and A1-A3. The same IDs ordered differently don't count as a new combination, so for example A1-A2 is the same as A2-A1.
Then, within those combinations, I want to add up the numbers from the "Number" column associated with those IDs. So for the combination A1-A2, which are associated with 1 and 2 in the "Number" column, this would result in the number 1+2=3.
Finally, I want to place the ID combinations, added numbers and original Letter in a new data frame. Something like this:
Letter Combination Add.Number
A A1-A2 3
A A2-A3 5
A A1-A3 4
B B1-B2 3
B B2-B3 5
B B3-B4 7
B B1-B3 4
B B2-B4 6
B B1-B4 5
How can I do this in R, ideally using the package dplyr?
library(dplyr)
letter <- c("A","A","A","B","B","B","B")
df <-
data.frame(letter) %>%
group_by(letter) %>%
mutate(
number = row_number(),
id = paste0(letter,number)
)
df %>%
full_join(df,by = "letter") %>%
filter(number.x < number.y) %>%
mutate(
combination = paste0(id.x,"-",id.y),
add_number = number.x + number.y) %>%
select(letter,combination,add_number)
# A tibble: 9 x 3
# Groups: letter [2]
letter combination add_number
<chr> <chr> <int>
1 A A1-A2 3
2 A A1-A3 4
3 A A2-A3 5
4 B B1-B2 3
5 B B1-B3 4
6 B B1-B4 5
7 B B2-B3 5
8 B B2-B4 6
9 B B3-B4 7
In base R, using combn:
df <- data.frame(
Letter = c("A","A","A","B","B","B","B"),
Id = c("A1","A2","A3","B1","B2","B3","B4"),
Number = c(1,2,3,1,2,3,4))
# combinations
l<-lapply(split(df$Id, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("L1","L2")))
n<-lapply(split(df$Number, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("N1","N2")))
# rbind all
result <- do.call(rbind, mapply(cbind, Letter=names(l), l, n, SIMPLIFY = F))
result$combination <- paste(result$L1, result$L2, sep="-")
result$sum = result$N1 + result$N2
result
#> Letter L1 L2 N1 N2 combination sum
#> A.1 A A1 A2 1 2 A1-A2 3
#> A.2 A A1 A3 1 3 A1-A3 4
#> A.3 A A2 A3 2 3 A2-A3 5
#> B.1 B B1 B2 1 2 B1-B2 3
#> B.2 B B1 B3 1 3 B1-B3 4
#> B.3 B B1 B4 1 4 B1-B4 5
#> B.4 B B2 B3 2 3 B2-B3 5
#> B.5 B B2 B4 2 4 B2-B4 6
#> B.6 B B3 B4 3 4 B3-B4 7

Replace NA in row with value in adjacent row "ROW" not column [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
Raw data:
V1 V2
1 c1 a
2 c2 b
3 <NA> c
4 <NA> d
5 c3 e
6 <NA> f
7 c4 g
Reproducible Sample Data
V1 = c('c1','c2',NA,NA,'c3',NA,'c4')
V2 = c('a','b','c','d','e','f','g')
data.frame(V1,V2)
Expected output
V1_after V2_after
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
V1_after <- c('c1','c2','c3','c4')
V2_after <- c('a',paste('b','c','d'),paste('e','f'),'g')
data.frame(V1_after,V2_after)
This is sample data.
In Real data, Rows where NA in V1 is not regular
It is too difficult to me
You could make use of zoo::na.locf for this. It takes the most recent non-NA value and fill all NA values on the way:
library(dplyr)
library(zoo)
df %>%
mutate(V1 = zoo::na.locf(V1)) %>%
group_by(V1) %>%
summarise(V2 = paste0(V2, collapse = " "))
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
A base R option using na.omit + cumsum + aggregate
aggregate(
V2 ~ .,
transform(
df,
V1 = na.omit(V1)[cumsum(!is.na(V1))]
), c
)
gives
V1 V2
1 c1 a
2 c2 b, c, d
3 c3 e, f
4 c4 g
You can fill the NA with the previous non-NA values and summarise the data.
library(dplyr)
library(tidyr)
df %>%
fill(V1) %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ' '))
# V1 V2
# <chr> <chr>
#1 c1 a
#2 c2 b c d
#3 c3 e f
#4 c4 g

R calculating the sum of values according to condition

Here is a data frame:
ID<-c(rep("A",3),rep("B",2), rep("C",3),rep("D",5))
cell<-c("a1","a2","a3","a1","a2","a1","a2", "a3","a1","a2","a1","a2","a3")
value<-c(2,5,3,4,5,6,9,8,7,2,5,2,4)
df<-as.data.frame(cbind(ID, cell, value))
I want to calculate the sum of all values for each ID up to cell a2 (incl.). The sequence of cells and ID’s must be taken into account. If there isn’t any cell “a2” after calculating of the sum, this rows should not be taken into account.
As a result I would like to get this table:
Could You please help me to code this condition?
Thanks in advance.
Best regards, Inna
assuming the file is already correctly ordered by cell
library( tidyverse )
df %>%
group_by( ID ) %>%
mutate( value = cumsum( value ) ) %>%
filter( cell == "a2" )
# # A tibble: 5 x 3
# # Groups: ID [4]
# ID cell value
# <chr> <chr> <dbl>
# 1 A a2 7
# 2 B a2 9
# 3 C a2 15
# 4 D a2 9
# 5 D a2 16
Treating each occurrence of "a2" as different group we can do :
library(dplyr)
df %>%
#Create a group column with every value of cell == 'a2' as different group
group_by(ID, grp = cumsum(lag(cell == 'a2', default = TRUE))) %>%
#Remove those groups that do not have 'a2' in them
filter(any(cell == 'a2')) %>%
#Sum till 'a2' value
summarise(value = sum(value[seq_len(match('a2', cell))]),
cell = last(cell)) %>%
select(-grp)
# ID value cell
# <chr> <dbl> <chr>
#1 A 7 a2
#2 B 9 a2
#3 C 15 a2
#4 D 9 a2
#5 D 7 a2
A succinct solution using ave.
r <- transform(df, value=ave(value, ID, FUN=cumsum))[df$cell == "a2", ]
r
# ID cell value
# 2 A a2 7
# 5 B a2 9
# 7 C a2 15
# 10 D a2 9
# 12 D a2 16
An option with data.table
library(data.table)
setDT(df)[, value := cumsum(value) , ID][cell == 'a2']
-output
# ID cell value
#1: A a2 7
#2: B a2 9
#3: C a2 15
#4: D a2 9
#5: D a2 16

Subset a dataset to only groups with 2 or more unique subgroups in R [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 2 years ago.
I have a dataset that I want to subset to only observations where a subgroup of a group have 2 or more unique classes, (I am trying to subset respondents in a survey who live in Nielsen DMA's that cross state lines.)
So if I have this dataframe:
start <- data.frame("obs"=seq(1,10, by=1),"grp"=c(rep("A",4), rep("B",3),rep("C",3)), "sub_grp"=c(rep("A1",2), rep("A2",2), rep("B1",3), "C1","C2","C3"))
What command would I need to subset it to this?
end <- data.frame("obs"=c(seq(1,4,by=1), seq(8,10, by=1)), "grp"=c(rep("A",4), rep("C",3)), "sub_grp"=c("A1","A1","A2","A2","C1","C2","C3"))
The datasets are all data.tables, so I figure there must be a special command in that package to do this.
Thank you for your help!
With data.table, you can check the number of unique values in sub_grp using uniqueN, if it's larger than one, keep the group with .SD:
setDT(start)[, if(uniqueN(sub_grp) > 1) .SD, grp]
# grp obs sub_grp
#1: A 1 A1
#2: A 2 A1
#3: A 3 A2
#4: A 4 A2
#5: C 8 C1
#6: C 9 C2
#7: C 10 C3
You could use the dplyr library:
library(dplyr)
start %>%
group_by(grp) %>%
filter(length(unique(sub_grp)) >= 2) %>%
ungroup
This would give you the result:
# A tibble: 7 x 3
obs grp sub_grp
<dbl> <chr> <chr>
1 1 A A1
2 2 A A1
3 3 A A2
4 4 A A2
5 8 C C1
6 9 C C2
7 10 C C3

dplyr::mutate:- new column = difference between two comma-delimited list columns

Example that works:
df <- data.frame(c0=c(1, 2), c1=c("A,B,C", "D,E,F"), c2=c("B,C", "D,E"))
df
# c0 c1 c2
# 1 1 A,B,C B,C
# 2 2 D,E,F D,E
# Add a column d with difference between c1 and c2
df %>% mutate(d=setdiff(unlist(strsplit(as.character(c1), ",")), unlist(strsplit(as.character(c2), ","))))
# c0 c1 c2 d
# 1 1 A,B,C B,C A
# 2 2 D,E,F D,E F
I get what I expected above: d is assigned the difference between these two lists of characters (they are already sorted).
However, if I introduce more than one different character it no longer works:
df <- data.frame(c0=c(1, 2), c1=c("A,B,C", "D,E,F,G"), c2=c("B,C", "D,E"))
df
# c0 c1 c2
# 1 1 A,B,C B,C
# 2 2 D,E,F,G D,E
# Add a column d with difference between c1 and c2
df %>% mutate(d=setdiff(unlist(strsplit(as.character(c1), ",")), unlist(strsplit(as.character(c2), ","))))
Error: wrong result size (3), expected 2 or 1
What I wanted to get there is:
c0 c1 c2 d
1 1 A,B,C B,C A
2 2 D,E,F,G D,E F,G
I've tried adding a paste() around setdiff but that didn't help. In the end I actually want to be able to probably use tidyr::separate to split out the d column into new rows like:
c0 c1 c2 d
1 1 A,B,C B,C A
2 2 D,E,F,G D,E F
3 2 D,E,F,G D,E G
What am I doing wrong with the setdiff above?
Thanks
Tim
You get the error because at row 2 you have more than one element which can not fit a cell, one way is to use rowwise and wrap the result as list so that it can fit and after that use unnest from tidyr to expand the list type column:
library(dplyr)
library(tidyr)
df %>%
rowwise() %>%
mutate(d=list(setdiff(unlist(strsplit(as.character(c1), ",")),
unlist(strsplit(as.character(c2), ","))))) %>%
unnest()
# Source: local data frame [3 x 4]
# c0 c1 c2 d
# <dbl> <fctr> <fctr> <chr>
# 1 1 A,B,C B,C A
# 2 2 D,E,F,G D,E F
# 3 2 D,E,F,G D,E G

Resources