Collapse duplicate rows by median value in R - r

I have a date frame with two columns. I would like to remove rows where there are duplicate entries in the first column. however I would like to select a specific row to remain based on the value of the second columns.
Specifically - if there are 2 duplicate entries in columns 1, I would like the row removed with the lower value in column 2
Or if there are more than 2 identical entries in columns 1 then I would like the row with the median value in row 2 to remain.
So for data frame
a <- c(rep("A", 3), rep("B", 3), rep("C",1), rep("D",1), rep("D",1))
b <- c(1,2,3,4,5,6,4,7,6)
df <-data.frame(a,b)
would become
a <- c(rep("A", 1), rep("B", 1), rep("C",1), rep("D",1))
b <- c(2,5,4,7)
df <-data.frame(a,b)
I have tried functions unique() and duplicated() but can't seem to find arguments that meet these criteria. Any help much appreciated.

You can try
library(data.table)
setDT(df)[, list(b=if(.N==2) min(b) else median(b)) , by = a]
# a b
#1: A 2
#2: B 5
#3: C 4
#4: D 6
Or a similar option with aggregate
aggregate(b~a, df, FUN=function(x) if(length(x)==2) min(x) else median(x))
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 6
Or
library(sqldf)
sqldf('select a,
case
when count(b) is 2 then min(b)
else median(b)
end b
from df
group by a')
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 6
Based on the expected output showed, the last row is D 7, so if we are selecting the first observation when the group length is 2,
setDT(df)[, list(b=if(.N==2) b[1L] else median(b)) , by = a]
# a b
#1: A 2
#2: B 5
#3: C 4
#4: D 7
Or
aggregate(b~a, df, FUN=function(x) if(length(x)==2) x[1L] else median(x))
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 7
Or
sqldf('select a,
case
when count(b) is 2 and min(rowid) then b
else median(b)
end b
from df
group by a')
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 7
EDIT changed first observation to min after I saw #eipi10's post. Didn't read the OP's post correctly and the OP's expected output is not matching the description.

Using dplyr:
library(dplyr)
df %>% group_by(a) %>%
summarise(b = ifelse(n() == 2, min(b), median(b)))
a b
1 A 2
2 B 5
3 C 4
4 D 6
In your question, you said you want the "lower" value, in case there are two rows, which would give D=6, rather than D=7. If you meant the first row that appears in the data frame, you can do this:
df %>% group_by(a) %>%
summarise(b = ifelse(n() == 2, b[1], median(b)))

Related

keep last non missing observation for all variables by group

My data has multiple columns and some of those columns have missing values in different rows. I would like to group (collapse) the data by the variable "g", keeping the last non missing obserbation of each varianle.
Input:
d <- data.table(a=c(1,NA,3,4),b=c(1,2,3,4),c=c(NA,NA,'c',NA),g=c(1,1,2,2))
Desired output
d_g <- data.table(a=c(1,4),b=c(2,4),c=c(NA,'c'),g=c(1,2))
data.table (or dplyr) solution prefered here
OBS:this is related to this question, but the main answers there seem to cause unecessary NAs in some groups
Using data.table :
library(data.table)
d[, lapply(.SD, function(x) last(na.omit(x))), g]
# g a b c
#1: 1 1 2 <NA>
#2: 2 4 4 c
One option using dplyr could be:
d %>%
group_by(g) %>%
summarise(across(everything(), ~ if(all(is.na(.))) NA else last(na.omit(.))))
g a b c
<dbl> <dbl> <dbl> <chr>
1 1 1 2 <NA>
2 2 4 4 c
In base aggregatecould be used.
aggregate(.~g, d, function(x) tail(x[!is.na(x)], 1), na.action = NULL)
# g a b c
#1 1 1 2
#2 2 4 4 c

Turning a data frame and a list into long format with dplyr

Here is a puzzle.
Assume you have a data frame and a list. The list has as many elements as the df has rows:
dd <- data.frame(ID=1:3, Name=LETTERS[1:3])
dl <- map(4:6, rnorm) %>% set_names(letters[1:3])
Is there a simple way (preferably with dplyr / tidyverse) to make a long format, such that the elements of the list are joined with the corresponding rows of the data frame? Here is what I have in mind illustrated with not-so-elegant way:
rows <- map(1:length(dl), ~ rep(., length(dl[[.]]))) %>% unlist()
dd <- dd[rows,]
dd$value <- unlist(dl)
As you can see, for each vector in dl, we replicated the corresponding row as many times as necessary to accommodate each value.
In base R, you can get your result with stack followed by merge:
res <- merge(stack(dl), dd, by.x="ind", by.y="Name")
head(res)
# ind values ID
#1 A -0.79616693 1
#2 A 0.37720953 1
#3 A 1.30273712 1
#4 A 0.19483859 1
#5 B 0.18770716 2
#6 B -0.02226917 2
NB: I supposed the names for dl were supposed to be in uppercases but if they are indeed lowercase, the following line needs to be pass instead:
res <- merge(stack(setNames(dl, toupper(names(dl)))), dd, by.x="ind", by.y="Name")
Since a dplyr solution has already been provided, another option is to subset dl for each Name value in dd using data.table grouping
library(data.table)
setDT(dd)
dd[, .(values = dl[[tolower(Name)]]), by = .(ID, Name)]
# ID Name values
# 1: 1 A -1.09633600
# 2: 1 A -1.26238190
# 3: 1 A 1.15220845
# 4: 1 A -1.45741071
# 5: 2 B -0.49318131
# 6: 2 B 0.59912670
# 7: 2 B -0.73117632
# 8: 2 B -1.09646143
# 9: 2 B -0.79409753
# 10: 3 C -0.08205888
# 11: 3 C 0.21503398
# 12: 3 C -1.17541571
# 13: 3 C -0.10020616
# 14: 3 C -1.01152362
# 15: 3 C -1.03693337
We can create a list column and unnest
library(tidyverse)
dd %>%
mutate(value = dl) %>%
unnest
# ID Name value
#1 1 A 1.57984385
#2 1 A 0.66831102
#3 1 A -0.45472145
#4 1 A 2.33807619
#5 2 B 1.56716709
#6 2 B 0.74982763
#7 2 B 0.07025534
#8 2 B 1.31174561
#9 2 B 0.57901536
#10 3 C -1.36629653
#11 3 C -0.66437155
#12 3 C 2.12506187
#13 3 C 1.20220402
#14 3 C 0.10687018
#15 3 C 0.15973401
Note that if the criteria is based on the compactness of code, if we remove the %>%
unnest(mutate(dd, value = dl))
Or another option is uncount and mutate
dd %>%
uncount(lengths(dl)) %>%
mutate(value = flatten_dbl(unname(dl)))
If it needs a join based on the names of the 'dl'
enframe(dl, name = 'Name') %>%
mutate(Name = toupper(Name)) %>%
left_join(dd) %>%
unnest
In base R, we can replicate the rows of 'dd' with lengths of 'dl' and transform to create the 'value' as unlisted 'dl'
transform(dd[rep(seq_len(nrow(dd)), lengths(dl)),], value = unlist(dl))

Count character values by group in a data.frame

I have got a data.frame which contains two columns: ID and Letter. I need to summarize the Letter observations by ID.
Here an example:
df = read.table(text = 'ID Letter
1 A
1 A
1 B
1 A
1 C
1 D
1 B
2 A
2 B
2 B
2 B
2 D
2 F
3 B
3 A
3 A
3 C
3 D, header = TRUE)
My output should be 3 data.frames as follows:
df_1
A 3
B 2
C 1
D 1
df_2
A 1
B 3
D 1
F 1
df_3
A 2
B 1
C 1
D 1
It is just the count of the letters within each ID group. I think I could use a combination of the functions table and aggregate, but how?
thanks to #akrun, please see below how I managed to do the trick:
#create list of data.frames
library(dplyr)
lst = lapply(split(df, df$ID), function(x) count(x, ID, Letter) %>% ungroup() %>% select(-ID))
lst = lapply(lst, function(y) y = as.data.frame(y)) #convert data into data.frames
This will also work (with base R):
lapply(split(df, df$ID), function(x) subset(as.data.frame(table(x$Letter)), Freq != 0))

How to divide between groups of rows using dplyr?

I have this dataframe:
x <- data.frame(
name = rep(letters[1:4], each = 2),
condition = rep(c("A", "B"), times = 4),
value = c(2,10,4,20,8,40,20,100)
)
# name condition value
# 1 a A 2
# 2 a B 10
# 3 b A 4
# 4 b B 20
# 5 c A 8
# 6 c B 40
# 7 d A 20
# 8 d B 100
I want to group by name and divide the value of rows with condition == "B" with those with condition == "A", to get this:
data.frame(
name = letters[1:4],
value = c(5,5,5,5)
)
# name value
# 1 a 5
# 2 b 5
# 3 c 5
# 4 d 5
I know something like this can get me pretty close:
x$value[which(x$condition == "B")]/x$value[which(x$condition == "A")]
but I was wondering if there was an easy way to do this with dplyr (My dataframe is a toy example and I got to it by chaining multiple group_by and summarise calls).
Try:
x %>%
group_by(name) %>%
summarise(value = value[condition == "B"] / value[condition == "A"])
Which gives:
#Source: local data frame [4 x 2]
#
# name value
# (fctr) (dbl)
#1 a 5
#2 b 5
#3 c 5
#4 d 5
I'd use spread from tidyr.
library(dplyr)
library(tidyr)
x %>%
spread(condition, value) %>%
mutate(value = B/A)
name A B value
1 a 2 10 5
2 b 4 20 5
3 c 8 40 5
4 d 20 100 5
You could then do select(-A, -B) to drop the extra columns.
Using data.table, convert the 'data.frame' to 'data.table' (setDT(x)), grouped by 'name', we divide the 'value' corresponds to 'B' condition by the those that corresponds to 'A' 'condition'.
library(data.table)
setDT(x)[,.(value = value[condition=="B"]/value[condition=="A"]) , name]
# name value
#1: a 5
#2: b 5
#3: c 5
#4: d 5
Or reshape from 'long' to 'wide' and divide the 'B' column by 'A'.
dcast(setDT(x), name~condition, value.var='value')[, .(name, value = B/A)]

Create a variable capturing the most frequent occurence by group

Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

Resources