R group by key get max value for multiple columns - r

I want to do something like this:
How to make a unique in R by column A and keep the row with maximum value in column B
Except my data.table has one key column, and multiple value columns. So say I have the following:
a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1
If the key is column a, I want for each unique a to return the row with the maximum b, and if there is more than one unique max b, get the one with the max c and so on for multiple columns. So the result should be:
a b c
1: 1 2 2
2: 2 3 3
3: 3 2 1
I'd also like this to be done for an arbitrary number of columns. So if my data.table had 20 columns, I'd want the max function to be applied in order from left to right.

Here is a suggested data.table solution. You might want to consider using data.table::frankv as follows:
DT[, .SD[frankv(.SD, ties.method="first")[.N],], by=a]
frankv returns the order. Then [.N] will take the largest rank. Then .SD[ subset to that particular row.
Please let me know if it fails for your larger dataset.

to make this work for any number of columns, a possible dplyr solution would be to use arrange_all
df <- data.frame(a = c(1,1,1,2,2,2,3,3), b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
df %>% group_by(a) %>% arrange_all() %>% filter(row_number() == n())
# A tibble: 3 x 3
# Groups: a [3]
# a b c
# 1 1 2 2
# 2 2 3 3
# 3 3 2 1

The generic solution can be achieved for arbitrary number of column using mutate_at. In the below example c("a","b","c") are arbitrary columns.
library(dplyr)
df %>% arrange_at(.vars = vars(c("a","b","c"))) %>%
mutate(changed = ifelse(a != lead(a), TRUE, FALSE)) %>%
filter(is.na(changed) | changed ) %>%
select(-changed)
a b c
1 1 2 2
2 2 3 3
3 3 2 1
Another option could be using max and dplyr as below. The approach is to first group_by on a and then filter for max value of b. The again group_by on both a and b and filter for rows with max value of c.
library(dplyr)
df %>% group_by(a) %>%
filter(b == max(b)) %>%
group_by(a, b) %>%
filter(c == max(c))
# Groups: a, b [3]
# a b c
# <int> <int> <int>
#1 1 2 2
#2 2 3 3
#3 3 2 1
Data
df <- read.table(text = "a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1", header = TRUE, stringsAsFactors = FALSE)

dat <- data.frame(a = c(1,1,1,2,2,2,3,3),
b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
library(sqldf)
sqldf("with d as (select * from 'dat' group by a order by b, c desc) select * from d order by a")
a b c
1 1 2 2
2 2 3 3
3 3 2 1

Related

Frequency count for multiple columns with same values

I'd like to make a frequency count individually for multiple columns with same possible values. The idea is to keep all columns from original data table, just adding a new one for levels and aggregating.
Here is an example of input data:
foo <- data.table(a = c(1,3,2,3,3), b = c(2,3,3,1,1), c = c(3,1,2,3,2))
# a b c
#1: 1 2 3
#2: 3 3 1
#3: 2 3 2
#4: 3 1 3
#5: 3 1 2
And desired output:
data.table(levels = 1:3, a = c(1,1,3), b = c(2,1,2), c = c(1,2,2))
# levels a b c
#1: 1 1 2 1
#2: 2 1 1 2
#3: 3 3 2 2
Thanks for helping !
We may use
library(data.table)
dcast(melt(foo)[, .N, .(variable, levels = value)],
levels ~ variable, value.var = 'N')
-output
Key: <levels>
levels a b c
<num> <int> <int> <int>
1: 1 1 2 1
2: 2 1 1 2
3: 3 3 2 2
Or using base R
table(stack(foo))
ind
values a b c
1 1 2 1
2 1 1 2
3 3 2 2
You could also use recast from reshape2:
reshape2::recast(foo, value~variable)
# No id variables; using all as measure variables
# Aggregation function missing: defaulting to length
value a b c
1 1 1 2 1
2 2 1 1 2
3 3 3 2 2
or even
reshape2::recast(foo, value~variable, length)
Here is an option using purrr and dplyr from the tidyverse:
library(purrr)
library(dplyr)
foo %>%
imap(~ as.data.frame(table(.x, dnn = "levels"), responseName = .y)) %>%
reduce(left_join, by = "levels")
Alternatively, you could use the pivot functions from tidyr:
library(dplyr)
library(tidyr)
foo %>%
pivot_longer(everything(),
values_to = "levels") %>%
count(name, levels) %>%
pivot_wider(id_cols = levels,
names_from = name,
values_from = n)
foo |>
melt() |>
dcast(value ~ variable, fun.aggregate = length)
# value a b c
# 1: 1 1 2 1
# 2: 2 1 1 2
# 3: 3 3 2 2

Fill sequence by factor

I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1

How to Index subjects using R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 7 years ago.
I am working in R and I have a Data set that has multiple entries for each subject. I want to create an index variable that indexes by subject. For example:
Subject Index
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
7 D 1
8 D 2
9 E 1
The first A entry is indexed as 1, while the second A entry is indexed as 2. The first B entry is indexed as 1, etc.
Any help would be excellent!
Here.s a quick data.table aproach
library(data.table)
setDT(df)[, Index := seq_len(.N), by = Subject][]
# Subject Index
# 1: A 1
# 2: A 2
# 3: B 1
# 4: C 1
# 5: C 2
# 6: C 3
# 7: D 1
# 8: D 2
# 9: E 1
Or with base R
with(df, ave(as.numeric(Subject), Subject, FUN = seq_along))
## [1] 1 2 1 1 2 3 1 2 1
Or with dplyr (don't run this on a data.table class)
library(dplyr)
df %>%
group_by(Subject) %>%
mutate(Index = row_number())
Using dplyr
library(dplyr)
df %>% group_by(Subject) %>% mutate(Index = 1:n())
You get:
#Source: local data frame [9 x 2]
#Groups: Subject
#
# Subject Index
#1 A 1
#2 A 2
#3 B 1
#4 C 1
#5 C 2
#6 C 3
#7 D 1
#8 D 2
#9 E 1

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Create a variable capturing the most frequent occurence by group

Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

Resources