This question already has answers here:
Return most frequent string value for each group [duplicate]
(3 answers)
Fastest way of determining most frequent factor in a grouped data frame in dplyr
(4 answers)
Closed 5 years ago.
I have a dataset as below:
Group Class
A 1
A 2
A 1
A 1
B 2
B 2
B 2
B 1
B 3
B 1
C 1
C 1
C 1
C 2
C 3
I want to aggregate the table by the ‘Group’ column and the value on the ‘Class’ column would be the Class with maximum count. For instance, for Group A, 1 appears three times, so the value for Class is 1. Similarly, for Group 2, 2 appears three times, so the value for Class is 2. The result table should be the following:
Group Class
A 1
B 2
C 1
I am new to R programming and would appreciate your help in solving this problem. Thanks!
You could also do this without using aggregate, so using table and max.col instead:
tb <- table(df$Group, df$Class)
data.frame("Group"=rownames(tb), "CLass"=max.col(tb))
# Group CLass
#1 A 1
#2 B 2
#3 C 1
Which seems to be faster:
library(microbenchmark)
# by Ronak Shah in comments
f1 <- function(df) aggregate(Class~Group, df, function(x) which.max(table(x)))
# this answer
f2 <- function(df) {tb <- table(df$Group, df$Class);
data.frame("Group"=rownames(tb), "CLass"=max.col(tb));}
all(f1(df)==f2(df))
# [1] TRUE
microbenchmark(f1(df), f2(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# f1(df) 800.153 838.9130 923.6484 870.0115 918.988 1981.901 100
# f2(df) 298.367 319.0995 353.4915 338.6305 380.246 599.439 100
df <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Class = c(1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 1L,
1L, 2L, 3L)), .Names = c("Group", "Class"), class = "data.frame", row.names = c(NA,
-15L))
Related
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
in my data
data=structure(list(v1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
v2 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), x = c(10L,
1L, 2L, 3L, 4L, 3L, 2L, 30L, 3L, 5L)), .Names = c("v1", "v2",
"x"), class = "data.frame", row.names = c(NA, -10L))
There are 3 variables.
I need to get only those lines in relation to which X, has the max value.
For example. Take First category of v1 and look in relation to which category v2 x has max value
It is
v1=1 and v2=1 x=10
Take second category of v1 and look in relation to which category v2 x has max value
It is v1=2 ,v2=3 x=30
so desired output
v1 v2 x
1 1 10
2 3 30
How to do it?
Here is a solution using data.table:
library(data.table)
setDT(data)
data[, .SD[which.max(x)], keyby = v1]
v1 v2 x
1: 1 1 10
2: 2 3 30
And for completeness an ugly base-R solution:
t(sapply(split(data, data[["v1"]]), function(s) s[which.max(s[["x"]]),]))
v1 v2 x
1 1 1 10
2 2 3 30
Using dplyr:
data %>%
group_by(v1) %>%
filter(x == max(x))
# A tibble: 2 x 3
# Groups: v1 [2]
v1 v2 x
<int> <int> <int>
1 1 1 10
2 2 3 30
CustomerID MarkrtungChannel OrderID
1 A 1
2 B 2
3 A 3
4 B 4
5 C 5
1 C 6
1 A 7
2 C 8
3 B 9
3 B 10
Hi, I want to know which combinations of marketing channels are used by how many customers .
How can I calculate this with R?
E.g. The combination of Marketing channels A and C is used by 1 customer (ID 1)
the combination of Marketing channels C and B is also used by 1 customer (ID 2)
And so on...
and here's a tidyverse way.
library(tidyverse)
data.df%>%
group_by(CustomerID)%>%
summarize(combo=paste0(sort(unique(MarkrtungChannel)),collapse=""))%>%
ungroup()%>%
group_by(combo)%>%
summarize(n.users=n())
counting the number of people using each combo at the end.
You can do it multiple ways. Here is data.table way:
# Here is your data
df<-structure(list(CustomerID = c(1L, 2L, 3L, 4L, 5L, 1L, 1L, 2L,
3L, 3L), MarkrtungChannel = structure(c(1L, 2L, 1L, 2L, 3L, 3L,
1L, 3L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor"),
OrderID = 1:10), .Names = c("CustomerID", "MarkrtungChannel",
"OrderID"), class = "data.frame", row.names = c(NA, -10L))
df[]<-lapply(df[],as.character)
# Here is the combination field
library(data.table)
setDT(df)
df[,Combo:=.(list(unique(MarkrtungChannel))), by=CustomerID]
# Or (to get the combination counts)
df[,list(combo=(list(unique(MarkrtungChannel)))), by=CustomerID][,uniqueN(CustomerID),by=combo]
Given df as follows:
# group value
# 1 A 8
# 2 A 1
# 3 A 7
# 4 B 3
# 5 B 2
# 6 B 6
# 7 C 4
# 8 C 5
df <- structure(list(group = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), .Label = c("A", "B", "C"), class = "factor"), value = c(8L,
1L, 7L, 3L, 2L, 6L, 4L, 5L)), .Names = c("group", "value"), class = "data.frame", row.names = c(NA,
-8L))
And a vector of indices (possibly with NA):
inds <- c(2,1,NA)
How we can get the nth element of column value per group, preferably in base R?
For example, based on inds, we want the second element of value in group A, first element in group B, NA in group C. So the result would be:
#[1] 1 3 NA
Here is a solution with mapply and split:
mapply("[", with(df, split(value, group)), inds)
which returns a named vector
A B C
1 3 NA
with(df, split(value, group)) splits the data frame by group and returns a list of data frames. mapply takes that list and "inds" and applies the subsetting function "[" to each pairs of arguments.
Using levels and sapply you could do:
DF <- structure(list(group = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), .Label = c("A", "B", "C"), class = "factor"), value = c(8L,
1L, 7L, 3L, 2L, 6L, 4L, 5L)), .Names = c("group", "value"), class = "data.frame", row.names = c(NA,
-8L))
inds <- c(2,1,NA)
lvls = levels(DF$group)
groupInds = sapply(1:length(lvls),function(x) DF$value[DF$group==lvls[x]][inds[x]] )
groupInds
#[1] 1 3 NA
Using again mapply (but not nearly as elegant as IMO's answer):
mapply(function(x, y) subset(df, group == x, value)[y,] ,levels(df$group), inds)
I know you said preferably in base R, but just for the record, here is a data.table way
setDT(df)[, .SD[inds[.GRP], value], by=group][,V1]
#[1] 1 3 NA
I just did come up with another solution:
diag(aggregate(value~group, df, function(x) x[inds])[,-1])
#[1] 1 3 NA
Benchmarking
library(microbenchmark)
library(data.table)
df <- structure(list(group = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), .Label = c("A", "B", "C"), class = "factor"), value = c(8L,
1L, 7L, 3L, 2L, 6L, 4L, 5L)), .Names = c("group", "value"), class = "data.frame", row.names = c(NA,
-8L))
inds <- c(2,1,NA)
f_Imo <- function(df) as.vector(mapply("[", with(df, split(value, group)), inds))
f_Osssan <- function(df) {lvls = levels(df$group);sapply(1:length(lvls),function(x) df$value[df$group==lvls[x]][inds[x]])}
f_User2321 <- function(df) unlist(mapply(function(x, y) subset(df, group == x, value)[y,] ,levels(df$group), inds))
f_dww <- function(df) setDT(df)[, .SD[inds[.GRP], value], by=group][,V1]
f_m0h3n <- function(df) diag(aggregate(value~group, df, function(x) x[inds])[,-1])
all.equal(f_Imo(df), f_Osssan(df), f_User2321(df), f_dww(df), f_m0h3n(df))
# [1] TRUE
microbenchmark(f_Imo(df), f_Osssan(df), f_m0h3n(df), f_User2321(df), f_dww(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# f_Imo(df) 71.004 85.1180 91.52996 91.748 96.8810 121.048 100
# f_Osssan(df) 252.788 276.5265 318.70529 287.648 301.5495 2651.492 100
# f_m0h3n(df) 1422.627 1555.4365 1643.47184 1618.740 1670.7095 4729.827 100
# f_User2321(df) 2889.738 3000.3055 3148.44916 3037.945 3118.7860 6013.442 100
# f_dww(df) 2960.740 3086.2790 3206.02147 3143.381 3250.9545 5976.229 100
I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.
Because I am working on a very large dataset, I need to slice my dataset by groups in order to pursue my computations.
I have a person-period (melt) dataset that looks like this
group id var time
1 A 1 a 1
2 A 1 b 2
3 A 1 a 3
4 A 2 b 1
5 A 2 b 2
6 A 2 b 3
7 B 1 a 1
8 B 1 a 2
9 B 1 a 3
10 B 2 c 1
11 B 2 c 2
12 B 2 c 3
I need to do this simple transformation
library(reshape2)
library(dplyr)
dt %>% dcast(group + id ~ time, value.var = 'var')
In order to get
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
So far, so good.
However, because my database is too big, I need to do this separately for each different groups, such as
a = dt %>% filter(group == 'A') %>% dcast(group + id ~ time, value.var ='var')
b = dt %>% filter(group == 'B') %>% dcast(group + id ~ time, value.var = 'var')
bind_rows(a,b)
My problem is that I would like to avoid doing it by hand. I mean, having to store separately each groups, a = ..., b = ..., c = ..., and so on
Any idea how I could have a single pipe stream that would separate each group, compute the transformation and put it back together in a dataframe ?
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), time = structure(c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor")), .Names = c("group", "id",
"var", "time"), row.names = c(NA, -12L), class = "data.frame")
Package purrr can be useful for working with lists. First split the dataset by group and then use map_df to dcast each list but return everything in a single data.frame.
library(purrr)
dt %>%
split(.$group) %>%
map_df(~dcast(.x, group + id ~ time, value.var = "var"))
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
lapply is your friend here:
do.call(rbind, lapply(unique(dt$Group), function(grp, dt){
dt %>% filter(Group == grp) %>% dcast(group + id ~ time, value.var = "var")
}, dt = dt))