Create a variable capturing the most frequent occurence by group

Create a variable capturing the most frequent occurence by group - r

Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.

Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.

mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

Related

R Show duplicates in dataframe

I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)

a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1

count by all variables / count distinct with dplyr

Say I have this data.frame :
library(dplyr)
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
# x y
# 1 a a
# 2 b b
# 3 b b
# 4 c c
# 5 c c
# 6 c c
I can group and count easily by mentioning the names :
df1 %>%
count(x,y)
# A tibble: 3 x 3
# x y n
# <fctr> <fctr> <int>
# 1 a a 1
# 2 b b 2
# 3 c c 3
How do I do to group by everything without mentioning individual column names, in the most compact /readable way ?

We can pass the input itself to the ... argument and splice it with !!! :
df1 %>% count(., !!!.)
#> x y n
#> 1 a a 1
#> 2 b b 2
#> 3 c c 3
Note : see edit history to make sense of some comments
With base we could do : aggregate(setNames(df1[1],"n"), df1, length)

For those who wouldn't get the voodoo you are using in the accepted answer, if you don't need to use dplyr, you can do it with data.table:
setDT(df1)
df1[, .N, names(df1)]
# x y N
# 1: a a 1
# 2: b b 2
# 3: c c 3

Have you considered the (now superceded) group_by_all()?
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
df1 %>% group_by_all() %>% count
df1 %>% group_by(across()) %>% count()
df1 %>% count(across()) # don't know why this returns a data.frame and not tibble
See the colwise vignette "other verbs" section for explanation... though honestly I get turned around myself sometimes.

R group by key get max value for multiple columns

I want to do something like this:
How to make a unique in R by column A and keep the row with maximum value in column B
Except my data.table has one key column, and multiple value columns. So say I have the following:
a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1
If the key is column a, I want for each unique a to return the row with the maximum b, and if there is more than one unique max b, get the one with the max c and so on for multiple columns. So the result should be:
a b c
1: 1 2 2
2: 2 3 3
3: 3 2 1
I'd also like this to be done for an arbitrary number of columns. So if my data.table had 20 columns, I'd want the max function to be applied in order from left to right.

Here is a suggested data.table solution. You might want to consider using data.table::frankv as follows:
DT[, .SD[frankv(.SD, ties.method="first")[.N],], by=a]
frankv returns the order. Then [.N] will take the largest rank. Then .SD[ subset to that particular row.
Please let me know if it fails for your larger dataset.

to make this work for any number of columns, a possible dplyr solution would be to use arrange_all
df <- data.frame(a = c(1,1,1,2,2,2,3,3), b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
df %>% group_by(a) %>% arrange_all() %>% filter(row_number() == n())
# A tibble: 3 x 3
# Groups: a [3]
# a b c
# 1 1 2 2
# 2 2 3 3
# 3 3 2 1

The generic solution can be achieved for arbitrary number of column using mutate_at. In the below example c("a","b","c") are arbitrary columns.
library(dplyr)
df %>% arrange_at(.vars = vars(c("a","b","c"))) %>%
mutate(changed = ifelse(a != lead(a), TRUE, FALSE)) %>%
filter(is.na(changed) | changed ) %>%
select(-changed)
a b c
1 1 2 2
2 2 3 3
3 3 2 1
Another option could be using max and dplyr as below. The approach is to first group_by on a and then filter for max value of b. The again group_by on both a and b and filter for rows with max value of c.
library(dplyr)
df %>% group_by(a) %>%
filter(b == max(b)) %>%
group_by(a, b) %>%
filter(c == max(c))
# Groups: a, b [3]
# a b c
# <int> <int> <int>
#1 1 2 2
#2 2 3 3
#3 3 2 1
Data
df <- read.table(text = "a b c
1: 1 1 1
2: 1 2 1
3: 1 2 2
4: 2 1 1
5: 2 2 5
6: 2 3 3
7: 3 1 4
8: 3 2 1", header = TRUE, stringsAsFactors = FALSE)

dat <- data.frame(a = c(1,1,1,2,2,2,3,3),
b = c(1,2,2,1,2,3,1,2),
c = c(1,1,2,1,5,3,4,1))
library(sqldf)
sqldf("with d as (select * from 'dat' group by a order by b, c desc) select * from d order by a")
a b c
1 1 2 2
2 2 3 3
3 3 2 1

Using dplyr first function but ignoring a particular character

I wish to add the first feature in the following dataset in a new column
mydf <- data.frame (customer= c(1,2,1,2,2,1,1) , feature =c("other", "a", "b", "c", "other","b", "c"))
customer feature
1 1 other
2 2 a
3 1 b
4 2 c
5 2 other
6 1 b
7 1 c
by using dplyr. However, I wish to my code ignore the "other" feature in the data set and choose the first one after "other".
so the following code is not sufficient:
library (dplyr)
new <- mydf %>%
group_by(customer) %>%
mutate(firstfeature = first(feature))
How can I ignore "other" so that I reach the following ideal output:
customer feature firstfeature
1 1 other b
2 2 a a
3 1 b b
4 2 c a
5 2 other a
6 1 b b

With dplyr we can group by customer and take the first feature for every group.
library(dplyr)
mydf %>%
group_by(customer) %>%
mutate(firstfeature = feature[feature != "other"][1])
# customer feature firstfeature
# <dbl> <chr> <chr>
#1 1 other b
#2 2 a a
#3 1 b b
#4 2 c a
#5 2 other a
#6 1 b b
#7 1 c b
Similarly we can also do this with base R ave
mydf$firstfeature <- ave(mydf$feature, mydf$customer,
FUN= function(x) x[x!= "other"][1])

Another option is data.table
library(data.table)
setDT(mydf)[, firstfeature := feature[feature != "other"][1], customer]

Count character values by group in a data.frame

I have got a data.frame which contains two columns: ID and Letter. I need to summarize the Letter observations by ID.
Here an example:
df = read.table(text = 'ID Letter
1 A
1 A
1 B
1 A
1 C
1 D
1 B
2 A
2 B
2 B
2 B
2 D
2 F
3 B
3 A
3 A
3 C
3 D, header = TRUE)
My output should be 3 data.frames as follows:
df_1
A 3
B 2
C 1
D 1
df_2
A 1
B 3
D 1
F 1
df_3
A 2
B 1
C 1
D 1
It is just the count of the letters within each ID group. I think I could use a combination of the functions table and aggregate, but how?

thanks to #akrun, please see below how I managed to do the trick:
#create list of data.frames
library(dplyr)
lst = lapply(split(df, df$ID), function(x) count(x, ID, Letter) %>% ungroup() %>% select(-ID))
lst = lapply(lst, function(y) y = as.data.frame(y)) #convert data into data.frames

This will also work (with base R):
lapply(split(df, df$ID), function(x) subset(as.data.frame(table(x$Letter)), Freq != 0))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create a variable capturing the most frequent occurence by group - r

mode <- function(x) names(table(x))[ which.max(table(x)) ] df1$freq <- ave(df1$v1, df1$id, FUN=mode) > df1 id v1 freq 1 1 a b 2 1 b b 3 1 b b 4 2 c c 5 2 c c 6 2 c c

Related

R Show duplicates in dataframe

count by all variables / count distinct with dplyr

R group by key get max value for multiple columns

Using dplyr first function but ignoring a particular character

Count character values by group in a data.frame

Categories

Resources