How to rank column accordingly using case_when? - r

I wanted to create another column (called delayGrade) where the top 10% of values (closest to 0) from another column (averageDelay) get assigned the letter 'A', the next 25% 'B', and the remaining 'C'. I figured I could use a case_when function to do so, but not sure how to go about doing it. Any ideas?

Here is toy data frame and solution:
library(tidyverse)
df <- tibble(
averageDelay = rnorm(10)
)
df %>%
mutate(
delayGrade = case_when(
averageDelay < quantile(averageDelay, .1) ~ "A",
averageDelay < quantile(averageDelay, .35) ~ "B",
TRUE ~ "C"
)
) %>%
arrange(averageDelay) # Not necissary, but improves readability
# A tibble: 10 x 2
averageDelay delayGrade
<dbl> <chr>
1 -1.57878473 A
2 -1.00129022 B
3 -0.34245100 B
4 -0.08652020 B
5 -0.05240453 C
6 0.15732711 C
7 0.21509389 C
8 0.34202367 C
9 0.90296373 C
10 0.90820894 C

Related

R: dplyr How to group by then filter rows based on the condition of each group's first row

I have a simple data frame such as
df <- data.frame(x=c(1,1,1,1,2,2,2,3,3,3),
y=c('a','b','a','c','e','d','e','a','f','c'))
I want to group by x, then if the first row of each x-groups has y == 'a', then get only rows that have y == 'a' | y == 'c'
So I expect the outcome would have row 1, 3, 4, 8, 10
Thank you very much.
After grouping by 'x', create an & condition - 1) check whether the first value of 'y' is 'a', 2) condition that checks for values 'a', 'c' in the column
library(dplyr)
df %>%
group_by(x) %>%
filter('a' == first(y), y %in% c('a', 'c')) %>%
ungroup
-output
# A tibble: 5 × 2
x y
<dbl> <chr>
1 1 a
2 1 a
3 1 c
4 3 a
5 3 c
If we have additional rules, create a named list where the names will be expected first values of 'y' and the vector of values to be filtered, then extract the list element based on the first value of the 'y' and use that vector in the logical expression with %in%
df %>%
group_by(x) %>%
filter(y %in% list(a = c('a', 'c'), e = 'e')[[first(y)]]) %>%
ungroup
-output
# A tibble: 7 × 2
x y
<dbl> <chr>
1 1 a
2 1 a
3 1 c
4 2 e
5 2 e
6 3 a
7 3 c
Here is another dplyr option
> df %>%
+ filter(y %in% c("a", "c") & ave(y == "a", x, FUN = first))
x y
1 1 a
2 1 a
3 1 c
4 3 a
5 3 c

Group data by factor level, then transform to data frame with colname being levels?

There is my problem that I can't solve it:
Data:
df <- data.frame(f1=c("a", "a", "b", "b", "c", "c", "c"),
v1=c(10, 11, 4, 5, 0, 1, 2))
data.frame:f1 is factor
f1 v1
a 10
a 11
b 4
b 5
c 0
c 1
c 2
# What I want is:(for example, fetch data with the number of element of some level == 2, then to data.frame)
a b
10 4
11 5
Thanks in advance!
I might be missing something simple here , but the below approach using dplyr works.
library(dplyr)
nlevels = 2
df1 <- df %>%
add_count(f1) %>%
filter(n == nlevels) %>%
select(-n) %>%
mutate(rn = row_number()) %>%
spread(f1, v1) %>%
select(-rn)
This gives
# a b
# <int> <int>
#1 10 NA
#2 11 NA
#3 NA 4
#4 NA 5
Now, if you want to remove NA's we can do
do.call("cbind.data.frame", lapply(df1, function(x) x[!is.na(x)]))
# a b
#1 10 4
#2 11 5
As we have filtered the dataframe which has only nlevels observations, we would have same number of rows for each column in the final dataframe.
split might be useful here to split df$v1 into parts corresponding to df$f1. Since you are always extracting equal length chunks, it can then simply be combined back to a data.frame:
spl <- split(df$v1, df$f1)
data.frame(spl[lengths(spl)==2])
# a b
#1 10 4
#2 11 5
Or do it all in one call by combining this with Filter:
data.frame(Filter(function(x) length(x)==2, split(df$v1, df$f1)))
# a b
#1 10 4
#2 11 5
Here is a solution using unstack :
unstack(
droplevels(df[ave(df$v1, df$f1, FUN = function(x) length(x) == 2)==1,]),
v1 ~ f1)
# a b
# 1 10 4
# 2 11 5
A variant, similar to #thelatemail's solution :
data.frame(Filter(function(x) length(x) == 2, unstack(df,v1 ~ f1)))
My tidyverse solution would be:
library(tidyverse)
df %>%
group_by(f1) %>%
filter(n() == 2) %>%
mutate(i = row_number()) %>%
spread(f1, v1) %>%
select(-i)
# # A tibble: 2 x 2
# a b
# * <dbl> <dbl>
# 1 10 4
# 2 11 5
or mixing approaches :
as_tibble(keep(unstack(df,v1 ~ f1), ~length(.x) == 2))
Using all base functions (but you should use tidyverse)
# Add count of instances
x$len <- ave(x$v1, x$f1, FUN = length)
# Filter, drop the count
x <- x[x$len==2, c('f1','v1')]
# Hacky pivot
result <- data.frame(
lapply(unique(x$f1), FUN = function(y) x$v1[x$f1==y])
)
colnames(result) <- unique(x$f1)
> result
a b
1 10 4
2 11 5
I'd like code this, may it helps for you
library(reshape2)
library(dplyr)
aa = data.frame(v1=c('a','a','b','b','c','c','c'),f1=c(10,11,4,5,0,1,2))
cc = aa %>% group_by(v1) %>% summarise(id = length((v1)))
dd= merge(aa,cc) #get the level
ee = dd[dd$aa==2,] #select number of level equal to 2
ee$id = rep(c(1,2),nrow(ee)/2) # reset index like (1,2,1,2)
dcast(ee, id~v1,value.var = 'f1')
all done!

Select rows based on non-directed combinations of columns

I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)

Bind data frames on longer identifiers R

I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

I want to make a grouped filter using dplyr, in a way that within each group only that row is returned which has the minimum value of variable x.
My problem is: As expected, in the case of multiple minima all rows with the minimum value are returned. But in my case, I only want the first row if multiple minima are present.
Here's an example:
df <- data.frame(
A=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
y=rnorm(9)
)
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, x == min(x))
As expected, all minima are returned:
Source: local data frame [6 x 3]
Groups: A
A x y
1 A 1 -1.04584335
2 A 1 0.97949399
3 B 2 0.79600971
4 C 5 -0.08655151
5 C 5 0.16649962
6 C 5 -0.05948012
With ddply, I would have approach the task that way:
library(plyr)
ddply(df, .(A), function(z) {
z[z$x == min(z$x), ][1, ]
})
... which works:
A x y
1 A 1 -1.04584335
2 B 2 0.79600971
3 C 5 -0.08655151
Q: Is there a way to approach this in dplyr? (For speed reasons)
Update
With dplyr >= 0.3 you can use the slice function in combination with which.min, which would be my favorite approach for this task:
df %>% group_by(A) %>% slice(which.min(x))
#Source: local data frame [3 x 3]
#Groups: A
#
# A x y
#1 A 1 0.2979772
#2 B 2 -1.1265265
#3 C 5 -1.1952004
Original answer
For the sample data, it is also possible to use two filter after each other:
group_by(df, A) %>%
filter(x == min(x)) %>%
filter(1:n() == 1)
Just for completeness: Here's the final dplyr solution, derived from the comments of #hadley and #Arun:
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)
For what it's worth, here's a data.table solution, to those who may be interested:
# approach with setting keys
dt <- as.data.table(df)
setkey(dt, A,x)
dt[J(unique(A)), mult="first"]
# without using keys
dt <- as.data.table(df)
dt[dt[, .I[which.min(x)], by=A]$V1]
This can be accomplished by using row_number combined with group_by. row_number handles ties by assigning a rank not only by the value but also by the relative order within the vector. To get the first row of each group with the minimum value of x:
df.g <- group_by(df, A)
filter(df.g, row_number(x) == 1)
For more information see the dplyr vignette on window functions.
dplyr offers slice_min function, wich do the job with the argument with_ties = FALSE
library(dplyr)
df %>%
group_by(A) %>%
slice_min(x, with_ties = FALSE)
Output :
# A tibble: 3 x 3
# Groups: A [3]
A x y
<fct> <dbl> <dbl>
1 A 1 0.273
2 B 2 -0.462
3 C 5 1.08
Another way to do it:
set.seed(1)
x <- data.frame(a = rep(1:2, each = 10), b = rnorm(20))
x <- dplyr::arrange(x, a, b)
dplyr::filter(x, !duplicated(a))
Result:
a b
1 1 -0.8356286
2 2 -2.2146999
Could also be easily adapted for getting the row in each group with maximum value.
In case you are looking to filter the minima of x and then the minima of y. An intuitive way of do it is just using filtering functions:
> df
A x y
1 A 1 1.856368296
2 A 1 -0.298284187
3 A 2 0.800047796
4 B 2 0.107289719
5 B 3 0.641819999
6 B 4 0.650542284
7 C 5 0.422465687
8 C 5 0.009819306
9 C 5 -0.482082635
df %>% group_by(A) %>%
filter(x == min(x), y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
This code will filter the minima of x and y.
Also you can do a double filter
that looks even more readable:
df %>% group_by(A) %>%
filter(x == min(x)) %>%
filter(y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
I like sqldf for its simplicity..
sqldf("select A,min(X),y from 'df.g' group by A")
Output:
A min(X) y
1 A 1 -1.4836989
2 B 2 0.3755771
3 C 5 0.9284441
For the sake of completeness, here's the base R answer:
df[with(df, ave(x, A, FUN = \(x) rank(x, ties.method = "first")) == 1), ]
# A x y
#1 A 1 0.1076158
#4 B 2 -1.3909084
#7 C 5 0.3511618

Resources