dplyr mutate: Excluding observations similar to the current one - r

I have some data like this:
X Y
-----
A 1
A 2
B 3
B 4
C 5
C 6
I would like to add a new column with values equal to the mean of all Ys in rows where X is not euqal to X of the current observation.
In this particlar case we would get
X Y Mean
-------------------
A 1 (3+4+5+6)/4
A 2 (3+4+5+6)/4
B 3 (1+2+5+6)/4
B 4 (1+2+5+6)/4
C 5 (1+2+3+4)/4
C 6 (1+2+3+4)/4
Thanks in advance!

You can likely do this more succinctly, but this will get you the result.
You essentially create a column which contains the total observations and sum of records for the whole data.frame. Then you group by the X column and repeat the process, by taking the difference you can calculate your mean.
data
df <- data.frame(X = c("A", "A", "B", "B", "C", "C"),
Y = c(1:6))
solution
library(tidyverse)
df %>%
mutate(total_sum = sum(Y),
total_obs = n()) %>%
group_by(X) %>%
mutate(group_sum = sum(Y),
group_obs = n()) %>%
ungroup() %>%
mutate(other_group_sum = total_sum - group_sum,
other_group_obs = total_obs - group_obs,
other_mean = other_group_sum/other_group_obs) %>%
select(X, Y, other_mean)
result
# A tibble: 6 x 3
X Y other_mean
<fct> <int> <dbl>
1 A 1 4.50
2 A 2 4.50
3 B 3 3.50
4 B 4 3.50
5 C 5 2.50
6 C 6 2.50

Related

Solution on R group by issue _ multiple combination

I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6

Rank a dataframe based on multiple conditions [duplicate]

Suppose I have the following data
df = data.frame(name=c("A", "B", "C", "D"), score = c(10, 10, 9, 8))
I want to add a new column with the ranking. This is what I'm doing:
df %>% mutate(ranking = rank(score, ties.method = 'first'))
# name score ranking
# 1 A 10 3
# 2 B 10 4
# 3 C 9 2
# 4 D 8 1
However, my desired result is:
# name score ranking
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3
Clearly rank does not do what I have in mind. What function should I be using?
It sounds like you're looking for dense_rank from "dplyr" -- but applied in a reverse order than what rank normally does.
Try this:
df %>% mutate(rank = dense_rank(desc(score)))
# name score rank
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3
Other solution when you need to apply the rank to all variables (not just one).
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
select(df, -name) %>% mutate_all(funs(dense_rank(desc(.))))
#user101089 --- you can try out with this alternative way:
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
df %>% mutate(rank_score = dense_rank(desc(score)),
rank_score2 = dense_rank(desc(score2)))

subtracting values group wise on a tibble

I have a simple tibble and want to calculate the absolute difference of values after grouping them:
tibb <- tibble(id = c("A", "B", "C","A", "B", "C"), value = c(5,4,3,7,8,9))
# A tibble: 6 x 2
id value
<chr> <dbl>
1 A 5
2 B 4
3 C 3
4 A 7
5 B 8
6 C 9
tibb %>% group_by(id) %>% summarize(diff = function(x,y){abs(x-y)})
dplyr return an error stating diff is not supported.
The output should looks like this:
# A tibble: 3 x 2
id sum
<chr> <int>
1 A 2
2 B 4
3 C 6
Error in summarise_impl(.data, dots) :
Column `diff` is of unsupported type function
Is there any way to calculate this?
If there are 2 values in each group then try this:
tibb <- tibble(id = c("A", "B", "C","A", "B", "C"), value = c(5,4,3,7,8,9))
tibb
tibb %>% group_by(id) %>% summarize(diff = abs(diff(value)))
#or
tibb %>% group_by(id) %>% summarize(diff = abs(value[1] - value[2]))

Create a ranking variable with dplyr?

Suppose I have the following data
df = data.frame(name=c("A", "B", "C", "D"), score = c(10, 10, 9, 8))
I want to add a new column with the ranking. This is what I'm doing:
df %>% mutate(ranking = rank(score, ties.method = 'first'))
# name score ranking
# 1 A 10 3
# 2 B 10 4
# 3 C 9 2
# 4 D 8 1
However, my desired result is:
# name score ranking
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3
Clearly rank does not do what I have in mind. What function should I be using?
It sounds like you're looking for dense_rank from "dplyr" -- but applied in a reverse order than what rank normally does.
Try this:
df %>% mutate(rank = dense_rank(desc(score)))
# name score rank
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3
Other solution when you need to apply the rank to all variables (not just one).
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
select(df, -name) %>% mutate_all(funs(dense_rank(desc(.))))
#user101089 --- you can try out with this alternative way:
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
df %>% mutate(rank_score = dense_rank(desc(score)),
rank_score2 = dense_rank(desc(score2)))

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

I want to make a grouped filter using dplyr, in a way that within each group only that row is returned which has the minimum value of variable x.
My problem is: As expected, in the case of multiple minima all rows with the minimum value are returned. But in my case, I only want the first row if multiple minima are present.
Here's an example:
df <- data.frame(
A=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
y=rnorm(9)
)
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, x == min(x))
As expected, all minima are returned:
Source: local data frame [6 x 3]
Groups: A
A x y
1 A 1 -1.04584335
2 A 1 0.97949399
3 B 2 0.79600971
4 C 5 -0.08655151
5 C 5 0.16649962
6 C 5 -0.05948012
With ddply, I would have approach the task that way:
library(plyr)
ddply(df, .(A), function(z) {
z[z$x == min(z$x), ][1, ]
})
... which works:
A x y
1 A 1 -1.04584335
2 B 2 0.79600971
3 C 5 -0.08655151
Q: Is there a way to approach this in dplyr? (For speed reasons)
Update
With dplyr >= 0.3 you can use the slice function in combination with which.min, which would be my favorite approach for this task:
df %>% group_by(A) %>% slice(which.min(x))
#Source: local data frame [3 x 3]
#Groups: A
#
# A x y
#1 A 1 0.2979772
#2 B 2 -1.1265265
#3 C 5 -1.1952004
Original answer
For the sample data, it is also possible to use two filter after each other:
group_by(df, A) %>%
filter(x == min(x)) %>%
filter(1:n() == 1)
Just for completeness: Here's the final dplyr solution, derived from the comments of #hadley and #Arun:
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)
For what it's worth, here's a data.table solution, to those who may be interested:
# approach with setting keys
dt <- as.data.table(df)
setkey(dt, A,x)
dt[J(unique(A)), mult="first"]
# without using keys
dt <- as.data.table(df)
dt[dt[, .I[which.min(x)], by=A]$V1]
This can be accomplished by using row_number combined with group_by. row_number handles ties by assigning a rank not only by the value but also by the relative order within the vector. To get the first row of each group with the minimum value of x:
df.g <- group_by(df, A)
filter(df.g, row_number(x) == 1)
For more information see the dplyr vignette on window functions.
dplyr offers slice_min function, wich do the job with the argument with_ties = FALSE
library(dplyr)
df %>%
group_by(A) %>%
slice_min(x, with_ties = FALSE)
Output :
# A tibble: 3 x 3
# Groups: A [3]
A x y
<fct> <dbl> <dbl>
1 A 1 0.273
2 B 2 -0.462
3 C 5 1.08
Another way to do it:
set.seed(1)
x <- data.frame(a = rep(1:2, each = 10), b = rnorm(20))
x <- dplyr::arrange(x, a, b)
dplyr::filter(x, !duplicated(a))
Result:
a b
1 1 -0.8356286
2 2 -2.2146999
Could also be easily adapted for getting the row in each group with maximum value.
In case you are looking to filter the minima of x and then the minima of y. An intuitive way of do it is just using filtering functions:
> df
A x y
1 A 1 1.856368296
2 A 1 -0.298284187
3 A 2 0.800047796
4 B 2 0.107289719
5 B 3 0.641819999
6 B 4 0.650542284
7 C 5 0.422465687
8 C 5 0.009819306
9 C 5 -0.482082635
df %>% group_by(A) %>%
filter(x == min(x), y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
This code will filter the minima of x and y.
Also you can do a double filter
that looks even more readable:
df %>% group_by(A) %>%
filter(x == min(x)) %>%
filter(y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
I like sqldf for its simplicity..
sqldf("select A,min(X),y from 'df.g' group by A")
Output:
A min(X) y
1 A 1 -1.4836989
2 B 2 0.3755771
3 C 5 0.9284441
For the sake of completeness, here's the base R answer:
df[with(df, ave(x, A, FUN = \(x) rank(x, ties.method = "first")) == 1), ]
# A x y
#1 A 1 0.1076158
#4 B 2 -1.3909084
#7 C 5 0.3511618

Resources