summarize in dplyr with the maximum value of the date - R - r

I have the following data,
data
date ID value1 value2
2016-04-03 1 0 1
2016-04-10 1 6 2
2016-04-17 1 7 3
2016-04-24 1 2 4
2016-04-03 2 1 5
2016-04-10 2 5 6
2016-04-17 2 9 7
2016-04-24 2 4 8
Now I want to group by ID and find the mean of value2 and latest value of value1. Latest value in the sense, I would like to get the value of latest date i.e. here I would like to get the value1 for corresponding value of 2016-04-24 for each IDs. My output should be like,
ID max_value1 mean_value2
1 2 2.5
2 4 6.5
The following is the command I am using,
data %>% group_by(ID) %>% summarize(mean_value2 = mean(value2))
But I am not sure how to do the first one. Can anybody help me in getting the latest value of value1 while summarizing in dplyr?

One way would be the following. My assumption here is that date is a date object. You want to arrange the order of date for each ID using arrange. Then, you group the data by ID. In summarize, you can use last() to take the last value1 for each ID.
arrange(data,ID,date) %>%
group_by(ID) %>%
summarize(mean_value2 = mean(value2), max_value1 = last(value1))
# ID mean_value2 max_value1
# <int> <dbl> <int>
#1 1 2.5 2
#2 2 6.5 4
DATA
data <- structure(list(date = structure(c(16894, 16901, 16908, 16915,
16894, 16901, 16908, 16915), class = "Date"), ID = c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), value1 = c(0L, 6L, 7L, 2L, 1L, 5L, 9L,
4L), value2 = 1:8), .Names = c("date", "ID", "value1", "value2"
), row.names = c(NA, -8L), class = "data.frame")

Here is an option with data.table
library(data.table)
setDT(data)[, .(max_value1 = value1[which.max(date)],
mean_value2 = mean(value2)) , by = ID]
# ID max_value1 mean_value2
#1: 1 2 2.5
#2: 2 4 6.5

You can do this using the function nth in dplyr which finds the nth value of a vector.
data %>% group_by(ID) %>%
summarize(max_value1 = nth(value1, n = length(value1)), mean_value2 = mean(value2))
This is based on the assumption that the data is ordered by date as in the example; otherwise use arrange as discussed above.

Related

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

Set unique IDs which start from zero in R data.frame

I have a data frame that looks like this
column1
1
1
2
3
3
and I would like to give a unique ID to each element. My problem is that I can not
find a way the unique IDs to start from zero and be like this
column1 column2
1 0
1 0
2 1
3 2
3 2
Any help is appreciated
Try this, cur_group_id from dplyr will create the id from 1 but you can easily make it to start from zero:
library(dplyr)
#Data
df <- structure(list(column1 = c(0L, 1L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,-5L))
#Mutate
df %>% group_by(column1) %>% mutate(id=cur_group_id()-1)
# A tibble: 5 x 2
# Groups: column1 [4]
column1 id
<int> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 3 3
We could use match
library(dplyr)
df1 %>%
mutate(column2 = match(column1, unique(column1)) - 1)
data
df1 <- structure(list(column1 = c(1L, 1L, 2L, 3L, 3L)), class = "data.frame",
row.names = c(NA,
-5L))

Filtering using dplyr package

My dataset is set up as follows:
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1
I'n trying to find out the users that are present on all three days. I'm using the below code using dplyr package:
MAU%>%
group_by(User)%>%
filter(c(1,2,3) %in% Day)
# but get this error message:
# Error in filter_impl(.data, quo) : Result must have length 12, not 3
any idea how to fix?
Using the input shown reproducibly in the Note at the end, count the distinct Users and filter out those for which there are 3 days:
library(dplyr)
DF %>%
distinct %>%
count(User) %>%
filter(n == 3) %>%
select(User)
giving:
# A tibble: 1 x 1
User
<int>
1 1
Note
Lines <- "
User Day
10 2
1 3
15 1
3 1
1 2
15 3
1 1"
DF <- read.table(text = Lines, header = TRUE)
We can use all to get a single TRUE/FALSE from the logical vector 1:3 %in% Day
library(dplyr)
MAU %>%
group_by(User)%>%
filter(all(1:3 %in% Day))
# A tibble: 3 x 2
# Groups: User [1]
# User Day
# <int> <int>
#1 1 3
#2 1 2
#3 1 1
data
MAU <- structure(list(User = c(10L, 1L, 15L, 3L, 1L, 15L, 1L), Day = c(2L,
3L, 1L, 1L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA,
-7L))

Removing duplicates based on 3 columns in R

I have a data set of 300k+ cases and where a customer id may be repeated several times. Each customer has a date and rank associated with it as well. I'd like to be able to keep only unique customer ids sorted first by date then if there is a duplicate id with a duplicate date it would sort by rank (keeping the rank closest to 1). An example of my data is like this:
Customer.ID Date Rank
576293 8/13/2012 2
576293 11/16/2015 6
581252 11/22/2013 4
581252 11/16/2011 6
581252 1/4/2016 5
581600 1/12/2015 3
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
Ideal outcome would then be like this:
Customer.ID Date Rank
576293 11/16/2015 6
581252 1/4/2016 5
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
With the desired output of the OP clarified:
We can also do this with base R, which will be faster than the below dplyr approach using group_by(Customer.ID) since we are not going to have to loop over all unique Customer.ID:
df <- df[order(-df$Customer.ID,as.Date(df$Date, format="%m/%d/%Y"),-df$Rank, decreasing=TRUE),]
res <- df[!duplicated(df$Customer.ID),]
Notes:
First, sort by Customer.ID in ascending order followed by Date in descending order followed by Rank in ascending order.
Remove the duplicates in Customer.ID so that only the first row for each Customer.ID is kept.
The result using your posted data as a data frame df (without converting the Date column) in ascending order for Customer.ID:
print(res)
## Customer.ID Date Rank
##2 576293 11/16/2015 6
##5 581252 1/4/2016 5
##7 581600 1/12/2015 2
##8 582560 4/13/2016 1
##10 586334 3/30/2014 1
##9 591674 3/21/2012 6
Data:
df <- structure(list(Customer.ID = c(591674L, 586334L, 582560L, 581600L,
581252L, 576293L), Date = c("3/21/2012", "3/30/2014", "4/13/2016",
"1/12/2015", "1/4/2016", "11/16/2015"), Rank = c(6L, 1L, 1L,
2L, 5L, 6L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(9L,
10L, 8L, 7L, 5L, 2L), class = "data.frame")
If you want to keep only the latest date (followed by lower rank) row for each Customer.ID, you can do the following using dplyr:
library(dplyr)
res <- df %>% group_by(Customer.ID) %>% arrange(desc(Date),Rank) %>%
summarise_all(funs(first)) %>%
ungroup() %>% arrange(Customer.ID)
Notes:
group_by Customer.ID and sort using arrange by Date in descending order and Rank by ascending order.
summarise_all to keep only the first row from each Customer.ID.
Finally, ungroup and sort by Customer.ID to get your desired result.
Using your data as a data frame df with the Date column converted to the Date class:
print(res)
### A tibble: 7 x 3
## Customer.ID Date Rank
## <int> <date> <int>
##1 576293 2015-11-16 6
##2 581252 2016-01-04 5
##3 581600 2015-01-12 2
##4 582560 2016-04-13 1
##5 586334 2014-03-30 1
##6 591674 2012-03-21 6
Data:
df <- structure(list(Customer.ID = c(576293L, 576293L, 581252L, 581252L,
581252L, 581600L, 581600L, 582560L, 591674L, 586334L), Date = structure(c(15565,
16755, 16031, 15294, 16804, 16447, 16447, 16904, 15420, 16159
), class = "Date"), Rank = c(2L, 6L, 4L, 6L, 5L, 3L, 2L, 1L,
6L, 1L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(NA,
-10L), class = "data.frame")

How to get the top cases for each group using dplyr? [duplicate]

This question already has answers here:
Getting the top values by group
(6 answers)
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 4 years ago.
I have a data table with 3 columns: ID, Type, and Count. For each ID, I want to get the Type with top 2 Count in this ID, and flatten the result into one row. For example, if my data table is like below:
ID Type Count
A 1 8
B 1 3
A 2 5
A 3 2
B 2 1
B 3 4
Then I want my output to be two rows like below:
ID Top1Type Top1TypeCount Top2Type Top2TypeCount
A 1 8 2 5
B 3 4 1 3
Can anyone tell me how to achieve this using the dplyr library in R? Thank you very much.
It's mostly better to keep your data in a long/tidy format. To achieve that, you can use:
df1 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID)
which gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 5
3 B 1 3
4 B 3 4
When you have ties, you can use slice to select an equal number of observations for each group:
# some example data
df2 <- structure(list(ID = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Type = c(1L, 1L, 2L, 3L, 2L, 3L),
Count = c(8L, 3L, 8L, 8L, 1L, 4L)),
.Names = c("ID", "Type", "Count"), class = "data.frame", row.names = c(NA, -6L))
Without slice():
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID)
gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 8
3 A 3 8
4 B 1 3
5 B 3 4
With the use of slice():
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID) %>% slice(1:2)
gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 8
3 B 1 3
4 B 3 4
With arrange you can determine the order of the cases and thus which are selected by slice. The following:
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID, -Type) %>% slice(1:2)
gives this result:
ID Type Count
(fctr) (int) (int)
1 A 3 8
2 A 2 8
3 B 3 4
4 B 1 3
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we order the 'Count' in descending order, subset the first two rows (head(.SD, 2)). Then, we create a sequence column ('N') grouped by 'ID', and dcast from 'long' to 'wide'. The data.table dcast can take multiple value.var columns.
library(data.table)#v1.9.6+
DT <- setDT(df1)[order(-Count), head(.SD, 2) , by = ID]
DT[, N:= 1:.N, by = ID]
dcast(DT, ID~paste0('Top', N),
value.var=c('Type', 'Count'), fill = 0)
# ID Type_Top1 Type_Top2 Count_Top1 Count_Top2
#1: A 1 2 8 5
#2: B 3 1 4 3
data
df1 <- structure(list(ID = c("A", "B", "A", "A", "B", "B"),
Type = c(1L,
1L, 2L, 3L, 2L, 3L), Count = c(8L, 3L, 5L, 2L, 1L, 4L)),
.Names = c("ID",
"Type", "Count"), class = "data.frame", row.names = c(NA, -6L))

Resources