Get first n rows for each date in a dataframe [duplicate] - r

This question already has answers here:
Selecting top N rows for each group based on value in column
(4 answers)
Closed 3 years ago.
I am currently trying to subset the first n-observations for each date in my dataset. Let's say n=2 for example purposes. This is what the data set looks like:
Date Measure
2019-02-01 5
2019-02-01 4
2019-02-01 3
2019-02-01 6
… …
2019-02-02 5
2019-02-02 5
2019-02-02 2
… …
I would like to see this output:
Date Measure
2019-02-01 5
2019-02-01 4
2019-02-02 5
2019-02-02 5
… …
Unfortunately, this is not something I am able to do with definitions. I am dealing with over 10 million rows of data, so the solution needs to be dynamic to make the selection of n for each unique date.

An option is to group by 'Date' and slice the sequence of 'n' rows
library(dplyr)
n <- 2
df1 %>%
group_by(Date) %>%
slice(seq_len(n))

Related

Having aggregated data - wanna have data for each element [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
Hei,
My aim is to do a histogramm.
Therefor I need unaggregated data - but unfortunately I only have it in aggregated form.
My data:
tribble(~date,~groupsize,
"2020-09-01",3,
"2020-09-02",2,
"2020-09-03",1,
"2020-09-04",2)
I want to have:
tribble(~date,~n,
"2020-09-01",1,
"2020-09-01",1,
"2020-09-01",1,
"2020-09-02",1,
"2020-09-02",1,
"2020-09-01",1,
"2020-09-04",1,
"2020-09-04",1)
I think this is really simple, but I am at a loss. Sorry for that!
What can I do? I really like dplyr solutions :-)
Thank you!
repeat the date according to groupsize.
res <- data.frame(date=rep(dat$date, dat$groupsize), n=1)
res
# date n
# 1 2020-09-01 1
# 2 2020-09-01 1
# 3 2020-09-01 1
# 4 2020-09-02 1
# 5 2020-09-02 1
# 6 2020-09-03 1
# 7 2020-09-04 1
# 8 2020-09-04 1

R Function error in calculating date difference [duplicate]

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 2 years ago.
I have a dataframe that looks like this:
Name Date
David 2019-12-23
David 2020-1-10
David 2020-2-13
Kevin 2019-2-12
Kevin 2019-3-19
Kevin 2019-5-1
Kevin 2019-7-23
Basically, I'm trying to calculate the date difference between each instance, specific to each person. I am currently using the following for-loop:
df$daysbetween <- with(df, ave(as.numeric(date) , name,
FUN=function(x) { z=c(NA,NA);
for( i in seq_along(x)[-(1:2)] ){
z <- c(z, (x[i]-x[i-1]))}
return(z) }) )
Currently, it calculates the difference between the second and third, and any following instance, perfectly fine. However, it doesn't calculate the difference between the first and second date and I need it to. Where is the error in my code coming from? Would appreciate any help.
transform(df, diff = ave(Date, Name, FUN = function(x)c(NA,diff(as.Date(x)))))
Name Date diff
1 David 2019-12-23 <NA>
2 David 2020-1-10 18
3 David 2020-2-13 34
4 Kevin 2019-2-12 <NA>
5 Kevin 2019-3-19 35
6 Kevin 2019-5-1 43
7 Kevin 2019-7-23 83
Just use lag from the dplyr package:
Description:
Find the "previous" (lag()) or "next" (lead()) values in a vector. Useful for comparing values behind of or ahead of the current values.
df %>%
group_by(name) %>%
mutate(diff = date - lag(date))
Output:
name date diff
<chr> <date> <drtn>
1 David 2019-12-23 NA days
2 David 2020-01-10 18 days
3 David 2020-02-13 34 days
4 Kevin 2019-02-12 NA days
5 Kevin 2019-03-19 35 days
6 Kevin 2019-05-01 43 days
7 Kevin 2019-07-23 83 days

Summation of dataframes in a list [duplicate]

This question already has an answer here:
Aggregating across list of dataframes and storing all results
(1 answer)
Closed 3 years ago.
I am working on a script where I have two lists and I am trying to combine the results so I get a new list. Each list has a date and then two numbers. The lists look like this:
date clicks impressions
1 2019-06-01 1 2
2 2019-06-02 0 0
3 2019-06-03 100 120
and
date clicks impressions
1 2019-06-01 2 14
2 2019-06-02 3 14
3 2019-06-03 11 29
I'd like a single list that is
date clicks impressions
1 2019-06-01 3 16
2 2019-06-02 3 14
3 2019-06-03 111 149
What is the best way to accomplish this. In time I will have 20 - 30 more lists that will be added to this, so I'll want to pull the first list and then combine with the second and then a third and so on. I don't know if I'll be able to assume that each date will be in each list.
Assuming your list is called list_df, you can bind them all together using bind_rows, group_by date and then sum all the other columns.
library(dplyr)
list_df %>%
bind_rows() %>%
group_by(date) %>%
summarise_all(sum)
# A tibble: 3 x 3
# date clicks impressions
# <fct> <int> <int>
#1 2019-06-01 3 16
#2 2019-06-02 3 14
#3 2019-06-03 111 149
which in base R could be achieved using Reduce
aggregate(.~date, Reduce(rbind, list_df), sum)
We can use data.table
library(data.table)
rbindlist(list_df)[, lapply(.SD, sum), date]
# date clicks impressions
#1: 2019-06-01 3 16
#2: 2019-06-02 3 14
#3: 2019-06-03 111 149
data
list_df <- mget(paste0("df", 1:2))
We can do:
cbind(date=df1[,1],do.call(`+`, list(df1[,-1],df2[,-1])),
row.names = NULL)
date clicks impressions
1 2019-06-01 3 16
2 2019-06-02 3 14
3 2019-06-03 111 149
If you are not sure about the presence of dates(can then cbind as above):
do.call(`+`,lapply(list(df1,df2), function(x) x[,-1]))
clicks impressions
1 3 16
2 3 14
3 111 149
This assumes that the data sets will have the same structure always.

Summing the values of several R Lists based on a date key [duplicate]

This question already has an answer here:
Aggregating across list of dataframes and storing all results
(1 answer)
Closed 3 years ago.
I am working on a script where I have two lists and I am trying to combine the results so I get a new list. Each list has a date and then two numbers. The lists look like this:
date clicks impressions
1 2019-06-01 1 2
2 2019-06-02 0 0
3 2019-06-03 100 120
and
date clicks impressions
1 2019-06-01 2 14
2 2019-06-02 3 14
3 2019-06-03 11 29
I'd like a single list that is
date clicks impressions
1 2019-06-01 3 16
2 2019-06-02 3 14
3 2019-06-03 111 149
What is the best way to accomplish this. In time I will have 20 - 30 more lists that will be added to this, so I'll want to pull the first list and then combine with the second and then a third and so on. I don't know if I'll be able to assume that each date will be in each list.
Assuming your list is called list_df, you can bind them all together using bind_rows, group_by date and then sum all the other columns.
library(dplyr)
list_df %>%
bind_rows() %>%
group_by(date) %>%
summarise_all(sum)
# A tibble: 3 x 3
# date clicks impressions
# <fct> <int> <int>
#1 2019-06-01 3 16
#2 2019-06-02 3 14
#3 2019-06-03 111 149
which in base R could be achieved using Reduce
aggregate(.~date, Reduce(rbind, list_df), sum)
We can use data.table
library(data.table)
rbindlist(list_df)[, lapply(.SD, sum), date]
# date clicks impressions
#1: 2019-06-01 3 16
#2: 2019-06-02 3 14
#3: 2019-06-03 111 149
data
list_df <- mget(paste0("df", 1:2))
We can do:
cbind(date=df1[,1],do.call(`+`, list(df1[,-1],df2[,-1])),
row.names = NULL)
date clicks impressions
1 2019-06-01 3 16
2 2019-06-02 3 14
3 2019-06-03 111 149
If you are not sure about the presence of dates(can then cbind as above):
do.call(`+`,lapply(list(df1,df2), function(x) x[,-1]))
clicks impressions
1 3 16
2 3 14
3 111 149
This assumes that the data sets will have the same structure always.

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeñorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Resources