How to calculate the cumulative data difference with preceding data by group? - r

the reduced raw data is as follow
Data group
2016/1/10 1
2016/2/4 1
2016/3/25 1
2016/4/13 1
2016/5/5 1
2016/7/1 2
2016/8/1 2
2016/10/1 2
2016/12/1 2
2016/12/31 2
what the final data i want to get is like:
Data group cum_diff_preceding
2016/1/10 1 0
2016/2/4 1 25
2016/3/25 1 125
2016/4/13 1 182
2016/5/5 1 270
2016/7/1 2 0
2016/8/1 2 31
2016/10/1 2 153
2016/12/1 2 336
2016/12/31 2 380
the calculation method is as follow:
for row 2016/1/10, cum_diff_preceding is 0
for row 2016/2/4, cum_diff_preceding is (2016/2/4-2016/1/10)
for row 2016/3/25, cum_diff_preceding is (2016/3/25-2016/1/10)+(2016/3/25-2016/2/4)
for row 2016/4/13, cum_diff_preceding is (2016/4/13-2016/1/10)+(2016/4/13- 2016/2/4)+(2016/4/13-2016/3/25)
for row 2016/5/5, cum_diff_preceding is (2016/5/5-2016/1/10)+(2016/5/5- 2016/2/4)+(2016/5/5-2016/3/25)+(2016/4/13-2016/4/13)
for row 2016/7/1, cum_diff_preceding is 0
for row 2016/8/1, cum_diff_preceding is (2016/8/1-2016/7/1)
for row 2016/10/1, cum_diff_preceding is (2016/10/1-2016/7/1)+(2016/10/1- 2016/8/1)
for row 2016/12/1, cum_diff_preceding is (2016/12/1-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)
for row 2016/12/31, cum_diff_preceding is (2016/12/31-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)+(2016/12/31- 2016/12/1)
my major code is as follow
>as.Date(df$Data,"%Y-%m-%d")
>fun_forcast<-function(df){for(i in 2:nrow(df)){df$cum_diff_preceeding[i]<-sum(df$data[i]-df$data[1:(i-1)])}}
>ddply(df,.(group),transform,cum_diff_preceding<-fun_forcast)
but it not work.
or when i change my code to
>fun_forcast<-function(df)(df$cum_diff_preceding<-sapply(1:NROW(df), >function(i) sum(df$data[i] - df$data[1:(i-1)])))
ddply(df,.(group),fun_forcast)
it work, but the result format is
> ddply(df,.(group),fun_forcast)
group V1 V2 V3 V4 V5
1 1 0 25 125 182 270
2 2 0 31 153 336 380
i don't know how to take the results back into cum_diff_preceding in original data.frame.
please

We can do this with ave from base R
df$Data <- as.Date(df$Data, "%Y/%m/%d")
fun_forcast <- function(v1) sapply(seq_along(v1), function(i) sum(v1[i] - v1[1:(i-1)]))
df$cum_diff_preceding <- with(df, ave(as.numeric(Data), group, FUN = fun_forcast))
df$cum_diff_preceding
#[1] 0 25 125 182 270 0 31 153 336 456
Or use dplyr
library(dplyr)
df %>%
group_by(group) %>%
mutate(cum_diff_preceding = fun_forcast(Data))
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
#10 2016-12-31 2 456

By converting the dates to numeric, and generalizing the formula:
df %>%
group_by(group) %>%
mutate(numdata = as.numeric(Data),
cum_diff_preceding = (1:n())*numdata-cumsum(numdata)) %>%
select(-numdata)
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
# 10 2016-12-31 2 456

Related

Inexact joining data based on greater equal condition

I have some values in
df:
# A tibble: 7 × 1
var1
<dbl>
1 0
2 10
3 20
4 210
5 230
6 266
7 267
that I would like to compare to a second dataframe called
value_lookup
# A tibble: 4 × 2
var1 value
<dbl> <dbl>
1 0 0
2 200 10
3 230 20
4 260 30
In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.
Here is the expected output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
I thought it should be doable using {fuzzyjoin} but I cannot get it done.
value_lookup <- tibble(var1 = c(0, 200,230,260),
value = c(0,10,20,30))
df <- tibble(var1 = c(0,10,20,210,230,266,267))
library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
x = df,
y = value_lookup ,
by = "var1",
match_fun = list(`>=`)
)
An option is also findInterval:
df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
Output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:
library(data.table)
setDT(value_lookup)[setDT(df), on = 'var1', roll = T]
You can use cut:
df$value <- value_lookup$value[cut(df$var1,
c(value_lookup$var1, Inf),
right=F)]
# # A tibble: 7 x 2
# var1 value
# <dbl> <dbl>
# 1 0 0
# 2 10 0
# 3 20 0
# 4 210 10
# 5 230 20
# 6 266 30
# 7 267 30

R Lag value of previous calculated function

I am trying to use a lag value of a previous row, which needs to be calculated from the previous row (unless its first entry).
I was trying something similar to:
test<-data.frame(account_id=c(123,123,123,123,444,444,444,444),entry=c(1,2,3,4,1,2,3,4),beginning_balance=c(100,0,0,0,200,0,0,0),
deposit=c(10,20,5,8,10,12,20,4),running_balance=c(0,0,0,0,0,0,0,0))
test2<-test %>%
group_by(account_id) %>%
mutate(running_balance = if_else(entry==1, beginning_balance+deposit,
lag(running_balance)+deposit))
print(test2)
the running balance should be 110,130,135,143,210,222,242,246
For each account_id you can add first beginning_balance with cumulative sum of deposit.
library(dplyr)
test %>%
group_by(account_id) %>%
mutate(running_balance = first(beginning_balance) + cumsum(deposit))
# account_id entry beginning_balance deposit running_balance
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 123 1 100 10 110
#2 123 2 0 20 130
#3 123 3 0 5 135
#4 123 4 0 8 143
#5 444 1 200 10 210
#6 444 2 0 12 222
#7 444 3 0 20 242
#8 444 4 0 4 246
Same thing using data.table :
library(data.table)
setDT(test)[, running_balance := first(beginning_balance) + cumsum(deposit), account_id]
Using for-loops for each unique account_id and adding cumulative sum for each id.
for ( i in unique (test$account_id)) {
test$running_balance [test$account_id == i] <- cumsum(test$beginning_balance[test$account_id == i]+test$deposit[test$account_id == i])
}
print (test)
account_id entry beginning_balance deposit running_balance
1 123 1 100 10 110
2 123 2 0 20 130
3 123 3 0 5 135
4 123 4 0 8 143
5 444 1 200 10 210
6 444 2 0 12 222
7 444 3 0 20 242
8 444 4 0 4 246

Better way of binning data in a group in a data frame by equal intervals

I have a dataframe of which is characterized by many different ID's. For every ID there are multiple events which are characterized by the cumulative time duration between events(hours) and the duration of that event(seconds). So, it would look something like:
Id <- c(1,1,1,1,1,1,2,2,2,2,2)
cumulative_time<-c(0,3.58,8.88,11.19,21.86,29.54,0,5,14,19,23)
duration<-c(188,124,706,53,669,1506.2,335,349,395,385,175)
test = data.frame(Id,cumulative_time,duration)
> test
Id cummulative_time duration
1 1 0.00 188.0
2 1 3.58 124.0
3 1 8.88 706.0
4 1 11.19 53.0
5 1 21.86 669.0
6 1 29.54 1506.2
7 2 0.00 335.0
8 2 5.00 349.0
9 2 14.00 395.0
10 2 19.00 385.0
11 2 23.00 175.0
I would like to group by the ID and then restructure the group by sampling by a cumulative amount of every say 10 hours, and in that 10 hours sum by the duration that occurred in the 10 hour interval. The number of bins I want should be from say 0 to 30 hours. Thus were would be 3 bins.
I looked at the cut function and managed to make a hack of it within a dataframe - even me as a new r user I know it isn't pretty
test_cut = test %>%
mutate(bin_durations = cut(test$cummulative_time,breaks = c(0,10,20,30),labels = c("10","20","30"),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
which gives the output:
test_cut
Id time_bins duration
1 1 10 1018.0
2 1 20 53.0
3 1 30 2175.2
4 2 10 684.0
5 2 20 780.0
6 2 30 175.0
Ultimately I want the interval window and number of bins to be arbitrary - If I have a span of 5000 hours and I want to bin in 1 hour samples. For this I would use breaks=seq(0,5000,1) for the bins I would say labels = as.character(seq(1,5000,1))
This is will also be applied to a very large data frame, so computational speed somewhat desired.
A dplyr solution would be great since I am applying the binning per group.
My guess is there is a nice interaction between cut and perhaps split to generate the desired output.
Thanks in advance.
Update
After testing, I find that even my current implementation isn't quite what I'd like as if I say:
n=3
test_cut = test %>%
mutate(bin_durations = cut(test$cumulative_time,breaks=seq(0,30,n),labels = as.character(seq(n,30,n)),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
I get
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 24 669
6 1 30 1506.
7 2 3 335
8 2 6 349
9 2 15 395
10 2 21 385
11 2 24 175
Where there are no occurrences in the bin sequence I should just get 0 in the duration column. Rather than an omission.
Thus, it should look like:
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 15 0
6 1 18 0
7 1 21 0
8 1 24 669
9 1 27 0
10 1 30 1506.
11 2 3 335
12 2 6 349
13 2 9 0
14 2 12 0
15 2 15 395
16 2 18 0
17 2 21 385
18 2 24 175
19 2 27 0
20 2 30 0
Here is one idea via integer division (%/%)
library(tidyverse)
test %>%
group_by(Id, grp = cumulative_time %/% 10) %>%
summarise(toatal_duration = sum(duration))
which gives,
# A tibble: 6 x 3
# Groups: Id [?]
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 1018
2 1 1 53
3 1 2 2175.
4 2 0 684
5 2 1 780
6 2 2 175
To address your updated issue, we can use complete in order to add the missing rows. So, for the same example, binning in hours of 3,
test %>%
group_by(Id, grp = cumulative_time %/% 3) %>%
summarise(toatal_duration = sum(duration)) %>%
ungroup() %>%
complete(Id, grp = seq(min(grp), max(grp)), fill = list(toatal_duration = 0))
which gives,
# A tibble: 20 x 3
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 188
2 1 1 124
3 1 2 706
4 1 3 53
5 1 4 0
6 1 5 0
7 1 6 0
8 1 7 669
9 1 8 0
10 1 9 1506.
11 2 0 335
12 2 1 349
13 2 2 0
14 2 3 0
15 2 4 395
16 2 5 0
17 2 6 385
18 2 7 175
19 2 8 0
20 2 9 0
We could make these changes:
test$cummulative_time can be simply cumulative_time
breaks could be factored out and then used in the cut as shown
the second mutate could be changed to summarize in which case the select and distinct are not needed
it is always a good idea to close any group_by with a matching ungroup or in the case of summarize we can use .groups = "drop")
add complete to insert 0 for levels not present
Implementing these changes we have:
library(dplyr)
library(tidyr)
breaks <- seq(0, 40, 10)
test %>%
mutate(bin_durations = cut(cumulative_time, breaks = breaks,
labels = breaks[-1], include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
summarize(total_duration = sum(duration), .groups = "drop") %>%
complete(Id, bin_durations, fill = list(total_duration = 0))
giving:
# A tibble: 8 x 3
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 10 1018
2 1 20 53
3 1 30 2175.
4 1 40 0
5 2 10 684
6 2 20 780
7 2 30 175
8 2 40 0

Changing rows to columns through merging in R

actual_date=c('2018-01-03','2018-01-02','2018-01-25','2018-01-15','2018-01-06','2018-01-02','2018-01-16','2018-01-22','2018-01-03')
date_band=c('_201801','_201801','_201803','_201802','_201801', '_201801','_201803','_201804','_201801')
action=c('text','letter','call','letter','visit','letter','text','text','call')
unique_ref=c(1,1,2,1,2,3,3,4,4)
df1=as.data.frame(cbind(unique_ref,actual_date,date_band,action))
unique_ref=c(1,2,3,4)
priority_201801=c('3','2','3','0')
balance_201801=c('30','-20','35','-100')
priority_201802=c('1','1','1','2')
balance_201802=c('60','-40','35','0')
priority_201803=c('2','2','3','2')
balance_201803=c('30','-40','-50','100')
priority_201804=c('99','0','0','0')
balance_201804=c('0','-20','-50','-100')
df2=as.data.frame(cbind(unique_ref,priority_201801,balance_201801,priority_201802,balance_201802,priority_201803,
balance_201803,priority_201804,balance_201804))
The code above produces two examples of datasets I'm working with.
df1 looks like this:
unique_ref actual_date date_band action
1 1 2018-01-03 _201801 text
2 1 2018-01-02 _201801 letter
3 2 2018-01-25 _201803 call
4 1 2018-01-15 _201802 letter
5 2 2018-01-06 _201801 visit
6 3 2018-01-02 _201801 letter
7 3 2018-01-16 _201803 text
8 4 2018-01-22 _201804 text
9 4 2018-01-03 _201801 call
While df2 looks like:
unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803 priority_201804 balance_201804
1 1 3 30 1 60 2 30 99 0
2 2 2 -20 1 -40 2 -40 0 -20
3 3 3 35 1 35 3 -50 0 -50
4 4 0 -100 2 0 2 100 0 -100
What I want to do is add a column to df2 that states action_dateband (i.e. action_201801, action_201802 etc). This would be taken from the action in df1 using the date_band and matched on unqiue_ref.
Desired output looks as follows: Where if there are two for one week, there is a comma between the two actions.
unique_ref priority_201801 balance_201801 action_201801 priority_201802 balance_201802 action_201802 priority_201803 balance_201803 action_201803
1 1 3 30 text,letter 1 60 letter 2 30
2 2 2 -20 visit 1 -40 2 -40 call
3 3 3 35 letter 1 35 3 -50 text
4 4 0 -100 call 2 0 2 100
priority_201804 balance_201804 action_201804
1 99 0
2 0 -20
3 0 -50
4 0 -100 text
library(tidyverse)
df2 %>%
left_join(df1, by=c("unique_ref")) %>% # join df1 to df2
select(-actual_date) %>% # remove column you won't need
mutate(date_band = paste0("action", date_band)) %>% # update column values
spread(date_band, action) # reshape to get the format you want
# unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803
# 1 1 3 30 1 60 2 30
# 2 2 2 -20 1 -40 2 -40
# 3 3 3 35 1 35 3 -50
# 4 4 0 -100 2 0 2 100
# priority_201804 balance_201804 action_201801 action_201802 action_201803 action_201804
# 1 99 0 text letter <NA> <NA>
# 2 0 -20 visit <NA> call <NA>
# 3 0 -50 letter <NA> text <NA>
# 4 0 -100 call <NA> <NA> text
For the case where you have multiple actions for a specific ref and date_band you can use this approach:
library(tidyverse)
# update df1
df1_upd = df1 %>%
group_by(unique_ref, date_band) %>% # for every combination of ref and date_band
summarise(action = paste0(action, collapse = ",")) %>% # combine actions
ungroup() # forget the grouping
df2 %>%
left_join(df1_upd, by=c("unique_ref")) %>%
mutate(date_band = paste0("action", date_band)) %>%
spread(date_band, action)
# unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803
# 1 1 3 30 1 60 2 30
# 2 2 2 -20 1 -40 2 -40
# 3 3 3 35 1 35 3 -50
# 4 4 0 -100 2 0 2 100
# priority_201804 balance_201804 action_201801 action_201802 action_201803 action_201804
# 1 99 0 text,letter letter <NA> <NA>
# 2 0 -20 visit <NA> call <NA>
# 3 0 -50 letter <NA> text <NA>
# 4 0 -100 call <NA> <NA> text

ddply type functionality on multiple datafrmaes

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Resources