R Lag value of previous calculated function - r

I am trying to use a lag value of a previous row, which needs to be calculated from the previous row (unless its first entry).
I was trying something similar to:
test<-data.frame(account_id=c(123,123,123,123,444,444,444,444),entry=c(1,2,3,4,1,2,3,4),beginning_balance=c(100,0,0,0,200,0,0,0),
deposit=c(10,20,5,8,10,12,20,4),running_balance=c(0,0,0,0,0,0,0,0))
test2<-test %>%
group_by(account_id) %>%
mutate(running_balance = if_else(entry==1, beginning_balance+deposit,
lag(running_balance)+deposit))
print(test2)
the running balance should be 110,130,135,143,210,222,242,246

For each account_id you can add first beginning_balance with cumulative sum of deposit.
library(dplyr)
test %>%
group_by(account_id) %>%
mutate(running_balance = first(beginning_balance) + cumsum(deposit))
# account_id entry beginning_balance deposit running_balance
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 123 1 100 10 110
#2 123 2 0 20 130
#3 123 3 0 5 135
#4 123 4 0 8 143
#5 444 1 200 10 210
#6 444 2 0 12 222
#7 444 3 0 20 242
#8 444 4 0 4 246
Same thing using data.table :
library(data.table)
setDT(test)[, running_balance := first(beginning_balance) + cumsum(deposit), account_id]

Using for-loops for each unique account_id and adding cumulative sum for each id.
for ( i in unique (test$account_id)) {
test$running_balance [test$account_id == i] <- cumsum(test$beginning_balance[test$account_id == i]+test$deposit[test$account_id == i])
}
print (test)
account_id entry beginning_balance deposit running_balance
1 123 1 100 10 110
2 123 2 0 20 130
3 123 3 0 5 135
4 123 4 0 8 143
5 444 1 200 10 210
6 444 2 0 12 222
7 444 3 0 20 242
8 444 4 0 4 246

Related

Eliminate rows that have a match in one of multiple columns of the preceding rows

I have a large data.frame. Here a simpler version for more clarity.
ID <- rep(c(1,2,3),each=4)
Bed <- rep(c(1,1,2,2),3)
ERRBeg <- c(90,140,190,200,290,340,390,100,490,540,560,610)
POST1Beg <- c(100,150,200,250,300,350,400,450,500,550,600,650)
POST2Beg <- c(110,160,210,260,310,360,410,460,510,560,610,660)
DATA <- data.frame(ID,Bed,ERRBeg,POST1Beg,POST2Beg)
It looks like that:
I want to delete all rows that have the following match:
The value of ERRBeg is found in POST1Beg or POST2Beg (i have more variables) in one of the previous rows (only if ID and Bed is the same)
ID Bed ERRBeg POST1Beg POST2Beg LAG_ERRBeg LAG_POST1Beg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 90 100 110 NA NA
2 1 1 140 150 160 90 100
3 1 2 190 200 210 NA NA
4 1 2 200 250 260 190 200
5 2 1 290 300 310 NA NA
6 2 1 340 350 360 290 300
7 2 2 390 400 410 NA NA
8 2 2 100 450 460 390 400
9 3 1 490 500 510 NA NA
10 3 1 540 550 560 490 500
11 3 2 560 600 610 NA NA
12 3 2 610 650 660 560 600
I tried this which gives me the exact line where two variables match. However if i turn it around using filter(!ERRBeg == lag(POST1Beg)) it deletes all line where ID and Bed has duplicates.
DATA %>%
group_by(ID, Bed)%>%
filter(ERRBeg == lag(POST1Beg) ) %>%
ungroup()
I also tried this which does not work. I know i might be missing something trivial, but i do not see it.
DATA_xx <- DATA %>%
group_by(ID, Bed)%>%
filter(ERRBeg %in% c(lag(ERRBeg),lag(POST1Beg)) ) %>%
ungroup()
Desired Output:
ID Bed ERRBeg POST1Beg POST2Beg LAG_ERRBeg LAG_POST1Beg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 90 100 110 NA NA
2 1 1 140 150 160 90 100
3 1 2 190 200 210 NA NA
5 2 1 290 300 310 NA NA
6 2 1 340 350 360 290 300
7 2 2 390 400 410 NA NA
8 2 2 100 450 460 390 400
9 3 1 490 500 510 NA NA
10 3 1 540 550 560 490 500
11 3 2 560 600 610 NA NA
DATA %>%
group_by(ID, Bed)%>%
filter(!ERRBeg %in% POST1Beg ) %>%
ungroup()
I tried this of switching the lag to be an in, and it works I think
Edit: Will not work forward i.e if ERRBeg value appears in a POST1Beg later in the values.
Putting lag back around the post will fix this I believe
DATA %>%
group_by(ID, Bed)%>%
filter(!ERRBeg %in% lag(POST1Beg) ) %>%
ungroup()
Found the problem and the solution. :)
DATA %>%
group_by(ID, Bed)%>%
filter(!ERRBeg %in% c(lag(ERRBeg),lag(POST1Beg),lag(POST2Beg)) | is.na(lag(ERRBeg)) ) %>%
ungroup()
The problem was that i do not only get TRUE, FALSE, but also NA as a result of the equation in the filter.
ID Bed ERRBeg POST1Beg POST2Beg FILTER
<dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 1 1 90 100 110 NA
2 1 1 140 150 160 FALSE
3 1 2 190 200 210 NA
4 2 1 290 300 310 NA
5 2 1 340 350 360 FALSE
6 2 2 390 400 410 NA
7 2 2 100 450 460 FALSE
8 3 1 490 500 510 NA
9 3 1 540 550 560 FALSE
10 3 2 560 600 610 NA

Better way of binning data in a group in a data frame by equal intervals

I have a dataframe of which is characterized by many different ID's. For every ID there are multiple events which are characterized by the cumulative time duration between events(hours) and the duration of that event(seconds). So, it would look something like:
Id <- c(1,1,1,1,1,1,2,2,2,2,2)
cumulative_time<-c(0,3.58,8.88,11.19,21.86,29.54,0,5,14,19,23)
duration<-c(188,124,706,53,669,1506.2,335,349,395,385,175)
test = data.frame(Id,cumulative_time,duration)
> test
Id cummulative_time duration
1 1 0.00 188.0
2 1 3.58 124.0
3 1 8.88 706.0
4 1 11.19 53.0
5 1 21.86 669.0
6 1 29.54 1506.2
7 2 0.00 335.0
8 2 5.00 349.0
9 2 14.00 395.0
10 2 19.00 385.0
11 2 23.00 175.0
I would like to group by the ID and then restructure the group by sampling by a cumulative amount of every say 10 hours, and in that 10 hours sum by the duration that occurred in the 10 hour interval. The number of bins I want should be from say 0 to 30 hours. Thus were would be 3 bins.
I looked at the cut function and managed to make a hack of it within a dataframe - even me as a new r user I know it isn't pretty
test_cut = test %>%
mutate(bin_durations = cut(test$cummulative_time,breaks = c(0,10,20,30),labels = c("10","20","30"),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
which gives the output:
test_cut
Id time_bins duration
1 1 10 1018.0
2 1 20 53.0
3 1 30 2175.2
4 2 10 684.0
5 2 20 780.0
6 2 30 175.0
Ultimately I want the interval window and number of bins to be arbitrary - If I have a span of 5000 hours and I want to bin in 1 hour samples. For this I would use breaks=seq(0,5000,1) for the bins I would say labels = as.character(seq(1,5000,1))
This is will also be applied to a very large data frame, so computational speed somewhat desired.
A dplyr solution would be great since I am applying the binning per group.
My guess is there is a nice interaction between cut and perhaps split to generate the desired output.
Thanks in advance.
Update
After testing, I find that even my current implementation isn't quite what I'd like as if I say:
n=3
test_cut = test %>%
mutate(bin_durations = cut(test$cumulative_time,breaks=seq(0,30,n),labels = as.character(seq(n,30,n)),include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
mutate(total_duration = sum(duration)) %>%
select(Id,bin_durations,total_duration) %>%
distinct()
I get
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 24 669
6 1 30 1506.
7 2 3 335
8 2 6 349
9 2 15 395
10 2 21 385
11 2 24 175
Where there are no occurrences in the bin sequence I should just get 0 in the duration column. Rather than an omission.
Thus, it should look like:
test_cut
# A tibble: 11 x 3
# Groups: Id, bin_durations [11]
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 3 188
2 1 6 124
3 1 9 706
4 1 12 53
5 1 15 0
6 1 18 0
7 1 21 0
8 1 24 669
9 1 27 0
10 1 30 1506.
11 2 3 335
12 2 6 349
13 2 9 0
14 2 12 0
15 2 15 395
16 2 18 0
17 2 21 385
18 2 24 175
19 2 27 0
20 2 30 0
Here is one idea via integer division (%/%)
library(tidyverse)
test %>%
group_by(Id, grp = cumulative_time %/% 10) %>%
summarise(toatal_duration = sum(duration))
which gives,
# A tibble: 6 x 3
# Groups: Id [?]
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 1018
2 1 1 53
3 1 2 2175.
4 2 0 684
5 2 1 780
6 2 2 175
To address your updated issue, we can use complete in order to add the missing rows. So, for the same example, binning in hours of 3,
test %>%
group_by(Id, grp = cumulative_time %/% 3) %>%
summarise(toatal_duration = sum(duration)) %>%
ungroup() %>%
complete(Id, grp = seq(min(grp), max(grp)), fill = list(toatal_duration = 0))
which gives,
# A tibble: 20 x 3
Id grp toatal_duration
<dbl> <dbl> <dbl>
1 1 0 188
2 1 1 124
3 1 2 706
4 1 3 53
5 1 4 0
6 1 5 0
7 1 6 0
8 1 7 669
9 1 8 0
10 1 9 1506.
11 2 0 335
12 2 1 349
13 2 2 0
14 2 3 0
15 2 4 395
16 2 5 0
17 2 6 385
18 2 7 175
19 2 8 0
20 2 9 0
We could make these changes:
test$cummulative_time can be simply cumulative_time
breaks could be factored out and then used in the cut as shown
the second mutate could be changed to summarize in which case the select and distinct are not needed
it is always a good idea to close any group_by with a matching ungroup or in the case of summarize we can use .groups = "drop")
add complete to insert 0 for levels not present
Implementing these changes we have:
library(dplyr)
library(tidyr)
breaks <- seq(0, 40, 10)
test %>%
mutate(bin_durations = cut(cumulative_time, breaks = breaks,
labels = breaks[-1], include.lowest = TRUE)) %>%
group_by(Id,bin_durations) %>%
summarize(total_duration = sum(duration), .groups = "drop") %>%
complete(Id, bin_durations, fill = list(total_duration = 0))
giving:
# A tibble: 8 x 3
Id bin_durations total_duration
<dbl> <fct> <dbl>
1 1 10 1018
2 1 20 53
3 1 30 2175.
4 1 40 0
5 2 10 684
6 2 20 780
7 2 30 175
8 2 40 0

Dividing proportionally row values based on common identifier and specific column in a data frame

After a merging process, I got a data frame that looks like:
df <- data.frame(trip=c(315,328,422,422,458,652,652,652,699),
catch_kg=c(10,8,12,2,26,4,18,14,11),
age_1=c(0,0,0,0,0,0,0,0,0),
age_2=c(2,1,7.5,7.5,8,11,11,11,13),
id=c(1,2,3,3,4,5,5,5,6))
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 7.5 3
422 2 0 7.5 3
458 26 0 8 4
652 4 0 11 5
652 18 0 11 5
652 14 0 11 5
699 11 0 13 6
where trips represents the fishing trip, catch_kg the amount of caught fish (in kg), age_1 & age_2 is the number of individuals in each trip and per age group, and id represents the haul identity in each trip.
In some fishing trips I have more than 1 haul - this can be accessed in the id column, where trips with more than 1 haul have the same id number. For instance: trip number 422 has two hauls (id=3).
At this very moment, for a trip with more than 1 haul, I have that the number of individuals within each age group is equally divided by the number of hauls that appears within that specific trip. For example, in trip 422 I have a total of 15 individuals, but since there are 2 hauls, this number was divided by 2 leading to 7.5 individuals per haul.
What I would like, however, is to compute the number of individuals within each age group as a proportion of the total catch in each haul group.
Thus, at the end I would like to have a data frame that looks like:
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 13 3
422 2 0 2 3
458 26 0 8 4
652 4 0 4 5
652 18 0 16 5
652 14 0 13 5
699 11 0 13 6
This is basically a rule of three calculation, where for trip 422 (2 hauls), for instance, I would have the following calculation:
haul1: 12*(7.5 + 7.5)/(12 + 2) = 13 individuals
haul2: 2*(7.5 + 7.5)/(12 + 2) = 2 individuals
Is there an easy way to compute these calculations?
Any help would be much appreciated.
-M
You could use dplyr to help with this
library(dplyr)
df %>% group_by(trip) %>%
mutate(age_2=catch_kg/sum(catch_kg)*sum(age_2))
# trip catch_kg age_1 age_2 id
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 315 10 0 2.000000 1
# 2 328 8 0 1.000000 2
# 3 422 12 0 12.857143 3
# 4 422 2 0 2.142857 3
# 5 458 26 0 8.000000 4
# 6 652 4 0 3.666667 5
# 7 652 18 0 16.500000 5
# 8 652 14 0 12.833333 5
# 9 699 11 0 13.000000 6
Not sure exactly what rounding rule you were using to get to integer counts of people, but you'd likely run into trouble with parts not adding up to wholes in more complicated scenarios.
Another solution using data.table:
library(data.table)
setDT(df)
df[, age_2 := catch_kg * sum(age_2) / sum(catch_kg), trip]
# trip catch_kg age_1 age_2 id
#1: 315 10 0 2.000000 1
#2: 328 8 0 1.000000 2
#3: 422 12 0 12.857143 3
#4: 422 2 0 2.142857 3
#5: 458 26 0 8.000000 4
#6: 652 4 0 3.666667 5
#7: 652 18 0 16.500000 5
#8: 652 14 0 12.833333 5
#9: 699 11 0 13.000000 6
If you want you can round age_2 with round(): age_2 := round(catch_kg * sum(age_2) / sum(catch_kg))

How to calculate the cumulative data difference with preceding data by group?

the reduced raw data is as follow
Data group
2016/1/10 1
2016/2/4 1
2016/3/25 1
2016/4/13 1
2016/5/5 1
2016/7/1 2
2016/8/1 2
2016/10/1 2
2016/12/1 2
2016/12/31 2
what the final data i want to get is like:
Data group cum_diff_preceding
2016/1/10 1 0
2016/2/4 1 25
2016/3/25 1 125
2016/4/13 1 182
2016/5/5 1 270
2016/7/1 2 0
2016/8/1 2 31
2016/10/1 2 153
2016/12/1 2 336
2016/12/31 2 380
the calculation method is as follow:
for row 2016/1/10, cum_diff_preceding is 0
for row 2016/2/4, cum_diff_preceding is (2016/2/4-2016/1/10)
for row 2016/3/25, cum_diff_preceding is (2016/3/25-2016/1/10)+(2016/3/25-2016/2/4)
for row 2016/4/13, cum_diff_preceding is (2016/4/13-2016/1/10)+(2016/4/13- 2016/2/4)+(2016/4/13-2016/3/25)
for row 2016/5/5, cum_diff_preceding is (2016/5/5-2016/1/10)+(2016/5/5- 2016/2/4)+(2016/5/5-2016/3/25)+(2016/4/13-2016/4/13)
for row 2016/7/1, cum_diff_preceding is 0
for row 2016/8/1, cum_diff_preceding is (2016/8/1-2016/7/1)
for row 2016/10/1, cum_diff_preceding is (2016/10/1-2016/7/1)+(2016/10/1- 2016/8/1)
for row 2016/12/1, cum_diff_preceding is (2016/12/1-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)
for row 2016/12/31, cum_diff_preceding is (2016/12/31-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)+(2016/12/31- 2016/12/1)
my major code is as follow
>as.Date(df$Data,"%Y-%m-%d")
>fun_forcast<-function(df){for(i in 2:nrow(df)){df$cum_diff_preceeding[i]<-sum(df$data[i]-df$data[1:(i-1)])}}
>ddply(df,.(group),transform,cum_diff_preceding<-fun_forcast)
but it not work.
or when i change my code to
>fun_forcast<-function(df)(df$cum_diff_preceding<-sapply(1:NROW(df), >function(i) sum(df$data[i] - df$data[1:(i-1)])))
ddply(df,.(group),fun_forcast)
it work, but the result format is
> ddply(df,.(group),fun_forcast)
group V1 V2 V3 V4 V5
1 1 0 25 125 182 270
2 2 0 31 153 336 380
i don't know how to take the results back into cum_diff_preceding in original data.frame.
please
We can do this with ave from base R
df$Data <- as.Date(df$Data, "%Y/%m/%d")
fun_forcast <- function(v1) sapply(seq_along(v1), function(i) sum(v1[i] - v1[1:(i-1)]))
df$cum_diff_preceding <- with(df, ave(as.numeric(Data), group, FUN = fun_forcast))
df$cum_diff_preceding
#[1] 0 25 125 182 270 0 31 153 336 456
Or use dplyr
library(dplyr)
df %>%
group_by(group) %>%
mutate(cum_diff_preceding = fun_forcast(Data))
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
#10 2016-12-31 2 456
By converting the dates to numeric, and generalizing the formula:
df %>%
group_by(group) %>%
mutate(numdata = as.numeric(Data),
cum_diff_preceding = (1:n())*numdata-cumsum(numdata)) %>%
select(-numdata)
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
# 10 2016-12-31 2 456

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Resources