rollapplyr on missing records related by date

rollapplyr on missing records related by date - r

How do I apply rollapplyr on the following data to allow it be sensitive to the date field? Because currently I am able to apply the rolling (blind to the date) over the dataset with eg. 4-quarters period and minimum of 2 observations in the 4 quarters.
#creating the data
set.seed(123)
data.frame(id=c(1,1,1,1,1,2,2,2,2,2),
date=as.Date(as.character(c(20040930, 20041231, 20050331, 20050630, 20050930, 20040930, 20050331, 20050630, 20051231, 20060331)), format = "%Y%m%d"),
col_a=round(runif(10, 0, 100),0),
col_b=round(runif(10, 0, 100),0))
id date col_a col_b
1 1 2004-09-30 3 10
2 1 2004-12-31 8 5
3 1 2005-03-31 4 7
4 1 2005-06-30 9 6
5 1 2005-09-30 9 1
6 2 2004-09-30 0 9
<missing>
7 2 2005-03-31 5 2
8 2 2005-06-30 9 0
<missing>
9 2 2005-12-31 6 3
10 2 2006-03-31 5 10
This is what I have attempted so far, but this will not take into consideration of the missing records, eg. id=2's 2005-09-30 record.
library(zoo)
data %>%
group_by(id) %>%
mutate(score = (col_a + col_b) / rollapplyr(col_b, 4, mean, fill=NA, by.column=TRUE, partial=2)) %>%
ungroup %>% select(id, date, col_a, col_b, score)
And this is what I got after applying the above function
id date col_a col_b score
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2004-09-30 3 10 NA
2 1 2004-12-31 8 5 1.73
3 1 2005-03-31 4 7 1.5
4 1 2005-06-30 9 6 2.14
5 1 2005-09-30 9 1 2.11
6 2 2004-09-30 0 9 NA
7 2 2005-03-31 5 2 1.27
8 2 2005-06-30 9 0 2.45
9 2 2005-12-31 6 3 2.57
10 2 2006-03-31 5 10 4
However what I am expecting is it will take into consideration the missing quarters itself automatically. This is my expected output
id date col_a col_b score
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2004-09-30 3 10 NA
2 1 2004-12-31 8 5 1.73
3 1 2005-03-31 4 7 1.5
4 1 2005-06-30 9 6 2.14
5 1 2005-09-30 9 1 2.11
6 2 2004-09-30 0 9 NA
<missing>
7 2 2005-03-31 5 2 1.27
8 2 2005-06-30 9 0 2.45
<missing>
9 2 2005-12-31 6 3 **5.4**
10 2 2006-03-31 5 10 **3.46**
Note that the "<missing>" will not be shown in the output, I just put for visual purpose. So eg. row 10 will only use row 8,9 and 10's records because the missing row is counted as a row too. How do I achieve that?
Note that eg. for row 10, n=3 should be used for the averaging not n=4 as it shouldn't include the missing rows.

One option would be to create the complete rows of 'date' for all 'id's before the group_by
library(tidyverse)
library(zoo)
complete(data, id, date, fill = list(col_a = 0, col_b = 0)) %>%
group_by(id) %>%
mutate(score = (col_a + col_b) /
rollapplyr(col_b, 4, sum, fill=NA, by.column=TRUE, partial=2)) %>%
ungroup %>%
select(id, date, col_a, col_b, score) %>%
right_join(data)
# A tibble: 10 x 5
# id date col_a col_b score
# <dbl> <date> <dbl> <dbl> <dbl>
# 1 1 2004-09-30 3 10 NA
# 2 1 2004-12-31 8 5 0.867
# 3 1 2005-03-31 4 7 0.5
# 4 1 2005-06-30 9 6 0.536
# 5 1 2005-09-30 9 1 0.526
# 6 2 2004-09-30 0 9 NA
# 7 2 2005-03-31 5 2 0.636
# 8 2 2005-06-30 9 0 0.818
# 9 2 2005-12-31 6 3 1.8
#10 2 2006-03-31 5 10 1.15
data
data <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
date = structure(c(12691,
12783, 12873, 12964, 13056, 12691, 12873, 12964, 13148, 13238
), class = "Date"), col_a = c(3, 8, 4, 9, 9, 0, 5, 9, 6, 5),
col_b = c(10, 5, 7, 6, 1, 9, 2, 0, 3, 10)), row.names = c(NA,
-10L), class = "data.frame")

Related

Create lagged variables for consecutive time points only using R

I have an unbalanced panel (with unequally spaced measurement points) and would like to create a lagged variable of x by group (Variable: id) but only for consecutive time points. My data looks like this:
# simple example with an unbalanced panel
base <- data.frame(id = rep(1:2, each = 7),
time = c(1, 2, 3, 4, 7, 8, 10, 3, 4, 6, 9, 10, 11, 14),
x = rnorm(14, mean = 3, sd = 1))
I already tried this code using dplyr:
base_lag <- base %>% # Add lagged column
group_by(id) %>%
dplyr::mutate(lag1_x = dplyr::lag(x, n = 1, default = NA)) %>%
as.data.frame()
base_lag # Print updated data
However, this way I get a lagged variable regardless of the fact that in some cases it is not two consecutive time points.
My final data set should look like this:
id time x lag1_x
1 1 1 3.437416 NA
2 1 2 2.300553 3.437416
3 1 3 2.374212 2.300553
4 1 4 4.374009 2.374212
5 1 7 1.177433 NA
6 1 8 1.543353 1.177433
7 1 10 3.222358 NA
8 2 3 3.763765 NA
9 2 4 3.881182 3.763765
10 2 6 4.754420 NA
11 2 9 4.518227 NA
12 2 10 2.512486 4.518227
13 2 11 3.129230 2.512486
14 2 14 2.152509 NA
Does anyone here have a tip for me on how to create this lagged variable? Many thanks in advance!

You could use ifelse, testing whether diff(time) is equal to 1. If so, write the lag. If not, write an NA.
base %>%
group_by(id) %>%
mutate(lag1_x = ifelse(c(0, diff(time)) == 1, lag(x, n = 1, default = NA), NA)) %>%
as.data.frame()
#> id time x lag1_x
#> 1 1 1 1.852343 NA
#> 2 1 2 2.710538 1.852343
#> 3 1 3 2.700785 2.710538
#> 4 1 4 2.588489 2.700785
#> 5 1 7 3.252223 NA
#> 6 1 8 2.108079 3.252223
#> 7 1 10 3.435683 NA
#> 8 2 3 1.762462 NA
#> 9 2 4 2.775732 1.762462
#> 10 2 6 3.377396 NA
#> 11 2 9 3.133336 NA
#> 12 2 10 3.804190 3.133336
#> 13 2 11 2.942893 3.804190
#> 14 2 14 3.503608 NA

An option is also to create a grouping based on the difference
library(dplyr)
base %>%
group_by(id, grp = cumsum(c(TRUE, diff(time) != 1))) %>%
mutate(lag1_x = lag(x)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 14 × 4
id time x lag1_x
<int> <dbl> <dbl> <dbl>
1 1 1 3.81 NA
2 1 2 2.79 3.81
3 1 3 3.04 2.79
4 1 4 1.76 3.04
5 1 7 1.72 NA
6 1 8 2.68 1.72
7 1 10 3.31 NA
8 2 3 2.92 NA
9 2 4 2.02 2.92
10 2 6 1.71 NA
11 2 9 2.56 NA
12 2 10 1.62 2.56
13 2 11 3.30 1.62
14 2 14 3.69 NA

Count variable until observations changes [duplicate]

This question already has answers here:
Create counter within consecutive runs of values
(3 answers)
Closed 1 year ago.
Unfortunately, I can't wrap my head around this but I'm sure there is a straightforward solution. I've a data.frame that looks like this:
set.seed(1)
mydf <- data.frame(group=sample(c("a", "b"), 20, replace=T))
I'd like to create a new variable that counts from top to bottom, how many times the group occured in a row. Hence, within the example from above it should look like this:
mydf$question <- c(1, 2, 1, 2, 1, 1, 2, 3, 4, 1, 2, 3, 1, 1, 1, 1, 1, 2, 1, 1)
> mydf[1:10,]
group question
1 a 1
2 a 2
3 b 1
4 b 2
5 a 1
6 b 1
7 b 2
8 b 3
9 b 4
10 a 1
Thanks for help.

Using data.table::rleid and dplyr you could do:
set.seed(1)
mydf <- data.frame(group=sample(c("a", "b"), 20, replace=T))
library(dplyr)
library(data.table)
mydf %>%
mutate(id = data.table::rleid(group)) %>%
group_by(id) %>%
mutate(question = row_number()) %>%
ungroup()
#> # A tibble: 20 × 3
#> group id question
#> <chr> <int> <int>
#> 1 a 1 1
#> 2 b 2 1
#> 3 a 3 1
#> 4 a 3 2
#> 5 b 4 1
#> 6 a 5 1
#> 7 a 5 2
#> 8 a 5 3
#> 9 b 6 1
#> 10 b 6 2
#> 11 a 7 1
#> 12 a 7 2
#> 13 a 7 3
#> 14 a 7 4
#> 15 a 7 5
#> 16 b 8 1
#> 17 b 8 2
#> 18 b 8 3
#> 19 b 8 4
#> 20 a 9 1

Update: Most is the same as stefan but without data.table package:
library(dplyr)
mydf %>%
mutate(myrleid = with(rle(group), rep(seq_along(lengths), lengths))) %>%
group_by(myrleid) %>%
mutate(question = row_number()) %>%
ungroup()
group myrleid question
<chr> <int> <int>
1 a 1 1
2 b 2 1
3 a 3 1
4 a 3 2
5 b 4 1
6 a 5 1
7 a 5 2
8 a 5 3
9 b 6 1
10 b 6 2
11 a 7 1
12 a 7 2
13 a 7 3
14 a 7 4
15 a 7 5
16 b 8 1
17 b 8 2
18 b 8 3
19 b 8 4
20 a 9 1

Rearranging the rows based on a sequential unique values

I have the following data set containing duplicate columns and I would like to stack them but in the following way. I can get the desired output with bind_rows but I would like to try it with tidyr functions:
df <- tibble(
runs = c(1, 2, 3, 4),
col1 = c(3, 4, 5, 5),
col2 = c(5, 3, 1, 4),
col3 = c(6, 4, 9, 2),
col1 = c(0, 2, 2, 1),
col2 = c(2, 3, 1, 7),
col3 = c(2, 4, 9, 9),
col1 = c(3, 4, 5, 7),
col2 = c(3, 3, 1, 4),
col3 = c(3, 2, NA, NA), .name_repair = "minimal")
df %>%
select(runs, 2:4) %>%
bind_rows(df %>%
select(runs, 5:7)) %>%
bind_rows(df %>%
select(runs, 8:10))
# A tibble: 12 x 4 # This is my desired output in a way that column runs is a repeated number of 1 to 4
runs col1 col2 col3
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 6
2 2 4 3 4
3 3 5 1 9
4 4 5 4 2
5 1 0 2 2
6 2 2 3 4
7 3 2 1 9
8 4 1 7 9
9 1 3 3 3
10 2 4 3 2
11 3 5 1 NA
12 4 7 4 NA
However when I use tidyr the runs is arranged differently in the following way.
df %>%
pivot_longer(-runs) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-id)
# A tibble: 12 x 4
runs col1 col2 col3
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 6
2 1 0 2 2
3 1 3 3 3
4 2 4 3 4
5 2 2 3 4
6 2 4 3 2
7 3 5 1 9
8 3 2 1 9
9 3 5 1 NA
10 4 5 4 2
11 4 1 7 9
12 4 7 4 NA
I would be grateful if you could let me know how I could rearrange runs so that the numbers are sequential and not like three 1 in a row and ...
Thank you very much in advance.

There may be a more elegant way to do this, but could you not simply group by runs and use the row numbers to arrange.
df %>%
pivot_longer(cols = starts_with("col"),
names_to = c(".value")) %>%
group_by(runs) %>%
mutate(grp_n = row_number()) %>%
ungroup() %>%
arrange(grp_n, runs)
# A tibble: 12 x 5
runs col1 col2 col3 grp_n
<dbl> <dbl> <dbl> <dbl> <int>
1 1 3 5 6 1
2 2 4 3 4 1
3 3 5 1 9 1
4 4 5 4 2 1
5 1 0 2 2 2
6 2 2 3 4 2
7 3 2 1 9 2
8 4 1 7 9 2
9 1 3 3 3 3
10 2 4 3 2 3
11 3 5 1 NA 3
12 4 7 4 NA 3

A base R option using split.default :
data.frame(runs = df$runs,
sapply(split.default(df[-1], names(df)[-1]), unlist),row.names = NULL)
# runs col1 col2 col3
#1 1 3 5 6
#2 2 4 3 4
#3 3 5 1 9
#4 4 5 4 2
#5 1 0 2 2
#6 2 2 3 4
#7 3 2 1 9
#8 4 1 7 9
#9 1 3 3 3
#10 2 4 3 2
#11 3 5 1 NA
#12 4 7 4 NA

How can I distribute a vector of numbers by a vector of percentages, round the result, and always get the same total that I started with in R?

Question Summary
I want to multiply a vector of numbers (the Sum_By_Group column) by a vector of percentages (the Percent column) to distribute the total number for the group into each ID, round the result, and end up with the same total number that I started with. In other words, I want the Distribution_Post_Round column to be the same as the Sum_By_Group column.
Below is an example of the issue that I am running into. In Group A, I multiply Percent by Sum_By_Group and finish with 3 in ID 1, 3 in ID 2, and 1 in ID 5 for a total of 7. The Sum_By_Group column and Distribution_Post_Round column are the same for Group A and this is what I want. In Group B, I multiply Percent by Sum_By_Group and finish with 1 in ID 8 and 1 in ID 10 for a total of 2. I want the Distribution_Post_Round column to be 3 for Group B.
Is there a way to do this without using loops, subsetting data frames, and then rejoining the data frames together?
Example
library(dplyr)
df = data.frame('Group' = c(rep('A', 7), rep('B', 5)),
'ID' = c(1:12),
'Percent' = c(0.413797750, 0.385366840, 0.014417571, 0.060095668, 0.076399650,
0.019672573, 0.030249949, 0.381214519, 0.084121796, 0.438327886,
0.010665749, 0.085670050),
'Sum_By_Group' = c(rep(7,7), rep(3, 5)))
df$Distribute_By_ID = round(df$Percent * df$Sum_By_Group, 0)
df_round = aggregate(Distribute_By_ID ~ Group, data = df, sum)
names(df_round)[names(df_round) == 'Distribute_By_ID'] = 'Distribution_Post_Round'
df = left_join(df, df_round, by = 'Group')
df
Group ID Percent Sum_By_Group Distribute_By_ID Distribution_Post_Round
A 1 0.41379775 7 3 7
A 2 0.38536684 7 3 7
A 3 0.01441757 7 0 7
A 4 0.06009567 7 0 7
A 5 0.07639965 7 1 7
A 6 0.01967257 7 0 7
A 7 0.03024995 7 0 7
B 8 0.38121452 3 1 2
B 9 0.08412180 3 0 2
B 10 0.43832789 3 1 2
B 11 0.01066575 3 0 2
B 12 0.08567005 3 0 2
Thank you so much for your help. Please let me know if additional clarification is needed.

Wow, who knew someone had already written a package that includes a function to solve this... kudos to that team https://cran.r-project.org/web/packages/sfsmisc/index.html
Since you seem to be willing to use dplyr hopefully this additional package will be worth it because it certainly makes a solution elegant.
# https://stackoverflow.com/questions/61667720
library(dplyr)
df = data.frame('Group' = c(rep('A', 7), rep('B', 5)),
'ID' = c(1:12),
'Percent' = c(0.413797750, 0.385366840, 0.014417571, 0.060095668, 0.076399650,
0.019672573, 0.030249949, 0.381214519, 0.084121796, 0.438327886,
0.010665749, 0.085670050),
'Sum_By_Group' = c(rep(7,7), rep(3, 5)))
glimpse(df)
#> Rows: 12
#> Columns: 4
#> $ Group <chr> "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "…
#> $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
#> $ Percent <dbl> 0.41379775, 0.38536684, 0.01441757, 0.06009567, 0.076399…
#> $ Sum_By_Group <dbl> 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3
df %>%
group_by(Group) %>%
mutate(Distribute_By_ID = sfsmisc::roundfixS(Percent * Sum_By_Group))
#> # A tibble: 12 x 5
#> # Groups: Group [2]
#> Group ID Percent Sum_By_Group Distribute_By_ID
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 1 0.414 7 3
#> 2 A 2 0.385 7 3
#> 3 A 3 0.0144 7 0
#> 4 A 4 0.0601 7 0
#> 5 A 5 0.0764 7 1
#> 6 A 6 0.0197 7 0
#> 7 A 7 0.0302 7 0
#> 8 B 8 0.381 3 1
#> 9 B 9 0.0841 3 0
#> 10 B 10 0.438 3 2
#> 11 B 11 0.0107 3 0
#> 12 B 12 0.0857 3 0
Created on 2020-05-07 by the reprex package (v0.3.0)

df %>%
mutate(dividend = floor(Percent*Sum_By_Group),
remainder= Percent*Sum_By_Group-dividend) %>%
group_by(Group) %>%
arrange(desc(remainder),.by_group=TRUE) %>%
mutate(delivered=sum(dividend),
rownumber=1:n(),
lastdelivery=if_else(rownumber<=Sum_By_Group-delivered,1,0),
Final=dividend+lastdelivery) %>%
ungroup()
# A tibble: 12 x 10
Group ID Percent Sum_By_Group dividend remainder delivered rownumber lastdelivery Final
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 A 1 0.414 7 2 0.897 4 1 1 3
2 A 2 0.385 7 2 0.698 4 2 1 3
3 A 5 0.0764 7 0 0.535 4 3 1 1
4 A 4 0.0601 7 0 0.421 4 4 0 0
5 A 7 0.0302 7 0 0.212 4 5 0 0
6 A 6 0.0197 7 0 0.138 4 6 0 0
7 A 3 0.0144 7 0 0.101 4 7 0 0
8 B 10 0.438 3 1 0.315 2 1 1 2
9 B 12 0.0857 3 0 0.257 2 2 0 0
10 B 9 0.0841 3 0 0.252 2 3 0 0
11 B 8 0.381 3 1 0.144 2 4 0 1
12 B 11 0.0107 3 0 0.0320 2 5 0 0
This is my solution, without any other dependencies relying on Hare quota :
I distributed all the integer "seats", then I distributed the remaining "seats" in the order of remainders.
The column "Final" is then OK.
Note : It seems to give the same results than the other solution with a package

Formulating this as an integer optimization problem:
library(CVXR)
A <- as.data.frame.matrix(t(model.matrix(~0+Group, df)))
prop <- df$Percent * df$Sum_By_Group
x <- Variable(nrow(df), integer=TRUE)
sums <- df$Sum_By_Group[!duplicated(df$Group)]
p <- Problem(Minimize(sum_squares(x - prop)), list(A %*% x == sums))
result <- solve(p)
df$Distribute_By_ID <- as.integer(round(result$getValue(x)))
output:
Group ID Percent Sum_By_Group
1 A 1 0.41379775 7
2 A 2 0.38536684 7
3 A 3 0.01441757 7
4 A 4 0.06009567 7
5 A 5 0.07639965 7
6 A 6 0.01967257 7
7 A 7 0.03024995 7
8 B 8 0.38121452 3
9 B 9 0.08412180 3
10 B 10 0.43832789 3
11 B 11 0.01066575 3
12 B 12 0.08567005 3

Rank by multiple columns in R

Trying to create a rank indicator over 2 columns, in this case both account and DATE.
For example:
df <- data.frame(
Account = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3),
DATE = c(201901, 201902, 201903, 201904, 201902, 201903, 201904, 201905, 201906, 201907, 201904, 201905))
> df
Account DATE
1 201901
1 201902
1 201903
1 201904
2 201902
2 201903
2 201904
2 201905
2 201906
2 201907
3 201904
3 201905
I've tried to use rank and order, and rank(rank()) and order(order()) but with no luck
df <- df %>%
mutate("rank" = rank(Account, DATE))
Account DATE rank
1 201901 2.5
1 201902 2.5
1 201903 2.5
1 201904 2.5
2 201902 7.5
2 201903 7.5
2 201904 7.5
2 201905 7.5
2 201906 7.5
2 201907 7.5
3 201904 11.5
3 201905 11.5
But what I want is for it to rank the dates descending, but by each account, it should look like this:
Account DATE RANK
1 201901 4
1 201902 3
1 201903 2
1 201904 1
2 201902 6
2 201903 5
2 201904 4
2 201905 3
2 201906 2
2 201907 1
3 201904 2
3 201905 1

library("dplyr")
df %>%
group_by(Account) %>%
mutate("rank" = rank(DATE))
#> # A tibble: 12 x 3
#> # Groups: Account [3]
#> Account DATE rank
#> <dbl> <dbl> <dbl>
#> 1 1 201901 1
#> 2 1 201902 2
#> 3 1 201903 3
#> 4 1 201904 4
#> 5 2 201902 1
#> 6 2 201903 2
#> 7 2 201904 3
#> 8 2 201905 4
#> 9 2 201906 5
#> 10 2 201907 6
#> 11 3 201904 1
#> 12 3 201905 2
Created on 2020-03-09 by the reprex package (v0.3.0.9001)

We can use a descending order to create the ranks:
library(dplyr)
df %>%
group_by(Account) %>%
mutate("rank" = order(DATE, decreasing = TRUE))
Output:
# A tibble: 12 x 3
# Groups: Account [3]
Account DATE rank
<dbl> <dbl> <int>
1 1 201901 4
2 1 201902 3
3 1 201903 2
4 1 201904 1
5 2 201902 6
6 2 201903 5
7 2 201904 4
8 2 201905 3
9 2 201906 2
10 2 201907 1
11 3 201904 2
12 3 201905 1

Here you go:
df <- df %>% group_by(Account) %>% mutate(ranking = rank(DATE))

in base R
sortdata <- lapply(1:3,grep,df[,1])
for(i in sortdata){
df[i,3] <- order(df[i,2],decreasing=T)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

rollapplyr on missing records related by date - r

Related

Create lagged variables for consecutive time points only using R

Count variable until observations changes [duplicate]

Rearranging the rows based on a sequential unique values

How can I distribute a vector of numbers by a vector of percentages, round the result, and always get the same total that I started with in R?

Rank by multiple columns in R

Categories

Resources