Create lagged variables for consecutive time points only using R - r

I have an unbalanced panel (with unequally spaced measurement points) and would like to create a lagged variable of x by group (Variable: id) but only for consecutive time points. My data looks like this:
# simple example with an unbalanced panel
base <- data.frame(id = rep(1:2, each = 7),
time = c(1, 2, 3, 4, 7, 8, 10, 3, 4, 6, 9, 10, 11, 14),
x = rnorm(14, mean = 3, sd = 1))
I already tried this code using dplyr:
base_lag <- base %>% # Add lagged column
group_by(id) %>%
dplyr::mutate(lag1_x = dplyr::lag(x, n = 1, default = NA)) %>%
as.data.frame()
base_lag # Print updated data
However, this way I get a lagged variable regardless of the fact that in some cases it is not two consecutive time points.
My final data set should look like this:
id time x lag1_x
1 1 1 3.437416 NA
2 1 2 2.300553 3.437416
3 1 3 2.374212 2.300553
4 1 4 4.374009 2.374212
5 1 7 1.177433 NA
6 1 8 1.543353 1.177433
7 1 10 3.222358 NA
8 2 3 3.763765 NA
9 2 4 3.881182 3.763765
10 2 6 4.754420 NA
11 2 9 4.518227 NA
12 2 10 2.512486 4.518227
13 2 11 3.129230 2.512486
14 2 14 2.152509 NA
Does anyone here have a tip for me on how to create this lagged variable? Many thanks in advance!

You could use ifelse, testing whether diff(time) is equal to 1. If so, write the lag. If not, write an NA.
base %>%
group_by(id) %>%
mutate(lag1_x = ifelse(c(0, diff(time)) == 1, lag(x, n = 1, default = NA), NA)) %>%
as.data.frame()
#> id time x lag1_x
#> 1 1 1 1.852343 NA
#> 2 1 2 2.710538 1.852343
#> 3 1 3 2.700785 2.710538
#> 4 1 4 2.588489 2.700785
#> 5 1 7 3.252223 NA
#> 6 1 8 2.108079 3.252223
#> 7 1 10 3.435683 NA
#> 8 2 3 1.762462 NA
#> 9 2 4 2.775732 1.762462
#> 10 2 6 3.377396 NA
#> 11 2 9 3.133336 NA
#> 12 2 10 3.804190 3.133336
#> 13 2 11 2.942893 3.804190
#> 14 2 14 3.503608 NA

An option is also to create a grouping based on the difference
library(dplyr)
base %>%
group_by(id, grp = cumsum(c(TRUE, diff(time) != 1))) %>%
mutate(lag1_x = lag(x)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 14 × 4
id time x lag1_x
<int> <dbl> <dbl> <dbl>
1 1 1 3.81 NA
2 1 2 2.79 3.81
3 1 3 3.04 2.79
4 1 4 1.76 3.04
5 1 7 1.72 NA
6 1 8 2.68 1.72
7 1 10 3.31 NA
8 2 3 2.92 NA
9 2 4 2.02 2.92
10 2 6 1.71 NA
11 2 9 2.56 NA
12 2 10 1.62 2.56
13 2 11 3.30 1.62
14 2 14 3.69 NA

Related

How can I distribute a vector of numbers by a vector of percentages, round the result, and always get the same total that I started with in R?

Question Summary
I want to multiply a vector of numbers (the Sum_By_Group column) by a vector of percentages (the Percent column) to distribute the total number for the group into each ID, round the result, and end up with the same total number that I started with. In other words, I want the Distribution_Post_Round column to be the same as the Sum_By_Group column.
Below is an example of the issue that I am running into. In Group A, I multiply Percent by Sum_By_Group and finish with 3 in ID 1, 3 in ID 2, and 1 in ID 5 for a total of 7. The Sum_By_Group column and Distribution_Post_Round column are the same for Group A and this is what I want. In Group B, I multiply Percent by Sum_By_Group and finish with 1 in ID 8 and 1 in ID 10 for a total of 2. I want the Distribution_Post_Round column to be 3 for Group B.
Is there a way to do this without using loops, subsetting data frames, and then rejoining the data frames together?
Example
library(dplyr)
df = data.frame('Group' = c(rep('A', 7), rep('B', 5)),
'ID' = c(1:12),
'Percent' = c(0.413797750, 0.385366840, 0.014417571, 0.060095668, 0.076399650,
0.019672573, 0.030249949, 0.381214519, 0.084121796, 0.438327886,
0.010665749, 0.085670050),
'Sum_By_Group' = c(rep(7,7), rep(3, 5)))
df$Distribute_By_ID = round(df$Percent * df$Sum_By_Group, 0)
df_round = aggregate(Distribute_By_ID ~ Group, data = df, sum)
names(df_round)[names(df_round) == 'Distribute_By_ID'] = 'Distribution_Post_Round'
df = left_join(df, df_round, by = 'Group')
df
Group ID Percent Sum_By_Group Distribute_By_ID Distribution_Post_Round
A 1 0.41379775 7 3 7
A 2 0.38536684 7 3 7
A 3 0.01441757 7 0 7
A 4 0.06009567 7 0 7
A 5 0.07639965 7 1 7
A 6 0.01967257 7 0 7
A 7 0.03024995 7 0 7
B 8 0.38121452 3 1 2
B 9 0.08412180 3 0 2
B 10 0.43832789 3 1 2
B 11 0.01066575 3 0 2
B 12 0.08567005 3 0 2
Thank you so much for your help. Please let me know if additional clarification is needed.
Wow, who knew someone had already written a package that includes a function to solve this... kudos to that team https://cran.r-project.org/web/packages/sfsmisc/index.html
Since you seem to be willing to use dplyr hopefully this additional package will be worth it because it certainly makes a solution elegant.
# https://stackoverflow.com/questions/61667720
library(dplyr)
df = data.frame('Group' = c(rep('A', 7), rep('B', 5)),
'ID' = c(1:12),
'Percent' = c(0.413797750, 0.385366840, 0.014417571, 0.060095668, 0.076399650,
0.019672573, 0.030249949, 0.381214519, 0.084121796, 0.438327886,
0.010665749, 0.085670050),
'Sum_By_Group' = c(rep(7,7), rep(3, 5)))
glimpse(df)
#> Rows: 12
#> Columns: 4
#> $ Group <chr> "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "…
#> $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
#> $ Percent <dbl> 0.41379775, 0.38536684, 0.01441757, 0.06009567, 0.076399…
#> $ Sum_By_Group <dbl> 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3
df %>%
group_by(Group) %>%
mutate(Distribute_By_ID = sfsmisc::roundfixS(Percent * Sum_By_Group))
#> # A tibble: 12 x 5
#> # Groups: Group [2]
#> Group ID Percent Sum_By_Group Distribute_By_ID
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A 1 0.414 7 3
#> 2 A 2 0.385 7 3
#> 3 A 3 0.0144 7 0
#> 4 A 4 0.0601 7 0
#> 5 A 5 0.0764 7 1
#> 6 A 6 0.0197 7 0
#> 7 A 7 0.0302 7 0
#> 8 B 8 0.381 3 1
#> 9 B 9 0.0841 3 0
#> 10 B 10 0.438 3 2
#> 11 B 11 0.0107 3 0
#> 12 B 12 0.0857 3 0
Created on 2020-05-07 by the reprex package (v0.3.0)
df %>%
mutate(dividend = floor(Percent*Sum_By_Group),
remainder= Percent*Sum_By_Group-dividend) %>%
group_by(Group) %>%
arrange(desc(remainder),.by_group=TRUE) %>%
mutate(delivered=sum(dividend),
rownumber=1:n(),
lastdelivery=if_else(rownumber<=Sum_By_Group-delivered,1,0),
Final=dividend+lastdelivery) %>%
ungroup()
# A tibble: 12 x 10
Group ID Percent Sum_By_Group dividend remainder delivered rownumber lastdelivery Final
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 A 1 0.414 7 2 0.897 4 1 1 3
2 A 2 0.385 7 2 0.698 4 2 1 3
3 A 5 0.0764 7 0 0.535 4 3 1 1
4 A 4 0.0601 7 0 0.421 4 4 0 0
5 A 7 0.0302 7 0 0.212 4 5 0 0
6 A 6 0.0197 7 0 0.138 4 6 0 0
7 A 3 0.0144 7 0 0.101 4 7 0 0
8 B 10 0.438 3 1 0.315 2 1 1 2
9 B 12 0.0857 3 0 0.257 2 2 0 0
10 B 9 0.0841 3 0 0.252 2 3 0 0
11 B 8 0.381 3 1 0.144 2 4 0 1
12 B 11 0.0107 3 0 0.0320 2 5 0 0
This is my solution, without any other dependencies relying on Hare quota :
I distributed all the integer "seats", then I distributed the remaining "seats" in the order of remainders.
The column "Final" is then OK.
Note : It seems to give the same results than the other solution with a package
Formulating this as an integer optimization problem:
library(CVXR)
A <- as.data.frame.matrix(t(model.matrix(~0+Group, df)))
prop <- df$Percent * df$Sum_By_Group
x <- Variable(nrow(df), integer=TRUE)
sums <- df$Sum_By_Group[!duplicated(df$Group)]
p <- Problem(Minimize(sum_squares(x - prop)), list(A %*% x == sums))
result <- solve(p)
df$Distribute_By_ID <- as.integer(round(result$getValue(x)))
output:
Group ID Percent Sum_By_Group
1 A 1 0.41379775 7
2 A 2 0.38536684 7
3 A 3 0.01441757 7
4 A 4 0.06009567 7
5 A 5 0.07639965 7
6 A 6 0.01967257 7
7 A 7 0.03024995 7
8 B 8 0.38121452 3
9 B 9 0.08412180 3
10 B 10 0.43832789 3
11 B 11 0.01066575 3
12 B 12 0.08567005 3

Create new column and carry forward value from previous group to next

I am trying to carry forward value from the previous group to the next group. I tried to solve it using rleid but that could not get the desired result.
df <- data.frame(signal = c(1,1,5,5,5,2,3,3,3,4,4,5,5,5,5,6,7,7,8,9,9,9,10),
desired_outcome = c(NA, NA, 1, 1, 1, 5, 2, 2, 2, 3, 3, 4, 4,4,4,5,6,6,7,8,8,8,9))
# outcome column has the expected result -
signal desired_outcome
1 1 NA
2 1 NA
3 5 1
4 5 1
5 5 1
6 2 5
7 3 2
8 3 2
9 3 2
10 4 3
11 4 3
12 5 4
13 5 4
14 5 4
15 5 4
16 6 5
17 7 6
18 7 6
19 8 7
20 9 8
21 9 8
22 9 8
23 10 9
rle will give the lengths and values of sequences where the same value occur. Then: remove the last value, shift remaining values one over, add an NA to the beginning of the value to account for removing the last value, and repeat each value as given by lengths (i.e. the lengths of sequences of same value in the original vector).
with(rle(df$signal), rep(c(NA, head(values, -1)), lengths))
# [1] NA NA 1 1 1 5 2 2 2 3 3 4 4 4 4 5 6 6 7 8 8 8 9
Another way could be to first lag signal then use rleid to create groups and use mutate to broadcast first value of each group to all the values.
library(dplyr)
df %>%
mutate(out = lag(signal)) %>%
group_by(group = data.table::rleid(signal)) %>%
mutate(out = first(out)) %>%
ungroup() %>%
select(-group)
# A tibble: 23 x 2
# signal out
# <dbl> <dbl>
# 1 1 NA
# 2 1 NA
# 3 5 1
# 4 5 1
# 5 5 1
# 6 2 5
# 7 3 2
# 8 3 2
# 9 3 2
#10 4 3
# … with 13 more rows

R how to fill in NA with rules

data=data.frame(person=c(1,1,1,2,2,2,2,3,3,3,3),
t=c(3,NA,9,4,7,NA,13,3,NA,NA,12),
WANT=c(3,6,9,4,7,10,13,3,6,9,12))
So basically I am wanting to create a new variable 'WANT' which takes the PREVIOUS value in t and ADDS 3 to it, and if there are many NA in a row then it keeps doing this. My attempt is:
library(dplyr)
data %>%
group_by(person) %>%
mutate(WANT_TRY = fill(t) + 3)
Here's one way -
data %>%
group_by(person) %>%
mutate(
# cs = cumsum(!is.na(t)), # creates index for reference value; uncomment if interested
w = case_when(
# rle() gives the running length of NA
is.na(t) ~ t[cumsum(!is.na(t))] + 3*sequence(rle(is.na(t))$lengths),
TRUE ~ t
)
) %>%
ungroup()
# A tibble: 11 x 4
person t WANT w
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12
Here is another way. We can do linear interpolation with the imputeTS package.
library(dplyr)
library(imputeTS)
data2 <- data %>%
group_by(person) %>%
mutate(WANT2 = na.interpolation(WANT)) %>%
ungroup()
data2
# # A tibble: 11 x 4
# person t WANT WANT2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3
# 2 1 NA 6 6
# 3 1 9 9 9
# 4 2 4 4 4
# 5 2 7 7 7
# 6 2 NA 10 10
# 7 2 13 13 13
# 8 3 3 3 3
# 9 3 NA 6 6
# 10 3 NA 9 9
# 11 3 12 12 12
This is harder than it seems because of the double NA at the end. If it weren't for that, then the following:
ifelse(is.na(data$t), c(0, data$t[-nrow(data)])+3, data$t)
...would give you want you want. The simplest way, that uses the same logic but doesn't look very clever (sorry!) would be:
.impute <- function(x) ifelse(is.na(x), c(0, x[-length(x)])+3, x)
.impute(.impute(data$t))
...which just cheats by doing it twice. Does that help?
You can use functional programming from purrr and "NA-safe" addition from hablar:
library(hablar)
library(dplyr)
library(purrr)
data %>%
group_by(person) %>%
mutate(WANT2 = accumulate(t, ~.x %plus_% 3))
Result
# A tibble: 11 x 4
# Groups: person [3]
person t WANT WANT2
<dbl> <dbl> <dbl> <dbl>
1 1 3 3 3
2 1 NA 6 6
3 1 9 9 9
4 2 4 4 4
5 2 7 7 7
6 2 NA 10 10
7 2 13 13 13
8 3 3 3 3
9 3 NA 6 6
10 3 NA 9 9
11 3 12 12 12

Left join two tables on LHS rows that meet a condition, leave others as NA

Consider the following scenario:
test <- data.frame(Id1 = c(1, 2, 3, 4, 5, 10, 11),
Id2 = c(3, 4, 10, 11, 12, 15, 9),
Type = c(1, 1, 1, 2, 2, 2, 1) )
test
#> Id1 Id2 Type
#> 1 1 3 1
#> 2 2 4 1
#> 3 3 10 1
#> 4 4 11 2
#> 5 5 12 2
#> 6 10 15 2
#> 7 11 9 1
I want to join test on itself by Id2 = Id1 only when Type has a certain value, e.g. Type == 1 in such a way that I get the following result:
#> Id1 Id2 Type.x Id2.y Type.y
#> 1 1 3 1 10 1 # matches row 3
#> 2 2 4 1 11 2 # matches row 4
#> 3 3 10 1 15 2 # matches row 6
#> 4 4 11 2 NA NA # matches row 7 but Type != 1
#> 5 5 12 2 NA NA # Type !=1
#> 6 10 15 2 NA NA # Type !=1
#> 7 11 9 1 NA NA # Type == 1 but no matches
Since in this case, test represents a hierarchy, a join of this type would allow me to "expand" the hierarchy so that each row eventually terminated in an Id2 that was not equal to any value of Id1.
How can one achieve such a join?
Tidyverse is a great package for data manipulation. Here, we can do as follows:
library(tidyverse)
joined <- test %>% left_join(test %>% filter(Type==1), by = c("Id1" = "Id2"))
joined
UPDATE:
library(tidyverse)
joined <- test %>%
filter(Type==1) %>% left_join(test, by = c("Id2" = "Id1")) %>%
bind_rows(test %>% filter(Type==2) %>% rename(Type.x = Type))
joined
Id1 Id2 Type.x Id2.y Type.y
1 3 1 10 1
2 4 1 11 2
3 10 1 15 2
11 9 1 NA NA
4 11 2 NA NA
5 12 2 NA NA
10 15 2 NA NA

Sum over a range with start and end point variables in R

I have a dataframe with the following variables:
start_point end_point variable_X
1 5 0.3757
2 7 0.4546
3 7 0.1245
4 8 0.3455
5 11 0.2399
6 12 0.0434
7 15 0.4323
... ... ...
I would like to add a fourth column that sums variable X from the start point to the end points defined in the first two columns, i.e. the entry in the first row would be the sum between 1 and 5 (inclusive): 0.3757+0.4546+0.1245+0.3455+0.2399 = 1.5402, the entry in second row would be sum between 2 and 7 (inclusive): 0.4546+0.1245+0.3455+0.2399+0.0434+0.4323 = 1.6402 and so forth.
I'm new to R, any help would be greatly appreciated.
There are probably slicker ways to do this, but here's a quick version:
df$sumX <- apply(df, 1, function(x) sum(df$variable_X[x[1]:x[2]]))
df
start_point end_point variable_X sumX
1 1 5 0.3757 1.5402
2 2 7 0.4546 1.6402
3 3 7 0.1245 1.1856
4 4 8 0.3455 NA
5 5 11 0.2399 NA
6 6 12 0.0434 NA
7 7 15 0.4323 NA
The last few rows are NA here because I don't have rows 8 through 15 of your data.
A solution with dplyr, using another reproducible example to address the situation with NA's in end_point as in the OP's comment (with ifelse):
# Reproducible example
mydf = data.frame(start_point = 1:9,
end_point = c(5, NA, 7, 8, 11, 12, 7, 15, NA),
variable_X = c(1, 5, 2, 3, 5, 4, 2, 1, 2))
library(dplyr)
mydf %>% rowwise() %>%
mutate(sumX = ifelse(is.na(end_point), NA, sum(mydf$variable_X[start_point:end_point])))
# start_point end_point variable_X sumX
# <int> <dbl> <dbl> <dbl>
# 1 1 5 1 16
# 2 2 NA 5 NA
# 3 3 7 2 16
# 4 4 8 3 15
# 5 5 11 5 NA
# 6 6 12 4 NA
# 7 7 7 2 2
# 8 8 15 1 NA
# 9 9 NA 2 NA

Resources