Lag variable by group/time indicator in dplyr - r

I have data that looks like this:
set.seed(13)
dt <- data.frame(group = c(rep("a", 3), rep("b", 4), rep("c", 3)), var = c(rep(0.1,3), rep(0.3, 4), rep(1.1,3)))
dt
group var
1 a 0.1
2 a 0.1
3 a 0.1
4 b 0.3
5 b 0.3
6 b 0.3
7 b 0.3
8 c 1.1
9 c 1.1
10 c 1.1
I'd like to lag var variable for all respondents in the group variable group. One difficulty is that the groups are of different size, otherwise this would be no problem specifing n as the size of all groups. My data should look accordingly (see below). How do I get at this using dplyr for example?
group var lag1.var lag2.var
1 a 0.1 NA NA
2 a 0.1 NA NA
3 a 0.1 NA NA
4 b 0.3 0.1 NA
5 b 0.3 0.1 NA
6 b 0.3 0.1 NA
7 b 0.3 0.1 NA
8 c 1.1 0.3 0.1
9 c 1.1 0.3 0.1
10 c 1.1 0.3 0.1

You can create a tibble with the lag variables for each group and then merge it with dt. Try this:
left_join(dt, dt %>%
group_by(group) %>%
mutate(var = first(var)) %>%
distinct() %>%
ungroup() %>%
mutate(lag1.var = lag(var, order_by = group),
lag2.var = lag(lag1.var, order_by = group)) %>%
select(-var),
by = "group")
# output
group var lag1.var lag2.var
1 a 0.1 NA NA
2 a 0.1 NA NA
3 a 0.1 NA NA
4 b 0.3 0.1 NA
5 b 0.3 0.1 NA
6 b 0.3 0.1 NA
7 b 0.3 0.1 NA
8 c 1.1 0.3 0.1
9 c 1.1 0.3 0.1
10 c 1.1 0.3 0.1
This assumes that var is always the same within each group

Here is another option. First we nest by group, then we map out the lagged values and then unnest.
library(tidyverse)
dt %>%
nest(-group) %>%
mutate(lag1.var = map_dbl(data, ~.x$var[[1]]) %>% lag(.), lag2.var = lag(lag1.var)) %>%
unnest
#> group lag1.var lag2.var var
#> 1 a NA NA 0.1
#> 2 a NA NA 0.1
#> 3 a NA NA 0.1
#> 4 b 0.1 NA 0.3
#> 5 b 0.1 NA 0.3
#> 6 b 0.1 NA 0.3
#> 7 b 0.1 NA 0.3
#> 8 c 0.3 0.1 1.1
#> 9 c 0.3 0.1 1.1
#> 10 c 0.3 0.1 1.1

Related

Problems with Rollapplyr function due to NA values (that should not be removed) in R

I have a dataframe:
date comp ei
1 1/1/73 A NA
2 1/4/73 A 0.6
3 1/7/73 A 0.7
4 1/10/73 A 0.9
5 1/1/74 A 0.4
6 1/4/74 A 0.5
7 1/7/74 A 0.7
8 1/10/74 A 0.7
9 1/1/75 A 0.4
10 1/4/75 A 0.5
11 1/1/73 B 0.8
12 1/4/73 B 0.8
13 1/7/73 B 0.5
14 1/10/73 B 0.6
15 1/1/74 B 0.3
16 1/4/74 B 0.2
17 1/1/73 C NA
18 1/4/73 C 0.6
19 1/7/73 C 0.4
20 1/10/73 C 0.8
21 1/1/74 C 0.7
22 1/4/74 C 0.9
23 1/7/74 C 0.4
24 1/10/74 C 0.3
I want to calculate the rolling std. deviation of ei grouped by comp. I want the rolling standard deviation of the last 8 lines - but if only 6 lines exists, so far, it should still take the rolling std. deviation of those. So I use width = 8 and partial = 6 in this code:
roll <- function(z) rollapplyr(z, width = 8, FUN = sd, fill = NA, partial = 6)
df <- transform(df, roll = ave(ei, comp, FUN = roll))
However, due to the fact that some of my 'ei' values are 'NA' the partial part of the function doesn't work, since there is an NA in one of the past 8 lines. So of course after 6 lines the std. dev. is NA. Only for comp = B, the partial = 6 works. The results are seen below:
date comp ei roll
1 1/1/73 A NA NA
2 1/4/73 A 0.6 NA
3 1/7/73 A 0.7 NA
4 1/10/73 A 0.9 NA
5 1/1/74 A 0.4 NA
6 1/4/74 A 0.5 NA
7 1/7/74 A 0.7 NA
8 1/10/74 A 0.7 NA
9 1/1/75 A 0.4 0.1726888
10 1/4/75 A 0.5 0.1772811
11 1/1/73 B 0.8 NA
12 1/4/73 B 0.8 NA
13 1/7/73 B 0.5 NA
14 1/10/73 B 0.6 NA
15 1/1/74 B 0.3 NA
16 1/4/74 B 0.2 0.2503331
17 1/1/73 C NA NA
18 1/4/73 C 0.6 NA
19 1/7/73 C 0.4 NA
20 1/10/73 C 0.8 NA
21 1/1/74 C 0.7 NA
22 1/4/74 C 0.9 NA
23 1/7/74 C 0.4 NA
24 1/10/74 C 0.3 NA
I would have rather wanted my results to look as it does below, where the first std. dev is calculated for comp A in line number 7 for the previous 6 values (not NA) and where comp C has a std. dev in line 23 and 24:
date comp ei roll
1 1/1/73 A NA NA
2 1/4/73 A 0.6 NA
3 1/7/73 A 0.7 NA
4 1/10/73 A 0.9 NA
5 1/1/74 A 0.4 NA
6 1/4/74 A 0.5 NA
7 1/7/74 A 0.7 0.1751190
8 1/10/74 A 0.7 0.1618347
9 1/1/75 A 0.4 0.1726888
10 1/4/75 A 0.5 0.1772811
11 1/1/73 B 0.8 NA
12 1/4/73 B 0.8 NA
13 1/7/73 B 0.5 NA
14 1/10/73 B 0.6 NA
15 1/1/74 B 0.3 NA
16 1/4/74 B 0.2 0.2503331
17 1/1/73 C NA NA
18 1/4/73 C 0.6 NA
19 1/7/73 C 0.4 NA
20 1/10/73 C 0.8 NA
21 1/1/74 C 0.7 NA
22 1/4/74 C 0.9 NA
23 1/7/74 C 0.4 0.2065591
24 1/10/74 C 0.3 0.2267787
How can I do this without running a na.omit code before calculating the rolling std. dev? The reason why I don't want to remove NA's is that I need the lines with comp and dates (plus other columns in my real dataset). Also, removing my NA values might, in my real dataset, lead to removing NA's in the middle of a period so that the rolling std. dev. function won't fit with the dates and my results will be wrong.
Is there a way to deal with this without removing the NA values?
1) FUN computes sd if there are at least 6 non-NAs and otherwise returns NA.
Then proceed as in the question.
library(zoo)
df$date <- as.Date(df$date, "%d/%m/%y")
FUN <- function(x) if (length(na.omit(x)) >= 6) sd(x, na.rm = TRUE) else NA
roll <- function(z) rollapplyr(z, width = 8, FUN = FUN,
fill = NA, partial = 6)
transform(df, roll = ave(ei, comp, FUN = roll))
2) The other possibility is to use na.omit and then merge the result back with the original data frame.
library(zoo)
df$date <- as.Date(df$date, "%d/%m/%y")
roll <- function(z) rollapplyr(z, width = 8, FUN = sd, fill = NA, partial = 6)
df_roll_0 <- transform(na.omit(df), roll = ave(ei, comp, FUN = roll))
df_roll_m <- merge(df, df_roll_0, all = TRUE)
o <- with(df_roll_m, order(comp, date))
df_roll <- df_roll_m[o, ]
2a) This could also be expressed using dplyr/tidyr:
library(dplyr)
library(tidyr)
library(zoo)
df$date <- as.Date(df$date, "%d/%m/%y")
roll <- function(z) rollapplyr(z, width = 8, FUN = sd, fill = NA, partial = 6)
df_roll_0 <- df %>%
drop_na %>%
group_by(comp) %>%
mutate(roll = roll(ei)) %>%
ungroup
df %>%
left_join(df_roll_0)
Note
Lines <- " date comp ei
1 1/1/73 A NA
2 1/4/73 A 0.6
3 1/7/73 A 0.7
4 1/10/73 A 0.9
5 1/1/74 A 0.4
6 1/4/74 A 0.5
7 1/7/74 A 0.7
8 1/10/74 A 0.7
9 1/1/75 A 0.4
10 1/4/75 A 0.5
11 1/1/73 B 0.8
12 1/4/73 B 0.8
13 1/7/73 B 0.5
14 1/10/73 B 0.6
15 1/1/74 B 0.3
16 1/4/74 B 0.2
17 1/1/73 C NA
18 1/4/73 C 0.6
19 1/7/73 C 0.4
20 1/10/73 C 0.8
21 1/1/74 C 0.7
22 1/4/74 C 0.9
23 1/7/74 C 0.4
24 1/10/74 C 0.3"
df <- read.table(text = Lines)

Expand a dataframe IN R into a larger dataframe

I Have used the code below to generate a dataframe df1
df1<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", 'B', "B"),
"X_Fr"=c(NA, NA, NA, NA,NA,1,2,3,4,5), "X_Ax"=c(NA, NA, NA, NA, NA,
.2,.3,.4,.2, .3),
"Y_Fr"=c(1,2,3,4,5,1,2,3,4,5),
"Y_Ax"=c(.1,.2,.3,.4,.1,.3,.4,.5,.2,.3),"Z_Fr"=c(1,2,NA, NA, 3,
1,3,4,5,10),
"Z_Ax"=c(.1,.2,NA,NA,.5, .1,.2,.4,.3,.5) )
ID X_Fr X_Ax Y_Fr Y_Ax Z_Fr Z_Ax
1 A NA NA 1 0.1 1 0.1
2 A NA NA 2 0.2 2 0.2
3 A NA NA 3 0.3 NA NA
4 A NA NA 4 0.4 NA NA
5 A NA NA 5 0.1 3 0.5
6 B 1 0.2 1 0.3 1 0.1
7 B 2 0.3 2 0.4 3 0.2
8 B 3 0.4 3 0.5 4 0.4
9 B 4 0.2 4 0.2 5 0.3
10 B 5 0.3 5 0.3 10 0.5
I would like to expand the dataframe to give the following data frame as output
ID X_Fr Y_Fr Z_Fr X_Ax Y_Ax Z_Ax
1 A 1 1 1 NA 0.1 0.1
2 A 2 2 2 NA 0.2 0.2
3 A 3 3 3 NA 0.3 0.5
4 A 4 4 4 NA 0.4 NA
5 A 5 5 5 NA 0.1 NA
6 B 1 1 1 0.2 0.3 0.1
7 B 2 2 2 0.3 0.4 NA
8 B 3 3 3 0.4 0.5 0.2
9 B 4 4 4 0.2 0.2 0.4
10 B 5 5 5 0.3 0.3 0.3
11 B 6 6 6 NA NA NA
12 B 7 7 7 NA NA NA
13 B 8 8 8 NA NA NA
14 B 9 9 9 NA NA NA
15 B 10 10 10 NA NA 0.5
I have tried the following code to obtain the above dataframe
library(tidyr)
library(dplyr)
df2<-df1 %>% complete(ID, nesting(X_Fr=full_seq(na.omit(c(X_Fr, Y_Fr,
Z_Fr)),1), Y_Fr=full_seq(na.omit(c(X_Fr, Y_Fr, Z_Fr)),1),
Z_Fr=full_seq(na.omit(c(X_Fr, Y_Fr, Z_Fr)),1)))
I am unable to obtain this result. I request someone to take a look.
I think it's two steps process so I create two new dataframes processed them and joined them later
library(dplyr)
library(tidyr)
df1x<-df1 %>% select(ID,matches('^X|^Y')) #select ID and any cloumn start with X or Y
df1y<-df1 %>% select(ID,matches('^Z'))
df1y %>% group_by(ID) %>% #group by ID column
arrange(Z_Fr, .by_group=TRUE) %>% #arrange Z_Fr column in ascending order so we can use row_number later
mutate(Z_Fr=coalesce(Z_Fr,as.numeric(row_number()))) %>% #Use row_number to fill NA's in Z_Fr.
#See ?dplyr::row_number() for more details
ungroup() %>% #Before using complete we need to ungroup
complete(ID, nesting(Z_Fr=full_seq(Z_Fr,1))) %>%
left_join(df1x, by=c('ID','Z_Fr'='Y_Fr')) #left join using "on" ID and Z_Fr from df1y and Y_Fr from df1x

Lagging variable by group does not work in dplyr

I'm desperately trying to lag a variable by group. I found this post that deals with essentially the same problem I'm facing, but the solution does not work for me, no idea why.
This is my problem:
library(dplyr)
df <- data.frame(monthvec = c(rep(1:2, 2), rep(3:5, 3)))
df <- df %>%
arrange(monthvec) %>%
mutate(growth=ifelse(monthvec==1, 0.3,
ifelse(monthvec==2, 0.5,
ifelse(monthvec==3, 0.7,
ifelse(monthvec==4, 0.1,
ifelse(monthvec==5, 0.6,NA))))))
df%>%
group_by(monthvec) %>%
mutate(lag.growth = lag(growth, order_by=monthvec))
Source: local data frame [13 x 3]
Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 0.3
3 2 0.5 NA
4 2 0.5 0.5
5 3 0.7 NA
6 3 0.7 0.7
7 3 0.7 0.7
8 4 0.1 NA
9 4 0.1 0.1
10 4 0.1 0.1
11 5 0.6 NA
12 5 0.6 0.6
13 5 0.6 0.6
This is what I'd like it to be in the end:
df$lag.growth <- c(NA, NA, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7,0.7,0.7, 0.1,0.1,0.1)
monthvec growth lag.growth
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
I believe that one problem is that my groups are not of equal length...
Thanks for helping out.
Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.
library(dplyr)
df %>%
group_by(monthvec) %>%
mutate(cnt = n()) %>%
ungroup() %>%
mutate(lag.growth = lag(growth, first(cnt))) %>%
group_by(monthvec) %>%
mutate(lag.growth = first(lag.growth)) %>%
select(-cnt)
which gives,
# A tibble: 13 x 3
# Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
You may join your original data with a dataframe with a shifted "monthvec".
left_join(df, df %>% mutate(monthvec = monthvec + 1) %>% unique(), by = "monthvec")
# monthvec growth.x growth.y
# 1 1 0.3 NA
# 2 1 0.3 NA
# 3 2 0.5 0.3
# 4 2 0.5 0.3
# 5 3 0.7 0.5
# 6 3 0.7 0.5
# 7 3 0.7 0.5
# 8 4 0.1 0.7
# 9 4 0.1 0.7
# 10 4 0.1 0.7
# 11 5 0.6 0.1
# 12 5 0.6 0.1
# 13 5 0.6 0.1

How to reset row names?

Here is a sample data set:
sample1 <- data.frame(Names=letters[1:10], Values=sample(seq(0.1,1,0.1)))
When I'm reordering the data set, I'm losing the row names order
sample1[order(sample1$Values), ]
Names Values
7 g 0.1
4 d 0.2
3 c 0.3
9 i 0.4
10 j 0.5
5 e 0.6
8 h 0.7
6 f 0.8
1 a 0.9
2 b 1.0
Desired output:
Names Values
1 g 0.1
2 d 0.2
3 c 0.3
4 i 0.4
5 j 0.5
6 e 0.6
7 h 0.7
8 f 0.8
9 a 0.9
10 b 1.0
Try
rownames(Ordersample2) <- 1:10
or more generally
rownames(Ordersample2) <- NULL
I had a dplyr usecase:
df %>% as.data.frame(row.names = 1:nrow(.))

Fill nth columns in a dataframe

I have this data frame:
df <- data.frame(A=c("a","b","c","d","e","f","g","h","i"),
B=c("1","1","1","2","2","2","3","3","3"),
C=c(0.1,0.2,0.4,0.1,0.5,0.7,0.1,0.2,0.5))
> df
A B C
1 a 1 0.1
2 b 1 0.2
3 c 1 0.4
4 d 2 0.1
5 e 2 0.5
6 f 2 0.7
7 g 3 0.1
8 h 3 0.2
9 i 3 0.5
I would like to add 1000 further columns and fill this columns with the values generated by :
transform(df, D=ave(C, B, FUN=function(b) sample(b, replace=TRUE)))
I've tried with a for loop but it does not work:
for (i in 4:1000){
df[, 4:1000] <- NA
df[,i] = transform(df, D=ave(C, B, FUN=function(b) sample(b, replace=TRUE)))
}
For efficiency reasons, I suggest running sample only once for each group. This can be achieved with this:
sample2 <- function(x, size)
{
if(length(x)==1) rep(x, size) else sample(x, size, replace=TRUE)
}
new_df <- do.call(rbind, by(df, df$B,
function(d) cbind(d, matrix(sample2(d$C, length(d$C)*1000),
ncol=1000))))
Notes:
I've created sample2 in case there is a group with only one C value. Check ?sample to see what I mean.
The names of the columns will be numbers, from 1 to 1000. This can be changed as in the answer by #agstudy.
The row names are also changed. "Fixing" them is similar, just use row.names instead of col.names.
Using replicate for example:
cbind(df,replicate(1000,ave(df$C, df$B,
FUN=function(b) sample(b, replace=TRUE))))
To add 4 columns for example:
cbind(df,replicate(4,ave(df$C, df$B,
FUN=function(b) sample(b, replace=TRUE))))
A B C 1 2 3 4
1 a 1 0.1 0.2 0.2 0.1 0.2
2 b 1 0.2 0.4 0.2 0.4 0.4
3 c 1 0.4 0.1 0.1 0.1 0.1
4 d 2 0.1 0.1 0.5 0.5 0.1
5 e 2 0.5 0.7 0.1 0.5 0.1
6 f 2 0.7 0.1 0.7 0.7 0.7
7 g 3 0.1 0.2 0.5 0.2 0.2
8 h 3 0.2 0.2 0.1 0.2 0.1
9 i 3 0.5 0.5 0.5 0.1 0.5
Maybe you need to rename columns by something like :
gsub('([0-9]+)','D\\1',colnames(res))
1] "A" "B" "C" "D1" "D2" "D3" "D4"

Resources