Calculating cumsum of mortality before a specific date in R - r

I need help writing either a for loop or dply code for two things:
Calculating the cumsum of mortality (%), grouped on Unit, in a defined period (7 days) before a treatment.
Make a vector that counts the day post last treatment until next treatment.
The data sett looks like this:
Unit Date Prcent_daily.mortality Date.treatment
A 20.07.2020 0.2 NA
A 21.07.2020 0 NA
A 22.07.2020 0.4 NA
A 23.07.2020 0.3 NA
A 24.07.2020 0.6 NA
A 25.07.2020 0.05 NA
A 26.07.2020 0 NA
A 27.07.2020 0 NA
A 28.07.2020 0.01 28.07.2020
A 29.07.2020 0.1 NA
A 30.07.2020 0.2 NA
A 31.07.2020 0 NA
A 01.08.2020 0.2 NA
A 02.08.2020 0.3 NA
A 03.08.2020 0.3 NA
A 04.08.2020 0.05 NA
A 05.08.2020 0 NA
A 06.08.2020 0 NA
A 07.08.2020 0.01 05.08.2020
A 08.08.2020 0.1 NA
A 09.08.2020 0.2 NA
And I want to achieve this:
Unit Date Prcent_daily.mortality Date.treatment akkum.7dbt days.post.treatment
A 20.07.2020 0.2 NA NA NA
A 21.07.2020 0 NA 1.35 NA
A 22.07.2020 0.4 NA 1.35 NA
A 23.07.2020 0.3 NA 1.35 NA
A 24.07.2020 0.6 NA 1.35 NA
A 25.07.2020 0.05 NA 1.35 NA
A 26.07.2020 0 NA 1.35 NA
A 27.07.2020 0 NA 1.35 NA
A 28.07.2020 0.01 28.07.2020 1.35 0
A 29.07.2020 0.1 NA NA 1
A 30.07.2020 0.2 NA NA 2
A 31.07.2020 0 NA 0.85 3
A 01.08.2020 0.2 NA 0.85 4
A 02.08.2020 0.3 NA 0.85 5
A 03.08.2020 0.3 NA 0.85 6
A 04.08.2020 0.05 NA 0.85 7
A 05.08.2020 0 NA 0.85 8
A 06.08.2020 0 NA 0.85 9
A 07.08.2020 0.01 05.08.2020 0.85 0
A 08.08.2020 0.1 NA NA 1
A 09.08.2020 0.2 NA NA 2
Thanks for all help form a self learned R amateur.

Try this solution which combines base R and dplyr:
library(dplyr)
library(tidyr)
#Create empty col for index
i1 <- which(!is.na(df$Date.treatment))
i2 <- i1-7
i1 <- i1-1
i3 <- 1:length(i1)
#Create index for second var
j1 <- which(!is.na(df$Date.treatment))
j2 <- 1:length(j1)
# i3 <- i1+1
df$Var <- NA
df$Var[i1]<-i3
df$Var[i2]<-i3
df$Var[1] <- 0
df$Var <- ifelse(!is.na(df$Date.treatment),0,df$Var)
#Fill
df %>% fill(Var) -> df1
#Create aggregations
df1 %>% filter(Var!=0) %>% group_by(Var) %>% mutate(Cum=cumsum(Prcent_daily.mortality)) %>%
filter(Cum==max(Cum)) %>% filter(!duplicated(Cum)) %>% ungroup() %>% select(c(Unit,Cum)) -> Ag1
#Create another var
df$Var2 <- NA
df$Var2[j1] <- j2
df$Var2[1] <- 0
#Fill
df %>% fill(Var2) -> df2
#Create cums and days
df2 %>% group_by(Unit,Var2) %>% mutate(Day=(1:n())-1) %>% ungroup() %>% select(-c(Var2)) -> df3
#Empty var for cums
df3$Cum <- NA
df3$Cum[i1+1] <- Ag1$Cum
#Fill 2
df3 %>% fill(Cum,.direction = 'up') -> df4
#Some adjusts
df4$Day[1:i1[1]]<-NA
df4$Cum[1] <- NA
df4$Cum <- ifelse((df4$Day==1 | df4$Day==2) & !is.na(df4$Day),NA,df4$Cum)
This will produce:
Unit Date Prcent_daily.mortality Date.treatment Var Day Cum
1 A 20.07.2020 0.20 <NA> 0 NA NA
2 A 21.07.2020 0.00 <NA> 1 NA 1.35
3 A 22.07.2020 0.40 <NA> NA NA 1.35
4 A 23.07.2020 0.30 <NA> NA NA 1.35
5 A 24.07.2020 0.60 <NA> NA NA 1.35
6 A 25.07.2020 0.05 <NA> NA NA 1.35
7 A 26.07.2020 0.00 <NA> NA NA 1.35
8 A 27.07.2020 0.00 <NA> 1 NA 1.35
9 A 28.07.2020 0.01 28.07.2020 0 0 1.35
10 A 29.07.2020 0.10 <NA> NA 1 NA
11 A 30.07.2020 0.20 <NA> NA 2 NA
12 A 31.07.2020 0.00 <NA> 2 3 0.85
13 A 01.08.2020 0.20 <NA> NA 4 0.85
14 A 02.08.2020 0.30 <NA> NA 5 0.85
15 A 03.08.2020 0.30 <NA> NA 6 0.85
16 A 04.08.2020 0.05 <NA> NA 7 0.85
17 A 05.08.2020 0.00 <NA> NA 8 0.85
18 A 06.08.2020 0.00 <NA> 2 9 0.85
19 A 07.08.2020 0.01 05.08.2020 0 0 0.85
20 A 08.08.2020 0.10 <NA> NA 1 NA
21 A 09.08.2020 0.20 <NA> NA 2 NA
Update: Working on df4 you can get the cumsum of Prcent_daily.mortality with next code:
#You can work with df4 to complete the rest of aggregations
#First create an dpuplicate var
df4$DateD <- df4$Date.treatment
#Now fill and mutate
df4 %>% fill(DateD) -> df4
#Create index for replacement
k <- df4$Date.treatment==df4$DateD & !is.na(df4$Date.treatment)
#Assign a value for aggregations not considered
df4$DateD[k]<-'NULL'
#Cumsum
df4 %>% group_by(DateD) %>% mutate(CumAfter=cumsum(Prcent_daily.mortality)) -> df4
#Now remove redundant values in the cum and drop the reference var
df4 %>% ungroup() %>% mutate(CumAfter=ifelse(is.na(DateD) | DateD=='NULL',NA,CumAfter)) %>%
select(-DateD) -> df4
The output will be next:
Unit Date Prcent_daily.mortality Date.treatment Var Day Cum CumAfter
1 A 20.07.2020 0.20 <NA> 0 NA NA NA
2 A 21.07.2020 0.00 <NA> 1 NA 1.35 NA
3 A 22.07.2020 0.40 <NA> NA NA 1.35 NA
4 A 23.07.2020 0.30 <NA> NA NA 1.35 NA
5 A 24.07.2020 0.60 <NA> NA NA 1.35 NA
6 A 25.07.2020 0.05 <NA> NA NA 1.35 NA
7 A 26.07.2020 0.00 <NA> NA NA 1.35 NA
8 A 27.07.2020 0.00 <NA> 1 NA 1.35 NA
9 A 28.07.2020 0.01 28.07.2020 0 0 1.35 NA
10 A 29.07.2020 0.10 <NA> NA 1 NA 0.10
11 A 30.07.2020 0.20 <NA> NA 2 NA 0.30
12 A 31.07.2020 0.00 <NA> 2 3 0.85 0.30
13 A 01.08.2020 0.20 <NA> NA 4 0.85 0.50
14 A 02.08.2020 0.30 <NA> NA 5 0.85 0.80
15 A 03.08.2020 0.30 <NA> NA 6 0.85 1.10
16 A 04.08.2020 0.05 <NA> NA 7 0.85 1.15
17 A 05.08.2020 0.00 <NA> NA 8 0.85 1.15
18 A 06.08.2020 0.00 <NA> 2 9 0.85 1.15
19 A 07.08.2020 0.01 05.08.2020 0 0 0.85 NA
20 A 08.08.2020 0.10 <NA> NA 1 NA 0.10
21 A 09.08.2020 0.20 <NA> NA 2 NA 0.30

A data.table solution. Although you are looking for a dplyr solution, I just want to share.
The idea is creating flags indicating 7 (or 8?) days before the treatment and the days after a treatment.
library(data.table)
odt <- fread('Unit Date Prcent_daily.mortality Date.treatment
A 20.07.2020 0.2 NA
A 21.07.2020 0 NA
A 22.07.2020 0.4 NA
A 23.07.2020 0.3 NA
A 24.07.2020 0.6 NA
A 25.07.2020 0.05 NA
A 26.07.2020 0 NA
A 27.07.2020 0 NA
A 28.07.2020 0.01 28.07.2020
A 29.07.2020 0.1 NA
A 30.07.2020 0.2 NA
A 31.07.2020 0 NA
A 01.08.2020 0.2 NA
A 02.08.2020 0.3 NA
A 03.08.2020 0.3 NA
A 04.08.2020 0.05 NA
A 05.08.2020 0 NA
A 06.08.2020 0 NA
A 07.08.2020 0.01 05.08.2020
A 08.08.2020 0.1 NA
A 09.08.2020 0.2 NA')
#create group flags
odt[,postgrp:=cumsum(!is.na(Date.treatment)),by=.(Unit)]
odt[,pregrp:= c(if (postgrp-1 < 0) 0 else postgrp-1,rep(postgrp,.N-1)),by=.(Unit,postgrp)]
treat_date <- odt[,.I[!is.na(Date.treatment)]]
pre7_date <- unlist(Map(seq,treat_date-7,treat_date))
odt[!pre7_date,pregrp:=NA][]
#> Unit Date Prcent_daily.mortality Date.treatment postgrp pregrp
#> 1: A 20.07.2020 0.20 <NA> 0 NA
#> 2: A 21.07.2020 0.00 <NA> 0 0
#> 3: A 22.07.2020 0.40 <NA> 0 0
#> 4: A 23.07.2020 0.30 <NA> 0 0
#> 5: A 24.07.2020 0.60 <NA> 0 0
#> 6: A 25.07.2020 0.05 <NA> 0 0
#> 7: A 26.07.2020 0.00 <NA> 0 0
#> 8: A 27.07.2020 0.00 <NA> 0 0
#> 9: A 28.07.2020 0.01 28.07.2020 1 0
#> 10: A 29.07.2020 0.10 <NA> 1 NA
#> 11: A 30.07.2020 0.20 <NA> 1 NA
#> 12: A 31.07.2020 0.00 <NA> 1 1
#> 13: A 01.08.2020 0.20 <NA> 1 1
#> 14: A 02.08.2020 0.30 <NA> 1 1
#> 15: A 03.08.2020 0.30 <NA> 1 1
#> 16: A 04.08.2020 0.05 <NA> 1 1
#> 17: A 05.08.2020 0.00 <NA> 1 1
#> 18: A 06.08.2020 0.00 <NA> 1 1
#> 19: A 07.08.2020 0.01 05.08.2020 2 1
#> 20: A 08.08.2020 0.10 <NA> 2 NA
#> 21: A 09.08.2020 0.20 <NA> 2 NA
#> Unit Date Prcent_daily.mortality Date.treatment postgrp pregrp
#calculation
odt[!is.na(pregrp),akkum.7dbty:=sum(tail(Prcent_daily.mortality[-.N],7)),by=.(Unit,pregrp)]
odt[postgrp!=0,days.post.treatment:= 0:(.N-1),by=.(Unit,postgrp)]
#result
odt[,c("postgrp","pregrp"):=NULL][]
#> Unit Date Prcent_daily.mortality Date.treatment akkum.7dbty
#> 1: A 20.07.2020 0.20 <NA> NA
#> 2: A 21.07.2020 0.00 <NA> 1.35
#> 3: A 22.07.2020 0.40 <NA> 1.35
#> 4: A 23.07.2020 0.30 <NA> 1.35
#> 5: A 24.07.2020 0.60 <NA> 1.35
#> 6: A 25.07.2020 0.05 <NA> 1.35
#> 7: A 26.07.2020 0.00 <NA> 1.35
#> 8: A 27.07.2020 0.00 <NA> 1.35
#> 9: A 28.07.2020 0.01 28.07.2020 1.35
#> 10: A 29.07.2020 0.10 <NA> NA
#> 11: A 30.07.2020 0.20 <NA> NA
#> 12: A 31.07.2020 0.00 <NA> 0.85
#> 13: A 01.08.2020 0.20 <NA> 0.85
#> 14: A 02.08.2020 0.30 <NA> 0.85
#> 15: A 03.08.2020 0.30 <NA> 0.85
#> 16: A 04.08.2020 0.05 <NA> 0.85
#> 17: A 05.08.2020 0.00 <NA> 0.85
#> 18: A 06.08.2020 0.00 <NA> 0.85
#> 19: A 07.08.2020 0.01 05.08.2020 0.85
#> 20: A 08.08.2020 0.10 <NA> NA
#> 21: A 09.08.2020 0.20 <NA> NA
#> Unit Date Prcent_daily.mortality Date.treatment akkum.7dbty
#> days.post.treatment
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: NA
#> 8: NA
#> 9: 0
#> 10: 1
#> 11: 2
#> 12: 3
#> 13: 4
#> 14: 5
#> 15: 6
#> 16: 7
#> 17: 8
#> 18: 9
#> 19: 0
#> 20: 1
#> 21: 2
#> days.post.treatment
Created on 2020-07-21 by the reprex package (v0.3.0)

Related

R Upsampling a time series in a dataframe filling missing values

I have collected data from two instruments, one is collected #10 Hz and the other is #100Hz.
I would like to increase the data from 10Hz to 100Hz in one dataframe to then align and merge the two dataframes together
The example data frame is:
DeltaT
Speed
Acc
HR
Player
48860,7
0,03
-0,05
0
Player1
48860,8
0,02
-0,05
0
Player1
48860,9
0,02
-0,04
0
Player1
48861,0
0,02
-0,03
0
Player1
48861,1
0,01
-0,02
0
Player1
Is there a package function that can help me create data between two points?
Manually with the approx function:
dt<- read.table(text=gsub(",", ".", 'DeltaT Speed Acc HR Player
48860,7 0,03 -0,05 0 Player1
48860,8 0,02 -0,05 0 Player1
48860,9 0,02 -0,04 0 Player1
48861,0 0,02 -0,03 0 Player1
48861,1 0,01 -0,02 0 Player1', fixed = TRUE),header=T)
upsampleDeltaT=seq(from=min(dt$DeltaT),to=max(dt$DeltaT),by=.01)
Speed<-approx(dt$DeltaT,dt$Speed,upsampleDeltaT)$y
Acc<-approx(dt$DeltaT,dt$Acc,upsampleDeltaT)$y
HR<-approx(dt$DeltaT,dt$HR,upsampleDeltaT)$y
Player <- rep(dt$Player,c(rep(10,nrow(dt)-1),1))
data.frame(upsampleDeltaT,Speed,Acc,HR,Player)
#> upsampleDeltaT Speed Acc HR Player
#> 1 48860.70 0.030 -0.050 0 Player1
#> 2 48860.71 0.029 -0.050 0 Player1
#> 3 48860.72 0.028 -0.050 0 Player1
#> 4 48860.73 0.027 -0.050 0 Player1
#> 5 48860.74 0.026 -0.050 0 Player1
#> 6 48860.75 0.025 -0.050 0 Player1
#> 7 48860.76 0.024 -0.050 0 Player1
#> 8 48860.77 0.023 -0.050 0 Player1
#> 9 48860.78 0.022 -0.050 0 Player1
#> 10 48860.79 0.021 -0.050 0 Player1
#> 11 48860.80 0.020 -0.050 0 Player1
#> 12 48860.81 0.020 -0.049 0 Player1
#> 13 48860.82 0.020 -0.048 0 Player1
#> 14 48860.83 0.020 -0.047 0 Player1
#> 15 48860.84 0.020 -0.046 0 Player1
#> 16 48860.85 0.020 -0.045 0 Player1
#> 17 48860.86 0.020 -0.044 0 Player1
#> 18 48860.87 0.020 -0.043 0 Player1
#> 19 48860.88 0.020 -0.042 0 Player1
#> 20 48860.89 0.020 -0.041 0 Player1
#> 21 48860.90 0.020 -0.040 0 Player1
#> 22 48860.91 0.020 -0.039 0 Player1
#> 23 48860.92 0.020 -0.038 0 Player1
#> 24 48860.93 0.020 -0.037 0 Player1
#> 25 48860.94 0.020 -0.036 0 Player1
#> 26 48860.95 0.020 -0.035 0 Player1
#> 27 48860.96 0.020 -0.034 0 Player1
#> 28 48860.97 0.020 -0.033 0 Player1
#> 29 48860.98 0.020 -0.032 0 Player1
#> 30 48860.99 0.020 -0.031 0 Player1
#> 31 48861.00 0.020 -0.030 0 Player1
#> 32 48861.01 0.019 -0.029 0 Player1
#> 33 48861.02 0.018 -0.028 0 Player1
#> 34 48861.03 0.017 -0.027 0 Player1
#> 35 48861.04 0.016 -0.026 0 Player1
#> 36 48861.05 0.015 -0.025 0 Player1
#> 37 48861.06 0.014 -0.024 0 Player1
#> 38 48861.07 0.013 -0.023 0 Player1
#> 39 48861.08 0.012 -0.022 0 Player1
#> 40 48861.09 0.011 -0.021 0 Player1
#> 41 48861.10 0.010 -0.020 0 Player1
library(data.table)
library(zoo)
set.seed(123)
# 10Hz and 100Hz sample data
DT10 <- data.table(time = seq(0,1, by = 0.1), value = sample(1:10, 11, replace = TRUE))
DT100 <- data.table(time = seq(0,1, by = 0.01), value = sample(1:10, 101, replace = TRUE))
# you should use setDT() if your data is not already data.table format
# join the DT10 to DT100
DT100[DT10, value2 := i.value, on = .(time)]
# intyerpolate NA-values
DT100[, value2_inter := zoo::na.approx(value2)]
#output
head(DT100, 31)
# time value value2 value2_inter
# 1: 0.00 3 3 3.0
# 2: 0.01 9 NA 3.0
# 3: 0.02 9 NA 3.0
# 4: 0.03 9 NA 3.0
# 5: 0.04 3 NA 3.0
# 6: 0.05 8 NA 3.0
# 7: 0.06 10 NA 3.0
# 8: 0.07 7 NA 3.0
# 9: 0.08 10 NA 3.0
# 10: 0.09 9 NA 3.0
# 11: 0.10 3 3 3.0
# 12: 0.11 4 NA 3.7
# 13: 0.12 1 NA 4.4
# 14: 0.13 7 NA 5.1
# 15: 0.14 5 NA 5.8
# 16: 0.15 10 NA 6.5
# 17: 0.16 7 NA 7.2
# 18: 0.17 9 NA 7.9
# 19: 0.18 9 NA 8.6
# 20: 0.19 10 NA 9.3
# 21: 0.20 7 10 10.0
# 22: 0.21 5 NA 9.8
# 23: 0.22 7 NA 9.6
# 24: 0.23 5 NA 9.4
# 25: 0.24 6 NA 9.2
# 26: 0.25 9 NA 9.0
# 27: 0.26 2 NA 8.8
# 28: 0.27 5 NA 8.6
# 29: 0.28 8 NA 8.4
# 30: 0.29 2 NA 8.2
# 31: 0.30 1 NA 8.0
# time value value2 value2_inter
Have a look at the approx function. It interpolates data for new points by.

Forward fill data with conditions

I have a dataframe, DF, looking like this:
date permno ret sue bm gpa
1 202001 10000 0.01 0.4 0.4 NA
2 202002 10000 0.04 NA NA 0.5
3 202003 10000 -0.01 NA NA NA
4 202004 10000 0.00 1.3 0.5 NA
5 202005 10000 0.02 NA NA 0.3
6 202006 10000 0.01 NA NA NA
7 202007 10000 0.03 NA NA NA
8 202008 10000 -0.02 NA NA 0.4
9 202001 11000 0.05 0.1 0.3 NA
10 202002 11000 0.02 NA NA NA
11 202003 11000 0.01 NA NA NA
12 202004 11000 0.00 NA NA 0.3
13 202005 11000 0.01 NA NA NA
14 202006 11000 -0.01 NA NA NA
15 202007 11000 0.04 0.5 0.4 NA
16 202008 11000 0.30 NA NA NA
I am using this code to forward fill the variables sue, bm and gpa:
DF1 <-
DF %>%
arrange(permno,date) %>%
group_by(permno) %>%
mutate_at(vars(c(sue,bm,gpa)), funs(na.locf(.,na.rm=FALSE)))
This results in the following
date permno ret sue bm gpa
1 202001 10000 0.01 0.4 0.4 NA
2 202002 10000 0.04 0.4 0.4 0.5
3 202003 10000 -0.01 0.4 0.4 0.5
4 202004 10000 0 1.3 0.5 0.5
5 202005 10000 0.02 1.3 0.5 0.3
6 202006 10000 0.01 1.3 0.5 0.3
7 202007 10000 0.03 1.3 0.5 0.3
8 202008 10000 -0.02 1.3 0.5 0.4
9 202001 11000 0.05 0.1 0.3 NA
10 202002 11000 0.02 0.1 0.3 NA
11 202003 11000 0.01 0.1 0.3 NA
12 202004 11000 0 0.1 0.3 0.3
13 202005 11000 0.01 0.1 0.3 0.3
14 202006 11000 -0.01 0.1 0.3 0.3
15 202007 11000 0.04 0.5 0.4 0.3
16 202008 11000 0.3 0.5 0.4 0.3
I want to put a limit on how many months the data is forward filled. I want to forward fill the three variables until the next available value, but maximum 3 months. Thus, the results should look like this:
date permno ret sue bm gpa
1 202001 10000 0.01 0.4 0.4 NA
2 202002 10000 0.04 0.4 0.4 0.5
3 202003 10000 -0.01 0.4 0.4 0.5
4 202004 10000 0.00 1.3 0.5 0.5
5 202005 10000 0.02 1.3 0.5 0.3
6 202006 10000 0.01 1.3 0.5 0.3
7 202007 10000 0.03 1.3 0.5 0.3
8 202008 10000 -0.02 NA NA 0.4
9 202001 11000 0.05 0.1 0.3 NA
10 202002 11000 0.02 0.1 0.3 NA
11 202003 11000 0.01 0.1 0.3 NA
12 202004 11000 0.00 0.1 0.3 0.3
13 202005 11000 0.01 NA NA 0.3
14 202006 11000 -0.01 NA NA 0.3
15 202007 11000 0.04 0.5 0.4 0.3
16 202008 11000 0.30 0.5 0.4 NA
Does anyone know how I can do this in R?
We can write our own na.locf() that allows to make the adjustments you desire:
Code
library(zoo)
library(dplyr)
na.locf2 <- function(object, period = 3, ...){
# consecutive NAs
tmp1 <- rle(is.na(object))
# NA count in the length of the vector
tmp2 <- unlist(sapply(tmp1[[1]] , function(x){
1:x
}))
# remove all NAs
tmp3 <- na.locf(object, ...)
# reassign those that are greater than the desired period
tmp3[tmp2 > period] <- NA
# return
tmp3
}
# Then
DF %>%
arrange(permno,date) %>%
group_by(permno) %>%
mutate_at(vars(c(sue,bm,gpa)), funs(na.locf2(.,na.rm=FALSE)))
# Yields
# A tibble: 16 x 6
# Groups: permno [2]
# date permno ret sue bm gpa
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 202001 10000 0.01 0.4 0.4 NA
# 2 202002 10000 0.04 0.4 0.4 0.5
# 3 202003 10000 -0.01 0.4 0.4 0.5
# 4 202004 10000 0 1.3 0.5 0.5
# 5 202005 10000 0.02 1.3 0.5 0.3
# 6 202006 10000 0.01 1.3 0.5 0.3
# 7 202007 10000 0.03 1.3 0.5 0.3
# 8 202008 10000 -0.02 NA NA 0.4
# 9 202001 11000 0.05 0.1 0.3 NA
# 10 202002 11000 0.02 0.1 0.3 NA
# 11 202003 11000 0.01 0.1 0.3 NA
# 12 202004 11000 0 0.1 0.3 0.3
# 13 202005 11000 0.01 NA NA 0.3
# 14 202006 11000 -0.01 NA NA 0.3
# 15 202007 11000 0.04 0.5 0.4 0.3
# 16 202008 11000 0.3 0.5 0.4 NA
This sounds like a rolling-window thing. Since you need to limit the carry-forward, though, one problem is that when you look at a particular cell, its predecessor has already been fixed (un-NA'd), so we need to look at the vector in reverse.
A helper function, where 2:4 is based on your preference for no more than three months. In the context of a reversed rollapply, z[1] in this case is likely an NA with z[2:4] being the three preceding months.
func <- function(z) if (is.na(z[1])) na.omit(z[2:4])[1] else z[1]
Since we're using partial=TRUE in the rolling, it is feasible that z will not be length-4; this is fine, though, since even z[100000] will return NA, which is explicitly handled by na.omit. (This condition can also be addressed with a test on length(z).)
Another note: na.omit(.) can return a length-0 vector, which is obviously bad in this context. However, adding [1] after it forces it to return an NA, which is what we need it to be if no non-NA elements exist within z[2:4].
From here, most of the work is done by zoo::rollapply:
DF %>%
arrange(permno, date) %>%
group_by(permno) %>%
mutate(across(sue:gpa, ~ rev(zoo::rollapply(rev(.), 4, align="left", FUN = func, partial = TRUE)))) %>%
ungroup()
# # A tibble: 16 x 6
# date permno ret sue bm gpa
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 202001 10000 0.01 0.4 0.4 NA
# 2 202002 10000 0.04 0.4 0.4 0.5
# 3 202003 10000 -0.01 0.4 0.4 0.5
# 4 202004 10000 0 1.3 0.5 0.5
# 5 202005 10000 0.02 1.3 0.5 0.3
# 6 202006 10000 0.01 1.3 0.5 0.3
# 7 202007 10000 0.03 1.3 0.5 0.3
# 8 202008 10000 -0.02 NA NA 0.4
# 9 202001 11000 0.05 0.1 0.3 NA
# 10 202002 11000 0.02 0.1 0.3 NA
# 11 202003 11000 0.01 0.1 0.3 NA
# 12 202004 11000 0 0.1 0.3 0.3
# 13 202005 11000 0.01 NA NA 0.3
# 14 202006 11000 -0.01 NA NA 0.3
# 15 202007 11000 0.04 0.5 0.4 0.3
# 16 202008 11000 0.3 0.5 0.4 NA

R read.table fill empty data with value above

I have a text file with horrible formatting that I need to read into R. I am reading a bunch of other files that don't have horrible formatting with read.table, so I would like to continue to use this function, if possible.
The file looks like this:
M D YY CONC
7 1 78 15
0.00
0.15
1.06
1.21
10.91
34.55
69.09
87.27
73.67
38.65
12.27
2.27
6.52
0.45
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.19
0.96
4.59
4.55
4.59
7.25
7.13
11.60
1.06
0.15
1.50
1.16
0.00
0.00
0.00
0.00
0.00
7 1 78 16
0.00
0.00
0.00
0.00
7.25
1.50
9.00
20.25
51.25
55.00
53.75
3.13
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.80
0.98
4.00
2.47
5.63
3.50
7.88
0.43
2.30
0.00
0.00
0.00
0.00
0.00
0.00
0.00
7 1 78 17
4.15
0.00
0.00
0.15
2.27
16.36
54.37
67.96
58.07
3.58
0.89
0.20
0.52
0.59
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
5.44
0.00
3.09
3.26
7.17
9.39
8.65
3.09
0.45
7.41
3.18
0.00
2.05
0.00
There is one CONC per hour on the date provided in the first row. My ultimate goal will be to have the date repeat and add a column for hour. So the first bit should look like:
M D YY H CONC
7 1 78 1 15
7 1 78 2 0.00
7 1 78 3 0.15
7 1 78 4 1.06
7 1 78 5 1.21
7 1 78 6 10.91
7 1 78 7 34.55
7 1 78 8 69.09
I can read in the file using this:
monitor_datai <- read.table(file =file,header = TRUE, stringsAsFactors = FALSE, skip = 0, sep = "", fill = TRUE)
BUT the issue with that approach is that the data reads in filling the first column with the month (if provided on that line) or concentration (if no month was provided for that line). Looking something like this:
head(monitor_datai)
V1 V2 V3 V4
1 7.00 1 78 15
2 0.00 NA NA NA
3 0.15 NA NA NA
4 1.06 NA NA NA
5 1.21 NA NA NA
6 10.91 NA NA NA
So, I need help reading in the file and fixing the formatting.
Thanks!
Here is my approach, using the weapons of the data.table-package
I was not sure what the values of H shuld become... just 1:128, of a sequency by group, or ...? Please specify and I'll add it into the answer..
I included comments and in between results in the code below, so you (hopefully) can follow the steps and adjust if/where needed
library( data.table )
#read the file as-is, complete lines, no separator
DT <- fread( "./temp/testfile.txt", sep = "", skip = 1, header = FALSE )
# head(DT)
# V1
# 1: 7 1 78 15
# 2: 0.00
# 3: 0.15
# 4: 1.06
# 5: 1.21
# 6: 10.91
#get column names from the file, store in a vector
colnames = names( fread( "./temp/testfile.txt", sep = " ", nrows = 1, header = TRUE ) )
#split the rows with a space in them to the for desired columns,
# use a space (or multiple in a row) as separator
DT[ grepl(" ", V1), (colnames) := tstrsplit( V1, "[ ]+", perl = TRUE ) ]
# V1 M D YY CONC
# 1: 7 1 78 15 7 1 78 15
# 2: 0.00 <NA> <NA> <NA> <NA>
# 3: 0.15 <NA> <NA> <NA> <NA>
# 4: 1.06 <NA> <NA> <NA> <NA>
# 5: 1.21 <NA> <NA> <NA> <NA>
# ---
# 124: 7.41 <NA> <NA> <NA> <NA>
# 125: 3.18 <NA> <NA> <NA> <NA>
# 126: 0.00 <NA> <NA> <NA> <NA>
# 127: 2.05 <NA> <NA> <NA> <NA>
# 128: 0.00 <NA> <NA> <NA> <NA>
#where CONC is.na, copy the value of V1
DT[ is.na( CONC ), CONC := V1 ]
# V1 M D YY CONC
# 1: 7 1 78 15 7 1 78 15
# 2: 0.00 <NA> <NA> <NA> 0.00
# 3: 0.15 <NA> <NA> <NA> 0.15
# 4: 1.06 <NA> <NA> <NA> 1.06
# 5: 1.21 <NA> <NA> <NA> 1.21
# ---
# 124: 7.41 <NA> <NA> <NA> 7.41
# 125: 3.18 <NA> <NA> <NA> 3.18
# 126: 0.00 <NA> <NA> <NA> 0.00
# 127: 2.05 <NA> <NA> <NA> 2.05
# 128: 0.00 <NA> <NA> <NA> 0.00
#now we can drop the V1-column
DT[, V1 := NULL]
#set all columns to the right (numeric) type
DT[, (names(DT)) := lapply( .SD, as.numeric ) ]
#and fill down the missing values of M, D and YY
setnafill( DT, type = "locf", cols = c("M", "D", "YY") )
# M D YY CONC
# 1: 7 1 78 15.00
# 2: 7 1 78 0.00
# 3: 7 1 78 0.15
# 4: 7 1 78 1.06
# 5: 7 1 78 1.21
# ---
# 124: 7 1 78 7.41
# 125: 7 1 78 3.18
# 126: 7 1 78 0.00
# 127: 7 1 78 2.05
# 128: 7 1 78 0.00

taking average by groups, excluding NA value

I'm struggling with finding something to aggregate my data frame by taking the mean and ignoring the NA value, but the end results would still show a missing value them.
the data table looks for instance like this
Guar1 Bucket2 1 2 3 4 Total Month
10 -10 NA NA NA NA 0 201110
10 -0.2 0 9.87 8.42 0 18.29 201110
10 0 0.81 7.49 3.32 5.92 17.54 201110
10 0.4 0 0 NA 0 0 201110
10 999 0.73 7.57 4.61 0.77 13.68 201110
20 -10 NA NA NA NA 0 201110
20 -0.2 NA NA 100 NA 100 201110
20 0 NA 0 0 0 0 201110
20 0.4 1.39 3.13 14.04 2.98 21.54 201110
20 999 1.38 3.11 17.08 2.97 24.54 201110
999 999 1.06 5.44 8.61 1.52 16.63 201110
10 -10 NA NA NA NA 0 201111
10 -0.2 0 0 8.54 0 8.54 201111
10 0 1.87 6.12 16.6 0 24.59 201111
10 0.4 0 0 0 1.47 1.47 201111
10 999 1.68 5.82 13.15 1.67 22.32 201111
20 -10 NA NA NA NA 0 201111
20 -0.2 NA 0 NA NA 0 201111
20 0 NA NA 0 0 0 201111
20 0.4 2.29 5.38 14.91 14.18 36.76 201111
20 999 2.29 5.35 13.09 14.1 34.83 201111
And the final table
Guar1 Bucket2 1 2 3 4 Total
10 -10 NA NA NA NA 0
10 -0.2 0 4.935 8.48 0 13.415
10 0 1.34 6.805 9.96 2.96 21.065
10 0.4 0 0 0 0.735 0.735
10 999 1.205 6.695 8.88 1.22 18
20 -10 NA NA NA NA 0
20 -0.2 NA 0 100 NA 50
20 0 NA 0 0 0 0
20 0.4 1.84 4.255 14.475 8.58 29.15
20 999 1.835 4.23 15.085 8.535 29.685
999 999 1.06 5.44 8.61 1.52 16.63
I've try the
aggregate(.~ Guar1+Bucket2, df, mean, na.rm = FALSE)
but it then excluding all NA in the final table.
and if I set all the NA value in df equal to 0 then I would not have the desire average.
I hope that someone can help me with this. Thanks!
Check this example with dplyr package
You can group by more than one variable. dplyr package is great for data editing summarising end etc.
dataFrame <- data.frame(group = c("a","a","a", "b","b","b"), value = c(1,2,NA,NA,NA,3))
library("dplyr")
df <- dataFrame %>%
group_by(group) %>%
summarise(Mean = mean(value, na.rm = T))
Output
# A tibble: 2 × 2
group Mean
<fctr> <dbl>
1 a 1.5
2 b 3.0
To avoid the NA rows to be removed, use na.action = na.pass and with na.rm=TRUE from the mean, make sure that we use only the non-NA elements to get the mean
aggregate(.~ Guar1+Bucket2, df, mean, na.rm =TRUE, na.action = na.pass)

getting the minor diagonal elements for every group of data in R

Dataset consist of 3 character variables and 21 numeric variables in time series format.
A,B,C,num_1,num_2,num3_.......num_21
Below is sample data
Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10 Variable11
1 2005 A X 1AX 1 NA NA NA NA NA
1 2006 A X 1AX 2 0 0 0 0 0
1 2007 A X 1AX 3 0 0.02 0 0.04 0.06
1 2008 A X 1AX 4 0 0.03 0.04 0.09 0.13
1 2009 A X 1AX 5 0.01 0.05 0.08 0.16 0.27
1 2010 A X 1AX 6 0.01 0.04 0.15 0.29 0.42
1 2011 A X 1AX 7 0.01 0.03 0.12 0.26 0.4
1 2012 A X 1AX 8 0 0.04 0.15 0.3 0.44
1 2013 A X 1AX 9 0.01 0.04 0.1 0.17 0.27
1 2014 A X 1AX 10 0.01 0.05 0.1 0.14 0.21
1 2007 A Y 1AY 1 0.01 0.04 0.15 0.29 0.42
1 2008 A Y 1AY 2 0.01 0.03 0.12 0.26 0.4
1 2009 A Y 1AY 3 0 0.04 0.15 0.3 0.44
1 2010 A Y 1AY 4 0.01 0.04 0.1 0.17 0.27
1 2011 A Y 1AY 5 0 0.04 0.15 0.3 0.44
1 2012 A Y 1AY 6 0.01 0.04 0.1 0.17 0.27
1 1983 B Y 1BY 1 NA NA NA NA NA
1 1984 B Y 1BY 2 0 0.5 0 0 0
I have to get the minor diagonal elements for each group of Variable1,3,4
Variable 5 is the concatenation of Variable1,3,4 and variable6 to increment the occurrence of the concatenated variable.
newdata$variable5<-apply(newdata,1 ,function(x) paste0(toString(x[1]), toString(x[3]), toString(x[4])))
DT<-data.table(newdata)
DT[ , variable6:= 1:.N , by = c("variable5") ]
keys<-unique(unlist(DT$variable5))
for (i in 2:length(keys)){
data_1<-subset(DT,merge=keys[i])
result<-diag(data_1[nrow(data_1):1,])
}
I am getting the below error
Error in diag(data_1[nrow(data_1):1, ]) :
(list) object cannot be coerced to type 'double'

Resources