R read.table fill empty data with value above - r

I have a text file with horrible formatting that I need to read into R. I am reading a bunch of other files that don't have horrible formatting with read.table, so I would like to continue to use this function, if possible.
The file looks like this:
M D YY CONC
7 1 78 15
0.00
0.15
1.06
1.21
10.91
34.55
69.09
87.27
73.67
38.65
12.27
2.27
6.52
0.45
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.19
0.96
4.59
4.55
4.59
7.25
7.13
11.60
1.06
0.15
1.50
1.16
0.00
0.00
0.00
0.00
0.00
7 1 78 16
0.00
0.00
0.00
0.00
7.25
1.50
9.00
20.25
51.25
55.00
53.75
3.13
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.80
0.98
4.00
2.47
5.63
3.50
7.88
0.43
2.30
0.00
0.00
0.00
0.00
0.00
0.00
0.00
7 1 78 17
4.15
0.00
0.00
0.15
2.27
16.36
54.37
67.96
58.07
3.58
0.89
0.20
0.52
0.59
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
5.44
0.00
3.09
3.26
7.17
9.39
8.65
3.09
0.45
7.41
3.18
0.00
2.05
0.00
There is one CONC per hour on the date provided in the first row. My ultimate goal will be to have the date repeat and add a column for hour. So the first bit should look like:
M D YY H CONC
7 1 78 1 15
7 1 78 2 0.00
7 1 78 3 0.15
7 1 78 4 1.06
7 1 78 5 1.21
7 1 78 6 10.91
7 1 78 7 34.55
7 1 78 8 69.09
I can read in the file using this:
monitor_datai <- read.table(file =file,header = TRUE, stringsAsFactors = FALSE, skip = 0, sep = "", fill = TRUE)
BUT the issue with that approach is that the data reads in filling the first column with the month (if provided on that line) or concentration (if no month was provided for that line). Looking something like this:
head(monitor_datai)
V1 V2 V3 V4
1 7.00 1 78 15
2 0.00 NA NA NA
3 0.15 NA NA NA
4 1.06 NA NA NA
5 1.21 NA NA NA
6 10.91 NA NA NA
So, I need help reading in the file and fixing the formatting.
Thanks!

Here is my approach, using the weapons of the data.table-package
I was not sure what the values of H shuld become... just 1:128, of a sequency by group, or ...? Please specify and I'll add it into the answer..
I included comments and in between results in the code below, so you (hopefully) can follow the steps and adjust if/where needed
library( data.table )
#read the file as-is, complete lines, no separator
DT <- fread( "./temp/testfile.txt", sep = "", skip = 1, header = FALSE )
# head(DT)
# V1
# 1: 7 1 78 15
# 2: 0.00
# 3: 0.15
# 4: 1.06
# 5: 1.21
# 6: 10.91
#get column names from the file, store in a vector
colnames = names( fread( "./temp/testfile.txt", sep = " ", nrows = 1, header = TRUE ) )
#split the rows with a space in them to the for desired columns,
# use a space (or multiple in a row) as separator
DT[ grepl(" ", V1), (colnames) := tstrsplit( V1, "[ ]+", perl = TRUE ) ]
# V1 M D YY CONC
# 1: 7 1 78 15 7 1 78 15
# 2: 0.00 <NA> <NA> <NA> <NA>
# 3: 0.15 <NA> <NA> <NA> <NA>
# 4: 1.06 <NA> <NA> <NA> <NA>
# 5: 1.21 <NA> <NA> <NA> <NA>
# ---
# 124: 7.41 <NA> <NA> <NA> <NA>
# 125: 3.18 <NA> <NA> <NA> <NA>
# 126: 0.00 <NA> <NA> <NA> <NA>
# 127: 2.05 <NA> <NA> <NA> <NA>
# 128: 0.00 <NA> <NA> <NA> <NA>
#where CONC is.na, copy the value of V1
DT[ is.na( CONC ), CONC := V1 ]
# V1 M D YY CONC
# 1: 7 1 78 15 7 1 78 15
# 2: 0.00 <NA> <NA> <NA> 0.00
# 3: 0.15 <NA> <NA> <NA> 0.15
# 4: 1.06 <NA> <NA> <NA> 1.06
# 5: 1.21 <NA> <NA> <NA> 1.21
# ---
# 124: 7.41 <NA> <NA> <NA> 7.41
# 125: 3.18 <NA> <NA> <NA> 3.18
# 126: 0.00 <NA> <NA> <NA> 0.00
# 127: 2.05 <NA> <NA> <NA> 2.05
# 128: 0.00 <NA> <NA> <NA> 0.00
#now we can drop the V1-column
DT[, V1 := NULL]
#set all columns to the right (numeric) type
DT[, (names(DT)) := lapply( .SD, as.numeric ) ]
#and fill down the missing values of M, D and YY
setnafill( DT, type = "locf", cols = c("M", "D", "YY") )
# M D YY CONC
# 1: 7 1 78 15.00
# 2: 7 1 78 0.00
# 3: 7 1 78 0.15
# 4: 7 1 78 1.06
# 5: 7 1 78 1.21
# ---
# 124: 7 1 78 7.41
# 125: 7 1 78 3.18
# 126: 7 1 78 0.00
# 127: 7 1 78 2.05
# 128: 7 1 78 0.00

Related

Calculating cumsum of mortality before a specific date in R

I need help writing either a for loop or dply code for two things:
Calculating the cumsum of mortality (%), grouped on Unit, in a defined period (7 days) before a treatment.
Make a vector that counts the day post last treatment until next treatment.
The data sett looks like this:
Unit Date Prcent_daily.mortality Date.treatment
A 20.07.2020 0.2 NA
A 21.07.2020 0 NA
A 22.07.2020 0.4 NA
A 23.07.2020 0.3 NA
A 24.07.2020 0.6 NA
A 25.07.2020 0.05 NA
A 26.07.2020 0 NA
A 27.07.2020 0 NA
A 28.07.2020 0.01 28.07.2020
A 29.07.2020 0.1 NA
A 30.07.2020 0.2 NA
A 31.07.2020 0 NA
A 01.08.2020 0.2 NA
A 02.08.2020 0.3 NA
A 03.08.2020 0.3 NA
A 04.08.2020 0.05 NA
A 05.08.2020 0 NA
A 06.08.2020 0 NA
A 07.08.2020 0.01 05.08.2020
A 08.08.2020 0.1 NA
A 09.08.2020 0.2 NA
And I want to achieve this:
Unit Date Prcent_daily.mortality Date.treatment akkum.7dbt days.post.treatment
A 20.07.2020 0.2 NA NA NA
A 21.07.2020 0 NA 1.35 NA
A 22.07.2020 0.4 NA 1.35 NA
A 23.07.2020 0.3 NA 1.35 NA
A 24.07.2020 0.6 NA 1.35 NA
A 25.07.2020 0.05 NA 1.35 NA
A 26.07.2020 0 NA 1.35 NA
A 27.07.2020 0 NA 1.35 NA
A 28.07.2020 0.01 28.07.2020 1.35 0
A 29.07.2020 0.1 NA NA 1
A 30.07.2020 0.2 NA NA 2
A 31.07.2020 0 NA 0.85 3
A 01.08.2020 0.2 NA 0.85 4
A 02.08.2020 0.3 NA 0.85 5
A 03.08.2020 0.3 NA 0.85 6
A 04.08.2020 0.05 NA 0.85 7
A 05.08.2020 0 NA 0.85 8
A 06.08.2020 0 NA 0.85 9
A 07.08.2020 0.01 05.08.2020 0.85 0
A 08.08.2020 0.1 NA NA 1
A 09.08.2020 0.2 NA NA 2
Thanks for all help form a self learned R amateur.
Try this solution which combines base R and dplyr:
library(dplyr)
library(tidyr)
#Create empty col for index
i1 <- which(!is.na(df$Date.treatment))
i2 <- i1-7
i1 <- i1-1
i3 <- 1:length(i1)
#Create index for second var
j1 <- which(!is.na(df$Date.treatment))
j2 <- 1:length(j1)
# i3 <- i1+1
df$Var <- NA
df$Var[i1]<-i3
df$Var[i2]<-i3
df$Var[1] <- 0
df$Var <- ifelse(!is.na(df$Date.treatment),0,df$Var)
#Fill
df %>% fill(Var) -> df1
#Create aggregations
df1 %>% filter(Var!=0) %>% group_by(Var) %>% mutate(Cum=cumsum(Prcent_daily.mortality)) %>%
filter(Cum==max(Cum)) %>% filter(!duplicated(Cum)) %>% ungroup() %>% select(c(Unit,Cum)) -> Ag1
#Create another var
df$Var2 <- NA
df$Var2[j1] <- j2
df$Var2[1] <- 0
#Fill
df %>% fill(Var2) -> df2
#Create cums and days
df2 %>% group_by(Unit,Var2) %>% mutate(Day=(1:n())-1) %>% ungroup() %>% select(-c(Var2)) -> df3
#Empty var for cums
df3$Cum <- NA
df3$Cum[i1+1] <- Ag1$Cum
#Fill 2
df3 %>% fill(Cum,.direction = 'up') -> df4
#Some adjusts
df4$Day[1:i1[1]]<-NA
df4$Cum[1] <- NA
df4$Cum <- ifelse((df4$Day==1 | df4$Day==2) & !is.na(df4$Day),NA,df4$Cum)
This will produce:
Unit Date Prcent_daily.mortality Date.treatment Var Day Cum
1 A 20.07.2020 0.20 <NA> 0 NA NA
2 A 21.07.2020 0.00 <NA> 1 NA 1.35
3 A 22.07.2020 0.40 <NA> NA NA 1.35
4 A 23.07.2020 0.30 <NA> NA NA 1.35
5 A 24.07.2020 0.60 <NA> NA NA 1.35
6 A 25.07.2020 0.05 <NA> NA NA 1.35
7 A 26.07.2020 0.00 <NA> NA NA 1.35
8 A 27.07.2020 0.00 <NA> 1 NA 1.35
9 A 28.07.2020 0.01 28.07.2020 0 0 1.35
10 A 29.07.2020 0.10 <NA> NA 1 NA
11 A 30.07.2020 0.20 <NA> NA 2 NA
12 A 31.07.2020 0.00 <NA> 2 3 0.85
13 A 01.08.2020 0.20 <NA> NA 4 0.85
14 A 02.08.2020 0.30 <NA> NA 5 0.85
15 A 03.08.2020 0.30 <NA> NA 6 0.85
16 A 04.08.2020 0.05 <NA> NA 7 0.85
17 A 05.08.2020 0.00 <NA> NA 8 0.85
18 A 06.08.2020 0.00 <NA> 2 9 0.85
19 A 07.08.2020 0.01 05.08.2020 0 0 0.85
20 A 08.08.2020 0.10 <NA> NA 1 NA
21 A 09.08.2020 0.20 <NA> NA 2 NA
Update: Working on df4 you can get the cumsum of Prcent_daily.mortality with next code:
#You can work with df4 to complete the rest of aggregations
#First create an dpuplicate var
df4$DateD <- df4$Date.treatment
#Now fill and mutate
df4 %>% fill(DateD) -> df4
#Create index for replacement
k <- df4$Date.treatment==df4$DateD & !is.na(df4$Date.treatment)
#Assign a value for aggregations not considered
df4$DateD[k]<-'NULL'
#Cumsum
df4 %>% group_by(DateD) %>% mutate(CumAfter=cumsum(Prcent_daily.mortality)) -> df4
#Now remove redundant values in the cum and drop the reference var
df4 %>% ungroup() %>% mutate(CumAfter=ifelse(is.na(DateD) | DateD=='NULL',NA,CumAfter)) %>%
select(-DateD) -> df4
The output will be next:
Unit Date Prcent_daily.mortality Date.treatment Var Day Cum CumAfter
1 A 20.07.2020 0.20 <NA> 0 NA NA NA
2 A 21.07.2020 0.00 <NA> 1 NA 1.35 NA
3 A 22.07.2020 0.40 <NA> NA NA 1.35 NA
4 A 23.07.2020 0.30 <NA> NA NA 1.35 NA
5 A 24.07.2020 0.60 <NA> NA NA 1.35 NA
6 A 25.07.2020 0.05 <NA> NA NA 1.35 NA
7 A 26.07.2020 0.00 <NA> NA NA 1.35 NA
8 A 27.07.2020 0.00 <NA> 1 NA 1.35 NA
9 A 28.07.2020 0.01 28.07.2020 0 0 1.35 NA
10 A 29.07.2020 0.10 <NA> NA 1 NA 0.10
11 A 30.07.2020 0.20 <NA> NA 2 NA 0.30
12 A 31.07.2020 0.00 <NA> 2 3 0.85 0.30
13 A 01.08.2020 0.20 <NA> NA 4 0.85 0.50
14 A 02.08.2020 0.30 <NA> NA 5 0.85 0.80
15 A 03.08.2020 0.30 <NA> NA 6 0.85 1.10
16 A 04.08.2020 0.05 <NA> NA 7 0.85 1.15
17 A 05.08.2020 0.00 <NA> NA 8 0.85 1.15
18 A 06.08.2020 0.00 <NA> 2 9 0.85 1.15
19 A 07.08.2020 0.01 05.08.2020 0 0 0.85 NA
20 A 08.08.2020 0.10 <NA> NA 1 NA 0.10
21 A 09.08.2020 0.20 <NA> NA 2 NA 0.30
A data.table solution. Although you are looking for a dplyr solution, I just want to share.
The idea is creating flags indicating 7 (or 8?) days before the treatment and the days after a treatment.
library(data.table)
odt <- fread('Unit Date Prcent_daily.mortality Date.treatment
A 20.07.2020 0.2 NA
A 21.07.2020 0 NA
A 22.07.2020 0.4 NA
A 23.07.2020 0.3 NA
A 24.07.2020 0.6 NA
A 25.07.2020 0.05 NA
A 26.07.2020 0 NA
A 27.07.2020 0 NA
A 28.07.2020 0.01 28.07.2020
A 29.07.2020 0.1 NA
A 30.07.2020 0.2 NA
A 31.07.2020 0 NA
A 01.08.2020 0.2 NA
A 02.08.2020 0.3 NA
A 03.08.2020 0.3 NA
A 04.08.2020 0.05 NA
A 05.08.2020 0 NA
A 06.08.2020 0 NA
A 07.08.2020 0.01 05.08.2020
A 08.08.2020 0.1 NA
A 09.08.2020 0.2 NA')
#create group flags
odt[,postgrp:=cumsum(!is.na(Date.treatment)),by=.(Unit)]
odt[,pregrp:= c(if (postgrp-1 < 0) 0 else postgrp-1,rep(postgrp,.N-1)),by=.(Unit,postgrp)]
treat_date <- odt[,.I[!is.na(Date.treatment)]]
pre7_date <- unlist(Map(seq,treat_date-7,treat_date))
odt[!pre7_date,pregrp:=NA][]
#> Unit Date Prcent_daily.mortality Date.treatment postgrp pregrp
#> 1: A 20.07.2020 0.20 <NA> 0 NA
#> 2: A 21.07.2020 0.00 <NA> 0 0
#> 3: A 22.07.2020 0.40 <NA> 0 0
#> 4: A 23.07.2020 0.30 <NA> 0 0
#> 5: A 24.07.2020 0.60 <NA> 0 0
#> 6: A 25.07.2020 0.05 <NA> 0 0
#> 7: A 26.07.2020 0.00 <NA> 0 0
#> 8: A 27.07.2020 0.00 <NA> 0 0
#> 9: A 28.07.2020 0.01 28.07.2020 1 0
#> 10: A 29.07.2020 0.10 <NA> 1 NA
#> 11: A 30.07.2020 0.20 <NA> 1 NA
#> 12: A 31.07.2020 0.00 <NA> 1 1
#> 13: A 01.08.2020 0.20 <NA> 1 1
#> 14: A 02.08.2020 0.30 <NA> 1 1
#> 15: A 03.08.2020 0.30 <NA> 1 1
#> 16: A 04.08.2020 0.05 <NA> 1 1
#> 17: A 05.08.2020 0.00 <NA> 1 1
#> 18: A 06.08.2020 0.00 <NA> 1 1
#> 19: A 07.08.2020 0.01 05.08.2020 2 1
#> 20: A 08.08.2020 0.10 <NA> 2 NA
#> 21: A 09.08.2020 0.20 <NA> 2 NA
#> Unit Date Prcent_daily.mortality Date.treatment postgrp pregrp
#calculation
odt[!is.na(pregrp),akkum.7dbty:=sum(tail(Prcent_daily.mortality[-.N],7)),by=.(Unit,pregrp)]
odt[postgrp!=0,days.post.treatment:= 0:(.N-1),by=.(Unit,postgrp)]
#result
odt[,c("postgrp","pregrp"):=NULL][]
#> Unit Date Prcent_daily.mortality Date.treatment akkum.7dbty
#> 1: A 20.07.2020 0.20 <NA> NA
#> 2: A 21.07.2020 0.00 <NA> 1.35
#> 3: A 22.07.2020 0.40 <NA> 1.35
#> 4: A 23.07.2020 0.30 <NA> 1.35
#> 5: A 24.07.2020 0.60 <NA> 1.35
#> 6: A 25.07.2020 0.05 <NA> 1.35
#> 7: A 26.07.2020 0.00 <NA> 1.35
#> 8: A 27.07.2020 0.00 <NA> 1.35
#> 9: A 28.07.2020 0.01 28.07.2020 1.35
#> 10: A 29.07.2020 0.10 <NA> NA
#> 11: A 30.07.2020 0.20 <NA> NA
#> 12: A 31.07.2020 0.00 <NA> 0.85
#> 13: A 01.08.2020 0.20 <NA> 0.85
#> 14: A 02.08.2020 0.30 <NA> 0.85
#> 15: A 03.08.2020 0.30 <NA> 0.85
#> 16: A 04.08.2020 0.05 <NA> 0.85
#> 17: A 05.08.2020 0.00 <NA> 0.85
#> 18: A 06.08.2020 0.00 <NA> 0.85
#> 19: A 07.08.2020 0.01 05.08.2020 0.85
#> 20: A 08.08.2020 0.10 <NA> NA
#> 21: A 09.08.2020 0.20 <NA> NA
#> Unit Date Prcent_daily.mortality Date.treatment akkum.7dbty
#> days.post.treatment
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: NA
#> 8: NA
#> 9: 0
#> 10: 1
#> 11: 2
#> 12: 3
#> 13: 4
#> 14: 5
#> 15: 6
#> 16: 7
#> 17: 8
#> 18: 9
#> 19: 0
#> 20: 1
#> 21: 2
#> days.post.treatment
Created on 2020-07-21 by the reprex package (v0.3.0)

Fill NA by mean values corresponding to unique in R

I want to fill NA rows in a data table by the mean values for unique value from another column. Please see the intended output. How can I achieve this in R? I prefer the data table output.
data2 <- data.table(Plan=c(11,11,11,11,91,91,91,91), Price=c(4.4,4.4,4.4,NA,3.22,3.22,3.22,NA), factor=c(0.17,0.17,0.17,NA,0.15,0.15,0.15,NA), Type=c(4,4,4,4,3,3,3,3))
data2
Plan Price factor Type
1: 11 4.40 0.17 4
2: 11 4.40 0.17 4
3: 11 4.40 0.17 4
4: 11 NA NA 4
5: 91 3.22 0.15 3
6: 91 3.22 0.15 3
7: 91 3.22 0.15 3
8: 91 NA NA 3
Output
Plan Price factor Type
1: 11 4.40 0.17 4
2: 11 4.40 0.17 4
3: 11 4.40 0.17 4
4: 11 4.40 0.17 4
5: 91 3.22 0.15 3
6: 91 3.22 0.15 3
7: 91 3.22 0.15 3
8: 91 3.22 0.15 3
We can use na.locf grouped by 'Plan' to change the NA with non-NA preceding values
library(zoo)
data2[, factor := na.locf(factor), by = Plan]
If we need mean, use na.aggregate
data2[, factor := na.aggregate(factor), by = Plan]
For multiple columns
nm1 <- c("Price", "factor"_
data2[, (nm1) := lapply(.SD, na.aggregate), by = Plan, .SDcols = nm1]

r data.table use row n-1 value to calculate row n value [duplicate]

library(quantmod)
library(PerformanceAnalytics)
getSymbols("YHOO",src="google")
stock_dat=data.table(PerformanceAnalytics::
CalculateReturns(Cl(YHOO)[1:10],'discrete'))
stock_dat[,Price:=0]
stock_dat[1,2]=Cl(YHOO)[1]
stock_dat[,D:=(1+YHOO.Close)*shift(Price,1)]
The above code generates the below result:
stock_dat
YHOO.Close Price D
1: NA 25.61 NA
2: 0.048418586 0.00 26.85
3: 0.033147114 0.00 0.00
4: 0.006488825 0.00 0.00
5: -0.012177650 0.00 0.00
6: 0.040609137 0.00 0.00
7: 0.017421603 0.00 0.00
8: 0.008561644 0.00 0.00
9: -0.005432937 0.00 0.00
10: -0.008193923 0.00 0.00
The YHOO.Close is assumed to be a simulated returns and i need to back out the prices from that. And i am using the first price as the base. The above code needs to ideally use the price in D from row 3.
nrowsDF <- nrow(stock_dat)
for(i in 2:nrowsDF){
stock_dat[i,2]=(1+stock_dat[i,1,with=FALSE])*stock_dat[i-1,2,with=FALSE]
}
The above code solves the problem. But am looking for a more efficent way to do this, as i have to repeat this for over 5000 simulated return series
The below is the answer i actually need
stock_dat
YHOO.Close Price
1: NA 25.61
2: 0.048418586 26.85
3: 0.033147114 27.74
4: 0.006488825 27.92
5: -0.012177650 27.58
6: 0.040609137 28.70
7: 0.017421603 29.20
8: 0.008561644 29.45
9: -0.005432937 29.29
10: -0.008193923 29.05
You can use the cumulative product like this:
DT <- fread(" YHOO.Close Price D
NA 25.61 NA
0.048418586 0.00 26.85
0.033147114 0.00 0.00
0.006488825 0.00 0.00
-0.012177650 0.00 0.00
0.040609137 0.00 0.00
0.017421603 0.00 0.00
0.008561644 0.00 0.00
-0.005432937 0.00 0.00
-0.008193923 0.00 0.00")
DT[, res := Price[1] * c(1, cumprod(1 + YHOO.Close[-1]))]
# YHOO.Close Price D res
# 1: NA 25.61 NA 25.61
# 2: 0.048418586 0.00 26.85 26.85
# 3: 0.033147114 0.00 0.00 27.74
# 4: 0.006488825 0.00 0.00 27.92
# 5: -0.012177650 0.00 0.00 27.58
# 6: 0.040609137 0.00 0.00 28.70
# 7: 0.017421603 0.00 0.00 29.20
# 8: 0.008561644 0.00 0.00 29.45
# 9: -0.005432937 0.00 0.00 29.29
#10: -0.008193923 0.00 0.00 29.05

Error : missing value where TRUE/FALSE needed

WEEK PRICE QUANTITY SALE_PRICE TYPE
1 4992 5.99 2847.50 0.00 3
2 4995 3.33 36759.00 3.33 3
3 4996 5.99 2517.00 0.00 3
4 4997 5.49 2858.50 0.00 3
5 5001 3.33 32425.00 3.33 3
6 5002 5.49 4205.50 0.00 3
7 5004 5.99 4329.50 0.00 3
8 5006 2.74 55811.00 2.74 3
9 5007 5.49 4133.00 0.00 3
10 5008 5.99 4074.00 0.00 3
11 5009 3.99 12125.25 3.99 3
12 5017 2.74 77645.00 2.74 3
13 5018 5.49 5315.50 0.00 3
14 5020 2.74 78699.00 2.74 3
15 5021 5.49 5158.50 0.00 3
16 5023 5.99 5315.00 0.00 3
17 5024 5.49 6545.00 0.00 3
18 5025 3.33 63418.00 3.33 3
If there are consecutive 0 sale price entries then I want to keep last entry with sale price 0. Like I want to remove week 4996 and want to keep week 4997, I want week 5004 and I want to remove 5002. Similarly I want to delete 5021 & 5023 and want to keep week 5024.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)). create a grouping variable with rleid based on a logical vector of the presence of 0 in 'SALE_PRICE' (!SALE_PRICE). Using the 'grp' as grouping variable, we get the last row of 'Subset of Data.table (.SD[.N]) if the 'SALE_PRICEelements areall0 orelseget the.SD` i.e. the full rows for a particular group.
library(data.table)
setDT(df1)[, grp:= rleid(!SALE_PRICE)
][,if(all(!SALE_PRICE)) .SD[.N] else .SD , grp
][, grp := NULL][]
# WEEK PRICE QUANTITY SALE_PRICE TYPE
# 1: 4992 5.99 2847.50 0.00 3
# 2: 4995 3.33 36759.00 3.33 3
# 3: 4997 5.49 2858.50 0.00 3
# 4: 5001 3.33 32425.00 3.33 3
# 5: 5004 5.99 4329.50 0.00 3
# 6: 5006 2.74 55811.00 2.74 3
# 7: 5008 5.99 4074.00 0.00 3
# 8: 5009 3.99 12125.25 3.99 3
# 9: 5017 2.74 77645.00 2.74 3
#10: 5018 5.49 5315.50 0.00 3
#11: 5020 2.74 78699.00 2.74 3
#12: 5024 5.49 6545.00 0.00 3
#13: 5025 3.33 63418.00 3.33 3
Or an option using dplyr by creating a grouping variable with diff and cumsum, then filter the rows to keep only the last row of 'SALE_PRICE' that are 0 or (|) select the rows where 'SALE_PRICE' is not 0.
library(dplyr)
df1 %>%
group_by(grp = cumsum(c(TRUE,diff(!SALE_PRICE)!=0))) %>%
filter( !duplicated(!SALE_PRICE, fromLast=TRUE)|SALE_PRICE!=0) %>%
select(-grp)
# grp WEEK PRICE QUANTITY SALE_PRICE TYPE
# (int) (int) (dbl) (dbl) (dbl) (int)
#1 1 4992 5.99 2847.50 0.00 3
#2 2 4995 3.33 36759.00 3.33 3
#3 3 4997 5.49 2858.50 0.00 3
#4 4 5001 3.33 32425.00 3.33 3
#5 5 5004 5.99 4329.50 0.00 3
#6 6 5006 2.74 55811.00 2.74 3
#7 7 5008 5.99 4074.00 0.00 3
#8 8 5009 3.99 12125.25 3.99 3
#9 8 5017 2.74 77645.00 2.74 3
#10 9 5018 5.49 5315.50 0.00 3
#11 10 5020 2.74 78699.00 2.74 3
#12 11 5024 5.49 6545.00 0.00 3
#13 12 5025 3.33 63418.00 3.33 3

Weighted row average in time series join

Hello I'm looking for the cleanest/fastest way to solve the following problem:
My setup looks like this
library(data.table)
set.seed(1234)
DT1 <- data.table(replicate(12,runif(5)))
setnames(DT1,LETTERS[1:12])
DT1[,time:=100]
DT2 <- data.table(time=rep(100,12), grp=rep(c("X","Y","Z"),each=4),
sub=LETTERS[1:12], weight=sample(1:100,12))
options(digits=2)
DT1
A B C D E F G H I J K L time
1: 0.11 0.6403 0.69 0.84 0.32 0.811 0.46 0.76 0.55 0.50 0.074 0.50 100
2: 0.62 0.0095 0.54 0.29 0.30 0.526 0.27 0.20 0.65 0.68 0.310 0.49 100
3: 0.61 0.2326 0.28 0.27 0.16 0.915 0.30 0.26 0.31 0.48 0.717 0.75 100
4: 0.62 0.6661 0.92 0.19 0.04 0.831 0.51 0.99 0.62 0.24 0.505 0.17 100
5: 0.86 0.5143 0.29 0.23 0.22 0.046 0.18 0.81 0.33 0.77 0.153 0.85 100
> DT2
time grp sub weight
1: 100 X A 87
2: 100 X B 5
3: 100 X C 32
4: 100 X D 2
5: 100 Y E 23
6: 100 Y F 68
7: 100 Y G 29
8: 100 Y H 48
9: 100 Z I 99
10: 100 Z J 52
11: 100 Z K 11
12: 100 Z L 80
I want to compute a weighted average (per row) of the columns of DT1 by referencing the groups, subclasses & weights from DT2, while joining per time point.
E.g. so DT1 then gets columns X,Y & Z bound to it, so in this case the column X of the first row is 87*0.11 + 5*0.64 + 32*0.69 + 2*0.84 / (87 + 5 + 32 + 2)
There are millions of rows in DT1 with different time points, so memory might be a limiting factor though
Any advice would be much appreciated!
Something like this perhaps:
library(reshape2)
setkey(DT2, time, sub)
DT2[melt(DT1, id.var = 'time')[, row := 1:.N, by = list(time, variable)]][,
sum(weight * value) / sum(weight), by = list(time, grp, row)]
# time grp row V1
# 1: 100 X 1 0.29
# 2: 100 X 2 0.57
# 3: 100 X 3 0.51
# 4: 100 X 4 0.69
# 5: 100 X 5 0.69
# 6: 100 Y 1 0.67
# 7: 100 Y 2 0.36
# 8: 100 Y 3 0.52
# 9: 100 Y 4 0.71
#10: 100 Y 5 0.31
#11: 100 Z 1 0.50
#12: 100 Z 2 0.59
#13: 100 Z 3 0.51
#14: 100 Z 4 0.39
#15: 100 Z 5 0.59
You can also reshape the above result if you like:
# assuming you called the above table "res"
dcast.data.table(res, row + time ~ grp)
#Using 'V1' as value column. Use 'value.var' to override
# row time X Y Z
#1: 1 100 0.29 0.67 0.50
#2: 2 100 0.57 0.36 0.59
#3: 3 100 0.51 0.52 0.51
#4: 4 100 0.69 0.71 0.39
#5: 5 100 0.69 0.31 0.59

Resources