I have a dataframe, DF, looking like this:
date permno ret sue bm gpa
1 202001 10000 0.01 0.4 0.4 NA
2 202002 10000 0.04 NA NA 0.5
3 202003 10000 -0.01 NA NA NA
4 202004 10000 0.00 1.3 0.5 NA
5 202005 10000 0.02 NA NA 0.3
6 202006 10000 0.01 NA NA NA
7 202007 10000 0.03 NA NA NA
8 202008 10000 -0.02 NA NA 0.4
9 202001 11000 0.05 0.1 0.3 NA
10 202002 11000 0.02 NA NA NA
11 202003 11000 0.01 NA NA NA
12 202004 11000 0.00 NA NA 0.3
13 202005 11000 0.01 NA NA NA
14 202006 11000 -0.01 NA NA NA
15 202007 11000 0.04 0.5 0.4 NA
16 202008 11000 0.30 NA NA NA
I am using this code to forward fill the variables sue, bm and gpa:
DF1 <-
DF %>%
arrange(permno,date) %>%
group_by(permno) %>%
mutate_at(vars(c(sue,bm,gpa)), funs(na.locf(.,na.rm=FALSE)))
This results in the following
date permno ret sue bm gpa
1 202001 10000 0.01 0.4 0.4 NA
2 202002 10000 0.04 0.4 0.4 0.5
3 202003 10000 -0.01 0.4 0.4 0.5
4 202004 10000 0 1.3 0.5 0.5
5 202005 10000 0.02 1.3 0.5 0.3
6 202006 10000 0.01 1.3 0.5 0.3
7 202007 10000 0.03 1.3 0.5 0.3
8 202008 10000 -0.02 1.3 0.5 0.4
9 202001 11000 0.05 0.1 0.3 NA
10 202002 11000 0.02 0.1 0.3 NA
11 202003 11000 0.01 0.1 0.3 NA
12 202004 11000 0 0.1 0.3 0.3
13 202005 11000 0.01 0.1 0.3 0.3
14 202006 11000 -0.01 0.1 0.3 0.3
15 202007 11000 0.04 0.5 0.4 0.3
16 202008 11000 0.3 0.5 0.4 0.3
I want to put a limit on how many months the data is forward filled. I want to forward fill the three variables until the next available value, but maximum 3 months. Thus, the results should look like this:
date permno ret sue bm gpa
1 202001 10000 0.01 0.4 0.4 NA
2 202002 10000 0.04 0.4 0.4 0.5
3 202003 10000 -0.01 0.4 0.4 0.5
4 202004 10000 0.00 1.3 0.5 0.5
5 202005 10000 0.02 1.3 0.5 0.3
6 202006 10000 0.01 1.3 0.5 0.3
7 202007 10000 0.03 1.3 0.5 0.3
8 202008 10000 -0.02 NA NA 0.4
9 202001 11000 0.05 0.1 0.3 NA
10 202002 11000 0.02 0.1 0.3 NA
11 202003 11000 0.01 0.1 0.3 NA
12 202004 11000 0.00 0.1 0.3 0.3
13 202005 11000 0.01 NA NA 0.3
14 202006 11000 -0.01 NA NA 0.3
15 202007 11000 0.04 0.5 0.4 0.3
16 202008 11000 0.30 0.5 0.4 NA
Does anyone know how I can do this in R?
We can write our own na.locf() that allows to make the adjustments you desire:
Code
library(zoo)
library(dplyr)
na.locf2 <- function(object, period = 3, ...){
# consecutive NAs
tmp1 <- rle(is.na(object))
# NA count in the length of the vector
tmp2 <- unlist(sapply(tmp1[[1]] , function(x){
1:x
}))
# remove all NAs
tmp3 <- na.locf(object, ...)
# reassign those that are greater than the desired period
tmp3[tmp2 > period] <- NA
# return
tmp3
}
# Then
DF %>%
arrange(permno,date) %>%
group_by(permno) %>%
mutate_at(vars(c(sue,bm,gpa)), funs(na.locf2(.,na.rm=FALSE)))
# Yields
# A tibble: 16 x 6
# Groups: permno [2]
# date permno ret sue bm gpa
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 202001 10000 0.01 0.4 0.4 NA
# 2 202002 10000 0.04 0.4 0.4 0.5
# 3 202003 10000 -0.01 0.4 0.4 0.5
# 4 202004 10000 0 1.3 0.5 0.5
# 5 202005 10000 0.02 1.3 0.5 0.3
# 6 202006 10000 0.01 1.3 0.5 0.3
# 7 202007 10000 0.03 1.3 0.5 0.3
# 8 202008 10000 -0.02 NA NA 0.4
# 9 202001 11000 0.05 0.1 0.3 NA
# 10 202002 11000 0.02 0.1 0.3 NA
# 11 202003 11000 0.01 0.1 0.3 NA
# 12 202004 11000 0 0.1 0.3 0.3
# 13 202005 11000 0.01 NA NA 0.3
# 14 202006 11000 -0.01 NA NA 0.3
# 15 202007 11000 0.04 0.5 0.4 0.3
# 16 202008 11000 0.3 0.5 0.4 NA
This sounds like a rolling-window thing. Since you need to limit the carry-forward, though, one problem is that when you look at a particular cell, its predecessor has already been fixed (un-NA'd), so we need to look at the vector in reverse.
A helper function, where 2:4 is based on your preference for no more than three months. In the context of a reversed rollapply, z[1] in this case is likely an NA with z[2:4] being the three preceding months.
func <- function(z) if (is.na(z[1])) na.omit(z[2:4])[1] else z[1]
Since we're using partial=TRUE in the rolling, it is feasible that z will not be length-4; this is fine, though, since even z[100000] will return NA, which is explicitly handled by na.omit. (This condition can also be addressed with a test on length(z).)
Another note: na.omit(.) can return a length-0 vector, which is obviously bad in this context. However, adding [1] after it forces it to return an NA, which is what we need it to be if no non-NA elements exist within z[2:4].
From here, most of the work is done by zoo::rollapply:
DF %>%
arrange(permno, date) %>%
group_by(permno) %>%
mutate(across(sue:gpa, ~ rev(zoo::rollapply(rev(.), 4, align="left", FUN = func, partial = TRUE)))) %>%
ungroup()
# # A tibble: 16 x 6
# date permno ret sue bm gpa
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 202001 10000 0.01 0.4 0.4 NA
# 2 202002 10000 0.04 0.4 0.4 0.5
# 3 202003 10000 -0.01 0.4 0.4 0.5
# 4 202004 10000 0 1.3 0.5 0.5
# 5 202005 10000 0.02 1.3 0.5 0.3
# 6 202006 10000 0.01 1.3 0.5 0.3
# 7 202007 10000 0.03 1.3 0.5 0.3
# 8 202008 10000 -0.02 NA NA 0.4
# 9 202001 11000 0.05 0.1 0.3 NA
# 10 202002 11000 0.02 0.1 0.3 NA
# 11 202003 11000 0.01 0.1 0.3 NA
# 12 202004 11000 0 0.1 0.3 0.3
# 13 202005 11000 0.01 NA NA 0.3
# 14 202006 11000 -0.01 NA NA 0.3
# 15 202007 11000 0.04 0.5 0.4 0.3
# 16 202008 11000 0.3 0.5 0.4 NA
Related
What if I want to use "$" to get the sub-elements of a list in dplyr pipe?
Problem from:
I called boxplot.stats() on every circumstances (4 factors: value, metallic, bumpiness, reference):
match.raw %>%
select(-id) %>%
group_by(value, metallic, bumpiness, reference) %>%
group_split() %>%
map(select, diff) %>%
map(as.matrix) %>%
map(boxplot.stats)
Then I get a bunch of results of multiple boxplot.stats calls. I want to get the "out" information from every boxplot.stats result just like:
boxplot.stats(x)$out
Here is my data:
> head(match.raw, 20)
id value metallic bumpiness reference match diff
1 1 1.0 1.0 0.5 0.7 0.74 0.04
2 1 1.0 1.0 0.5 0.9 0.88 -0.02
3 1 1.0 0.0 0.5 0.3 0.30 0.00
4 1 0.0 0.5 0.5 0.3 0.32 0.02
5 1 1.0 0.0 0.5 0.7 0.46 -0.24
6 1 0.0 1.0 0.5 0.3 0.28 -0.02
7 1 0.0 1.0 0.5 0.7 0.72 0.02
8 1 1.0 0.5 0.5 0.5 0.56 0.06
9 1 0.5 0.0 0.5 0.9 0.84 -0.06
10 1 0.0 0.5 0.5 0.5 0.54 0.04
11 1 0.5 1.0 0.5 0.1 0.10 0.00
12 1 0.5 0.5 0.5 0.9 0.96 0.06
13 1 1.0 0.0 0.5 0.1 0.00 -0.10
14 1 0.0 0.5 0.5 0.9 0.92 0.02
15 1 1.0 0.5 0.5 0.9 0.94 0.04
16 1 1.0 0.5 0.5 0.1 0.28 0.18
17 1 1.0 1.0 0.5 0.1 0.10 0.00
18 1 0.0 0.5 0.5 0.1 0.22 0.12
19 1 1.0 0.0 0.5 0.5 0.00 -0.50
20 1 0.5 0.5 0.5 0.7 0.78 0.08
Try this.
library(purrr); library(dplyr)
match.raw %>%
select(-id) %>%
group_by(value, metallic, bumpiness, reference) %>%
group_split() %>%
map(select, diff) %>%
map(as.matrix) %>%
map(boxplot.stats) %>%
map(\(.) .$out) %>%
unlist()
Your example data is everywhere numeric(0) in "out" though.
I have chromatographic data in a table organized by peak position and integration value of various samples. All samples in the table have a repeated measurement as well with a different sample log number.
What I'm interested in, is the repeatability of the measurements of the various peaks. The measure for that would be the difference in peak integration = 0 for each sample.
The data
Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03
with Log1 is the original sample log number, and Log2 is the repeat log number.
How can I construct a new variable for every peak (being the difference PeakX_Log1 - PeakX_Log2)?
Mind that in my example I only have 5 peaks. The real-life situation is a complex mixture involving >20 peaks, so very hard to do it by hand.
If you will only have two values for each sample, something like this could work:
df <- data.table::fread(
"Sample Log1 Log2 Peak1 Peak2 Peak3 Peak4 Peak5
A 100 104 0.20 0.80 0.30 0.00 0.00
B 101 106 0.25 0.73 0.29 0.01 0.04
C 102 103 0.20 0.80 0.30 0.00 0.07
C 103 102 0.22 0.81 0.31 0.04 0.00
A 104 100 0.21 0.70 0.33 0.00 0.10
B 106 101 0.20 0.73 0.37 0.00 0.03"
)
library(tidyverse)
new_df <- df %>%
mutate(Log = ifelse(Log1 < Log2,"Log1","Log2")) %>%
select(-Log1,-Log2) %>%
pivot_longer(cols = starts_with("Peak"),names_to = "Peak") %>%
pivot_wider(values_from = value, names_from = Log) %>%
mutate(Variation = Log1 - Log2)
new_df
# A tibble: 15 × 5
Sample Peak Log1 Log2 Variation
<chr> <chr> <dbl> <dbl> <dbl>
1 A Peak1 0.2 0.21 -0.0100
2 A Peak2 0.8 0.7 0.100
3 A Peak3 0.3 0.33 -0.0300
4 A Peak4 0 0 0
5 A Peak5 0 0.1 -0.1
6 B Peak1 0.25 0.2 0.05
7 B Peak2 0.73 0.73 0
8 B Peak3 0.29 0.37 -0.08
9 B Peak4 0.01 0 0.01
10 B Peak5 0.04 0.03 0.01
11 C Peak1 0.2 0.22 -0.0200
12 C Peak2 0.8 0.81 -0.0100
13 C Peak3 0.3 0.31 -0.0100
14 C Peak4 0 0.04 -0.04
15 C Peak5 0.07 0 0.07
I need help writing either a for loop or dply code for two things:
Calculating the cumsum of mortality (%), grouped on Unit, in a defined period (7 days) before a treatment.
Make a vector that counts the day post last treatment until next treatment.
The data sett looks like this:
Unit Date Prcent_daily.mortality Date.treatment
A 20.07.2020 0.2 NA
A 21.07.2020 0 NA
A 22.07.2020 0.4 NA
A 23.07.2020 0.3 NA
A 24.07.2020 0.6 NA
A 25.07.2020 0.05 NA
A 26.07.2020 0 NA
A 27.07.2020 0 NA
A 28.07.2020 0.01 28.07.2020
A 29.07.2020 0.1 NA
A 30.07.2020 0.2 NA
A 31.07.2020 0 NA
A 01.08.2020 0.2 NA
A 02.08.2020 0.3 NA
A 03.08.2020 0.3 NA
A 04.08.2020 0.05 NA
A 05.08.2020 0 NA
A 06.08.2020 0 NA
A 07.08.2020 0.01 05.08.2020
A 08.08.2020 0.1 NA
A 09.08.2020 0.2 NA
And I want to achieve this:
Unit Date Prcent_daily.mortality Date.treatment akkum.7dbt days.post.treatment
A 20.07.2020 0.2 NA NA NA
A 21.07.2020 0 NA 1.35 NA
A 22.07.2020 0.4 NA 1.35 NA
A 23.07.2020 0.3 NA 1.35 NA
A 24.07.2020 0.6 NA 1.35 NA
A 25.07.2020 0.05 NA 1.35 NA
A 26.07.2020 0 NA 1.35 NA
A 27.07.2020 0 NA 1.35 NA
A 28.07.2020 0.01 28.07.2020 1.35 0
A 29.07.2020 0.1 NA NA 1
A 30.07.2020 0.2 NA NA 2
A 31.07.2020 0 NA 0.85 3
A 01.08.2020 0.2 NA 0.85 4
A 02.08.2020 0.3 NA 0.85 5
A 03.08.2020 0.3 NA 0.85 6
A 04.08.2020 0.05 NA 0.85 7
A 05.08.2020 0 NA 0.85 8
A 06.08.2020 0 NA 0.85 9
A 07.08.2020 0.01 05.08.2020 0.85 0
A 08.08.2020 0.1 NA NA 1
A 09.08.2020 0.2 NA NA 2
Thanks for all help form a self learned R amateur.
Try this solution which combines base R and dplyr:
library(dplyr)
library(tidyr)
#Create empty col for index
i1 <- which(!is.na(df$Date.treatment))
i2 <- i1-7
i1 <- i1-1
i3 <- 1:length(i1)
#Create index for second var
j1 <- which(!is.na(df$Date.treatment))
j2 <- 1:length(j1)
# i3 <- i1+1
df$Var <- NA
df$Var[i1]<-i3
df$Var[i2]<-i3
df$Var[1] <- 0
df$Var <- ifelse(!is.na(df$Date.treatment),0,df$Var)
#Fill
df %>% fill(Var) -> df1
#Create aggregations
df1 %>% filter(Var!=0) %>% group_by(Var) %>% mutate(Cum=cumsum(Prcent_daily.mortality)) %>%
filter(Cum==max(Cum)) %>% filter(!duplicated(Cum)) %>% ungroup() %>% select(c(Unit,Cum)) -> Ag1
#Create another var
df$Var2 <- NA
df$Var2[j1] <- j2
df$Var2[1] <- 0
#Fill
df %>% fill(Var2) -> df2
#Create cums and days
df2 %>% group_by(Unit,Var2) %>% mutate(Day=(1:n())-1) %>% ungroup() %>% select(-c(Var2)) -> df3
#Empty var for cums
df3$Cum <- NA
df3$Cum[i1+1] <- Ag1$Cum
#Fill 2
df3 %>% fill(Cum,.direction = 'up') -> df4
#Some adjusts
df4$Day[1:i1[1]]<-NA
df4$Cum[1] <- NA
df4$Cum <- ifelse((df4$Day==1 | df4$Day==2) & !is.na(df4$Day),NA,df4$Cum)
This will produce:
Unit Date Prcent_daily.mortality Date.treatment Var Day Cum
1 A 20.07.2020 0.20 <NA> 0 NA NA
2 A 21.07.2020 0.00 <NA> 1 NA 1.35
3 A 22.07.2020 0.40 <NA> NA NA 1.35
4 A 23.07.2020 0.30 <NA> NA NA 1.35
5 A 24.07.2020 0.60 <NA> NA NA 1.35
6 A 25.07.2020 0.05 <NA> NA NA 1.35
7 A 26.07.2020 0.00 <NA> NA NA 1.35
8 A 27.07.2020 0.00 <NA> 1 NA 1.35
9 A 28.07.2020 0.01 28.07.2020 0 0 1.35
10 A 29.07.2020 0.10 <NA> NA 1 NA
11 A 30.07.2020 0.20 <NA> NA 2 NA
12 A 31.07.2020 0.00 <NA> 2 3 0.85
13 A 01.08.2020 0.20 <NA> NA 4 0.85
14 A 02.08.2020 0.30 <NA> NA 5 0.85
15 A 03.08.2020 0.30 <NA> NA 6 0.85
16 A 04.08.2020 0.05 <NA> NA 7 0.85
17 A 05.08.2020 0.00 <NA> NA 8 0.85
18 A 06.08.2020 0.00 <NA> 2 9 0.85
19 A 07.08.2020 0.01 05.08.2020 0 0 0.85
20 A 08.08.2020 0.10 <NA> NA 1 NA
21 A 09.08.2020 0.20 <NA> NA 2 NA
Update: Working on df4 you can get the cumsum of Prcent_daily.mortality with next code:
#You can work with df4 to complete the rest of aggregations
#First create an dpuplicate var
df4$DateD <- df4$Date.treatment
#Now fill and mutate
df4 %>% fill(DateD) -> df4
#Create index for replacement
k <- df4$Date.treatment==df4$DateD & !is.na(df4$Date.treatment)
#Assign a value for aggregations not considered
df4$DateD[k]<-'NULL'
#Cumsum
df4 %>% group_by(DateD) %>% mutate(CumAfter=cumsum(Prcent_daily.mortality)) -> df4
#Now remove redundant values in the cum and drop the reference var
df4 %>% ungroup() %>% mutate(CumAfter=ifelse(is.na(DateD) | DateD=='NULL',NA,CumAfter)) %>%
select(-DateD) -> df4
The output will be next:
Unit Date Prcent_daily.mortality Date.treatment Var Day Cum CumAfter
1 A 20.07.2020 0.20 <NA> 0 NA NA NA
2 A 21.07.2020 0.00 <NA> 1 NA 1.35 NA
3 A 22.07.2020 0.40 <NA> NA NA 1.35 NA
4 A 23.07.2020 0.30 <NA> NA NA 1.35 NA
5 A 24.07.2020 0.60 <NA> NA NA 1.35 NA
6 A 25.07.2020 0.05 <NA> NA NA 1.35 NA
7 A 26.07.2020 0.00 <NA> NA NA 1.35 NA
8 A 27.07.2020 0.00 <NA> 1 NA 1.35 NA
9 A 28.07.2020 0.01 28.07.2020 0 0 1.35 NA
10 A 29.07.2020 0.10 <NA> NA 1 NA 0.10
11 A 30.07.2020 0.20 <NA> NA 2 NA 0.30
12 A 31.07.2020 0.00 <NA> 2 3 0.85 0.30
13 A 01.08.2020 0.20 <NA> NA 4 0.85 0.50
14 A 02.08.2020 0.30 <NA> NA 5 0.85 0.80
15 A 03.08.2020 0.30 <NA> NA 6 0.85 1.10
16 A 04.08.2020 0.05 <NA> NA 7 0.85 1.15
17 A 05.08.2020 0.00 <NA> NA 8 0.85 1.15
18 A 06.08.2020 0.00 <NA> 2 9 0.85 1.15
19 A 07.08.2020 0.01 05.08.2020 0 0 0.85 NA
20 A 08.08.2020 0.10 <NA> NA 1 NA 0.10
21 A 09.08.2020 0.20 <NA> NA 2 NA 0.30
A data.table solution. Although you are looking for a dplyr solution, I just want to share.
The idea is creating flags indicating 7 (or 8?) days before the treatment and the days after a treatment.
library(data.table)
odt <- fread('Unit Date Prcent_daily.mortality Date.treatment
A 20.07.2020 0.2 NA
A 21.07.2020 0 NA
A 22.07.2020 0.4 NA
A 23.07.2020 0.3 NA
A 24.07.2020 0.6 NA
A 25.07.2020 0.05 NA
A 26.07.2020 0 NA
A 27.07.2020 0 NA
A 28.07.2020 0.01 28.07.2020
A 29.07.2020 0.1 NA
A 30.07.2020 0.2 NA
A 31.07.2020 0 NA
A 01.08.2020 0.2 NA
A 02.08.2020 0.3 NA
A 03.08.2020 0.3 NA
A 04.08.2020 0.05 NA
A 05.08.2020 0 NA
A 06.08.2020 0 NA
A 07.08.2020 0.01 05.08.2020
A 08.08.2020 0.1 NA
A 09.08.2020 0.2 NA')
#create group flags
odt[,postgrp:=cumsum(!is.na(Date.treatment)),by=.(Unit)]
odt[,pregrp:= c(if (postgrp-1 < 0) 0 else postgrp-1,rep(postgrp,.N-1)),by=.(Unit,postgrp)]
treat_date <- odt[,.I[!is.na(Date.treatment)]]
pre7_date <- unlist(Map(seq,treat_date-7,treat_date))
odt[!pre7_date,pregrp:=NA][]
#> Unit Date Prcent_daily.mortality Date.treatment postgrp pregrp
#> 1: A 20.07.2020 0.20 <NA> 0 NA
#> 2: A 21.07.2020 0.00 <NA> 0 0
#> 3: A 22.07.2020 0.40 <NA> 0 0
#> 4: A 23.07.2020 0.30 <NA> 0 0
#> 5: A 24.07.2020 0.60 <NA> 0 0
#> 6: A 25.07.2020 0.05 <NA> 0 0
#> 7: A 26.07.2020 0.00 <NA> 0 0
#> 8: A 27.07.2020 0.00 <NA> 0 0
#> 9: A 28.07.2020 0.01 28.07.2020 1 0
#> 10: A 29.07.2020 0.10 <NA> 1 NA
#> 11: A 30.07.2020 0.20 <NA> 1 NA
#> 12: A 31.07.2020 0.00 <NA> 1 1
#> 13: A 01.08.2020 0.20 <NA> 1 1
#> 14: A 02.08.2020 0.30 <NA> 1 1
#> 15: A 03.08.2020 0.30 <NA> 1 1
#> 16: A 04.08.2020 0.05 <NA> 1 1
#> 17: A 05.08.2020 0.00 <NA> 1 1
#> 18: A 06.08.2020 0.00 <NA> 1 1
#> 19: A 07.08.2020 0.01 05.08.2020 2 1
#> 20: A 08.08.2020 0.10 <NA> 2 NA
#> 21: A 09.08.2020 0.20 <NA> 2 NA
#> Unit Date Prcent_daily.mortality Date.treatment postgrp pregrp
#calculation
odt[!is.na(pregrp),akkum.7dbty:=sum(tail(Prcent_daily.mortality[-.N],7)),by=.(Unit,pregrp)]
odt[postgrp!=0,days.post.treatment:= 0:(.N-1),by=.(Unit,postgrp)]
#result
odt[,c("postgrp","pregrp"):=NULL][]
#> Unit Date Prcent_daily.mortality Date.treatment akkum.7dbty
#> 1: A 20.07.2020 0.20 <NA> NA
#> 2: A 21.07.2020 0.00 <NA> 1.35
#> 3: A 22.07.2020 0.40 <NA> 1.35
#> 4: A 23.07.2020 0.30 <NA> 1.35
#> 5: A 24.07.2020 0.60 <NA> 1.35
#> 6: A 25.07.2020 0.05 <NA> 1.35
#> 7: A 26.07.2020 0.00 <NA> 1.35
#> 8: A 27.07.2020 0.00 <NA> 1.35
#> 9: A 28.07.2020 0.01 28.07.2020 1.35
#> 10: A 29.07.2020 0.10 <NA> NA
#> 11: A 30.07.2020 0.20 <NA> NA
#> 12: A 31.07.2020 0.00 <NA> 0.85
#> 13: A 01.08.2020 0.20 <NA> 0.85
#> 14: A 02.08.2020 0.30 <NA> 0.85
#> 15: A 03.08.2020 0.30 <NA> 0.85
#> 16: A 04.08.2020 0.05 <NA> 0.85
#> 17: A 05.08.2020 0.00 <NA> 0.85
#> 18: A 06.08.2020 0.00 <NA> 0.85
#> 19: A 07.08.2020 0.01 05.08.2020 0.85
#> 20: A 08.08.2020 0.10 <NA> NA
#> 21: A 09.08.2020 0.20 <NA> NA
#> Unit Date Prcent_daily.mortality Date.treatment akkum.7dbty
#> days.post.treatment
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: NA
#> 8: NA
#> 9: 0
#> 10: 1
#> 11: 2
#> 12: 3
#> 13: 4
#> 14: 5
#> 15: 6
#> 16: 7
#> 17: 8
#> 18: 9
#> 19: 0
#> 20: 1
#> 21: 2
#> days.post.treatment
Created on 2020-07-21 by the reprex package (v0.3.0)
There are two sensors. The collected data should be changing with time. How can identify the data stuck and replace it with another sensor?
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
so the data has
d
a b c
1 0.1 0.05
2 0.2 0.20
3 0.3 0.30
4 0.4 0.40
5 0.5 0.40
6 0.6 0.40
7 0.7 0.40
8 0.8 0.40
9 0.9 0.40
10 1.0 0.40
11 1.1 0.40
12 1.2 0.40
13 1.3 0.40
14 1.4 0.40
15 1.5 0.40
16 1.6 0.40
17 1.7 0.40
18 1.8 0.40
19 1.9 0.40
20 2.0 0.40
21 2.1 0.40
22 2.2 2.20
23 2.3 2.30
24 2.4 2.40
Sensor c stuck at 0.4 from time a4 to a20, is there a quick way to identify it and replace the stuck part using data from sensor b?
The new column c_updated is what you want. I've created some helpful columns (c_previous and c_is_stuck) which you can remove if you want.
library(dplyr)
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d %>%
mutate(c_previous = lag(c, default = 0), # get previous measurement for sensor c
c_is_stuck = ifelse(c == c_previous, 1 ,0), # flag stuck for sensor c when current measurement is same as previous one
c_updated = ifelse(c_is_stuck == 1, b, c)) # if sensor c is stuck use measurement from sensor b
# a b c c_previous c_is_stuck c_updated
# 1 1 0.1 0.05 0.00 0 0.05
# 2 2 0.2 0.20 0.05 0 0.20
# 3 3 0.3 0.30 0.20 0 0.30
# 4 4 0.4 0.40 0.30 0 0.40
# 5 5 0.5 0.40 0.40 1 0.50
# 6 6 0.6 0.40 0.40 1 0.60
# 7 7 0.7 0.40 0.40 1 0.70
# 8 8 0.8 0.40 0.40 1 0.80
# 9 9 0.9 0.40 0.40 1 0.90
# 10 10 1.0 0.40 0.40 1 1.00
# 11 11 1.1 0.40 0.40 1 1.10
# 12 12 1.2 0.40 0.40 1 1.20
# 13 13 1.3 0.40 0.40 1 1.30
# 14 14 1.4 0.40 0.40 1 1.40
# 15 15 1.5 0.40 0.40 1 1.50
# 16 16 1.6 0.40 0.40 1 1.60
# 17 17 1.7 0.40 0.40 1 1.70
# 18 18 1.8 0.40 0.40 1 1.80
# 19 19 1.9 0.40 0.40 1 1.90
# 20 20 2.0 0.40 0.40 1 2.00
# 21 21 2.1 0.40 0.40 1 2.10
# 22 22 2.2 2.20 0.40 0 2.20
# 23 23 2.3 2.30 2.20 0 2.30
# 24 24 2.4 2.40 2.30 0 2.40
This is a pretty simple way. Duplicate the c column with an offset of 1 and check if the two values are identical. If so, take the value from b.
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d$d <- c(NA, d$c[1:23])
d$replaced <- ifelse(d$c == d$d, d$b, d$c)
a b c d replaced
1 1 0.1 0.05 NA NA
2 2 0.2 0.20 0.05 0.2
3 3 0.3 0.30 0.20 0.3
4 4 0.4 0.40 0.30 0.4
5 5 0.5 0.40 0.40 0.5
6 6 0.6 0.40 0.40 0.6
7 7 0.7 0.40 0.40 0.7
8 8 0.8 0.40 0.40 0.8
9 9 0.9 0.40 0.40 0.9
10 10 1.0 0.40 0.40 1.0
11 11 1.1 0.40 0.40 1.1
12 12 1.2 0.40 0.40 1.2
13 13 1.3 0.40 0.40 1.3
14 14 1.4 0.40 0.40 1.4
15 15 1.5 0.40 0.40 1.5
16 16 1.6 0.40 0.40 1.6
17 17 1.7 0.40 0.40 1.7
18 18 1.8 0.40 0.40 1.8
19 19 1.9 0.40 0.40 1.9
20 20 2.0 0.40 0.40 2.0
21 21 2.1 0.40 0.40 2.1
22 22 2.2 2.20 0.40 2.2
23 23 2.3 2.30 2.20 2.3
24 24 2.4 2.40 2.30 2.4
The bellow solution is as basic as it gets I think. No additional packages required. Cheers!
a<- c(1:24)
b<- seq(0.1,2.4,0.1)
c<- c(0.05,0.2,0.3,rep(0.4,18),2.2,2.3,2.4)
d<- data.frame(a,b,c)
d$diff.b <- c(NA, diff(d$b))
d$diff.c <- c(NA, diff(d$c))
stuck.index <- which(d$diff.c==0)
d[stuck.index, "c"] <- d[stuck.index, "b"]
# changing to original data frame format
d$diff.b <- NULL
d$diff.c <- NULL
Dataset consist of 3 character variables and 21 numeric variables in time series format.
A,B,C,num_1,num_2,num3_.......num_21
Below is sample data
Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10 Variable11
1 2005 A X 1AX 1 NA NA NA NA NA
1 2006 A X 1AX 2 0 0 0 0 0
1 2007 A X 1AX 3 0 0.02 0 0.04 0.06
1 2008 A X 1AX 4 0 0.03 0.04 0.09 0.13
1 2009 A X 1AX 5 0.01 0.05 0.08 0.16 0.27
1 2010 A X 1AX 6 0.01 0.04 0.15 0.29 0.42
1 2011 A X 1AX 7 0.01 0.03 0.12 0.26 0.4
1 2012 A X 1AX 8 0 0.04 0.15 0.3 0.44
1 2013 A X 1AX 9 0.01 0.04 0.1 0.17 0.27
1 2014 A X 1AX 10 0.01 0.05 0.1 0.14 0.21
1 2007 A Y 1AY 1 0.01 0.04 0.15 0.29 0.42
1 2008 A Y 1AY 2 0.01 0.03 0.12 0.26 0.4
1 2009 A Y 1AY 3 0 0.04 0.15 0.3 0.44
1 2010 A Y 1AY 4 0.01 0.04 0.1 0.17 0.27
1 2011 A Y 1AY 5 0 0.04 0.15 0.3 0.44
1 2012 A Y 1AY 6 0.01 0.04 0.1 0.17 0.27
1 1983 B Y 1BY 1 NA NA NA NA NA
1 1984 B Y 1BY 2 0 0.5 0 0 0
I have to get the minor diagonal elements for each group of Variable1,3,4
Variable 5 is the concatenation of Variable1,3,4 and variable6 to increment the occurrence of the concatenated variable.
newdata$variable5<-apply(newdata,1 ,function(x) paste0(toString(x[1]), toString(x[3]), toString(x[4])))
DT<-data.table(newdata)
DT[ , variable6:= 1:.N , by = c("variable5") ]
keys<-unique(unlist(DT$variable5))
for (i in 2:length(keys)){
data_1<-subset(DT,merge=keys[i])
result<-diag(data_1[nrow(data_1):1,])
}
I am getting the below error
Error in diag(data_1[nrow(data_1):1, ]) :
(list) object cannot be coerced to type 'double'