Replacing zero's and NA with recursive value - r

I'm trying to replace NA & zero values recursive. Im working on time series data where a NA or zero is best replaced with the value previous week (every 15min measurement so 672 steps back). My data contains ~two years data of 15min values, thus this is a large set. Not much NA or zeros are expected and adjacent series of zero's or NA >672 are also not expected.
I found this thread (recursive replacement in R) where a recursive way is shown, adapted it to my problem.
load[is.na(load)] <- 0
o <- rle(load)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 672]
newload<-inverse.rle(o)
Now is this "the best" or an elegant method?
And how will I protect my code from errors when a zero value occurs within the first 672 values?
Im used to matlab, where I would do something like:
% Replace NaN with 0
Load(isnan(Load))=0;
% Find zero values
Ind=find(Load==0);
for f=Ind
if f>672
fprintf('Replacing index %d with the load 1 day ago\n', Ind)
% Replace zero with previous week value
Load(f)=Load(f-672);
end
end
As im not familiar to R how would i set such a if else loop up?
A reproducible example(change the code as the example used from other thread didnt cope with adjacent zeros):
day<-1:24
load<-rep(day, times=10)
load[50:54]<-0
load[112:115]<-NA
load[is.na(load)] <- 0
load[load==0]<-load[which(load == 0) - 24]
Which gives the original load dataframe without zero's and NA's.
When in the first 24 values a zero exist, this goes wrong because there is no value to replace with:
loadtest[c(10,50:54)]<-0 # instead of load[50:54]<-0 gives:
Error in loadtest[which(loadtest == 0) - 24] :
only 0's may be mixed with negative subscripts
Now to work around this an if else statement can be used, but i dont know how to apply. Something like:
day<-1:24
loadtest<-rep(day, times=10)
loadtest[c(10,50:54)]<-0
loadtest[112:115]<-NA
loadtest[is.na(loadtest)] <- 0
if(INDEX(loadtest[loadtest==0])<24) {
# nothing / mean / standard value
} else {
loadtest[loadtest==0]<-loadtest[which(loadtest == 0) - 24]
}
Ofcourse INDEX isnt valid code..

You can use this example:
set.seed(42)
x <- sample(c(0,1,2,3,NA), 100, T)
stepback <- 6
x_old <- x
x_new <- x_old
repeat{
filter <- x_new==0 | is.na(x_new)
x_new[filter] <- c(rep(NA, stepback), head(x_new, -stepback))[filter]
if(identical(x_old,x_new)) break
x_old <- x_new
}
x
x_new
Result:
> x
[1] NA NA 1 NA 3 2 3 0 3 3 2 3 NA 1 2 NA NA 0 2 2 NA 0 NA NA 0
[26] 2 1 NA 2 NA 3 NA 1 3 0 NA 0 1 NA 3 1 2 0 NA 2 NA NA 3 NA 3
[51] 1 1 1 3 0 3 3 0 1 2 3 NA 3 2 NA 0 1 NA 3 1 0 0 1 2 0
[76] 3 0 1 2 0 2 0 1 3 3 2 1 0 0 1 3 0 1 NA NA 3 1 2 3 3
> x_new
[1] NA NA 1 NA 3 2 3 NA 3 3 2 3 3 1 2 3 2 3 2 2 2 3 2 3 2
[26] 2 1 3 2 3 3 2 1 3 2 3 3 1 1 3 1 2 3 1 2 3 1 3 3 3
[51] 1 1 1 3 3 3 3 1 1 2 3 3 3 2 1 2 1 3 3 1 1 2 1 2 3
[76] 3 1 1 2 2 2 3 1 3 3 2 1 3 1 1 3 2 1 3 1 3 1 2 3 3
Note that some values are still NA, because there is no prior information to use for them. If your data has sufficient prior information, this will not happen.

One option would be to wrap your vector into a matrix with 672 rows:
load2 <- matrix(load, nrow=672)
Then apply the last observation carried forward (either from zoo, or the method above, or ...) to each row of the matrix:
load3 <- apply( load2, 1, locf.function )
Then take the resulting matrix back to a vector with the correct length:
load4 <- t(load3)[ seq_along(load) ]

Related

Remove/replace values in a list using the previous number in R [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
R replacing zeros in dataframe with next non zero value
(1 answer)
Closed 3 years ago.
I am trying to replace all the zeros with the previous number from a list.
The list is something like this:
x <- c(3,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,3,0,0,0,1,0,2,0)
I tried already the function
x <- c(3,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,3,0,0,0,1,0,2,0)
replace (x, x==0, first(x))
[1] 3 3 3 3 3 1 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 3 2 3
But it changes the first value of the list =3 to all the zeros and the 2's and 1's are neglected.
Also
replace (x, x==0, x)
[1] 3 3 0 0 0 1 0 1 0 0 0 2 0 0 2 0 3 0 0 0 1 3 2 0
You can use approx after you replaced all zeros with NA
approx(replace(x, x == 0, NA),
xout = 1:length(x), method = "constant", f = 0, rule = 2)$y
# [1] 3 3 3 3 3 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 1 1 2 2
Could modify this.
fill = function(x){
ave(x, cumsum(x != 0), FUN = function(y) y[pmax(1, cumsum(y != 0))])
}
fill(x)
# [1] 3 3 3 3 3 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 1 1 2 2

sapply function(x) where x is subsetted argument

So, I want to generate a new vector from the information in two existing ones (numerical), one which sets the id for the participant, the other indicating the observation number. Each paticipant has been observed different times.
Now, the new vector should should state: 0 when obs_no=1; 1 when obs_no=last observation for that id; NA for cases in between.
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
I figure I could do this separatly for every id using code like this
new_vector <- c(0, rep(NA, times=length(obs_no[id==1])-2), 1)
Or I guess just using max() but it wouldn't make any difference.
But adding each participant manually is really inconvenient since I have a lot of cases. I can't figure out how to make a generic function. I tried to define a function(x) using sapply but cant get it to work since x is positioned within subsetting brackets.
Any advice would be helpful. Thanks.
ave to the rescue:
dat$newvar <- NA
dat$newvar <- with(dat,
ave(newvar, id, FUN=function(x) replace(x, c(length(x),1), c(1,0)) )
)
Or use a bit of duplicated() fun:
dat$newvar <- NA
dat$newvar[!duplicated(dat$id, fromLast=TRUE)] <- 1
dat$newvar[!duplicated(dat$id)] <- 0
Both giving:
# id obs_no new_vector newvar
#1 1 1 0 0
#2 1 2 NA NA
#3 1 3 NA NA
#4 1 4 NA NA
#5 1 5 1 1
#6 2 1 0 0
#7 2 2 1 1
#8 3 1 0 0
#9 3 2 NA NA
#10 3 3 1 1
You can also do this with dplyr
str <- "
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
"
dt <- read.table(textConnection(str), header = T)
library(dplyr)
dt %>% group_by(id) %>%
mutate(newvar = if_else(obs_no==1,0L,if_else(obs_no==max(obs_no),1L,as.integer(NA))))
We can use data.table
library(data.table)
i1 <- setDT(df1)[, .I[seq_len(.N) %in% c(1, .N)], id]$V1
df1[i1, newvar := c(0, 1)]
df1
# id obs_no new_vector newvar
# 1: 1 1 0 0
# 2: 1 2 NA NA
# 3: 1 3 NA NA
# 4: 1 4 NA NA
# 5: 1 5 1 1
# 6: 2 1 0 0
# 7: 2 2 1 1
# 8: 3 1 0 0
# 9: 3 2 NA NA
#10: 3 3 1 1
Use split:
result = lapply(split(obs_no, id), function (x) c(0, rep(NA, length(x) - 2), 1))
This gives you a list of vectors. You can paste them back together like this:
do.call(c, result)

Referencing previous value in cell (R)

I have the following data.frame:
head(data.c)
mark high_mark mark_cum
5 0 0
7 1 1
7 1 2
NA 0 2
7 1 3
7 1 4
As there are NAs I need to construct an additional column of normal sequence from 1:length(mark). However, if it is NA vector cell has to take a previous value. So it must look like this:
mark high_mark mark_cum mark_seq
5 0 0 1
7 1 1 2
7 1 2 3
NA 0 2 3
7 1 3 4
7 1 4 5
NA 0 4 5
1) cumsum This solution uses the fact that each mark_seq element equals the cumulative number of non-NA elements in mark at that point.
transform(data.c, mark_seq = cumsum(!is.na(mark)))
giving:
mark high_mark mark_cum mark_seq
1 5 0 0 1
2 7 1 1 2
3 7 1 2 3
4 NA 0 2 3
5 7 1 3 4
6 7 1 4 5
7 NA 0 4 5
data.c <- read.table(text = Lines, header = TRUE)
2) na.locf Here is a second solution using seq_along and na.locf (from zoo). It creates a sequence the same length as the number of non-NA elements in mark and uses replace to put them in the spots where the non-NA elements exist. Then na.locf is used to fill in the NAs with the prior values.
library(zoo)
transform(data.c, mark_seq=na.locf(replace(mark, !is.na(mark), seq_along(na.omit(mark)))))
3) mark_cum It was not stated in the question how the input column mark_cum is constructed but in the sample output in the question the mark_seq column equals the mark_cum column plus 1 so if that is always the case then an easy solution is:
transform(data.c, mark_seq = mark_cum + 1)
Note: We used this as the input:
Lines <- "mark high_mark mark_cum
5 0 0
7 1 1
7 1 2
NA 0 2
7 1 3
7 1 4
NA 0 4"

Create a counting variable which I can use to group my unemployment data in R

I have data as below where i created the variable "B" with the function:
index <- which(Count$unemploymentduration ==1)
Count$B[index]<-1:length(index)
ID unemployment B
1 1 1
1 2 NA
1 3 NA
1 4 NA
2 1 2
2 2 NA
2 0 NA
2 1 3
2 2 NA
2 3 NA
2 4 NA
2 5 NA
And i want my data in this way and have no real idea how to get it like this.
Thought of an "if-function" but never used one in R.
ID unemployment B
1 1 1
1 2 1
1 3 1
1 4 1
2 1 2
2 2 2
2 0 2
2 1 3
2 2 3
2 3 3
2 4 3
2 5 3
Could someone help me out?
We can use na.locf from library(zoo)
library(zoo)
Count$B <- na.locf(Count$B)
But, this can be created directly without using an 'index'
Count$B <- cumsum(Count$unemployment==1)

Exclude a Specific Value from a Unique Value Counter

I am trying to count how many different responses a person gives during a trial of an experiment, but there is a catch.
There are supposed to be 6 possible responses (1,2,3,4,5,6) BUT sometimes 0 is recorded as a response (it's a glitch / flaw in design).
I need to count the number of different responses they give, BUT ONLY counting unique values within the range 1-6. This helps us calculate their accuracy.
Is there a way to exclude the value 0 from contributing to a unique value counter? Any other work-arounds?
Currently I am trying this method below, but it includes 0, NA, and I think any other entry in a cell in the Unique Value Counter Column (I have named "Span6"), which makes me sad.
# My Span6 calculator:
ASixImageTrials <- data.frame(eSOPT_831$T8.RESP, eSOPT_831$T9.RESP, eSOPT_831$T10.RESP, eSOPT_831$T11.RESP, eSOPT_831$T12.RESP, eSOPT_831$T13.RESP)
ASixImageTrials$Span6 = apply(ASixImageTrials, 1, function(x) length(unique(x)))
Use na.omit inside unique and sum logic vector as below
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))
df
Output:
X1 X2 X3 X4 X5 res
1 2 1 1 2 1 2
2 3 0 1 1 2 3
3 3 NA 1 1 3 2
4 3 3 3 4 NA 2
5 1 1 0 NA 3 2
6 3 NA NA 1 1 2
7 2 0 2 3 0 2
8 0 2 2 2 1 2
9 3 2 3 0 NA 2
10 0 2 3 2 2 2
11 2 2 1 2 1 2
12 0 2 2 2 NA 1
13 0 1 4 3 2 4
14 2 2 1 1 NA 2
15 3 NA 2 2 NA 2
16 2 2 NA 3 NA 2
17 2 3 2 2 2 2
18 2 NA 3 2 2 2
19 NA 4 5 1 3 4
20 3 1 2 1 NA 3
Data:
set.seed(752)
mat <- matrix(rbinom(100, 10, .2), nrow = 20)
mat[sample(1:100, 15)] = NA
data.frame(mat) -> df
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))
could you edit your question and clarify why this doesn't solve your problem?
# here is a numeric vector with a bunch of numbers
mtcars$carb
# here is how to limit that vector to only 1-6
mtcars$carb[ mtcars$carb %in% 1:6 ]
# here is how to tabulate that result
table( mtcars$carb[ mtcars$carb %in% 1:6 ] )

Resources