R: Remove intervals by criteria with overlap tolerance - r

I am searching for a solution to create some mask, with which I can remove some data (e.g. rows in data.frame) depending on some criteria, e.g.:
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
mask <- a == 0
mask
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
In my actual problem this cut is too harsh, I would like to have some smoother transition. The idea: I want to include some zeros before the non-zeros, and also add some zeros after the non-zeros. Simple approach: if I have this vector, I would like to switch every TRUE adjacent to a FALSE into a FALSE, which adds a overlapping tolerance region to the data. So instead of
a[!mask]
[1] 3 5 6 3 4 5 8 5
I would rather have something like
a[!mask]
[1] 0 3 5 6 3 0 0 4 5 8 5 0
or (increasing the size of the tolerance window)
a[!mask]
[1] 0 0 3 5 6 3 0 0 0 4 5 8 5 0 0
In the last case the three zeros in the middle arise, since the tolerance from the left and from the right start overlapping. My question: has anyone a good approach, how to write a function to create such a mask with overlapping tolerance?
[EDIT] It to me some time I realised the error in my initial question (thanks #tospig) In my initial post I completely made the number of zeros in the middle part wrong! Sorry for the confusion. So, for clarification: in the case of a tolerance window of 1, there really should be two zeros in the middle: one from the right bunch of valid data, one from the left bunch of valid data. Sorry for the confusion!
So, despite the really cool approach from #tospig (which I have to keep in mind) the solution from #agenis solves my problem perfectly!

I think I would go with a classic moving average of order 3 which simply expands the "non-zeros" by one to the left and one to the right. As simple as this. You will just have to figure out what you do with the first and the last point of your vector that are turned into NA (in my example I make them zeros).
And you have your desired result (for a bigger mask you take the order 5 instead of 3):
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
library(forecast)
a.ma <- ma(a, 3)
a.ma[is.na(a.ma)] <- 0
mask <- a.ma == 0
a[!mask]
#### [1] 0 3 5 6 3 0 0 4 5 8 5 0
Then you can easily transform this piece of code into a function.
[EDIT] this method does not ensure the conservation of the total number of zeros (see additional comments to clarify the OP initial question)

We can try
library(data.table)
lst1 <- split(a[!mask],rleid(mask)[!mask])
c(0,unlist(Map(`c`, lst1, 0), use.names=FALSE))
#[1] 0 3 5 6 3 0 4 5 8 5 0
Or another option is
n <- 1
i1 <- !inverse.rle(within.list(rle(mask), {
lengths[values] <- lengths[values]-n
lengths[!values] <- lengths[!values]+n}))
c(a[i1],0)
#[1] 0 3 5 6 3 0 4 5 8 5 0

Here's a solution that allows you to specify the tolerance. At the moment it doesn't 'overlap' zeros.
We can use a data.table structure (or a data.frame, but I like using data.table) and control how many zeros we want to keep between the set of positive numbers. We can specify any tolerance value, but if it's greater than a sequence of zeros, only the maximum number of consecutive zeroes will be returned.
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
library(data.table)
tolerance <- 1
dt <- data.table( id = seq(1, length(a), by = 1),
a = a)
## subset all the 0s, with their 'ids' for joining back on
dt_zero <- dt[a == 0]
## get the positions where the difference between values is greater than one,
## and create groups based on their length
changed <- which(c(TRUE, diff(dt_zero$id) > 1))
dt_zero$grps <- rep(changed, diff(c(changed, nrow(dt_zero) + 1)))
## we only need the 'tolerance' number of zeros
## if 'tolerance' is greater than number of entries in a group,
## it will return 'na'
dt_zero <- dt_zero[ dt_zero[ order(id) , .I[c(1:tolerance)], by=grps ]$V1, ]
## join back onto original data.table,
## and subset only relevant results
dt_zero <- dt_zero[, .(id, a)][ dt , on = "id"][(is.na(a) & i.a > 0) | a == 0]
res <- dt_zero$i.a
res
# [1] 0 3 5 6 3 0 4 5 8 5 0
## try different tolerances
tolerance <- 2
...
# 0 0 3 5 6 3 0 0 4 5 8 5 0 0
tolerance <- 6
...
# 0 0 0 3 5 6 3 0 0 0 4 5 8 5 0 0 0 0 0

Related

Insert missing natural numbers in a vector

I have a data frame column
df$col1=(1,2,3,4,5,6,7,8,9,...,500000)
and a vector
vc<-c(1,2,4,5,7,8,10,...,499999)
If i compare the two vectors the second vector has some missing values how i can insert in the missing values' places 0s e.g the second vector i want to be
vc<-c(1,2,0,4,5,0,7,8,9,10,...,499999,0)
You could use match and replace (thanks to #RonakShah)
Input
vc <- c(1,2,4,5,7,8,10)
x <- 1:15
Result
out <- replace(tmp <- vc[match(x, vc)], is.na(tmp), 0L)
out
# [1] 1 2 0 4 5 0 7 8 0 10 0 0 0 0 0
You could try using the larger vector containing all values as a template, and then assign zero to any value which does not match to the second smaller vector:
v_out <- df$col1
v_out[!(v_out %in% vc)] <- 0
v_out
[1] 1 2 0 4 5 0 7 8 0 10
Data
df$col1 <- c(1:10)
vc <- c(1,2,4,5,7,8,10)
A more cryptic, but maybe faster one-liner alternative (using Tim's data):
`[<-`(numeric(max(df$col1)),vc,vc)
#[1] 1 2 0 4 5 0 7 8 0 10

Reset cumulative summation whenever there is zero in the vector using R

I need to reset the cumulative summation whenever there is zero in the vector using R. E.g. Input c(0,1,0,0,0,1,1,1,0,1,1) I want the output as 0,1,0,0,0,1,2,3,0,1,2. I checked multiple answers which answer resetting of sequence using functions, but those solutions did not work here.
Numeric sequence with condition,
Create counter with multiple variables,
Comp. Efficent way of resetting sequence if condition met ( R ) are some which I referred.
I tried different combinations of cumsum, ave and getanID but can't seem to get the output I want.
perhaps something like this:
vec <- c(0,1,0,0,0,1,1,1,0,1,1)
library(data.table)
as.numeric(unlist(by(vec, rleid(vec), cumsum))) #or as in Maurits Evers answer `unname`, or `unlist(..., use.names = F)` instead of `as.numeric`
#output
0 1 0 0 0 1 2 3 0 1 2
rleid makes a run-length type id column:
rleid(vec)
#output
1 2 3 3 3 4 4 4 5 6 6
this is then used as a grouping variable
EDIT: as per suggestion of #digEmAll:
Note that this works only if vec contains only 0 and 1. To make it more generic you should use
rleid(vec > 0)
Here is a base R solution using split:
v <- c(0,1,0,0,0,1,1,1,0,1,1)
unname(unlist(lapply(split(v, cumsum(c(0, diff(v) != 0))), cumsum)))
# [1] 0 1 0 0 0 1 2 3 0 1 2
The idea is to split the vector into chunks based on 0s, and then calculate the cumsum per chunk.
Instead of unname(unlist(...)) you can also use unlist(..., use.names = F).
Another possible solution using ave and rle:
ave(v,inverse.rle(with(rle(v>0),list(values=seq_along(values),lengths=lengths))),FUN=cumsum)
>
[1] 0 1 0 0 0 1 2 3 0 1 2
Note that :
inverse.rle(with(rle(v>0),list(values=seq_along(values),lengths=lengths))
is equal to :
data.table::rleid(v>0)
and thay return the "ids" of the batches of consecutive zero/non-zero elements of v :
[1] 1 2 3 3 3 4 4 4 5 6 6

Add index to runs of positive or negative values of certain length

I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0

How to extract all rows between start signal and end signal?

I have the following df and I would like to extract all rows based on the following start and end signals.
Start signal : When status changes from 1 to 0
End signal : When status changes from 0 to -1.
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
time status
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
6 6 0
7 7 -1
8 8 0
9 9 1
10 10 0
11 11 0
12 12 0
13 13 -1
14 14 0
Desire:
time status
4 4 0
5 5 0
6 6 0
10 10 0
11 11 0
12 12 0
Here's a possible solution using the data.table package. I'm basically first grouping by status == 1 appearances and then checking per group if there was also a status == -1, if so, I'm sub-setting the group from the second incident until the -1 incident minus 1
library(data.table)
setDT(df)[, indx := cumsum(status == 1)]
df[, if(any(status == -1)) .SD[2:(which(status == -1) - 1)], by = indx]
# indx time status
# 1: 2 4 0
# 2: 2 5 0
# 3: 2 6 0
# 4: 3 10 0
# 5: 3 11 0
# 6: 3 12 0
We count start and end markers, then use those values and the cumulative-sum of (start - end) to filter rows. The (cumsum(start)-cumsum(end)>1) is a slight fiddle to avoid the cumulative counts being upset by row 2 which starts but doesn't end; otherwise row 14 would unwantedly get included.
require(dplyr)
df %>% mutate(start=(status==1), end=(status==-1)) %>%
filter(!start & !end & (cumsum(start)-cumsum(end)>1) ) %>%
select(-start, -end)
# time status
# 1 4 0
# 2 5 0
# 3 6 0
# 4 10 0
# 5 11 0
# 6 12 0
A little ugly, but you can always just loop over the values and keep a flag for determining whether the element should be kept or not.
keepers <- rep(FALSE, nrow(df))
flag <- FALSE
for(i in 1:(nrow(df)-1)) {
if(df$status[i] == 1 && df$status[i+1] == 0) {
flag <- TRUE
next # keep signal index false
}
if(df$status[i] == -1 && df$status[i+1] == 0) {
flag <- FALSE
next # keep signal index false
}
keepers[i] <- flag
}
keepers[nrow(df)] <- flag # Set the last element to final flag value
newdf <- df[keepers, ] # subset based on the T/F values determined
Do you have some more data (or can you gen some more data you know the outcome of) to see if this/these generalize?
Two similar approaches:
library(stringr)
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
dfr <- rle(df$status)
# first approach
find_seq_str <- function() {
str_locate_all(paste(gsub("-1", "x", dfr$values), collapse=""), "10")[[1]][,2]
}
df[as.vector(sapply(find_seq_str(),
function(n) {
i <- sum(dfr$lengths[1:(n-1)])
tail(i:(i+dfr$lengths[n]), -1)
})),]
# second approach
find_seq_ts <- function() {
which(apply(embed(dfr$values, 2), 1, function(x) all(x == c(0, 1))))
}
df[as.vector(sapply(find_seq_ts(),
function(n) {
i <- sum(dfr$lengths[1:(n)])+1
head(i:(i+dfr$lengths[n+1]), -1)
})),]
Both approaches need a run length encoding of the status vector.
The first does a single character replacement for -1 so we can make an unambiguous, contiguous string to then use str_locate to find the pairs that tell us when the target sequence starts then rebuilds the ranges of zeroes from the rle lengths.
If it needs to be base R I can try to whip up something with regexpr.
The second builds a paired matrix and compares for the same target sequence.
Caveats:
I did no benchmarking
Both create potentially big things if status is big.
I'm not completely positive it generalizes (hence my initial q).
David's is far more readable, maintainable & transferrable code but you get to deal with all the "goodness" that comes with using data.table ;-)
I wrapped the approaches in functions as they could potentially then be parameterized, but you could just as easily just assign the value to a variable or shove it into the sapply (ugh, tho).

Apply in R: recursive function that operates on its own previous result

How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}

Resources