How to extract all rows between start signal and end signal? - r

I have the following df and I would like to extract all rows based on the following start and end signals.
Start signal : When status changes from 1 to 0
End signal : When status changes from 0 to -1.
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
time status
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
6 6 0
7 7 -1
8 8 0
9 9 1
10 10 0
11 11 0
12 12 0
13 13 -1
14 14 0
Desire:
time status
4 4 0
5 5 0
6 6 0
10 10 0
11 11 0
12 12 0

Here's a possible solution using the data.table package. I'm basically first grouping by status == 1 appearances and then checking per group if there was also a status == -1, if so, I'm sub-setting the group from the second incident until the -1 incident minus 1
library(data.table)
setDT(df)[, indx := cumsum(status == 1)]
df[, if(any(status == -1)) .SD[2:(which(status == -1) - 1)], by = indx]
# indx time status
# 1: 2 4 0
# 2: 2 5 0
# 3: 2 6 0
# 4: 3 10 0
# 5: 3 11 0
# 6: 3 12 0

We count start and end markers, then use those values and the cumulative-sum of (start - end) to filter rows. The (cumsum(start)-cumsum(end)>1) is a slight fiddle to avoid the cumulative counts being upset by row 2 which starts but doesn't end; otherwise row 14 would unwantedly get included.
require(dplyr)
df %>% mutate(start=(status==1), end=(status==-1)) %>%
filter(!start & !end & (cumsum(start)-cumsum(end)>1) ) %>%
select(-start, -end)
# time status
# 1 4 0
# 2 5 0
# 3 6 0
# 4 10 0
# 5 11 0
# 6 12 0

A little ugly, but you can always just loop over the values and keep a flag for determining whether the element should be kept or not.
keepers <- rep(FALSE, nrow(df))
flag <- FALSE
for(i in 1:(nrow(df)-1)) {
if(df$status[i] == 1 && df$status[i+1] == 0) {
flag <- TRUE
next # keep signal index false
}
if(df$status[i] == -1 && df$status[i+1] == 0) {
flag <- FALSE
next # keep signal index false
}
keepers[i] <- flag
}
keepers[nrow(df)] <- flag # Set the last element to final flag value
newdf <- df[keepers, ] # subset based on the T/F values determined

Do you have some more data (or can you gen some more data you know the outcome of) to see if this/these generalize?
Two similar approaches:
library(stringr)
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
dfr <- rle(df$status)
# first approach
find_seq_str <- function() {
str_locate_all(paste(gsub("-1", "x", dfr$values), collapse=""), "10")[[1]][,2]
}
df[as.vector(sapply(find_seq_str(),
function(n) {
i <- sum(dfr$lengths[1:(n-1)])
tail(i:(i+dfr$lengths[n]), -1)
})),]
# second approach
find_seq_ts <- function() {
which(apply(embed(dfr$values, 2), 1, function(x) all(x == c(0, 1))))
}
df[as.vector(sapply(find_seq_ts(),
function(n) {
i <- sum(dfr$lengths[1:(n)])+1
head(i:(i+dfr$lengths[n+1]), -1)
})),]
Both approaches need a run length encoding of the status vector.
The first does a single character replacement for -1 so we can make an unambiguous, contiguous string to then use str_locate to find the pairs that tell us when the target sequence starts then rebuilds the ranges of zeroes from the rle lengths.
If it needs to be base R I can try to whip up something with regexpr.
The second builds a paired matrix and compares for the same target sequence.
Caveats:
I did no benchmarking
Both create potentially big things if status is big.
I'm not completely positive it generalizes (hence my initial q).
David's is far more readable, maintainable & transferrable code but you get to deal with all the "goodness" that comes with using data.table ;-)
I wrapped the approaches in functions as they could potentially then be parameterized, but you could just as easily just assign the value to a variable or shove it into the sapply (ugh, tho).

Related

Average length of runs selected from a rasterbrick in R

I am working with a rasterbrick "a" with thousands of layers, closer description is not necessary for my problem. I am using following function to create a rasterlayer of the total amount of runs of at least 5 days with values greater than 1 (one layer in brick is one day):
indices<-rep(1:69,each=90)
ff<-function(x,na.rm=TRUE){
y<-x > 1
n<- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Y<-stackApply(a,indices,fun=ff)
This works great, I tested that. In a similar manner, I wrote new function:
fff<-function(x,na.rm = TRUE){
y <- x > 1
n <- ave(y, cumsum(y == 0), FUN = cumsum)
mean(n[n >= 5])
}
X<-stackApply(a,indices,fun=fff)
Using this function, I wanted to create a rasterlayer of average lengths of those runs greater than 5 days. It seems reasonable and fine, but it does not work correctly. For example, when there is a run of 6 days (satisfying my criterion of value>1), it counts two runs, one of 5 and another one of six, and thus the average is 5,5 instead of 6. I am not sure how to adjust my function fff. If there is a way to do it, it would be great, otherwise I would be greatful if anyone shares another way how to calculate means of those runs. Thanks!
In the future, please include a minimal, reproducible, self-contained example. Do not describe the behavior of your code, but show it. Also, be very clear aobut the question. As-is it is hard to see that your question is not about raster data at all, as you are looking for a function that works on any numeric vector (that you may then apply to raster data).
You are looking for function that finds local maxima larger than 5, in the cumulated sum of neighbors that are > 1; and then average these local maxima.
You have this function
ff<-function(x,na.rm=TRUE){
y<-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Example data
x <- c(-1:10, -1:3, -1:6)
x
# [1] -1 0 1 2 3 4 5 6 7 8 9 10 -1 0 1 2 3 -1 0 1 2 3 4 5 6
ff(x)
# [1] 2
(two local maxima that are larger than 5)
To write the function you want we can start with what we have
y <-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 0 0 0 1 2 0 0 0 1 2 3 4 5
In this case, you need to find the numbers 9 and 5. You can start with
n[n<5] <- 0
n
# [1] 0 0 0 0 0 0 0 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 5
And now we can use diff to find the local maxima. These are the values for which the difference with the previous value is negative. Note the zero added to n to consider the last element of the vector.
i <- which(diff(c(n, 0)) < 0)
i
# [1] 12 25
n[i]
# [1] 9 5
Such that we can put the above together in a function like this
f <- function(x) {
y <- x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}
f(x)
# [1] 7
If you have NAs you may do
f <- function(x) {
y <- x > 1
y[is.na(y)] <- FALSE
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}

Row comparison in R

I have the below data frame,
R_Number A
1 0
2 15
3 10
4 11
5 12
6 18
7 19
8 15
9 17
10 11
Now I need to create another column B where the comparison of the values in A will be computed. The condition is that the comparion is not between two consecutive row, i.e Row number 1 is compared with Row number 4, like wise Row number 2 is compared with Row number 5 and this iteration continues till the end of the data . Condition for comparision result is:
if (A[1]>=15 && A[4] <= 12) {
B == 1
}
else if (A[1]<=0 && A[4]>= 10) {
B== 2
}
else {
B== 0
}
When it comes to Row number 8 and Row number 9these rows will not have next 4th row to compare with hence the value should be 0
Also, the comparision result of Row 1 and 4 is printed in Row number 1 similarly comparision result of Row 2 and 5 is printed in Row number 2
So the resulting dataframe should be as shown below
R_Number A B
1 0 2
2 15 1
3 10 0
4 11 0
5 12 0
6 18 0
7 19 1
8 15 0
9 17 0
10 11 0
According to #nicola comment, I tried to solve your problem as well.
I recreated your initial data frame:
df <- data.frame(R_Number = c(1:10), A = c(0,15,10,11,12,18,19,15,17,11), B = 0)
So I used an if statement inside a cycle for:
for (i in 1:(length(df$A)-3)) {
if (df$A[i] >= 15 && df$A[i+3] <= 12) {
df$B[i] <- 1
} else if ((df$A[i] <= 0) && (df$A[i+3] >= 10)) {
df$B[i] <- 2
}
else {
df$B[i] <- 0
}
}
With last edit I solved the problem that came up when the length of data frame changed.
Now you have a generic solution!
First lagging the variable and then computing your new variable should work. Something like this:
library(Hmisc)
df <- data.frame(R_Number = c(1:10), A = c(0,15,10,11,12,18,19,15,17,11))
A_Lag<-Lag(df$A,-3)
df$B <- rowSums(cbind(df$A>=15 & A_Lag <= 12,(df$A<=0 & A_Lag>= 10)*2),na.rm= T)
df$B
I tried to avoid if statements. The Lag function can be found in the Hmisc package.
> df$B
[1] 2 1 0 0 0 0 1 0 0 0

R: Remove intervals by criteria with overlap tolerance

I am searching for a solution to create some mask, with which I can remove some data (e.g. rows in data.frame) depending on some criteria, e.g.:
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
mask <- a == 0
mask
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
In my actual problem this cut is too harsh, I would like to have some smoother transition. The idea: I want to include some zeros before the non-zeros, and also add some zeros after the non-zeros. Simple approach: if I have this vector, I would like to switch every TRUE adjacent to a FALSE into a FALSE, which adds a overlapping tolerance region to the data. So instead of
a[!mask]
[1] 3 5 6 3 4 5 8 5
I would rather have something like
a[!mask]
[1] 0 3 5 6 3 0 0 4 5 8 5 0
or (increasing the size of the tolerance window)
a[!mask]
[1] 0 0 3 5 6 3 0 0 0 4 5 8 5 0 0
In the last case the three zeros in the middle arise, since the tolerance from the left and from the right start overlapping. My question: has anyone a good approach, how to write a function to create such a mask with overlapping tolerance?
[EDIT] It to me some time I realised the error in my initial question (thanks #tospig) In my initial post I completely made the number of zeros in the middle part wrong! Sorry for the confusion. So, for clarification: in the case of a tolerance window of 1, there really should be two zeros in the middle: one from the right bunch of valid data, one from the left bunch of valid data. Sorry for the confusion!
So, despite the really cool approach from #tospig (which I have to keep in mind) the solution from #agenis solves my problem perfectly!
I think I would go with a classic moving average of order 3 which simply expands the "non-zeros" by one to the left and one to the right. As simple as this. You will just have to figure out what you do with the first and the last point of your vector that are turned into NA (in my example I make them zeros).
And you have your desired result (for a bigger mask you take the order 5 instead of 3):
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
library(forecast)
a.ma <- ma(a, 3)
a.ma[is.na(a.ma)] <- 0
mask <- a.ma == 0
a[!mask]
#### [1] 0 3 5 6 3 0 0 4 5 8 5 0
Then you can easily transform this piece of code into a function.
[EDIT] this method does not ensure the conservation of the total number of zeros (see additional comments to clarify the OP initial question)
We can try
library(data.table)
lst1 <- split(a[!mask],rleid(mask)[!mask])
c(0,unlist(Map(`c`, lst1, 0), use.names=FALSE))
#[1] 0 3 5 6 3 0 4 5 8 5 0
Or another option is
n <- 1
i1 <- !inverse.rle(within.list(rle(mask), {
lengths[values] <- lengths[values]-n
lengths[!values] <- lengths[!values]+n}))
c(a[i1],0)
#[1] 0 3 5 6 3 0 4 5 8 5 0
Here's a solution that allows you to specify the tolerance. At the moment it doesn't 'overlap' zeros.
We can use a data.table structure (or a data.frame, but I like using data.table) and control how many zeros we want to keep between the set of positive numbers. We can specify any tolerance value, but if it's greater than a sequence of zeros, only the maximum number of consecutive zeroes will be returned.
a <- c(0,0,0,3,5,6,3,0,0,0,4,5,8,5,0,0,0,0,0)
library(data.table)
tolerance <- 1
dt <- data.table( id = seq(1, length(a), by = 1),
a = a)
## subset all the 0s, with their 'ids' for joining back on
dt_zero <- dt[a == 0]
## get the positions where the difference between values is greater than one,
## and create groups based on their length
changed <- which(c(TRUE, diff(dt_zero$id) > 1))
dt_zero$grps <- rep(changed, diff(c(changed, nrow(dt_zero) + 1)))
## we only need the 'tolerance' number of zeros
## if 'tolerance' is greater than number of entries in a group,
## it will return 'na'
dt_zero <- dt_zero[ dt_zero[ order(id) , .I[c(1:tolerance)], by=grps ]$V1, ]
## join back onto original data.table,
## and subset only relevant results
dt_zero <- dt_zero[, .(id, a)][ dt , on = "id"][(is.na(a) & i.a > 0) | a == 0]
res <- dt_zero$i.a
res
# [1] 0 3 5 6 3 0 4 5 8 5 0
## try different tolerances
tolerance <- 2
...
# 0 0 3 5 6 3 0 0 4 5 8 5 0 0
tolerance <- 6
...
# 0 0 0 3 5 6 3 0 0 0 4 5 8 5 0 0 0 0 0

Apply in R: recursive function that operates on its own previous result

How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}

Calculate number of changes of a variable per individual in a data frame

Might be a very simple question to ask but I struggle to solve this problem in r.
I have a dataset containing four variables: ID (for identifying the participants ), Type (with 1 value this time ), Decision (A or B) and Feedback (0 or 1). The data set for two participants looks like this:
ID Type Decision Feedback
1 1 A 0
1 1 A 0
1 1 B 1
1 1 B 1
1 1 B 0
2 1 A 0
2 1 A 1
2 1 A 1
2 1 A 0
2 1 B 1
etc...
I want to calculate the number of changes in the decision process as a function of the previous feedback. In other words, if the participant choose A and received a negative feedback, will she/he choose A again (Stay) or B (Shift). So my code is the following for one participant:
Stay=0
Shift=0
for(i in 2:length(mydf$Type)){
if(mydf$Decision[i] == "A" && mydf$Feedback[i-1]==1 && mydf$Decision [i-1] == "A" ){
Stay= Stay+1
}
else if(mydf$Decision [i] == "B" && mydf$Feedback[i-1]==1 && mydf$Decision [i-1] == "B" ){
Stay= Stay+1
}
else if(mydf$ Decision [i] == "A" && mydf$Feedback[i-1]==1 && mydf$Decision [i-1] == "B" ){
Shift= Shift+1
}
else if(mydf$Decision [i] == "B" && mydf$Feedback[i-1]==1 && mydf$Decision [i-1] == "A" ){
Shift= Shift+1
}
}
However, my data frame contains 20 participants and I don’t know how to extend my code to get the number of stays and shifts for each participant (i.e., to get something like this at the end):
#ID Stay Shift
#1 10 10
#2 16 4
#etc...
Thank you very much for your help in advance.
This is best done using ddply in the plyr package (you'll have to install it), which splits up a data frame based on one of the columns and does some analysis on each, before recombining into a new data frame.
First, write a function num.stay.shift that calculates your stay and shift values given a single subset of the data frame (explained in comments):
num.stay.shift = function(d) {
# vector of TRUE or FALSE for whether d$Feedback is 1
negative.feedback = (head(d$Feedback, -1) == 1)
# vector of TRUE or FALSE for whether there is a change at each point
stay = head(d$Decision, -1) == tail(d$Decision, -1)
# summarize as two values: the number that stayed when feedback == 1,
# and the number that shifted when feedback == 1
c(Stay=sum(stay[negative.feedback]), Shift=sum(!stay[negative.feedback]))
}
Then, use ddply to apply that function to each of the individuals within the data frame, splitting it up by ID:
print(ddply(tab, "ID", num.stay.shift))
On the subset of the data frame you show, you would end up with
# ID Stay Shift
# 1 1 2 0
# 2 2 2 0
How about a nice breakdown by ID and Feedback:
library(data.table)
X <- data.table(mydf, key="ID")
X[, list(Dif=abs(diff(as.numeric(Decision))),
FB=head(Feedback, -1))
, by=ID][,list(Shifted=sum(Dif), Stayed=length(Dif)-sum(Dif)), by=list(ID,FB)]
# ID FB Shifted Stayed
# 1: 1 0 1 1
# 2: 1 1 0 2
# 3: 2 0 1 1
# 4: 2 1 0 2
or if you don't want the breakdown by Feedback, it is even more succinct:
X[ , {Dif=abs(diff(as.numeric(Decision)));
list(Shifted=sum(Dif), Stayed=length(Dif)-sum(Dif))}
, by=list(ID)]
# ID Shifted Stayed
# 1: 1 1 3
# 2: 2 1 3
This is a slightly hairier alternative using the embed function, as mentioned in the comments to #DavidRobinson's answer.
d<-read.table(text="ID Type Decision Feedback
1 1 A 0
1 1 A 0
1 1 B 1
1 1 B 1
1 1 B 0
2 1 A 0
2 1 A 1
2 1 A 1
2 1 A 0
2 1 B 1", header=TRUE)
do.call(rbind,
by(d, d$ID, function(x) {
f <- function(x) length(unique(x)) == 1
stay <- apply(embed(as.vector(x$Decision), 2), 1, f)
neg.feedback <- x$Feedback[1:nrow(x)-1] == 1
c(Stay = sum(stay & neg.feedback), Shift = sum((! stay) & neg.feedback))
})
)
# Stay Shift
# 1 2 0
# 2 2 0

Resources