removing sequences of positive values between sequences of "0" - r

I would like to create a small function in a data frame, for detecting (and setting to 0) sequences of positive values which are located between sequences of values equal to 0, but only if these sequences of positive values are not more than 5 values long.
Here's just a small example for showing you how my data looks (initial_data column), and what I would like to obtain at the end (final_data column):
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
This sentence can also resume the trick:
"If there's a sequence of positive values, not longer than 5 values, and located between at least two or three 0-values (before and after this sequence of positive values), then set also this sequence to 0"
Any advice for doing this easily?
Thanks a lot!!!

Here's a possible approach using rle function :
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),
final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
# using rle create an object with the sequences of consecutive elements
# having the same sign (-1 means negative, 0 means zero, 1 means positive)
enc <- rle(sign(DF$initial_data))
# find the positive sequences having maximum 5 elements
posSequences <- which(enc$values == 1 & enc$lengths <= 5)
# remove index=1 or index=length(enc$values) if present because
# they can't be surrounded by 0
posSequences <- posSequences[posSequences != 1 &
posSequences != length(enc$values)]
# check if they're preceeded and followed by at least 2 zeros
# (if not remove the index)
toForceToZero <- sapply(posSequences,FUN=function(idx){
enc$values[idx-1]==0 &&
enc$lengths[idx-1] >= 2 &&
enc$values[idx+1] == 0 &&
enc$lengths[idx+1] >= 2})
posSequences <- posSequences[toForceToZero]
# reverse the run-length encoding, setting NA where we want to force to zero
v <- enc$values
v[posSequences] <- NA
# create the final data vector by forcing NAs to 0
final_data <- DF$initial_data
final_data[is.na(rep.int(v, enc$lengths))] <- 0
# check if is equal to your desired output
all(DF$final_data == final_data)
# > [1] TRUE

My best friend rle to the rescue:
notzero<-rle(as.logical(unlist(DF)))
Run Length Encoding
lengths: int [1:7] 4 3 6 8 20 8 7
values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now just find all locations where values is TRUE and lengths < 5, and replace the values at those locations with FALSE . Then invoke inverse.rle to get the desired output.

Related

How can I check for cluster patterns in a sequence of numbers and obtain the next value?

Given a set of sequences
seq1 <- c(3,3,3,7,7,7,4,4)
seq2 <- c(17,17,77,77,3)
seq3 <- c(5,5,23)
How can we create a function to check this sequence for cluster patterns and predict the next value of the sequence which in this case would be 4,3, and 23 respectively.
Edit: The sequence should first be checked for cluster patterns, if it does not contain this class of pattern then the sequence should be ignored or passed onto another function
Edit 2: A pattern should be defined by more that 1 of the same consecutive number and always grouped consistently e.g 1,1,1,2,2,2,3,3,3 is a pattern but 1,1,2,2,2,3,3 is not a pattern
Here's a way with rle in base R which checks if all run-lengths, except last, are equal and if TRUE then repeats the last value such that it has same pattern as others -
rl <- rle(seq1)$lengths
# check if all run-lengths, except last, are equal
if(all(head(rl, -1) == rl[1])) {
c(seq1, rep(seq1[length(seq1)], diff(range(rl))))
} else {
# do something else
}
# [1] 3 3 3 7 7 7 4 4 4
The same approach applies for seq2 and seq3.

ifelse statement with string

double="true"
data=data.frame("var1"=c(1:10))
data$var2=ifelse(double=="true",2*data$var1,NA)
data$var2want=2*data$var1
I have a character that stores into double as "true" if I want to double a variable. In this example I start with var1 equal to 1:10. double="true". So I want to make var2 equals to (1:10)*2. The desired output is var2want. However, when I apply my ifelse statement I just get var2=2 for all values. I am not sure how to fix this issue.
double is of length 1
length(double)
#[1] 1
whereas
length(data$var1)
#[1] 10
while using ifelse it returns the value which is of same length as test as
double == "true"
returns a vector of length 1, hence you get only one value back which is the first value of calculation
2*data$var1[1]
#[1] 2
and this value is recycled across all values.
For ifelse to work for all value we need to somehow make the length equal
ifelse(rep(double == "true", length(data$var1)), 2*data$var1, NA)
#[1] 2 4 6 8 10 12 14 16 18 20
However, if you have only one value to compare it is better to use simple if/else instead of ifelse
data$var2 <- if (double == "true") 2*data$var1 else NA

get length of character matching between two string in R

I have a dataframe where i need to compare two columns and find the number of matching characters between two elements.
For eg: x and y are two elements to be compared which look like below:
x<- "1/2"
y<-"2/3"
I did unlisted and splitted them by '/' as below:
unlist(strsplit(x,"/"))->a
unlist(strsplit(y,"/"))->b
Then i used pmatch:
pmatch(a,b,nomatch =0)
[1] 0 1
Used sum() to know how many characters are matching:
sum(pmatch(a,b,nomatch =0))
[1] 1
However, when the comparison is done the other way:
pmatch(b,a,nomatch = 0)
[1] 2 0
Since there is only one match between the two string, why is it showing 2. It could be index. But i would need to get how many characters are same between the strings irrespective of the comparison a vs b or b vs a.
Could someone help how to get this.
Per ?pmatch, pmatch seeks matches for the elements of its first argument among those of its second.
For example, "2" in the first list matches the second element in the second list.
> pmatch(c("2", "1"),c("3","2"),nomatch =0)
# [1] 2 0
One way to know the number of elements got matched is to sum non-zero elements:
sum(pmatch(c("2", "1"),c("3","2"),nomatch =0) != 0)
# [1] 1
Both
sum(pmatch(b, a, nomatch = 0) != 0) # 1
sum(pmatch(a, b, nomatch = 0) != 0) # 1
return the same value.
Another option could be
sum(b %in% a)
[1] 1
sum(a %in% b)
[1] 1

Get index of vector between 1nd and 2nd appearance of number 1

Suppose we have a vector:
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
Expected output:
v_index <- c(5,6,7)
v always starts and ends with 0. There is only one possibility of having cluster of zeros between two 1s.
Seems simple enough, can't get my head around...
I think this will do
which(cumsum(v == 1L) == 1L)[-1L]
## [1] 5 6 7
The idea here is to separate all the instances of "one"s to groups and select the first group while removing the occurrence of the "one" at the beginning (because you only want the zeroes).
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
v_index<-seq(which(v!=0)[1]+1,which(v!=0)[2]-1,1)
> v_index
[1] 5 6 7
Explanation:I ask which indices are not equal to 0:
which(v!=0)
then I take the first and second index from that vector and create a sequence out of it.
This is probably one of the simplest answers out there. Find which items are equal to one, then produce a sequence using the first two indexes, incrementing the first and decrementing the other.
block <- which(v == 1)
start <- block[1] + 1
end <- block[2] - 1
v_index <- start:end
v_index
[1] 5 6 7

calculating length using na.omit in R

here is my code:
data <-setNames(lapply(paste0("80-20 ", file.number,".csv"),read.csv,stringsAsFactors=FALSE),paste(file.number,"participant"))
# imports csv data and turns it into a R-data file
df <- data.frame(RT=1:100,rep.sw=sample(c("sw","rep",100,replace=TRUE)))
(error.sw.c <- lapply(data[control.data],function(df) with(df, na.omit(rep.sw == "sw" & accuracy == "wrong"))))
This code scans a bunch of excel file and attributes a value of 'TRUE' every time the accuracy is "wrong" for values labeled "sw." then what I want to do is count the number of true values, and put them in a data frame. This is what I tried:
(dataframe.c <- data.frame(switch.rt = sapply(sw.c,mean), repetition.rt = sapply(rep.c,mean), switch.error = sapply(error.sw.c,length), group = rep("control",each=length(control.data))))
However, when I do this, it gives me the length of all the values (TRUE & FALSE), not just the TRUE values.
If I do this:
length(error.sw.c)
I get the total of all the error values, not all the error values separately.
So my question is: Is there a way to get the length of each individual excel file so I can put it in a dataframe? Thank you StackOverflow community, you folks haven't let me down yet. Any help will be greatly appreciated. Let me know if any clarification is needed. :)
sum() can be used to count the number of TRUEs in a logical vector. Let's see why:
set.seed(555)
logicalVec <- rnorm(5) > 0 # create logical vector
logicalVec
[1] FALSE TRUE TRUE TRUE FALSE
Arithmetic functions coerce logical values to numeric values such that FALSE becomes 0 and TRUE becomes 1:
logicalVec*1
[1] 0 1 1 1 0
You can think of sum(logicalVec) as equivalent to sum(c(0,1,1,1,0)):
sum(c(0,1,1,1,0))
[1] 3
sum(logicalVec)
[1] 3

Resources