Related
Given a set of sequences
seq1 <- c(3,3,3,7,7,7,4,4)
seq2 <- c(17,17,77,77,3)
seq3 <- c(5,5,23)
How can we create a function to check this sequence for cluster patterns and predict the next value of the sequence which in this case would be 4,3, and 23 respectively.
Edit: The sequence should first be checked for cluster patterns, if it does not contain this class of pattern then the sequence should be ignored or passed onto another function
Edit 2: A pattern should be defined by more that 1 of the same consecutive number and always grouped consistently e.g 1,1,1,2,2,2,3,3,3 is a pattern but 1,1,2,2,2,3,3 is not a pattern
Here's a way with rle in base R which checks if all run-lengths, except last, are equal and if TRUE then repeats the last value such that it has same pattern as others -
rl <- rle(seq1)$lengths
# check if all run-lengths, except last, are equal
if(all(head(rl, -1) == rl[1])) {
c(seq1, rep(seq1[length(seq1)], diff(range(rl))))
} else {
# do something else
}
# [1] 3 3 3 7 7 7 4 4 4
The same approach applies for seq2 and seq3.
This question already has an answer here:
Subset columns based on row value
(1 answer)
Closed 4 years ago.
I have a data frame(DF) that is like so:
DF <- rbind (c(10,20,30,40,50), c(21,68,45,33,21), c(11,98,32,10,30), c(50,70,70,70,50))
10 20 30 40 50
21 68 45 33 21
11 98 32 10 30
50 70 70 70 50
In my scenario my x would be 50. So my resulting dataframe(resultDF) will look like this:
10 50
21 21
11 30
50 50
How Can I do this in r? I have attempted using subset as below but it doesn't seem to work as I am expecting:
resultDF <- subset(DF, DF[nrow(DF),] == 50)
Error in x[subset & !is.na(subset), vars, drop = drop] :
(subscript) logical subscript too long
I have solved it. My sub setting was function was inaccurate. I used the following piece of code to get the results I needed.
resultDF <- DF[, DF[nrow(DF),] == 50]
Your issue with subset() was only about the syntax for calling it with a logical column vector (its third arg, not its second). You can either use subset() or plain logical indexing. The latter is recommended.
The help page ?subset tells you its optional second arg ('subset') is a logical row-vector, and its optional third arg ('select') is a logical column-vector:
subset: logical expression indicating elements or rows to keep:
missing values are taken as false.
select: expression, indicating columns to select from a data frame.
So you want to call it with this logical column-vector:
> DF[nrow(DF),] == 50
[1] TRUE FALSE FALSE FALSE
There are two syntactical ways to leave subset()'s second arg default and pass the third arg:
# Explicitly pass the third arg by name...
> subset(DF, select=(DF[nrow(DF),] == 50) )
# Leave 2nd arg empty, it will default (to NULL)...
> subset(DF, , (DF[nrow(DF),] == 50) )
[,1] [,2]
[1,] 10 50
[2,] 21 21
[3,] 11 30
[4,] 50 50
The second way is probably preferable as it looks like generic row,col-indexing, and also doesn't require you to know the third arg's name.
(As a mnemonic, in R and SQL terminology, understand that 'select' implicitly means 'column-indices', and 'filter'/'subset' implicitly means 'row-indices'. Or in data.table terminology they're called i-indices, j-indices respectively.)
I have a question about searching for values in R, it is actually a bit similar to a question which was posted yesterday (as given over here: Searching a vector/data table backwards in R) except I think my problem is a bit more complicated (and also the opposite of what I want to do), and since I'm very new to R I'm not too sure how to solve this problem.
I have a data frame similar to one given below, and I wish to find a previous index value to my current one where the Times column is different to my current time and the Midquote column does not have an NA value.
Index Times | Midquote
-----------------------------
1 10:30:45.58 | 5.319
2 10:30:45.93 | 5.323
3 10:30:45.104 | 5.325
4 10:30:45.127 | 5.322
5 10:30:45.188 | 5.325
6 10:30:45.188 | NA
7 10:30:45.212 | NA
8 10:30:45.231 | 5.321
9 10:30:45.231 | 5.321
If we start at the bottom of the data frame and take this to be the 'current' time, this is found to be at index 9 and which has a Times value of 10:30:45.231 and Midquote value of 5.321, then if I want to find the first index where the time is different to my current time, we see this is found to be index 7, which has a time of 10:30:45.212 (since index 8 has the same time). But we also see that at index 7 the Midquote value is NA so I now have to check the data frame again. Index 6 again has a different time (i.e. 10:30:45.188 ) but it also has an NA value again in the Midquote column, so moving up again to index 5 we see that the Times column has a different time to my current time (i.e. 10:30:45.188 again) and that the Midquotes value is 5.325.
Therefore, since at index 5 the time is 10:30:45.188 (which is different to my current time which was 10:30:45.231) and since the Midquote value at index 5 is not NA, I wish to obtain the output '5' since it is the index value which fulfills both criteria.
My question is, is there a good way of doing this? I am sorry if this is an easy question, I am very new to R and I don't know much about working with data frames...
EDIT: I would also like to do it preferably without adding another column to the data frame (as is given in the top answer of the link I mentioned above), if that is possible
Working with dates is tough especially with fractional seconds.
If you could convert the times to doubles it would be easier to work with.
Assuming your 'Times' are in order you could use this
library(magrittr)
which(df$Times < df[9,1] & !is.na(df$Midquote)) %>% max()
The which gives a vector of the 'Index' where 'Times' are less than that in 9 AND the 'Midquote' is not NA. The %>% sends the vector to max() which gives the highest value. This is pretty inelegant, but will get the job done.
If I understood it correctly, please check if this is the output you are expecting.
ind<-function(t,df){
ind<-t
while(t>1){
t=t-1
if((df$Times[t]!=df$Times[ind]) && (!is.na(df$Midquote[t]))){
return(t)
}
}
}
sapply((nrow(data):1),FUN = ind,data)
#[[1]]
#[1] 5
#[[2]]
#[1] 5
#[[3]]
#[1] 5
#[[4]]
#[1] 4
#[[5]]
#[1] 4
#[[6]]
#[1] 3
#[[7]]
#[1] 2
#[[8]]
#[1] 1
#[[9]]
#NULL
The output series corresponds to the associated index for your data.frame starting from the last row.
Explanation: ind takes the value of row number as the current row, while t takes value starting from ind-1 to 1. df takes the entire data.frame as input and then while loop is used to check if time and midquote value of df$Times[t] and df$Midquote[t] satisfy the required conditions. If yes they return the index else the loop continues until it reaches the first row.
Without using sapply for a particular current row:
ind(9,df)
[1] 5
Data.table solution, 1 line.
library(data.table)
dt <- data.table(Index = 1:9,
Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321')
)
> dt[ Times != Times[.N] & !is.na(Midquote), max(Index) ]
[1] 5
EDIT
To remove the Index column you have (at least) two options
dt2 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321'))
# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt2[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]
# Option 2 - simply check the last position of where your condition is met
dt2[, max(which(Times != Times[.N] & !is.na(Midquote))) ]
NB You can't do nrow because you can have, say, the 1st, 2nd, and 4th records matching your condition, and nrow would give you 3, which is wrong because the 3rd row does not match.
EDIT 2 (option 3 is not correct)
dt3 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323', NA,'5.322','5.325', NA, NA,'5.321','5.321'))
# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt3[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]
[1] 5
# Option 2 - simply check the last position of where your condition is met
dt3[, max(which(Times != Times[.N] & !is.na(Midquote))) ]
[1] 5
# Option 3 - good luck with this
nrow(dt3[Times != Times[.N] & !is.na(Midquote)])
[1] 4
Say I have a data frame df and want to subset it based on the value of column a.
df <- data.frame(a = 1:4, b = 5:8)
df
Is it necessary to include a which function in the brackets or can I just include the logical test?
df[df$a == "2",]
# a b
#2 2 6
df[which(df$a == "2"),]
# a b
#2 2 6
It seems to work the same either way... I was getting some strange results in a large data frame (i.e., getting empty rows returned as well as the correct ones) but once I cleaned the environment and reran my script it worked fine.
df$a == "2" returns a logical vector, while which(df$a=="2") returns indices. If there are missing values in the vector, the first approach will include them in the returned value, but which will exclude them.
For example:
x=c(1,NA,2,10)
x[x==2]
[1] NA 2
x[which(x==2)]
[1] 2
x==2
[1] FALSE NA TRUE FALSE
which(x==2)
[1] 3
I would like to create a small function in a data frame, for detecting (and setting to 0) sequences of positive values which are located between sequences of values equal to 0, but only if these sequences of positive values are not more than 5 values long.
Here's just a small example for showing you how my data looks (initial_data column), and what I would like to obtain at the end (final_data column):
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
This sentence can also resume the trick:
"If there's a sequence of positive values, not longer than 5 values, and located between at least two or three 0-values (before and after this sequence of positive values), then set also this sequence to 0"
Any advice for doing this easily?
Thanks a lot!!!
Here's a possible approach using rle function :
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),
final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
# using rle create an object with the sequences of consecutive elements
# having the same sign (-1 means negative, 0 means zero, 1 means positive)
enc <- rle(sign(DF$initial_data))
# find the positive sequences having maximum 5 elements
posSequences <- which(enc$values == 1 & enc$lengths <= 5)
# remove index=1 or index=length(enc$values) if present because
# they can't be surrounded by 0
posSequences <- posSequences[posSequences != 1 &
posSequences != length(enc$values)]
# check if they're preceeded and followed by at least 2 zeros
# (if not remove the index)
toForceToZero <- sapply(posSequences,FUN=function(idx){
enc$values[idx-1]==0 &&
enc$lengths[idx-1] >= 2 &&
enc$values[idx+1] == 0 &&
enc$lengths[idx+1] >= 2})
posSequences <- posSequences[toForceToZero]
# reverse the run-length encoding, setting NA where we want to force to zero
v <- enc$values
v[posSequences] <- NA
# create the final data vector by forcing NAs to 0
final_data <- DF$initial_data
final_data[is.na(rep.int(v, enc$lengths))] <- 0
# check if is equal to your desired output
all(DF$final_data == final_data)
# > [1] TRUE
My best friend rle to the rescue:
notzero<-rle(as.logical(unlist(DF)))
Run Length Encoding
lengths: int [1:7] 4 3 6 8 20 8 7
values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now just find all locations where values is TRUE and lengths < 5, and replace the values at those locations with FALSE . Then invoke inverse.rle to get the desired output.