I have a string like this one:
0|294|314|20|314|SC49TST57ASG75A|1428.0
Using R, I want to extract only the data between two | (example- SC49TST57ASG75A), and then count only the numbers which are bigger than 20 (in this case I have the numbers 49,57,75 so the code needs to return the number 3)
I want to apply it on a column in a data frame.
Eventually, I want to get a new column that specify for each row how many numbers that are greater than 20 there is inside the |....|.
Thanks!
You can try strsplit with split = '\\|', if you only want to count between two pipes then you should exclude the first and the last elements also since you want elements greater than 20 ( we are using > sign for clarity in the solution)
I am assuming here that your columns have same structure as given in your question.
st <- '0|294|314|20|314|SC5GSC12ASG266T|1428.0'
Solution:
lapply(strsplit(st, '\\|'), function(x)sum(as.numeric(x[2:(length(x)-1)]) > 20, na.rm=TRUE))
I am not sure if this is what you are looking for, otherwise please tell me what is your expected result.
cnt <- Map(function(x) sum(as.numeric(x)>20),
regmatches(r <- unlist(regmatches(s,gregexpr("(?<=\\|).*?(?=\\|)",s,perl = TRUE))),
gregexpr("\\d+\\.?\\d+?",r)))
such that
> cnt
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 1
[[5]]
[1] 1
DATA
s <- "0|294|314|20|314|SC5GSC12ASG266T|1428.0"
Is there a one-line solution possible for this example?
df = data.frame('First' = c('T','T','V','V','A','E'),'Last' = c(rep('Ng',3),'Smith','Wolf','Wolf'))
matches = (df$First[-1] == df$First)
which(matches == 'TRUE')
# [1] 1 3
I want the indeces, but would rather not use a temporary variable.
Perhaps you could use the rleid function from data.table in combination with diff, like this:
which(diff(rleid(df$First)) == 0)
[1] 1 3
You could argue that the 2nd element and the 4th element in df$First match the previous value (instead of the 1st and 3rd), therefore, which(c(F, diff(rleid(df$First)) == 0)) might be more appropriate, which yields: [1] 2 4
anyone know if there's a build in function in R that can return indices of duplicated elements corresponding to the unique elements?
For instance I have a vector
a <- ["A","B","B","C","C"]
unique(a) will give ["A","B","C"]
duplicated(a) will give [F,F,T,F,T]
is there a build-in function to get a vector of indices for the same length as original vector a, that shows the location a's elements in the unique vecor (which is [1,2,2,3,3] in this example)?
i.e., something like the output variable "ic" in the matlab function "unique". (which is, if we let c = unique(a), then a = c(ic,:)).
http://www.mathworks.com/help/matlab/ref/unique.html
Thank you!
We can use match
match(a, unique(a))
#[1] 1 2 2 3 3
Or convert to factor and coerce to integer
as.integer(factor(a, levels = unique(a)))
#[1] 1 2 2 3 3
data
a <- c("A","B","B","C","C")
This should work:
cumsum( !duplicated( sort( a)) ) # one you replace Mathlab syntax with R syntax.
Or just:
as.numeric(factor(a) )
Suppose we have a vector:
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
Expected output:
v_index <- c(5,6,7)
v always starts and ends with 0. There is only one possibility of having cluster of zeros between two 1s.
Seems simple enough, can't get my head around...
I think this will do
which(cumsum(v == 1L) == 1L)[-1L]
## [1] 5 6 7
The idea here is to separate all the instances of "one"s to groups and select the first group while removing the occurrence of the "one" at the beginning (because you only want the zeroes).
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
v_index<-seq(which(v!=0)[1]+1,which(v!=0)[2]-1,1)
> v_index
[1] 5 6 7
Explanation:I ask which indices are not equal to 0:
which(v!=0)
then I take the first and second index from that vector and create a sequence out of it.
This is probably one of the simplest answers out there. Find which items are equal to one, then produce a sequence using the first two indexes, incrementing the first and decrementing the other.
block <- which(v == 1)
start <- block[1] + 1
end <- block[2] - 1
v_index <- start:end
v_index
[1] 5 6 7
I would like to create a small function in a data frame, for detecting (and setting to 0) sequences of positive values which are located between sequences of values equal to 0, but only if these sequences of positive values are not more than 5 values long.
Here's just a small example for showing you how my data looks (initial_data column), and what I would like to obtain at the end (final_data column):
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
This sentence can also resume the trick:
"If there's a sequence of positive values, not longer than 5 values, and located between at least two or three 0-values (before and after this sequence of positive values), then set also this sequence to 0"
Any advice for doing this easily?
Thanks a lot!!!
Here's a possible approach using rle function :
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),
final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
# using rle create an object with the sequences of consecutive elements
# having the same sign (-1 means negative, 0 means zero, 1 means positive)
enc <- rle(sign(DF$initial_data))
# find the positive sequences having maximum 5 elements
posSequences <- which(enc$values == 1 & enc$lengths <= 5)
# remove index=1 or index=length(enc$values) if present because
# they can't be surrounded by 0
posSequences <- posSequences[posSequences != 1 &
posSequences != length(enc$values)]
# check if they're preceeded and followed by at least 2 zeros
# (if not remove the index)
toForceToZero <- sapply(posSequences,FUN=function(idx){
enc$values[idx-1]==0 &&
enc$lengths[idx-1] >= 2 &&
enc$values[idx+1] == 0 &&
enc$lengths[idx+1] >= 2})
posSequences <- posSequences[toForceToZero]
# reverse the run-length encoding, setting NA where we want to force to zero
v <- enc$values
v[posSequences] <- NA
# create the final data vector by forcing NAs to 0
final_data <- DF$initial_data
final_data[is.na(rep.int(v, enc$lengths))] <- 0
# check if is equal to your desired output
all(DF$final_data == final_data)
# > [1] TRUE
My best friend rle to the rescue:
notzero<-rle(as.logical(unlist(DF)))
Run Length Encoding
lengths: int [1:7] 4 3 6 8 20 8 7
values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now just find all locations where values is TRUE and lengths < 5, and replace the values at those locations with FALSE . Then invoke inverse.rle to get the desired output.