R: Counting consecutive letters in a string - r

I currently have a string in R that looks like this:
df <- c ("BMMBMMBMMMMMBMMBM")
I need to determine how many times MM's appear in this string (in this example it's 4).
I've been using str_count(df, "MM") but this only counts how many times two M's are next to each other in the string (which returns 5).
Any help would be great...
Thanks!

Here's a base R approach without regular expressions:
with(rle(unlist(strsplit(x, ""))), sum(values == "M" & lengths >= 2))
# [1] 4

A possible approach is:
stringr::str_count(df, "MM+")
#output
[1] 4
+ means one or more
in base R:
lengths(gregexpr("MM+", df))
gregexpr returns a list, each element corresponds to one element of df.
lengths returns the length of each list element.
EDIT: as per the comment by #docendo discimus the second option is a little dangerous since it will return 1 if the string was not found.
lengths(gregexpr("xyz+", df))
#output
1
A safer option is:
lapply(gregexpr("MM+", df), function(x) length(x[x > 0]))
#output
[[1]]
[1] 4
lapply(gregexpr("xyz+", df), function(x) length(x[x > 0]))
#output
[[1]]
[1] 0

Base solution:
s <- "BMMBMMBMMMMMBMMBM"
lengths(gregexpr("MM+", s))
## [1] 4
Note that the input called df in the question is a character string, not a data frame, and c("X") is identical to "X" so the c and the parentheses are not needed.

Try the following pattern:
str_count(df,"(M)\\1+")
This will count two or more M as one case.
Or
str_count(df,"M{2,}")

Related

ifelse conditional assignent of tibbles [duplicate]

I've found R's ifelse statements to be pretty handy from time to time. For example:
ifelse(TRUE,1,2)
# [1] 1
ifelse(FALSE,1,2)
# [1] 2
But I'm somewhat confused by the following behavior.
ifelse(TRUE,c(1,2),c(3,4))
# [1] 1
ifelse(FALSE,c(1,2),c(3,4))
# [1] 3
Is this a design choice that's above my paygrade?
The documentation for ifelse states:
ifelse returns a value with the same
shape as test which is filled with
elements selected from either yes or
no depending on whether the element
of test is TRUE or FALSE.
Since you are passing test values of length 1, you are getting results of length 1. If you pass longer test vectors, you will get longer results:
> ifelse(c(TRUE, FALSE), c(1, 2), c(3, 4))
[1] 1 4
So ifelse is intended for the specific purpose of testing a vector of booleans and returning a vector of the same length, filled with elements taken from the (vector) yes and no arguments.
It is a common confusion, because of the function's name, to use this when really you want just a normal if () {} else {} construction instead.
I bet you want a simple if statement instead of ifelse - in R, if isn't just a control-flow structure, it can return a value:
> if(TRUE) c(1,2) else c(3,4)
[1] 1 2
> if(FALSE) c(1,2) else c(3,4)
[1] 3 4
Note that you can circumvent the problem if you assign the result inside the ifelse:
ifelse(TRUE, a <- c(1,2), a <- c(3,4))
a
# [1] 1 2
ifelse(FALSE, a <- c(1,2), a <- c(3,4))
a
# [1] 3 4
use `if`, e.g.
> `if`(T,1:3,2:4)
[1] 1 2 3
yeah, I think ifelse() is really designed for when you have a big long vector of tests and want to map each to one of two options. For example, I often do colors for plot() in this way:
plot(x,y, col = ifelse(x>2, 'red', 'blue'))
If you had a big long vector of tests but wanted pairs for outputs, you could use sapply() or plyr's llply() or something, perhaps.
Sometimes the user just needs a switch statement instead of an ifelse. In that case:
condition <- TRUE
switch(2-condition, c(1, 2), c(3, 4))
#### [1] 1 2
(which is another syntax option of Ken Williams's answer)
Here is an approach similar to that suggested by Cath, but it can work with existing pre-assigned vectors
It is based around using the get() like so:
a <- c(1,2)
b <- c(3,4)
get(ifelse(TRUE, "a", "b"))
# [1] 1 2
In your case, using if_else from dplyr would have been helpful: if_else is more strict than ifelse, and throws an error for your case:
library(dplyr)
if_else(TRUE,c(1,2),c(3,4))
#> `true` must be length 1 (length of `condition`), not 2
Found on everydropr:
ifelse(rep(TRUE, length(c(1,2))), c(1,2),c(3,4))
#>[1] 1 2
Can replicate the result of your condition to return the desired length

How to use if_else with sql query in RODBC package R [duplicate]

I've found R's ifelse statements to be pretty handy from time to time. For example:
ifelse(TRUE,1,2)
# [1] 1
ifelse(FALSE,1,2)
# [1] 2
But I'm somewhat confused by the following behavior.
ifelse(TRUE,c(1,2),c(3,4))
# [1] 1
ifelse(FALSE,c(1,2),c(3,4))
# [1] 3
Is this a design choice that's above my paygrade?
The documentation for ifelse states:
ifelse returns a value with the same
shape as test which is filled with
elements selected from either yes or
no depending on whether the element
of test is TRUE or FALSE.
Since you are passing test values of length 1, you are getting results of length 1. If you pass longer test vectors, you will get longer results:
> ifelse(c(TRUE, FALSE), c(1, 2), c(3, 4))
[1] 1 4
So ifelse is intended for the specific purpose of testing a vector of booleans and returning a vector of the same length, filled with elements taken from the (vector) yes and no arguments.
It is a common confusion, because of the function's name, to use this when really you want just a normal if () {} else {} construction instead.
I bet you want a simple if statement instead of ifelse - in R, if isn't just a control-flow structure, it can return a value:
> if(TRUE) c(1,2) else c(3,4)
[1] 1 2
> if(FALSE) c(1,2) else c(3,4)
[1] 3 4
Note that you can circumvent the problem if you assign the result inside the ifelse:
ifelse(TRUE, a <- c(1,2), a <- c(3,4))
a
# [1] 1 2
ifelse(FALSE, a <- c(1,2), a <- c(3,4))
a
# [1] 3 4
use `if`, e.g.
> `if`(T,1:3,2:4)
[1] 1 2 3
yeah, I think ifelse() is really designed for when you have a big long vector of tests and want to map each to one of two options. For example, I often do colors for plot() in this way:
plot(x,y, col = ifelse(x>2, 'red', 'blue'))
If you had a big long vector of tests but wanted pairs for outputs, you could use sapply() or plyr's llply() or something, perhaps.
Sometimes the user just needs a switch statement instead of an ifelse. In that case:
condition <- TRUE
switch(2-condition, c(1, 2), c(3, 4))
#### [1] 1 2
(which is another syntax option of Ken Williams's answer)
Here is an approach similar to that suggested by Cath, but it can work with existing pre-assigned vectors
It is based around using the get() like so:
a <- c(1,2)
b <- c(3,4)
get(ifelse(TRUE, "a", "b"))
# [1] 1 2
In your case, using if_else from dplyr would have been helpful: if_else is more strict than ifelse, and throws an error for your case:
library(dplyr)
if_else(TRUE,c(1,2),c(3,4))
#> `true` must be length 1 (length of `condition`), not 2
Found on everydropr:
ifelse(rep(TRUE, length(c(1,2))), c(1,2),c(3,4))
#>[1] 1 2
Can replicate the result of your condition to return the desired length

Using ifelse to change column names in R [duplicate]

I've found R's ifelse statements to be pretty handy from time to time. For example:
ifelse(TRUE,1,2)
# [1] 1
ifelse(FALSE,1,2)
# [1] 2
But I'm somewhat confused by the following behavior.
ifelse(TRUE,c(1,2),c(3,4))
# [1] 1
ifelse(FALSE,c(1,2),c(3,4))
# [1] 3
Is this a design choice that's above my paygrade?
The documentation for ifelse states:
ifelse returns a value with the same
shape as test which is filled with
elements selected from either yes or
no depending on whether the element
of test is TRUE or FALSE.
Since you are passing test values of length 1, you are getting results of length 1. If you pass longer test vectors, you will get longer results:
> ifelse(c(TRUE, FALSE), c(1, 2), c(3, 4))
[1] 1 4
So ifelse is intended for the specific purpose of testing a vector of booleans and returning a vector of the same length, filled with elements taken from the (vector) yes and no arguments.
It is a common confusion, because of the function's name, to use this when really you want just a normal if () {} else {} construction instead.
I bet you want a simple if statement instead of ifelse - in R, if isn't just a control-flow structure, it can return a value:
> if(TRUE) c(1,2) else c(3,4)
[1] 1 2
> if(FALSE) c(1,2) else c(3,4)
[1] 3 4
Note that you can circumvent the problem if you assign the result inside the ifelse:
ifelse(TRUE, a <- c(1,2), a <- c(3,4))
a
# [1] 1 2
ifelse(FALSE, a <- c(1,2), a <- c(3,4))
a
# [1] 3 4
use `if`, e.g.
> `if`(T,1:3,2:4)
[1] 1 2 3
yeah, I think ifelse() is really designed for when you have a big long vector of tests and want to map each to one of two options. For example, I often do colors for plot() in this way:
plot(x,y, col = ifelse(x>2, 'red', 'blue'))
If you had a big long vector of tests but wanted pairs for outputs, you could use sapply() or plyr's llply() or something, perhaps.
Sometimes the user just needs a switch statement instead of an ifelse. In that case:
condition <- TRUE
switch(2-condition, c(1, 2), c(3, 4))
#### [1] 1 2
(which is another syntax option of Ken Williams's answer)
Here is an approach similar to that suggested by Cath, but it can work with existing pre-assigned vectors
It is based around using the get() like so:
a <- c(1,2)
b <- c(3,4)
get(ifelse(TRUE, "a", "b"))
# [1] 1 2
In your case, using if_else from dplyr would have been helpful: if_else is more strict than ifelse, and throws an error for your case:
library(dplyr)
if_else(TRUE,c(1,2),c(3,4))
#> `true` must be length 1 (length of `condition`), not 2
Found on everydropr:
ifelse(rep(TRUE, length(c(1,2))), c(1,2),c(3,4))
#>[1] 1 2
Can replicate the result of your condition to return the desired length

Check if value is in data frame

I'm trying to check if a specific value is anywhere in a data frame.
I know the %in% operator should allow me to do this, but it doesn't seem to work the way I would expect when applying to a whole data frame:
A = data.frame(B=c(1,2,3,4), C=c(5,6,7,8))
1 %in% A
[1] FALSE
But if I apply this to the specific column the value is in it works the way I expect:
1 %in% A$C
[1] TRUE
What is the proper way of checking if a value is anywhere in a data frame?
You could do:
any(A==1)
#[1] TRUE
OR with Reduce:
Reduce("|", A==1)
OR
length(which(A==1))>0
OR
is.element(1,unlist(A))
To find the location of that value you can do f.ex:
which(A == 1, arr.ind=TRUE)
# row col
#[1,] 1 1
Or simply
sum(A == 1) > 0
#[1] TRUE
Loop through the variables with sapply, then use any.
any(sapply(A, function(x) 1 %in% x))
[1] TRUE
or following digEmAll's comment, you could use unlist, which takes a list (data.frame) and returns a vector.
1 %in% unlist(A)
[1] TRUE
The trick to understanding why your first attempt doesn't work, really comes down to understanding what a data frame is - namely a list of vectors of equal length. What you're trying to do here is not check if that list of vectors matches your condition, but checking if the values in those vectors matches the condition.
Try:
any(A == 1)
Returns FALSE or TRUE

Index a Particular Numeric Vector From a List of Vectors in R

In R, for the sake of example, I have a list composed of equal-length numeric vectors of form similar to:
list <- list(c(1,2,3),c(1,3,2),c(2,1,3))
[[1]]
[1] 1 2 3
[[2]]
[1] 1 3 2
[[3]]
[1] 2 1 3
...
Every element of the list is unique. I want to get the index number of the element x <- c(2,1,3), or any other particular numeric vector within the list.
I've attempted using match(x,list), which gives a vector full of NA, and which(list==(c(1,2,3)), which gives me a "(list) object cannot be coerced to type 'double'" error. Coercing the list to different types didn't seem to make a difference for the which function. I also attempted various grep* functions, but these don't return exact numeric vector matches. Using find(c(1,2,3),list) or even some fancy sapply which %in% type functions didn't give me what I was looking for. I feel like I have a type problem. Any suggestions?
--Update--
Summary of Solutions
Thanks for your replies. The method in the comment for this question is clean and works well (via akrun).
> which(paste(list)==deparse(x))
[1] 25
The next method didn't work correctly
> which(duplicated(c(x, list(y), fromLast = TRUE)))
[1] 49
> y
[1] 1 2 3
This sounds good, but in the next block you can see the problem
> y<-c(1,3,2)
> which(duplicated(c(list, list(y), fromLast = TRUE)))
[1] 49
More fundamentally, there are only 48 elements in the list I was using.
The last method works well (via BondedDust), and I would guess it is more efficient using an apply function:
> which( sapply(list, identical, y ))
[1] 25
match works fine if you pass it the right data.
L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
match(list(c(2,1,3)), L)
#[1] 3
Beware that this works via coercing lists to character, so fringe cases will fail - with a hat-tip to #nicola:
match(list(1:3),L)
#[1] NA
even though:
1:3 == c(1,2,3)
#[1] TRUE TRUE TRUE
Although arguably:
identical(1:3,c(1,2,3))
#[1] FALSE
identical(1:3,c(1L,2L,3L))
#[1] TRUE
You can use duplicated(). If we add the matching vector to the end of the original list and set fromLast = TRUE, we will find the duplicate(s). Then we can use which() to get the index.
which(duplicated(c(list, list(c(2, 1, 3)), fromLast = TRUE))
# [1] 3
Or you could add it as the first element and subtract 1 from the result.
which(duplicated(c(list(c(2, 1, 3)), list))) - 1L
# [1] 3
Note that the type always matters with this type of comparison. When comparing integers and numerics, you will need to convert doubles to integers for this to run without issue. For example, 1:3 is not the same type as c(1, 2, 3).
> L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
> sapply(L, identical, c(2,1,3))
[1] FALSE FALSE TRUE
> which( sapply(L, identical, c(2,1,3)) )
[1] 3
This would be slightly less restrictive in its test:
> which( sapply(L, function(x,y){all(x==y)}, c(1:3)) )
[1] 1
Try:
vapply(list,function(z) all(z==x),TRUE)
#[1] FALSE FALSE TRUE
Enclosing the above line to which gives you the index of the list.

Resources