Related
I currently have a string in R that looks like this:
df <- c ("BMMBMMBMMMMMBMMBM")
I need to determine how many times MM's appear in this string (in this example it's 4).
I've been using str_count(df, "MM") but this only counts how many times two M's are next to each other in the string (which returns 5).
Any help would be great...
Thanks!
Here's a base R approach without regular expressions:
with(rle(unlist(strsplit(x, ""))), sum(values == "M" & lengths >= 2))
# [1] 4
A possible approach is:
stringr::str_count(df, "MM+")
#output
[1] 4
+ means one or more
in base R:
lengths(gregexpr("MM+", df))
gregexpr returns a list, each element corresponds to one element of df.
lengths returns the length of each list element.
EDIT: as per the comment by #docendo discimus the second option is a little dangerous since it will return 1 if the string was not found.
lengths(gregexpr("xyz+", df))
#output
1
A safer option is:
lapply(gregexpr("MM+", df), function(x) length(x[x > 0]))
#output
[[1]]
[1] 4
lapply(gregexpr("xyz+", df), function(x) length(x[x > 0]))
#output
[[1]]
[1] 0
Base solution:
s <- "BMMBMMBMMMMMBMMBM"
lengths(gregexpr("MM+", s))
## [1] 4
Note that the input called df in the question is a character string, not a data frame, and c("X") is identical to "X" so the c and the parentheses are not needed.
Try the following pattern:
str_count(df,"(M)\\1+")
This will count two or more M as one case.
Or
str_count(df,"M{2,}")
I'm trying to create a calculator that multiplies permutation groups written in cyclic form (the process of which is described in this post, for anyone unfamiliar: https://math.stackexchange.com/questions/31763/multiplication-in-permutation-groups-written-in-cyclic-notation). Although I know this would be easier to do with Python or something else, I wanted to practice writing code in R since it is relatively new to me.
My gameplan for this is take an input, such as "(1 2 3)(2 4 1)" and split it into two separate lists or vectors. However, I am having trouble starting this because from my understanding of character functions (which I researched here: https://www.statmethods.net/management/functions.html) I will ultimately have to use the function grep() to find the points where ")(" occur in my string to split from there. However, grep only takes vectors for its argument, so I am trying to coerce my string into a vector. In researching this problem, I have mostly seen people suggest to use as.integer(unlist(str_split())), however, this doesn't work for me as when I split, not everything is an integer and the values become NA, as seen in this example.
library(tidyverse)
x <- "(1 2 3)(2 4 1)"
x <- as.integer(unlist(str_split(x," ")))'
x
Is there an alternative way to turn a string into a vector when there are not just integers involved? I also realize that the means by which I am trying to split up the two permutations is very roundabout, but that is because of the character functions that I researched this seems like the only way. If there are other functions that would make this easier, please let me know.
Thank you!
Comments in the code.
x <- "(1 2 3)(2 4 1)"
out1 <- strsplit(x, split = ")(", fixed = TRUE)[[1]] # split on close and open bracket
out2 <- gsub("[\\(|\\)]", replacement = "", out1) # remove brackets
out3 <- strsplit(out2, " ") # tease out numbers between spaces
lapply(out3, as.integer)
[[1]]
[1] 1 2 3
[[2]]
[1] 2 4 1
There aren't really any scalars on R. Single values like 1, TRUE, and "a" are all 1-element vectors. grep(pattern, x) will work fine on your original string. As a starting point for getting towards your desired goal, I would suggest splitting the groups using:
> str_extract_all(x, "\\([0-9 ]+\\)")
[[1]]
[1] "(1 2 3)" "(2 4 1)"
If we need to split the strings with the brackets
strsplit(x, "(?<=\\))(?=\\()", perl = TRUE)[[1]]
#[1] "(1 2 3)" "(2 4 1)"
Or we can use convenient wrapper from qdapRegex
library(qdapRegex)
ex_round(x, include.marker = TRUE)[[1]]
#[1] "(1 2 3)" "(2 4 1)"
alternative: using library(magrittr)
x <- "(1 2 3)(2 4 1)"
x %>%
gsub("^\\(","c(",.) %>% gsub("\\)\\(","),c(",.) %>% gsub("(?=\\s\\d)",", ",.,perl=T) %>%
paste0("list(",.,")") %>% {eval(parse(text=.))}
result:
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 4 1
You could use chartr with read.table :
read.table(text= chartr("()"," \n",x))
# V1 V2 V3
# 1 1 2 3
# 2 2 4 1
I've found R's ifelse statements to be pretty handy from time to time. For example:
ifelse(TRUE,1,2)
# [1] 1
ifelse(FALSE,1,2)
# [1] 2
But I'm somewhat confused by the following behavior.
ifelse(TRUE,c(1,2),c(3,4))
# [1] 1
ifelse(FALSE,c(1,2),c(3,4))
# [1] 3
Is this a design choice that's above my paygrade?
The documentation for ifelse states:
ifelse returns a value with the same
shape as test which is filled with
elements selected from either yes or
no depending on whether the element
of test is TRUE or FALSE.
Since you are passing test values of length 1, you are getting results of length 1. If you pass longer test vectors, you will get longer results:
> ifelse(c(TRUE, FALSE), c(1, 2), c(3, 4))
[1] 1 4
So ifelse is intended for the specific purpose of testing a vector of booleans and returning a vector of the same length, filled with elements taken from the (vector) yes and no arguments.
It is a common confusion, because of the function's name, to use this when really you want just a normal if () {} else {} construction instead.
I bet you want a simple if statement instead of ifelse - in R, if isn't just a control-flow structure, it can return a value:
> if(TRUE) c(1,2) else c(3,4)
[1] 1 2
> if(FALSE) c(1,2) else c(3,4)
[1] 3 4
Note that you can circumvent the problem if you assign the result inside the ifelse:
ifelse(TRUE, a <- c(1,2), a <- c(3,4))
a
# [1] 1 2
ifelse(FALSE, a <- c(1,2), a <- c(3,4))
a
# [1] 3 4
use `if`, e.g.
> `if`(T,1:3,2:4)
[1] 1 2 3
yeah, I think ifelse() is really designed for when you have a big long vector of tests and want to map each to one of two options. For example, I often do colors for plot() in this way:
plot(x,y, col = ifelse(x>2, 'red', 'blue'))
If you had a big long vector of tests but wanted pairs for outputs, you could use sapply() or plyr's llply() or something, perhaps.
Sometimes the user just needs a switch statement instead of an ifelse. In that case:
condition <- TRUE
switch(2-condition, c(1, 2), c(3, 4))
#### [1] 1 2
(which is another syntax option of Ken Williams's answer)
Here is an approach similar to that suggested by Cath, but it can work with existing pre-assigned vectors
It is based around using the get() like so:
a <- c(1,2)
b <- c(3,4)
get(ifelse(TRUE, "a", "b"))
# [1] 1 2
In your case, using if_else from dplyr would have been helpful: if_else is more strict than ifelse, and throws an error for your case:
library(dplyr)
if_else(TRUE,c(1,2),c(3,4))
#> `true` must be length 1 (length of `condition`), not 2
Found on everydropr:
ifelse(rep(TRUE, length(c(1,2))), c(1,2),c(3,4))
#>[1] 1 2
Can replicate the result of your condition to return the desired length
I've been looking around for quite a while now, but can't seem to solve this problem, although I feel like it should be an easy one.
I have 54 factors containing differing amounts of strings, names of pathways to be exact. For example, here are two factors with the elements they contain:
> PWe1
[1] Gene_Expression
[2] miR-targeted_genes_in_muscle_cell_-_TarBase
[3] Generic_Transcription_Pathway
> PWe2
[1] miR-targeted_genes_in_epithelium_-_TarBase
[2] miR-targeted_genes_in_leukocytes_-_TarBase
[3] miR-targeted_genes_in_lymphocytes_-_TarBase
[4] miR-targeted_genes_in_muscle_cell_-_TarBase
What I would like to do is take these, and combine them into one big data frame with 54 columns, where each column has the names of one corresponding factor. I've tried cbind, cbind.data.frame and a couple of other options but those return numeric values instead of strings.
Expected output:
PWe1 PWe2
Gene_Expression miR-targeted_genes_in_epithelium_-_TarBase
miR-targeted_genes_in_muscle_cell_-_TarBase miR-targeted_genes_in_leukocytes_-_TarBase
Generic_Transcription_Pathway miR-targeted_genes_in_lymphocytes_-_TarBase
NA miR-targeted_genes_in_muscle_cell_-_TarBase
I'm quite a beginner when it comes to R, could anyone nudge me towards a possible solution?
Thanks in advance!
lst <- mget(ls(pattern="PW")) #<--- Create list with all necessary vectors.
ind <- lengths(lst) #<--- find maximum length
as.data.frame(do.call(cbind,
lapply(lst, `length<-`, max(ind)))) #<--- Convert to data.frmae
# PWe1 PWe2
# 1 Gene_Expression miR-targeted_genes_in_epithelium_-_TarBase
# 2 miR-targeted_genes_in_muscle_cell_-_TarBase miR-targeted_genes_in_leukocytes_-_TarBase
# 3 Generic_Transcription_Pathway miR-targeted_genes_in_lymphocytes_-_TarBase
# 4 <NA> miR-targeted_genes_in_muscle_cell_-_TarBase
l1 <- max(length(v1), length(v2))
length(v1) <- l1
length(v2) <- l1
cbind(as.character(v1), as.character(v2))
# [,1] [,2]
#[1,] "Gene_Expression" "miR-#targeted_genes_in_epithelium_-_TarBase"
#[2,] "miR-targeted_genes_in_muscle_cell_-_TarBase" "miR-#targeted_genes_in_leukocytes_-_TarBase"
#[3,] "Generic_Transcription_Pathway" "miR-#targeted_genes_in_lymphocytes_-_TarBase"
#[4,] NA "miR-#targeted_genes_in_muscle_cell_-_TarBase"
If you convert your factors to characters before you use cbind, you don't get numeric values:
testFrame <- data.frame(cbind(as.character(PWe1), as.character(PWe3))
If the length of both vectors differs, cbind throws a warning and elements of the shorter vectors will be replicated. If that is unsatisfying in your case, maybe a data.frame object might not be the right choice?
My current way is
coalesce <- function(x){
if (is.null(x)) NA else x
}
data[,aa:=sapply(JSON, function(x) coalesce(x$a))]
data[,bb:=sapply(JSON, function(x) x$b)]
> JSON <- list(list(a=1, b=1), list(b=2))
> JSON
[[1]]
[[1]]$a
[1] 1
[[1]]$b
[1] 1
[[2]]
[[2]]$b
[1] 2
> sapply(JSON, function(x) coalesce(x$a))
[1] 1 NA
> sapply(JSON, function(x) x$b)
[1] 1 2
JSON is a list of lists, each list may contain a which I would like to grab. If a doesn't exist, NA is returned. Each list must contain b. Both a and b are always scalars.
My Rprof tells me the majority time spent lies in sapply and Fun and coalesce.
I am wondering if there is any way to improve it?
Update
Sample data
x <- list(a=1, b=1)
y <- list(a=1)
JSON <- rep(list(x,y),300000)
system.time(sapply(JSON, function(x) x$a))
system.time(sapply(JSON, function(x) coalesce(x$b)))
Try coalescing after you extract the value and stick to lapply, that should speed things up (and if you posted a reasonable benching sample, we could test it):
unlist(lapply(lapply(JSON, "[[", "a"), coalesce))
There's an error in the way you're using sapply - what you want is:
sapply(JSON, function(x) coalesce(x)$a)
But that's really not optimal, and returns NULL when coalesce returns NA (probably not what you want.
Modify coalesce:
coalesce <- function(x){
if (is.null(x$a)) NA else x$a
}
And do:
data[,b:=sapply(JSON, coalesce)]