Unexpected results from str_detect() - r

str_detect(c("abc", "xyz"), letters)) does not return expected results.
It should be a vector of
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE TRUE TRUE TRUE
But instead it returns
str_detect(c("abc", "xyz"), letters))
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE TRUE FALSE TRUE
Why? And how do I get the desired result?

The reason for this is because str_detect recycles arguments. It's comparing abc against a, then xyz against b, then abc against c, and so on. You should paste together abc and xyz into a single character, or just supply c("abcxyz"), but I'm assuming this might be a simplified version of a more complex issue.
library(stringr)
rgx <- paste0(c("abc", "xyz"), collapse = "")
str_detect(rgx, letters)

Related

subsetting by index in R

I have an vector with indexes:
indexes
[1] 25 2 16 23
and another vector with logical:
logical
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
i want to keep all logical items that, except those with indexes stored in indexes.
i thought this would have an easy solution, but mine doesn't work:
for(index in indexes){
logical[index] = NULL
}
You could just use minus (-) indexing :
indexes <- c(25, 2, 16, 23)
logicals <- sample(c(T,F),25,replace=T)
logicals
#> [1] FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
#> [13] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
#> [25] FALSE
logicals[-indexes]
#> [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
#> [13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE

Logical vector to see wether an element of a df is contained within a df inside a List

I tried:
mdf$CLAVE.EMISORA %in% BMV[[9]]$`CLAVE EMISORA`
But it only returns:
logical(0)
For some reason the reveres seems to work:
BMV[[9]]$`CLAVE EMISORA` %in% mdf$CLAVE.EMISORA
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[39] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[58] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[77] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
My data (mdf): I have it but I don't know how to embed
My list (BMV): .... I don't know how to copy a list to clipboard sorry...
logical(0) is a vector of base type logical with 0 length.
You're getting this because your trying to check if any element in a vector of length 0 is present in BMV[[9]]$'CLAVE EMISORA'
if you run
length(mdf$CLAVE.EMISORA)
You'll get 0 as output
Reverse works because you're checking if any element from a vector of a non-zero length is present in a vector of 0 length.

create numeric vector based on values in logic vector- R

I have a logic vector in R something like this:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[55] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I want to construct another numeric vector that contains a 1 if the logic vector is true and a 0 if it is false. I have tried the following code
## create an empty vector
numericvec <- vector(mode="numeric", length=0)
## for loop
for (i in logicvec){
if(i == TRUE){
c(numericvec, 1)
} else {
c(numericvec, 0)
}
}
The for loop syntax seems ok because I don't get errors when I run it but it isn't currently adding any values to the numeric vector.
This should work:
numericvec <- as.numeric(logicvec)
No need for a for() loop. R typically operates on entire columns.

Search for vector of motifs in vector of sequences with dataframe output

I have a set of nucleotide sequences in a vector of strings called x.
I want to check whether some (say 10) motifs are present in x. I want to produce a data frame or table where the rows are the sequences in X and the columns are the patterns/motifs are in the vector sdseqs.
sdframe <- data.frame
sdseqs = c("AGGAG.+ATG",
"AGAAG.+ATG","AAAGG.+ATG","GGAGG.+ATG","GAAGA.+ATG",
"GGAGA.+ATG","AAGGT.+ATG","AGGAA.+ATG","AAGGA.+ATG","GTGGA.+ATG")
for (i in 1:10) {
sdframe <- cbind(sdframe,(grepl(sdseqs[i], x)))
}
This code works just fine but the first column of the data frame will be empty, with question marks. The other columns are populated with true and false - that's what i want.
I tried to define an empty data frame outside the loop at the beginning. I am new to R and I am coming from Perl. This what I usually did in Perl: you define variables to be used within a loop outside. How can I do this in R?
Also, a viable option would be to delete the first column from my data frame, but that does not seem so straightforward to me.
Any help is appreciated.
The output i Get with my code now:
sdframe
[1,] ? TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[2,] ? FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[3,] ? FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
[4,] ? TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] ? FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[6,] ? FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
[7,] ? FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[8,] ? FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[9,] ? FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] ? FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[11,] ? FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I want the same but without the first column of ?. Note my x has 11 sequences, the motifs i checked for are the column (10 columns, 11 counting the first with ?)
A common R solution would use a function from the apply family to apply a function over a a vector.
sdseqs = c(
"AGGAG.+ATG",
"AGAAG.+ATG",
"AAAGG.+ATG",
"GGAGG.+ATG",
"GAAGA.+ATG",
"GGAGA.+ATG",
"AAGGT.+ATG",
"AGGAA.+ATG",
"AAGGA.+ATG",
"GTGGA.+ATG"
)
sdframe <- sapply(sdseqs, function(one.motif) {
grepl(one.motif, x = x)
})
sdframe
AGGAG.+ATG AGAAG.+ATG AAAGG.+ATG GGAGG.+ATG GAAGA.+ATG GGAGA.+ATG AAGGT.+ATG AGGAA.+ATG AAGGA.+ATG GTGGA.+ATG
[1,] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
[2,] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
[3,] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
sdframe.t <- t(sdframe)
sdframe.t
[,1] [,2] [,3]
AGGAG.+ATG FALSE FALSE FALSE
AGAAG.+ATG TRUE TRUE TRUE
AAAGG.+ATG FALSE FALSE FALSE
GGAGG.+ATG FALSE FALSE FALSE
GAAGA.+ATG TRUE TRUE TRUE
GGAGA.+ATG TRUE TRUE TRUE
AAGGT.+ATG TRUE TRUE TRUE
AGGAA.+ATG FALSE FALSE FALSE
AAGGA.+ATG TRUE TRUE TRUE
GTGGA.+ATG FALSE FALSE FALSE
In first line in fact you do not create a data.frame. So your output is a list.
Instead of cbind you need rbind to add rows:
sdframe <- data.frame()
sdseqs = c("AGGAG.+ATG",
"AGAAG.+ATG","AAAGG.+ATG","GGAGG.+ATG","GAAGA.+ATG",
"GGAGA.+ATG","AAGGT.+ATG","AGGAA.+ATG","AAGGA.+ATG","GTGGA.+ATG")
for (i in 1:10) {
sdframe <- rbind(sdframe,(grepl(sdseqs[i], x)))
}

Create a logical or binary matrix/data.frame from a list of factors in R

I have a list of approximately 2 million elements. The list is made up of vectors of character strings. There are about 50 different character strings so can be considered factors. The vectors of character strings are different lengths varying between 1 and 50 (i.e the total number of character strings).
I want to convert the list to a logical or binary matrix/data.frame. Currently my method involves lapply and is incredibly slow, I would like to know if there is a vectorised approach.
require(dplyr); require(tidyr)
#create test data set
set.seed(123)
list1 <- list()
ListLength <-10
elementlength <- sample(1:5, ListLength, replace = TRUE )
for(i in 1:length(elementlength) ){
list1[[i]] <- sample(letters[1:15], elementlength[i])
}
#Create data frame from list using lapply
lapply(list1, function(n){
data.frame(type = n, value = TRUE) %>%
spread(., key = type, value )
}) %>% bind_rows()
I don't know if there is a way by preallocating the data frame then filling it in somehow.
Type <- unique(unlist(list1, use.names = FALSE))
#Create empty dataframe
TypeMat <- data.frame(matrix(NA,
ncol = length(Type),
nrow = ListLength)) %>%
setNames(Type)
We could use mtabulate from qdapTools
library(qdapTools)
mtabulate(list1)!=0
# a b c d e f g h i j k l m o
#[1,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
#[3,] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
#[5,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#[6,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
#[8,] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
#[9,] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[10,]FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Resources