Exclude multiple words from a vector with grepl [duplicate]

Exclude multiple words from a vector with grepl [duplicate] - r

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 7 years ago.
Here sample data:
exclude.words <- c("zoznam","azet","dovera","joj","alza","telecom","google","post","sme")
main.data <- c("zoznam","registration","azet","azet.com","dovera","dna","joj","alza","telecom","google","post","sme")
This works if the words are equal (match exactly), however see azet.com that won't be excluded! For that we could use agrepl().
main.data[!(main.data %in% exclude.words)]
So how to use agrepl with two vectors?
main.data[!agrepl(main.data, exclude.words)]

As commented, you can use:
main.data[!grepl(paste(exclude.words, collapse = "|"), main.data)]
to exclude any words that have a partly or complete match between the main.data and exclude.words.
paste(exclude.words, collapse = "|")
creates a single string with "|" (logical OR) between the exclude.words which can be used as a single pattern in grepl. Therefore, you don't need to loop over the single words.

main.data[!as.logical(rowSums(sapply(exclude.words, function(x) agrepl(x, main.data))))]
# [1] "registration" "dna"
# clarification
sapply(exclude.words, function(x) agrepl(x, main.data))
# zoznam azet dovera joj alza telecom google post sme
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [7,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [8,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# [9,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [10,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

You can use this functional programming approach:
library(functional)
funcs = lapply(exclude.words, function(u) function(x) x[!grepl(u, x)])
Reduce(Compose, funcs)(main.data)
#[1] "registration" "dna"

Related

Logical vector to see wether an element of a df is contained within a df inside a List

I tried:
mdf$CLAVE.EMISORA %in% BMV[[9]]$`CLAVE EMISORA`
But it only returns:
logical(0)
For some reason the reveres seems to work:
BMV[[9]]$`CLAVE EMISORA` %in% mdf$CLAVE.EMISORA
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[39] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[58] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[77] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
My data (mdf): I have it but I don't know how to embed
My list (BMV): .... I don't know how to copy a list to clipboard sorry...

logical(0) is a vector of base type logical with 0 length.
You're getting this because your trying to check if any element in a vector of length 0 is present in BMV[[9]]$'CLAVE EMISORA'
if you run
length(mdf$CLAVE.EMISORA)
You'll get 0 as output
Reverse works because you're checking if any element from a vector of a non-zero length is present in a vector of 0 length.

Is it possible to keep memory while using apply()?

I need to run the function lapply on a activation_status list t times so that the t iteration of the function remembers the results from the t-1 iteration.
The list is basically a bidimensional array representing a single item i status over multiple t periods and looks like this:
n_items <<- 100
n_iterations <<- 10
activation_status <-
lapply(1:n_iterations,
FUN = function(t, bool, i) rep(bool, t),
FALSE, n_items)
Now during each iteration t, I randomly activate (set to TRUE) a number of items within the list but I want all the items already activated at time t-1 to stay active (note that I define activation_status within the update function so that it's accessible in the inner functions).
updateActivation <- function(t) {
activation_status[[t]] <- as.logical(rbinom(n_items, 1, prob = .5))
activation_status[[t]][activation_status[[t-1]] == TRUE] <- TRUE
}
But then
lapply(1:n_iterations, updateActivation)
throws as error:
Error in activation_status[[t - 1]] : attempt to select less than one element in get1index
I know I could use a loop, but I wonder if it is:
Possible to do something like this with the apply function?
Do it faster?

Not sure if I fully understood the question but seems like you are looking for a recursion.
In that case Reduce() can be used instead of lapply():
activation_status <- rep(FALSE, 10)
n_iterations <- 5
Reduce(function(y, x) as.logical(rbinom(length(y), 1, prob=0.1)) | y,
x=1:n_iterations, init=activation_status, accumulate=TRUE
)
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[[3]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
[[4]]
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[[5]]
[1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[[6]]
[1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE

We could probably do this without using any apply command.
#Set seed for reproduciblity
set.seed(123)
#Create initialization demo data
activation_status <- rep(FALSE, 10)
#Number of values to select
n_iterations <- 5
#Sequence from 1:n_iterations
seq_n_iterations <- seq_len(n_iterations)
#Create matrix to hold output
output <- replicate(n_iterations, activation_status)
#Select n_iterations random values from 1:length(activation_status)
#You can change this if you want to use some specific distrubution
points <- sample(length(activation_status), n_iterations)
#Create column indices
cols <- rep(seq_n_iterations, seq_n_iterations)
#Create row indices
rows <- points[ave(inds, inds, FUN = seq_along)]
#Change those values to TRUE
output[cbind(rows, cols)] <- TRUE
output
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE TRUE TRUE
# [3,] TRUE TRUE TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE TRUE
# [7,] FALSE FALSE FALSE FALSE FALSE
# [8,] FALSE FALSE FALSE TRUE TRUE
# [9,] FALSE FALSE FALSE FALSE FALSE
#[10,] FALSE TRUE TRUE TRUE TRUE
If you want them as lists :
asplit(output, 2)
#[[1]]
# [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[[2]]
# [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[[3]]
# [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[[4]]
# [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
#[[5]]
# [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE

Search for vector of motifs in vector of sequences with dataframe output

I have a set of nucleotide sequences in a vector of strings called x.
I want to check whether some (say 10) motifs are present in x. I want to produce a data frame or table where the rows are the sequences in X and the columns are the patterns/motifs are in the vector sdseqs.
sdframe <- data.frame
sdseqs = c("AGGAG.+ATG",
"AGAAG.+ATG","AAAGG.+ATG","GGAGG.+ATG","GAAGA.+ATG",
"GGAGA.+ATG","AAGGT.+ATG","AGGAA.+ATG","AAGGA.+ATG","GTGGA.+ATG")
for (i in 1:10) {
sdframe <- cbind(sdframe,(grepl(sdseqs[i], x)))
}
This code works just fine but the first column of the data frame will be empty, with question marks. The other columns are populated with true and false - that's what i want.
I tried to define an empty data frame outside the loop at the beginning. I am new to R and I am coming from Perl. This what I usually did in Perl: you define variables to be used within a loop outside. How can I do this in R?
Also, a viable option would be to delete the first column from my data frame, but that does not seem so straightforward to me.
Any help is appreciated.
The output i Get with my code now:
sdframe
[1,] ? TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[2,] ? FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[3,] ? FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
[4,] ? TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] ? FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[6,] ? FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
[7,] ? FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[8,] ? FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[9,] ? FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] ? FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[11,] ? FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I want the same but without the first column of ?. Note my x has 11 sequences, the motifs i checked for are the column (10 columns, 11 counting the first with ?)

A common R solution would use a function from the apply family to apply a function over a a vector.
sdseqs = c(
"AGGAG.+ATG",
"AGAAG.+ATG",
"AAAGG.+ATG",
"GGAGG.+ATG",
"GAAGA.+ATG",
"GGAGA.+ATG",
"AAGGT.+ATG",
"AGGAA.+ATG",
"AAGGA.+ATG",
"GTGGA.+ATG"
)
sdframe <- sapply(sdseqs, function(one.motif) {
grepl(one.motif, x = x)
})
sdframe
AGGAG.+ATG AGAAG.+ATG AAAGG.+ATG GGAGG.+ATG GAAGA.+ATG GGAGA.+ATG AAGGT.+ATG AGGAA.+ATG AAGGA.+ATG GTGGA.+ATG
[1,] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
[2,] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
[3,] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
sdframe.t <- t(sdframe)
sdframe.t
[,1] [,2] [,3]
AGGAG.+ATG FALSE FALSE FALSE
AGAAG.+ATG TRUE TRUE TRUE
AAAGG.+ATG FALSE FALSE FALSE
GGAGG.+ATG FALSE FALSE FALSE
GAAGA.+ATG TRUE TRUE TRUE
GGAGA.+ATG TRUE TRUE TRUE
AAGGT.+ATG TRUE TRUE TRUE
AGGAA.+ATG FALSE FALSE FALSE
AAGGA.+ATG TRUE TRUE TRUE
GTGGA.+ATG FALSE FALSE FALSE

In first line in fact you do not create a data.frame. So your output is a list.
Instead of cbind you need rbind to add rows:
sdframe <- data.frame()
sdseqs = c("AGGAG.+ATG",
"AGAAG.+ATG","AAAGG.+ATG","GGAGG.+ATG","GAAGA.+ATG",
"GGAGA.+ATG","AAGGT.+ATG","AGGAA.+ATG","AAGGA.+ATG","GTGGA.+ATG")
for (i in 1:10) {
sdframe <- rbind(sdframe,(grepl(sdseqs[i], x)))
}

Create a logical or binary matrix/data.frame from a list of factors in R

I have a list of approximately 2 million elements. The list is made up of vectors of character strings. There are about 50 different character strings so can be considered factors. The vectors of character strings are different lengths varying between 1 and 50 (i.e the total number of character strings).
I want to convert the list to a logical or binary matrix/data.frame. Currently my method involves lapply and is incredibly slow, I would like to know if there is a vectorised approach.
require(dplyr); require(tidyr)
#create test data set
set.seed(123)
list1 <- list()
ListLength <-10
elementlength <- sample(1:5, ListLength, replace = TRUE )
for(i in 1:length(elementlength) ){
list1[[i]] <- sample(letters[1:15], elementlength[i])
}
#Create data frame from list using lapply
lapply(list1, function(n){
data.frame(type = n, value = TRUE) %>%
spread(., key = type, value )
}) %>% bind_rows()
I don't know if there is a way by preallocating the data frame then filling it in somehow.
Type <- unique(unlist(list1, use.names = FALSE))
#Create empty dataframe
TypeMat <- data.frame(matrix(NA,
ncol = length(Type),
nrow = ListLength)) %>%
setNames(Type)

We could use mtabulate from qdapTools
library(qdapTools)
mtabulate(list1)!=0
# a b c d e f g h i j k l m o
#[1,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
#[3,] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
#[5,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#[6,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
#[8,] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
#[9,] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[10,]FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

R Programming- Permutation regarding repetition and order

I have attempted to search and attempt solutions to no avail with the combn and gtools library.
I want to take a vector of the following:
x<-c(TRUE,FALSE)
and have it look like the following output:
Permutations with repetition (n=2, r=5)
Using Items: t,f
List has 32 entries.
{t,t,t,t,t} {t,t,t,t,f} {t,t,t,f,t} {t,t,t,f,f} {t,t,f,t,t} {t,t,f,t,f} {t,t,f,f,t} {t,t,f,f,f} {t,f,t,t,t} {t,f,t,t,f} {t,f,t,f,t} {t,f,t,f,f} {t,f,f,t,t} {t,f,f,t,f} {t,f,f,f,t} {t,f,f,f,f} {f,t,t,t,t} {f,t,t,t,f} {f,t,t,f,t} {f,t,t,f,f} {f,t,f,t,t} {f,t,f,t,f} {f,t,f,f,t} {f,t,f,f,f} {f,f,t,t,t} {f,f,t,t,f} {f,f,t,f,t} {f,f,t,f,f} {f,f,f,t,t} {f,f,f,t,f} {f,f,f,f,t} {f,f,f,f,f}
Any suggestions? I am quite a newbie at this, so any help is appreciated. I used the following online calculator to give me the solution below. https://www.mathsisfun.com/combinatorics/combinations-permutations-calculator.html
Thanks!

Using the gtools library, I believe this is:
library(gtools)
permutations(2,5,v=c(TRUE,FALSE),repeats.allowed=TRUE)
## [,1] [,2] [,3] [,4] [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE TRUE
## [3,] FALSE FALSE FALSE TRUE FALSE
## [4,] FALSE FALSE FALSE TRUE TRUE
## [5,] FALSE FALSE TRUE FALSE FALSE
## [6,] FALSE FALSE TRUE FALSE TRUE
## [7,] FALSE FALSE TRUE TRUE FALSE
## [8,] FALSE FALSE TRUE TRUE TRUE
## [9,] FALSE TRUE FALSE FALSE FALSE
##[10,] FALSE TRUE FALSE FALSE TRUE
##[11,] FALSE TRUE FALSE TRUE FALSE
##[12,] FALSE TRUE FALSE TRUE TRUE
##[13,] FALSE TRUE TRUE FALSE FALSE
##[14,] FALSE TRUE TRUE FALSE TRUE
##[15,] FALSE TRUE TRUE TRUE FALSE
##[16,] FALSE TRUE TRUE TRUE TRUE
##[17,] TRUE FALSE FALSE FALSE FALSE
##[18,] TRUE FALSE FALSE FALSE TRUE
##[19,] TRUE FALSE FALSE TRUE FALSE
##[20,] TRUE FALSE FALSE TRUE TRUE
##[21,] TRUE FALSE TRUE FALSE FALSE
##[22,] TRUE FALSE TRUE FALSE TRUE
##[23,] TRUE FALSE TRUE TRUE FALSE
##[24,] TRUE FALSE TRUE TRUE TRUE
##[25,] TRUE TRUE FALSE FALSE FALSE
##[26,] TRUE TRUE FALSE FALSE TRUE
##[27,] TRUE TRUE FALSE TRUE FALSE
##[28,] TRUE TRUE FALSE TRUE TRUE
##[29,] TRUE TRUE TRUE FALSE FALSE
##[30,] TRUE TRUE TRUE FALSE TRUE
##[31,] TRUE TRUE TRUE TRUE FALSE
##[32,] TRUE TRUE TRUE TRUE TRUE