find a partial text in vector row +[r] - r

Basically I'm looking to write a function that will take a vector of strings and a search term as input, and output a boolean vector. After this, I'd also like to take a list of strings and run it through this same function to output multiple results vectors, one for each string.
So the initial data looks like:
> searchVector <- cbind(c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1"))
> searchVector
[,1]
[1,] "aaa1"
[2,] "aaa2"
[3,] ""
[4,] "bbb1,aaa1,ccc1"
[5,] "ddd1,ccc1,aaa1"
and this is what we'd hope to see:
>findTrigger(c("aaa","bbb"),searchVector)
[aaa] [bbb]
[1,] 1 0
[2,] 1 0
[3,] 0 0
[4,] 1 1
[5,] 1 0
I've made the following attempt:
searchfunction <- function (searchTerms, searchVector) {
output = matrix( nrow = length(searchVector),
ncol = length(searchTerms),
dimnames = searchTerms)
for (j in seq(1,length(searchTerms)))
{
for (i in seq(1,length(searchVector)))
{
output[i,j]=is.numeric(pmatch(searchTerms[j], searchVector[i]))
}
}
return(as.numeric(output))
}
But I just get a matrix of all 1's. I'm fairly new to R and I've looked around online, but haven't had any luck. Any help would be greatly appreciated, Thanks!

The key is to use the function grepl. This should get you started:
searchVector <- c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1")
res <- lapply(c('aaa','bbb'),function(pattern,x) as.numeric(grepl(pattern = pattern,x = x)),x = searchVector)
do.call(cbind,res)
To explore this a bit, start with just grepl:
> grepl('aaa',searchVector)
[1] TRUE TRUE FALSE TRUE TRUE
> as.numeric(grepl('aaa',searchVector))
[1] 1 1 0 1 1
Then I'm just wrapping that up in lapply, to loop over the vector c('aaa','bbb'). This will return a list of vectors, which we then combine into the matrix you indicated using do.call and cbind.

mapply and grep or grepl (thanks joran) are your friend:
searchTerms <- c("aaa", "bbb")
searchVector <- cbind(c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1"))
M <- mapply(grepl, searchTerms, MoreArgs=list(x=searchVector))
M
aaa bbb
[1,] TRUE FALSE
[2,] TRUE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] TRUE FALSE
If you want it as 1,0: apply(M,2,as.numeric)

Related

how to see if any element of a list contains only a certain value in R

I have a list of many matrices, let say my_list. I want to check if any of those matrices has only zero s as its element and if so which matrix in that list has such situation.
library(R.utils)
output_vec <- vector()
for(i in 1:length(my_list)){
asZero(as.vector(my_list[[i]]))}
this gives me true falses but, I am not able to return the index of matrices with all zero elements. I appreciate any help with this.
We may need to wrap with all - loop over the list of matrices with sapply, create a logical expression (x == 0), wrap with all to return a single TRUE/FALSE - if all values excluding NAs (na.rm = TRUE) are 0, this returns TRUE or else FALSe
sapply(my_list, function(x) all(x == 0, na.rm = TRUE))
You can use norm to judge if all entries in the matrix is zeros, e.g.,
sapply(my_list, norm) == 0
since the norm of matrix is 0 if and only if all values are zeros.
Example
> my_list
[[1]]
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
[[2]]
[,1] [,2]
[1,] 1 1
[2,] 0 0
> sapply(my_list, norm) == 0
[1] TRUE FALSE

Mapply on a function with conditional expressions

Background: PDF Parse My program looks for data in scanned PDF documents. I've created a CSV with rows representing various parameters to be searched for in a PDF, and columns for the different flavors of document that might contain those parameters. There are different identifiers for each parameter depending on the type of document. The column headers use dot separation to uniquely identify the document by type, subtype... , like so: type.subtype.s_subtype.s_s_subtype.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
p1 str1 str2
p2 str3 str4
p3 str5 str6
p4 str7
...
I'm reading in PDF files, and based on the filepaths they can be uniquely categorized into one of these types. I can apply various logical conditions to a substring of a given filepath, and based on that I'd like to output an NxM Boolean matrix, where N = NROW(filepath_vector), and M = ncol(params_csv). This matrix would show membership of a given file in a type with TRUE, and FALSE elsewhere.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
fpath1 FALSE FALSE TRUE FALSE
fpath2 FALSE TRUE FALSE FALSE
fpath3 FALSE TRUE FALSE FALSE
fpath4 FALSE FALSE FALSE TRUE
...
My solution: I'm trying to apply a function to a matrix that takes a vector as argument, and applies the first element of the vector to the first row, the second element to the second row, etc... however, the function has conditional behavior depending on the element of the vector being applied.
I know this is very similar to the question below (my reference point), but the conditionals in my function are tripping me up. I've provided a simplified reproducible example of the issue below.
R: Apply function to matrix with elements of vector as argument
set.seed(300)
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
f <- function(x) {
sapply(v, g <- function(y) {
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
})
}
mapply(f, as.data.frame(t(m)))
Desired output:
# [,1] [,2] [,3] [,4] [,5]
# [1,] -1 0 -1 -1 -1
# [2,] 4 4 0 4 0
# [3,] 3 0 3 3 0
# [4,] 2 0 2 2 0
# [5,] 1 1 1 1 0
But I get this error:
Error in if (y == 8) { : missing value where TRUE/FALSE needed
Can't seem to figure out the error or if I'm misguided elsewhere in my entire approach, any thoughts are appreciated.
Update (03April2018):
I had provided this as a toy example for the sake of reproducibility, but I think it would be more informative to use something similar to my actual code with #grand_chat's excellent solution. Hopefully this helps someone who's struggling with a similar issue.
chk <- c(NA, "abc.TRO", "def.TRO", "ghi.TRO", "kjl.TRO", "mno.TRO")
len <- c(8, NA, NA)
seed <- c(FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)
A = matrix(seed, nrow=3, ncol=6, byrow=TRUE)
pairs <- mapply(list, as.data.frame(t(A)), len, SIMPLIFY=F)
f <- function(pair) {
x = unlist(pair[[1]])
y = pair[[2]]
if(y==8 & !is.na(y)) {
x[c(grep("TRO", chk))] <- (x[c(grep("TRO", chk))] & TRUE)
} else {x <- (x & FALSE)}
return(x)
}
t(mapply(f, pairs))
Output:
# $v1
# [1,] FALSE TRUE TRUE FALSE FALSE FALSE
# $v2
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE
# $v3
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE
You're processing the elements of vector v and the rows of your matrix m (columns of data frame t(m)) in parallel, so you could zip the corresponding elements into a list of pairs and process the pairs. Try this:
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
# Zip into pairs:
pairs <- mapply(list, as.data.frame(t(m)), v, SIMPLIFY=F)
# Define a function that acts on pairs:
f <- function(pair) {
x = pair[[1]]
y = pair[[2]]
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
}
# Apply it:
mapply(f, pairs, SIMPLIFY=F)
with result:
$V1
[1] -2 -1 -2 -2 -1
$V2
[1] 4 4 0 0 4
$V3
[1] 3 3 3 3 0
$V4
[1] 2 0 2 2 0
$V5
[1] 0 0 3 0 3
(This doesn't agree with your desired output because you don't seem to have applied your function f properly.)

R Loop and Matrices

I am trying to get this simple 'for loop' to work. I can't get dim(F4) to be a 6848x2 matrix. I just want to divide the row entries of two matrices. Here's what I have...
> dim(F3)
[1] 6848 2
> head(F3)
[,1] [,2]
[1,] 140.9838 516.0239
[2,] 140.9838 516.0239
[3,] 140.9838 516.0239
[4,] 140.9838 516.0239
[5,] 140.9838 516.0239
[6,] 175.5093 515.2280
> dim(scale)
[1] 6848 1
F4 <- matrix(, nrow = nrow(F1), ncol = 1)
for (i in 1:t){
F4[i,]<-(F3[i]/scale[i])} #ONLY WANT F3(i) ROW TO BE DIVIDED BY SCALE(i) ROW
> dim(F4) #DOESN'T GIVE ME 6848x2 Matrix
[1] 6848 1
No need to use a for loop here. Here a vectorized solution:
F3/as.vector(sacle) ## BAD! use of built-in function "scale" as a variable!
Example :
mat <- matrix(1:8,4,2)
sx <- matrix(1:4,4,1)
mat /as.vector(sx)
The use of as.vector to get-rid of matrix division dimensions.

Why does this vectorized matrix comparison fail?

I am trying to compare 1st row of a matrix with all rows of the same matrix. But the vectorized comparison is not returning correct results. Any reason why this may be happening?
m <- matrix(c(1,2,3,1,2,4), nrow=2, ncol=3, byrow=TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
> # Why does the first row not have 3 TRUE values?
> m[1,] == m
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
> m[1,] == m[1,]
[1] TRUE TRUE TRUE
> m[1,] == m[2,]
[1] TRUE TRUE FALSE
Follow-up. In my actual data I have large number of rows then (atleast 10million) then both time and memory adds up. Additional suggestions on the below as suggested below by others?
m <- matrix(rep(c(1,2,3), 1000000), ncol=3, byrow=TRUE)
> #by #alexis_laz
> m1 <- matrix(m[1,], nrow = nrow(m), ncol = ncol(m), byrow = T)
> system.time(m == m1)
user system elapsed
0.21 0.03 0.31
> object.size(m1)
24000112 bytes
> #by #PaulHiemstra
> system.time( t(apply(m, 1, function(x) x == m[1,])) )
user system elapsed
35.18 0.08 36.04
Follow-up 2. #alexis_laz you are correct. I want to compare every row with each other and have posted a followup question on that ( How to vectorize comparing each row of matrix with all other rows)
In the comparison m[1,] == m, the first term m[1,] is recycled (once) to equal the length of m. The comparison is then done column-wise.
You're comparing c(1,2,3) with c(1,1,2,2,3,4), thus c(1,2,3,1,2,3) with c(1,1,2,2,3,3,4) so you have one TRUE followed by five FALSE (and packaged as a matrix to match the dimensions of m).
As #MatthewLundberg pointed out, the recycling rules of R do not behave as you expected. In my opinion it is always better to explicitely state what to compare and not rely on R's assumptions. One way to make the correct comparison:
t(apply(m, 1, function(x) x == m[1,]))
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or:
m == rbind(m[1,], m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or by making R's recyling working in your favor (thanks to #Arun):
t(t(m) == m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE

Select an element from each row of a matrix in R

The question is the same as here, but in R. I have a matrix and a vector such that
length(vec) == nrow(mat)
How do i get a vector such that
v[i] == mat[v[i],i]
I tried to achieve this by using logical matrix:
>a = matrix(runif(12),4,3)
a
[,1] [,2] [,3]
[1,] 0.6077585 0.5354680 0.2802681
[2,] 0.2596180 0.6358106 0.9336301
[3,] 0.5317069 0.4981082 0.8668405
[4,] 0.6150885 0.5164009 0.5797668
> sel = col(a) == c(1,3,2,1)
> sel
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE FALSE
> a[sel]
[1] 0.6077585 0.6150885 0.4981082 0.9336301
It selects right elements but messes up the order. I thought of using mapply either, but i don't know how to make it iterate through rows, like in apply.
upd: #gsk3 suggested to use as.list(as.data.frame(t(a))) this works. But still i would like to know if there is a more vectorized way, without lists.
I am not 100% sure I understand your question, but it seems like this may be close?
> b=c(1,3,2,1)
> i=cbind(1:nrow(a),b)
> a[i]

Resources