Background: PDF Parse My program looks for data in scanned PDF documents. I've created a CSV with rows representing various parameters to be searched for in a PDF, and columns for the different flavors of document that might contain those parameters. There are different identifiers for each parameter depending on the type of document. The column headers use dot separation to uniquely identify the document by type, subtype... , like so: type.subtype.s_subtype.s_s_subtype.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
p1 str1 str2
p2 str3 str4
p3 str5 str6
p4 str7
...
I'm reading in PDF files, and based on the filepaths they can be uniquely categorized into one of these types. I can apply various logical conditions to a substring of a given filepath, and based on that I'd like to output an NxM Boolean matrix, where N = NROW(filepath_vector), and M = ncol(params_csv). This matrix would show membership of a given file in a type with TRUE, and FALSE elsewhere.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
fpath1 FALSE FALSE TRUE FALSE
fpath2 FALSE TRUE FALSE FALSE
fpath3 FALSE TRUE FALSE FALSE
fpath4 FALSE FALSE FALSE TRUE
...
My solution: I'm trying to apply a function to a matrix that takes a vector as argument, and applies the first element of the vector to the first row, the second element to the second row, etc... however, the function has conditional behavior depending on the element of the vector being applied.
I know this is very similar to the question below (my reference point), but the conditionals in my function are tripping me up. I've provided a simplified reproducible example of the issue below.
R: Apply function to matrix with elements of vector as argument
set.seed(300)
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
f <- function(x) {
sapply(v, g <- function(y) {
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
})
}
mapply(f, as.data.frame(t(m)))
Desired output:
# [,1] [,2] [,3] [,4] [,5]
# [1,] -1 0 -1 -1 -1
# [2,] 4 4 0 4 0
# [3,] 3 0 3 3 0
# [4,] 2 0 2 2 0
# [5,] 1 1 1 1 0
But I get this error:
Error in if (y == 8) { : missing value where TRUE/FALSE needed
Can't seem to figure out the error or if I'm misguided elsewhere in my entire approach, any thoughts are appreciated.
Update (03April2018):
I had provided this as a toy example for the sake of reproducibility, but I think it would be more informative to use something similar to my actual code with #grand_chat's excellent solution. Hopefully this helps someone who's struggling with a similar issue.
chk <- c(NA, "abc.TRO", "def.TRO", "ghi.TRO", "kjl.TRO", "mno.TRO")
len <- c(8, NA, NA)
seed <- c(FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)
A = matrix(seed, nrow=3, ncol=6, byrow=TRUE)
pairs <- mapply(list, as.data.frame(t(A)), len, SIMPLIFY=F)
f <- function(pair) {
x = unlist(pair[[1]])
y = pair[[2]]
if(y==8 & !is.na(y)) {
x[c(grep("TRO", chk))] <- (x[c(grep("TRO", chk))] & TRUE)
} else {x <- (x & FALSE)}
return(x)
}
t(mapply(f, pairs))
Output:
# $v1
# [1,] FALSE TRUE TRUE FALSE FALSE FALSE
# $v2
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE
# $v3
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE
You're processing the elements of vector v and the rows of your matrix m (columns of data frame t(m)) in parallel, so you could zip the corresponding elements into a list of pairs and process the pairs. Try this:
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
# Zip into pairs:
pairs <- mapply(list, as.data.frame(t(m)), v, SIMPLIFY=F)
# Define a function that acts on pairs:
f <- function(pair) {
x = pair[[1]]
y = pair[[2]]
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
}
# Apply it:
mapply(f, pairs, SIMPLIFY=F)
with result:
$V1
[1] -2 -1 -2 -2 -1
$V2
[1] 4 4 0 0 4
$V3
[1] 3 3 3 3 0
$V4
[1] 2 0 2 2 0
$V5
[1] 0 0 3 0 3
(This doesn't agree with your desired output because you don't seem to have applied your function f properly.)
Related
I would like to create a function that takes a string object of at least 1 element and contains the numbers 2 through 5, and determine if there are consecutive digits of at least N length where N is the actual digit value.
If so, return the string true, otherwise return the string false.
For Example:
Input: "555123"
Output: false
Because 5 is found only 3 times instead of 5.
Or:
Input: "57333"
Output: true
Because 3 is found exactly 3 times.
Try rle + strsplit if you are working with base R
f <- function(s) {
with(
rle(unlist(strsplit(s, ""))),
any(as.numeric(values) <= lengths & lengths > 1)
)
}
and you will see
> f("555123")
[1] FALSE
> f("57333")
[1] TRUE
Late to the party but maybe still worth your while:
Data:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
Define vector with allowed numbers:
digits <- 2:5
Define alternation pattern with multiple backreferences:
patt <- paste0("(", digits, ")\\", c(1, digits), "{", digits - 1, "}", collapse = "|")
Input patt into str_detect:
library(stringr)
str_detect(x, patt)
[1] FALSE TRUE FALSE FALSE TRUE TRUE
You could check if the values in table correspond to the names.
x <- c('555123', '57333')
f <- \(x) {
s <- strsplit(x, '')
lapply(s, \(x) {
tb <- table(x)
names(tb) == tb
}) |> setNames(x)
}
f(x)
# $`555123`
# x
# 1 2 3 5
# TRUE FALSE FALSE FALSE
#
# $`57333`
# x
# 3 5 7
# TRUE FALSE FALSE
Another way would be:
my_func <- function(x) {
as.numeric(unlist(strsplit(x, ""))) -> all
table(all[all %in% 2:5]) -> f
any(names(f) == f)
}
# Input <- "555123"
# (my_func(Input))
# FALSE
# Input <- "57333"
# (my_func(Input))
# TRUE
In the following, logical operators don't seem to work properly.
a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)
wtf = data.frame(a, b, c, d)
wtf$huh = apply(wtf, 1, function(row) {
if (row['a'] == T) { return('we win') }
if (row['c'] < 5) { return('hooray') }
if (row['d'] == 1) { return('a thing') }
return('huh?')
})
Producing:
> wtf
a b c d huh
1 TRUE a 1 0 hooray
2 FALSE b 2 0 hooray
3 TRUE c 3 0 hooray
4 FALSE de 4 0 hooray
5 TRUE f 5 0 huh?
6 TRUE g 6 1 a thing
Where naively one would expect that in rows 1, 3, 5, and 6, there would be we win.
Can someone explain to me (1) why it does this, (2) how can this be fixed such that it doesn't happen, (3) why all my logical columns are seemingly changed to characters, and (4) how can a function be type-safely applied to rows in a data frame?
Why does this happen? Because is apply is made for matrices. When you give it a data frame, then the first thing that happens is it gets converted to a matrix:
m = as.matrix(wtf)
m
# a b huh huh1
# [1,] " TRUE" "a" "huh?" "hooray"
# [2,] "FALSE" "b" "huh?" "huh?"
# [3,] " TRUE" "c" "huh?" "hooray"
# [4,] "FALSE" "de" "huh?" "huh?"
# [5,] " TRUE" "f" "huh?" "hooray"
# [6,] " TRUE" "g" "huh?" "hooray"
When that happens, your different data types are lost and your data frame-style indexing doesn't work anymore:
m['a']
# [1] NA
Solution? Use a simple for loop:
wtf$huh1 = NA
for (i in 1:nrow(wtf)) {
wtf$huh1[i] = if(wtf[i, 'a']) "hooray" else "huh?"
}
If you have a function foo then
wtf$huh2 = NA
for (i in 1:nrow(wtf)) {
wtf$huh1[i] = foo(wtf[i, 'a'])
}
Functions that aren't vectorized can be vectorized to avoid the need for loops:
foov = Vectorize(foo)
# then you can
wtf$huh4 = foov(wtf$a)
Probably the easiest way to fix this is using ifelse which is vectorized, so you don't need to deal with loops, or apply:
myfunc <- function(row) {
ifelse (row['a'] == T,'hooray','huh?')
}
wtf$huh <- myfunc(wtf)
a b a
1 TRUE a hooray
2 FALSE b huh?
3 TRUE c hooray
4 FALSE de huh?
5 TRUE f hooray
6 TRUE g hooray
One advantage of a data.frame is that they can contain variables of different types of variables.
lapply(wtf, typeof)
$a
[1] "logical"
$b
[1] "factor"
$huh
[1] "character"
As noted by Gregor, apply requires a matrix and will convert the object you give it to one if possible. But matrices cannot contain multiple variable types and so as.matrix will look for a lowest common denominator that can represent the data, in this case, character.
typeof(as.matrix(wtf))
[1] "character"
class(as.matrix(wtf))
[1] "matrix"
I have two character vectors a, b with different dimensions. I have to take each element in a and compare with all elements in b and note the element if there is a close match. For matching I'm using agrepl function.
Following is the sample data
a <- c("US","Canada","United States","United States of America")
b <- c("United States","U.S","United States","Canada", "America", "Spain")
Following is the code that I'm using to match. Please help me how to avoid for loop as my real data has more 900 and 5000 records respectively
for(i in 1:4)
{
for(j in 1:6)
{
bFlag <- agrepl(a[i],b[j], max.distance = 0.1,ignore.case = TRUE)
if(bFlag)
{
#Custom logic
}
else
{
#Custom logic
}
}
}
You don't need a double loop, since agrepl's second argument accepts vectors of length >= 1. So you could do something like:
lapply(a, function(x) agrepl(x, b, max.distance = 0.1, ignore.case = TRUE))
# [[1]]
# [1] TRUE TRUE TRUE FALSE FALSE TRUE
#
# [[2]]
# [1] FALSE FALSE FALSE TRUE FALSE FALSE
#
# [[3]]
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
#
# [[4]]
# [1] FALSE FALSE FALSE FALSE FALSE FALSE
You can add some custom logic inside the lapply call if needed, but that's not specified in the question so I'll just leave the output as a list of logicals.
If you want indices (of TRUEs) instead of logicals, you can use agrep instead of agrepl:
lapply(a, function(x) agrep(x, b, max.distance = 0.1,ignore.case = TRUE))
# [[1]]
# [1] 1 2 3 6
#
# [[2]]
# [1] 4
#
# [[3]]
# [1] 1 3
#
# [[4]]
# integer(0)
If you only want the first TRUE index, you can use:
sapply(a, function(x) agrep(x, b, max.distance = 0.1,ignore.case = TRUE)[1])
# US Canada United States United States of America
# 1 4 1 NA
Basically I'm looking to write a function that will take a vector of strings and a search term as input, and output a boolean vector. After this, I'd also like to take a list of strings and run it through this same function to output multiple results vectors, one for each string.
So the initial data looks like:
> searchVector <- cbind(c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1"))
> searchVector
[,1]
[1,] "aaa1"
[2,] "aaa2"
[3,] ""
[4,] "bbb1,aaa1,ccc1"
[5,] "ddd1,ccc1,aaa1"
and this is what we'd hope to see:
>findTrigger(c("aaa","bbb"),searchVector)
[aaa] [bbb]
[1,] 1 0
[2,] 1 0
[3,] 0 0
[4,] 1 1
[5,] 1 0
I've made the following attempt:
searchfunction <- function (searchTerms, searchVector) {
output = matrix( nrow = length(searchVector),
ncol = length(searchTerms),
dimnames = searchTerms)
for (j in seq(1,length(searchTerms)))
{
for (i in seq(1,length(searchVector)))
{
output[i,j]=is.numeric(pmatch(searchTerms[j], searchVector[i]))
}
}
return(as.numeric(output))
}
But I just get a matrix of all 1's. I'm fairly new to R and I've looked around online, but haven't had any luck. Any help would be greatly appreciated, Thanks!
The key is to use the function grepl. This should get you started:
searchVector <- c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1")
res <- lapply(c('aaa','bbb'),function(pattern,x) as.numeric(grepl(pattern = pattern,x = x)),x = searchVector)
do.call(cbind,res)
To explore this a bit, start with just grepl:
> grepl('aaa',searchVector)
[1] TRUE TRUE FALSE TRUE TRUE
> as.numeric(grepl('aaa',searchVector))
[1] 1 1 0 1 1
Then I'm just wrapping that up in lapply, to loop over the vector c('aaa','bbb'). This will return a list of vectors, which we then combine into the matrix you indicated using do.call and cbind.
mapply and grep or grepl (thanks joran) are your friend:
searchTerms <- c("aaa", "bbb")
searchVector <- cbind(c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1"))
M <- mapply(grepl, searchTerms, MoreArgs=list(x=searchVector))
M
aaa bbb
[1,] TRUE FALSE
[2,] TRUE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] TRUE FALSE
If you want it as 1,0: apply(M,2,as.numeric)
I've got a first vector, let's say x that consists only of 1's and -1's. Then, I have a second vector y that consists of 1's, -1's, and zeros. Now, I'd like to create a vector z that contains in index i a 1 if x[i] equals 1 and a 1 exists within the vector y between the n precedent elements (y[(i-n):i])...
more formally: z <- ifelse(x == 1 && 1 %in% y[(index(y)-n):index(y)],1,0)
I'm looking to create such a vector in R without looping or recursion. The proposition above does not work since it does not recognize to take the expression y[(index(y)-n):index(y)] element by element.
Thanks a lot for your support
Here's an approach that uses the cumsum function to test for the number of ones that have been seen so far. If the number of ones at position i is larger than the number of ones at position i-n, then the condition on the right will be satisfied.
## Generate some random y's.
> y <- sample(-1:1, 25, replace=T)
> y
[1] 0 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 0 0 -1 -1 -1 1 -1 1 1 0 0 0 1
> n <- 3
## Compute number of ones seen at each position.
> cs <- cumsum(ifelse(y == 1, 1, 0))
> lagged.cs <- c(rep(0, n), cs[1:(length(cs)-n)])
> (cs - lagged.cs) > 0
[1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[25] TRUE
You could use apply like this, although it is essentially a pretty way to do a loop, I'm not sure if it will be faster (it may or may not).
y1 <- unlist(lapply(1:length(x), function(i){1 %in% y[max(0, (i-n)):i]}))
z <- as.numeric(x==1) * as.numeric(y1)