I have two character vectors a, b with different dimensions. I have to take each element in a and compare with all elements in b and note the element if there is a close match. For matching I'm using agrepl function.
Following is the sample data
a <- c("US","Canada","United States","United States of America")
b <- c("United States","U.S","United States","Canada", "America", "Spain")
Following is the code that I'm using to match. Please help me how to avoid for loop as my real data has more 900 and 5000 records respectively
for(i in 1:4)
{
for(j in 1:6)
{
bFlag <- agrepl(a[i],b[j], max.distance = 0.1,ignore.case = TRUE)
if(bFlag)
{
#Custom logic
}
else
{
#Custom logic
}
}
}
You don't need a double loop, since agrepl's second argument accepts vectors of length >= 1. So you could do something like:
lapply(a, function(x) agrepl(x, b, max.distance = 0.1, ignore.case = TRUE))
# [[1]]
# [1] TRUE TRUE TRUE FALSE FALSE TRUE
#
# [[2]]
# [1] FALSE FALSE FALSE TRUE FALSE FALSE
#
# [[3]]
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
#
# [[4]]
# [1] FALSE FALSE FALSE FALSE FALSE FALSE
You can add some custom logic inside the lapply call if needed, but that's not specified in the question so I'll just leave the output as a list of logicals.
If you want indices (of TRUEs) instead of logicals, you can use agrep instead of agrepl:
lapply(a, function(x) agrep(x, b, max.distance = 0.1,ignore.case = TRUE))
# [[1]]
# [1] 1 2 3 6
#
# [[2]]
# [1] 4
#
# [[3]]
# [1] 1 3
#
# [[4]]
# integer(0)
If you only want the first TRUE index, you can use:
sapply(a, function(x) agrep(x, b, max.distance = 0.1,ignore.case = TRUE)[1])
# US Canada United States United States of America
# 1 4 1 NA
Related
I would like to create a function that takes a string object of at least 1 element and contains the numbers 2 through 5, and determine if there are consecutive digits of at least N length where N is the actual digit value.
If so, return the string true, otherwise return the string false.
For Example:
Input: "555123"
Output: false
Because 5 is found only 3 times instead of 5.
Or:
Input: "57333"
Output: true
Because 3 is found exactly 3 times.
Try rle + strsplit if you are working with base R
f <- function(s) {
with(
rle(unlist(strsplit(s, ""))),
any(as.numeric(values) <= lengths & lengths > 1)
)
}
and you will see
> f("555123")
[1] FALSE
> f("57333")
[1] TRUE
Late to the party but maybe still worth your while:
Data:
x <- c("555123", "57333", "21112", "12345", "22144", "44440")
Define vector with allowed numbers:
digits <- 2:5
Define alternation pattern with multiple backreferences:
patt <- paste0("(", digits, ")\\", c(1, digits), "{", digits - 1, "}", collapse = "|")
Input patt into str_detect:
library(stringr)
str_detect(x, patt)
[1] FALSE TRUE FALSE FALSE TRUE TRUE
You could check if the values in table correspond to the names.
x <- c('555123', '57333')
f <- \(x) {
s <- strsplit(x, '')
lapply(s, \(x) {
tb <- table(x)
names(tb) == tb
}) |> setNames(x)
}
f(x)
# $`555123`
# x
# 1 2 3 5
# TRUE FALSE FALSE FALSE
#
# $`57333`
# x
# 3 5 7
# TRUE FALSE FALSE
Another way would be:
my_func <- function(x) {
as.numeric(unlist(strsplit(x, ""))) -> all
table(all[all %in% 2:5]) -> f
any(names(f) == f)
}
# Input <- "555123"
# (my_func(Input))
# FALSE
# Input <- "57333"
# (my_func(Input))
# TRUE
I have a question about a vectorized operation with logical vectors. In my problem, there are two vectors: main and secondary. They're both of the same length. I want to replace some elements in the main vector to NA, based on insights I gather from the secondary vector.
The main vector is comprised of TRUE and FALSE that can appear in any random order.
The secondary vector is either:
a sequence of TRUE then a sequence of FALSE with/without NA as the last element; or
all TRUE; or
all FALSE; or
all FALSE with last element as NA; or
all TRUE with last element as NA
I'll provide several examples below and explain the desired algorithm.
A - The most common case
replace x values with NA for positions that are FALSE in y
# here, `x` is the main vector
x <- c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, NA)
# `y` is the secondary vector
y <- c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE)
# `z` is the desired output
z <- c(FALSE, FALSE, TRUE, FALSE, FALSE, NA, NA) # in this case only index 7 in `x` actually changed
B - The secondary vector is all FALSE
x2 <- c(FALSE, NA)
y2 <- c(FALSE, FALSE)
# desired output
z2 <- c(NA, NA)
C - The secondary vector is all TRUE
x3 <- rep(FALSE, 4)
y3 <- rep(TRUE, 4)
# desired output
z3 <- rep(FALSE, 4)
My attempt
I've almost figured this out, but unfortunately it fails with scenario B.
my_func <- function(main, secondary) {
idx_last_true_in_secondary <- max(which(secondary))
if(idx_last_true_in_secondary == length(secondary)) {
return(main)
}
main[(idx_last_true_in_secondary + 1): length(main)] <- NA
main
}
# case A
my_func(x, y)
#> [1] FALSE FALSE TRUE FALSE FALSE NA NA
# case B
my_func(x2, y2)
#> Warning in max(which(secondary)): no non-missing arguments to max; returning
#> -Inf
#> Error in (idx_last_true_in_secondary + 1):length(main): result would be too long a vector
# case C
my_func(x3, y3)
#> [1] FALSE FALSE FALSE FALSE
My question is whether anyone sees a better way to approach the problem?
This seems to work as expected:
my_func <- function(main,secondary ) {
main[!secondary] <- NA
return(main)
}
my_func(x,y)
[1] FALSE FALSE TRUE FALSE FALSE NA NA
my_func(x2,y2)
[1] NA NA
my_func(x3,y3)
[1] FALSE FALSE FALSE FALSE
We could do it this way:
my_func <- function(x, y) {
replace(x, !y, NA)
}
How it works:
# A
replace(x, !y, NA)
[1] FALSE FALSE TRUE FALSE FALSE NA NA
# B
replace(x2, !y2, NA)
[1] NA NA
# C
replace(x3, !y3, NA)
[1] FALSE FALSE FALSE FALSE
my_func(x,y)
my_func(x2,y2)
my_func(x3,y3)
output:
> my_func(x,y)
[1] FALSE FALSE TRUE FALSE FALSE NA NA
> my_func(x2,y2)
[1] NA NA
> my_func(x3,y3)
[1] FALSE FALSE FALSE FALSE
We can try ifelse like below
> ifelse(y,x,NA)
[1] FALSE FALSE TRUE FALSE FALSE NA NA
If I have a function with argument (...) and want to check if a variable is defined in the argument. How can I do this? I have already looked at the solution provided at this link: How to check if object (variable) is defined in R?. However, it does not solve my problem.
# Scenario 1
exists("a")
# [1] FALSE
# Scenario 2
a <- 10
exists("a")
# [1] TRUE
# Define a function for remaining scenarios
f = function(...){exists("a", inherits = F)}
# Scenario 3
f()
# [1] FALSE
# Scenario 4
a <- 10
f()
# [1] FALSE
# Scenario 5
a <- 10
f(a = 5)
# [1] FALSE
I want the answer to be TRUE in Scenario 5.
Generally you use ... when you are passing parameters to other functions, not when you are using them in the function itself. It also makes a difference if you want to evaluate the parameter value or if you want to leave it unevaulated. If you need the latter, then you can do something like
f = function(...) {
mc <- match.call(expand.dots = TRUE)
"a" %in% names(mc)
}
This will return true for both
f(a = 4)
f(a = foo)
even when foo doesn't exist.
Does this suffice?
# Define a function for remaining scenarios
f = function(...){"a" %in% names(list(...))}
# Scenario 3
f()
# [1] FALSE
# Scenario 4
a <- 10
f()
# [1] FALSE
# Scenario 5
f(a = 5)
# [1] FALSE
f(a = 5)
[1] TRUE
Background: PDF Parse My program looks for data in scanned PDF documents. I've created a CSV with rows representing various parameters to be searched for in a PDF, and columns for the different flavors of document that might contain those parameters. There are different identifiers for each parameter depending on the type of document. The column headers use dot separation to uniquely identify the document by type, subtype... , like so: type.subtype.s_subtype.s_s_subtype.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
p1 str1 str2
p2 str3 str4
p3 str5 str6
p4 str7
...
I'm reading in PDF files, and based on the filepaths they can be uniquely categorized into one of these types. I can apply various logical conditions to a substring of a given filepath, and based on that I'd like to output an NxM Boolean matrix, where N = NROW(filepath_vector), and M = ncol(params_csv). This matrix would show membership of a given file in a type with TRUE, and FALSE elsewhere.
t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 t.s.s2.s3 ...
fpath1 FALSE FALSE TRUE FALSE
fpath2 FALSE TRUE FALSE FALSE
fpath3 FALSE TRUE FALSE FALSE
fpath4 FALSE FALSE FALSE TRUE
...
My solution: I'm trying to apply a function to a matrix that takes a vector as argument, and applies the first element of the vector to the first row, the second element to the second row, etc... however, the function has conditional behavior depending on the element of the vector being applied.
I know this is very similar to the question below (my reference point), but the conditionals in my function are tripping me up. I've provided a simplified reproducible example of the issue below.
R: Apply function to matrix with elements of vector as argument
set.seed(300)
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
f <- function(x) {
sapply(v, g <- function(y) {
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
})
}
mapply(f, as.data.frame(t(m)))
Desired output:
# [,1] [,2] [,3] [,4] [,5]
# [1,] -1 0 -1 -1 -1
# [2,] 4 4 0 4 0
# [3,] 3 0 3 3 0
# [4,] 2 0 2 2 0
# [5,] 1 1 1 1 0
But I get this error:
Error in if (y == 8) { : missing value where TRUE/FALSE needed
Can't seem to figure out the error or if I'm misguided elsewhere in my entire approach, any thoughts are appreciated.
Update (03April2018):
I had provided this as a toy example for the sake of reproducibility, but I think it would be more informative to use something similar to my actual code with #grand_chat's excellent solution. Hopefully this helps someone who's struggling with a similar issue.
chk <- c(NA, "abc.TRO", "def.TRO", "ghi.TRO", "kjl.TRO", "mno.TRO")
len <- c(8, NA, NA)
seed <- c(FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)
A = matrix(seed, nrow=3, ncol=6, byrow=TRUE)
pairs <- mapply(list, as.data.frame(t(A)), len, SIMPLIFY=F)
f <- function(pair) {
x = unlist(pair[[1]])
y = pair[[2]]
if(y==8 & !is.na(y)) {
x[c(grep("TRO", chk))] <- (x[c(grep("TRO", chk))] & TRUE)
} else {x <- (x & FALSE)}
return(x)
}
t(mapply(f, pairs))
Output:
# $v1
# [1,] FALSE TRUE TRUE FALSE FALSE FALSE
# $v2
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE
# $v3
# [3,] FALSE FALSE FALSE FALSE FALSE FALSE
You're processing the elements of vector v and the rows of your matrix m (columns of data frame t(m)) in parallel, so you could zip the corresponding elements into a list of pairs and process the pairs. Try this:
x <- y <- 5
m <- matrix(rbinom(x*y,1,0.5),x,y)
v <- c("321", "", "A160470", "7IDJOPLI", "ACEGIKM")
# Zip into pairs:
pairs <- mapply(list, as.data.frame(t(m)), v, SIMPLIFY=F)
# Define a function that acts on pairs:
f <- function(pair) {
x = pair[[1]]
y = pair[[2]]
if(nchar(y)==8) {x=x*2
} else if (nchar(y)==7) {
if(grepl("^[[:alpha:]]*$", substr(y, 1, 1))) {x=x*3}
else {x}
} else if (nchar(y)<3) {x=x*4
} else {x=x-2}
}
# Apply it:
mapply(f, pairs, SIMPLIFY=F)
with result:
$V1
[1] -2 -1 -2 -2 -1
$V2
[1] 4 4 0 0 4
$V3
[1] 3 3 3 3 0
$V4
[1] 2 0 2 2 0
$V5
[1] 0 0 3 0 3
(This doesn't agree with your desired output because you don't seem to have applied your function f properly.)
I have a matrix A,
A = as.matrix(data.frame(col1 = c(1,1,2,3,1,2), col2 = c(-1,-1,-2,-3,-1,-2), col3 = c(2,6,1,3,2,4)))
And I have a vector v,
v = c(-1, 2)
How can I get a vector of TRUE/FALSE that compares the last two columns of the matrix and returns TRUE if the last two columns match the vector, or false if they don't?
I.e., If I try,
A[,c(2:3)] == v
I obtain,
col2 col3
[1,] TRUE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] TRUE FALSE
[6,] FALSE FALSE
Which is not what I want, I want both columns to be the same as vector v, more like,
result = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE)
Since the first, and 5th rows match the vector v entirely.
Here's a simple alternative
> apply(A[, 2:3], 1, function(x) all(x==v))
[1] TRUE FALSE FALSE FALSE TRUE FALSE
Ooops by looking into R mailing list I found an answer: https://stat.ethz.ch/pipermail/r-help/2010-September/254096.html,
check.equal <- function(x, y)
{
isTRUE(all.equal(y, x, check.attributes=FALSE))
}
result = apply(A[,c(2:3)], 1, check.equal, y=v)
Not sure I need to define a function and do all that, maybe there are easier ways to do it.
Here's another straightforward option:
which(duplicated(rbind(A[, 2:3], v), fromLast=TRUE))
# [1] 1 5
results <- rep(FALSE, nrow(A))
results[which(duplicated(rbind(A[, 2:3], v), fromLast=TRUE))] <- TRUE
results
# [1] TRUE FALSE FALSE FALSE TRUE FALSE
Alternatively, as one line:
duplicated(rbind(A[, 2:3], v), fromLast=TRUE)[-(nrow(A)+1)]
# [1] TRUE FALSE FALSE FALSE TRUE FALSE
A dirty one:
result <- c()
for(n in 1:nrow(A)){result[n] <-(sum(A[n,-1]==v)==2)}
> result
[1] TRUE FALSE FALSE FALSE TRUE FALSE