How to print or display Not Available if any of my search list in (Table_search) is not available in the list I input. In the input I have three lines and I have 3 keywords to search through these lines and tell me if the keyword is present in those lines or not. If present print that line else print Not available like I showed in the desired output.
My code just prints all the available lines but that doesn't help as I need to know where is the word is missing as well.
Table_search <- list("Table 14", "Source Data:","VERSION")
Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)
Input:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry
Desired Output:
Table 14.1.1.1 (Page 1 of 2)
Source Data: Listing 16.2.1.1.1
NA
#r2evans
sapply(unlist(Table_search), grepl, x = dat)
I get a good output with this code actually, but instead of true or false I would like to print the actual data.
I think a single regex will do it:
replace(dat, !grepl(paste(unlist(Table_search), collapse="|"), dat), NA)
# [1] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
# [3] NA
One problem with using sapply(., grep) is that grep returns integer indices, and if no match is made then it returns a length-0 vector. For sapply (a class-unsafe function), this means that you may or may not get a integer vector in return. Each return may be length 0 (nothing found) or length 1 (something found), and when sapply finds that each return value is not exactly the same length, it returns a list instead (ergo my "class-unsafe" verbiage above).
This doesn't change when you use value=TRUE: change my reasoning above about "0 or 1 logical" into "0 or 1 character", and it's the same exact problem.
Because of this, I suggest grepl: it should always return logical indicating found or not found.
Further, since you don't appear to need to differentiate which of the patterns is found, just "at least one of them", then we can use a single regex, joined with the regex-OR operator |. This works with an arbitrary length of your Table_search list.
If you somehow needed to know which of the patterns was found, then you might want something like:
sapply(unlist(Table_search), grepl, x = dat)
# Table 14 Source Data: VERSION
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE FALSE
and then figure out what to do with the different columns (each row indicates a string within the dat vector).
One way (that is doing the same as my first code suggestion, albeit less efficiently) is
rowSums(sapply(unlist(Table_search), grepl, x = dat)) > 0
# [1] TRUE TRUE FALSE
where the logical return value indicates if something was found. If, for instance, you want to know if two or more of the patterns were found, one might use rowSums(.) >= 2).
Data
Table_search <- list("Table 14", "Source Data:","VERSION")
dat <- c("Table 14.1.1.1 (Page 1 of 2)", "Source Data: Listing 16.2.1.1.1", "Summary of Subject Status by Respiratory/Non-Ambulatory at Event Entry")
(reproducible example added)
I cannot grasp enough why the following is FALSE (I aware they are double and integer resp.):
identical(1, as.integer(1)) # FALSE
?identical reveals:
num.eq:
logical indicating if (double and complex non-NA) numbers should be compared using == (‘equal’), or by bitwise comparison. The latter (non-default)
differentiates between -0 and +0.
sprintf("%.8190f", as.integer(1)) and sprintf("%.8190f", 1) return exactly equal bit pattern. So, I think that at least one of the following must return TRUE. But, I get FALSE in each of the following:
identical(1, as.integer(1), num.eq=TRUE) # FALSE
identical(1, as.integer(1), num.eq=FALSE) # FALSE
I consider like that now: If sprintf is a notation indicator, not the storage indicator, then this means identical() compares based on storage. i.e.
identical(bitpattern1, bitpattern1bitpattern2) returns FALSE. I could not find any other logical explanation to above FALSE/FALSE situation.
I do know that in both 32bit/64bit architecture of R, integers are stored as 32bit.
They are not identical precisely because they have different types. If you look at the documentation for identical you'll find the example identical(1, as.integer(1)) with the comment ## FALSE, stored as different types. That's one clue. The R language definition reminds us that:
Single numbers, such as 4.2, and strings, such as "four point two" are still vectors, of length 1; there are no more basic types (emphasis mine).
So, basically everything is a vector with a type (that's also why [1] shows up every time R returns something). You can check this by explicitly creating a vector with length 1 by using vector, and then comparing it to 0:
x <- vector("double", 1)
identical(x, 0)
# [1] TRUE
That is to say, both vector("double", 1) and 0 output vectors of type "double" and length == 1.
typeof and storage.mode point to the same thing, so you're kind of right when you say "this means identical() compares based on storage". I don't think this necessarily means that "bit patterns" are being compared, although I suppose it's possible. See what happens when you change the storage mode using storage.mode:
## Assign integer to x. This is really a vector length == 1.
x <- 1L
typeof(x)
# [1] "integer"
identical(x, 1L)
# [1] TRUE
## Now change the storage mode and compare again.
storage.mode(x) <- "double"
typeof(x)
# [1] "double"
identical(x, 1L) # This is no longer TRUE.
# [1] FALSE
identical(x, 1.0) # But this is.
# [1] TRUE
One last note: The documentation for identical states that num.eq is a…
logical indicating if (double and complex non-NA) numbers should be compared using == (‘equal’), or by bitwise comparison.
So, changing num.eq doesn't affect any comparison involving integers. Try the following:
# Comparing integers with integers.
identical(+0L, -0L, num.eq = T) # TRUE
identical(+0L, -0L, num.eq = F) # TRUE
# Comparing integers with doubles.
identical(+0, -0L, num.eq = T) # FALSE
identical(+0, -0L, num.eq = F) # FALSE
# Comparing doubles with doubles.
identical(+0.0, -0.0, num.eq = T) # TRUE
identical(+0.0, -0.0, num.eq = F) # FALSE
I'm trying to determine if a string is a subset of another string. For example:
chars <- "test"
value <- "es"
I want to return TRUE if "value" appears as part of the string "chars". In the following scenario, I would want to return false:
chars <- "test"
value <- "et"
Use the grepl function
grepl( needle, haystack, fixed = TRUE)
like so:
grepl(value, chars, fixed = TRUE)
# TRUE
Use ?grepl to find out more.
Answer
Sigh, it took me 45 minutes to find the answer to this simple question. The answer is: grepl(needle, haystack, fixed=TRUE)
# Correct
> grepl("1+2", "1+2", fixed=TRUE)
[1] TRUE
> grepl("1+2", "123+456", fixed=TRUE)
[1] FALSE
# Incorrect
> grepl("1+2", "1+2")
[1] FALSE
> grepl("1+2", "123+456")
[1] TRUE
Interpretation
grep is named after the linux executable, which is itself an acronym of "Global Regular Expression Print", it would read lines of input and then print them if they matched the arguments you gave. "Global" meant the match could occur anywhere on the input line, I'll explain "Regular Expression" below, but the idea is it's a smarter way to match the string (R calls this "character", eg class("abc")), and "Print" because it's a command line program, emitting output means it prints to its output string.
Now, the grep program is basically a filter, from lines of input, to lines of output. And it seems that R's grep function similarly will take an array of inputs. For reasons that are utterly unknown to me (I only started playing with R about an hour ago), it returns a vector of the indexes that match, rather than a list of matches.
But, back to your original question, what we really want is to know whether we found the needle in the haystack, a true/false value. They apparently decided to name this function grepl, as in "grep" but with a "Logical" return value (they call true and false logical values, eg class(TRUE)).
So, now we know where the name came from and what it's supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a matches the character "a", the regex a* matches the character "a" 0 or more times, and the regex a+ would match the character "a" 1 or more times. Hence in the example above, the needle we are searching for 1+2, when treated as a regex, means "one or more 1 followed by a 2"... but ours is followed by a plus!
So, if you used the grepl without setting fixed, your needles would accidentally be haystacks, and that would accidentally work quite often, we can see it even works for the OP's example. But that's a latent bug! We need to tell it the input is a string, not a regex, which is apparently what fixed is for. Why fixed? No clue, bookmark this answer b/c you're probably going to have to look it up 5 more times before you get it memorized.
A few final thoughts
The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldn't need to be an argument), the docs list 9 arguments here, which means there's at least 2^9=512 ways to invoke it, that's a lot of work to write, test, and remember... decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, don't give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesn't exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, it's a design flaw that things like TRUE and 0 and "abc" are all vectors.
You want grepl:
> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE
Also, can be done using "stringr" library:
> library(stringr)
> chars <- "test"
> value <- "es"
> str_detect(chars, value)
[1] TRUE
### For multiple value case:
> value <- c("es", "l", "est", "a", "test")
> str_detect(chars, value)
[1] TRUE FALSE TRUE FALSE TRUE
Use this function from stringi package:
> stri_detect_fixed("test",c("et","es"))
[1] FALSE TRUE
Some benchmarks:
library(stringi)
set.seed(123L)
value <- stri_rand_strings(10000, ceiling(runif(10000, 1, 100))) # 10000 random ASCII strings
head(value)
chars <- "es"
library(microbenchmark)
microbenchmark(
grepl(chars, value),
grepl(chars, value, fixed=TRUE),
grepl(chars, value, perl=TRUE),
stri_detect_fixed(value, chars),
stri_detect_regex(value, chars)
)
## Unit: milliseconds
## expr min lq median uq max neval
## grepl(chars, value) 13.682876 13.943184 14.057991 14.295423 15.443530 100
## grepl(chars, value, fixed = TRUE) 5.071617 5.110779 5.281498 5.523421 45.243791 100
## grepl(chars, value, perl = TRUE) 1.835558 1.873280 1.956974 2.259203 3.506741 100
## stri_detect_fixed(value, chars) 1.191403 1.233287 1.309720 1.510677 2.821284 100
## stri_detect_regex(value, chars) 6.043537 6.154198 6.273506 6.447714 7.884380 100
Just in case you would also like check if a string (or a set of strings) contain(s) multiple sub-strings, you can also use the '|' between two substrings.
>substring="as|at"
>string_vector=c("ass","ear","eye","heat")
>grepl(substring,string_vector)
You will get
[1] TRUE FALSE FALSE TRUE
since the 1st word has substring "as", and the last word contains substring "at"
Use grep or grepl but be aware of whether or not you want to use regular expressions.
By default, grep and related take a regular expression to match, not a literal substring. If you're not expecting that, and you try to match on an invalid regex, it doesn't work:
> grep("[", "abc[")
Error in grep("[", "abc[") :
invalid regular expression '[', reason 'Missing ']''
To do a true substring test, use fixed = TRUE.
> grep("[", "abc[", fixed = TRUE)
[1] 1
If you do want regex, great, but that's not what the OP appears to be asking.
You can use grep
grep("es", "Test")
[1] 1
grep("et", "Test")
integer(0)
Similar problem here: Given a string and a list of keywords, detect which, if any, of the keywords are contained in the string.
Recommendations from this thread suggest stringr's str_detect and grepl. Here are the benchmarks from the microbenchmark package:
Using
map_keywords = c("once", "twice", "few")
t = "yes but only a few times"
mapper1 <- function (x) {
r = str_detect(x, map_keywords)
}
mapper2 <- function (x) {
r = sapply(map_keywords, function (k) grepl(k, x, fixed = T))
}
and then
microbenchmark(mapper1(t), mapper2(t), times = 5000)
we find
Unit: microseconds
expr min lq mean median uq max neval
mapper1(t) 26.401 27.988 31.32951 28.8430 29.5225 2091.476 5000
mapper2(t) 19.289 20.767 24.94484 23.7725 24.6220 1011.837 5000
As you can see, over 5,000 iterations of the keyword search using str_detect and grepl over a practical string and vector of keywords, grepl performs quite a bit better than str_detect.
The outcome is the boolean vector r which identifies which, if any, of the keywords are contained in the string.
Therefore, I recommend using grepl to determine if any keywords are in a string.
I have the following list
test_list=list(list(a=1,b=2),list(a=3,b=4))
and I want to extract all elements with list element name a.
I can do this via
sapply(test_list,`[[`,"a")
which gives me the correct result
#[1] 1 3
When I try the same with Rs dollar operator $, I get NULL
sapply(test_list,`$`,"a")
#[[1]]
#NULL
#
#[[2]]
#NULL
However, if I use it on a single element of test_list it works as expected
`$`(test_list[[1]],"a")
#[1] 1
Am I missing something obvious here?
evaluation vs. none
[[ evaluates its argument whereas $ does not. L[[a]] gets the component of L whose name is held in the variable a. $ just passes the argument name itself as a character string so L$a finds the "a" component of L. a is not regarded as a variable holding the component name -- just a character string.
Below L[[b]] returns the component of L named "a" because the variable b has the value "a" whereas L$b returns the componet of L named "b" because with that syntax b is not regarded as a variable but is regarded as a character string which itself is passed.
L <- list(a = 1, b = 2)
b <- "a"
L[[b]] # same as L[["a"]] since b holds a
## [1] 1
L$b # same as L[["b"]] since b is regarded as a character string to be passed
## [1] 2
sapply
Now that we understand the key difference bewteen $ and [[ to see what is going on with sapply consider this example. We have made each element of test_list into a "foo" object and defined our own $.foo and [[.foo methods which simply show what R is passing to the method via the name argument:
foo_list <- test_list
class(foo_list[[1]]) <- class(foo_list[[2]]) <- "foo"
"$.foo" <- "[[.foo" <- function(x, name) print(name)
result <- sapply(foo_list, "$", "a")
## "..."
## "..."
result2 <- sapply(foo_list, "[[", "a")
## [1] "a"
## [1] "a"
What is happening in the first case is that sapply is calling whatever$... and ... is not evaluated so it would be looking for a list component which is literally named "..." and, of course, there is no such component so whatever$... is NULL hence the NULLs shown in the output in the question. In the second case whatever[[[...]] evaluates to whatever[["a"]] hence the observed result.
From what I've been able to determine it's a combination of two things.
First, the second element of $ is matched but not evaluated so it cannot be a variable.
Secondly, when arguments are passed to functions they are assigned to the corresponding variables in the function call. When passed to sapply "a" is assigned to a variable and therefore will no longer work with $. We can see this by occurring by running
sapply("a", print)
[1] "a"
a
"a"
This can lead to peculiar results like this
sapply(test_list, function(x, a) {`$`(x, a)})
[1] 1 3
Where despite a being a variable (which hasn't even been assigned) $ matches it to the names of the elements in the list.
Given a list of objects that are either symbols or language, what is the best way to search for containment?
For example, consider the following example:
> a = list(substitute(1 + 2), substitute(2 + 3))
> substitute(1 + 2) %in% a
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
> a == substitute(1 + 2)
[1] TRUE FALSE
Warning message:
In a == substitute(1 + 2) :
longer object length is not a multiple of shorter object length
The second method seems to work, but I am unsure of what the warning means.
Another idea is to use deparse and then compare characters, but this becomes complicated when the parsed expressions are long enough to exceed the width.cutoff in deparse.
Not sue why you would need to do this but you can use identical to do the comparison. However, since identical only compares two arguments, you will have to loop over your list, preferably using lapply...
lapply( a , function(x) identical( substitute(1 + 2) , x ) )
#[[1]]
#[1] TRUE
#[[2]]
#[1] FALSE
Or similarly you can still use ==. Inspection of substitute(1 + 2) reveals it to be a language object of length 3, whilst your list a is obviously of length 2, hence the warning on vector recycling. Therefore you just need to loop over the elements in your list which you can do thus:
lapply( a , `==` , substitute(1 + 2) )
#[[1]]
#[1] TRUE
#[[2]]
#[1] FALSE