Using grepl in R - r

In which cases could these 2 different ways of implimentation would give different results?
data(mtcars)
firstWay <- mtcars[grepl('6',mtcars$cyl),]
SecondWay <- mtcars[mtcars$cyl=='6',]
If these ways always give the same results which of them is recommended and why? Thanks

mtcars$cyl is a numeric column, so you would be better off comparing it to a number using mtcars[mtcars$cyl == 6, ].
But the difference between the equality operator == and grepl is that == will only be TRUE for members of the vector which are equal to "6", while grepl will match any member of the vector which has a 6 anywhere within it.
So, for example:
String == grepl
6 TRUE TRUE
123456 FALSE TRUE
6ABC FALSE TRUE
This is a long sentence which happens to have a 6 in it FALSE TRUE
Whereas this long sentence does not FALSE FALSE
The equivalent grepl pattern would be "^6$". There's a tutorial (one of many) on regex at http://www.regular-expressions.info/tutorial.html.

Well, I think that the fist difference is that with grepl you can subset even if you do not already know, for example 6, but you can try to search a rows that start or end with 6.
If you try to do this with normal subsetting technique you'll have an empty object because, for example ^6, is not recognized as a regular expression but as a string with the symbol ^ and 6.
I am sure there are other differences but I am sure professional users will provide more detailed answers.
For the side os which one could be preferred maybe there can be reasons of efficiency:
system.time(mtcars[grepl('^6',mtcars$cyl),])
user system elapsed
0.029 0.002 0.035
system.time(mtcars[mtcars$cyl=='6',])
user system elapsed
0.031 0.002 0.046
This little example can be just a guide and as #Nick K suggested first further (and precise) investigations have to be done with microbenchmark . Of course with big dataset I barely believe that a professional users (or one in need of speed) will prefer both of them but maybe it will rely on data table, or tools like dplyr written in lower level language and so more fast.

Using the package microbenchmark, we can see which is faster
library(microbenchmark)
m <- microbenchmark(mtcars[grepl('6',mtcars$cyl),], mtcars[mtcars$cyl=='6',], times=10000)
Unit: microseconds
expr min lq mean median uq max neval
mtcars[grepl("6", mtcars$cyl), ] 229.080 234.738 247.5324 236.693 239.417 6713.914 10000
mtcars[mtcars$cyl == "6", ] 214.902 220.210 231.0240 221.956 224.471 7759.507 10000
It looks like == is faster, so when possible you should use that
However, the functions do not do exactly the same thing. grepl searches for if the string is present at all wheras == checks whether the expressions are equal
grepl("6", mtcars$disp)
[1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mtcars$disp == "6"
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Related

Flatten boolean vector in R

How could I get a single boolean value that is TRUE if all values in vector are TRUE and FALSE otherwise? For instance:
> grepl("ABC",c("ABC","ABC","123ABC"))
[1] TRUE TRUE TRUE
my desired result:
[1] TRUE
Another example:
> grepl("ABC",c("ABC","ABC","123ABA"))
[1] TRUE TRUE FALSE
my desired result:
[1] FALSE
I know that it could be possibly solved with FOR loop, but this would be a time consuming method. Perhaps there is another, ready and simple solution. Please advise.
Use all :
all(grepl("ABC",c("ABC","ABC","123ABC")))
#[1] TRUE
all(grepl("ABC",c("ABC","ABC","123ABA")))
#[1] FALSE

find where in boolean vector TRUE followed by FALSE

I have an boolean vector and need to indicate where it changes (from TRUE to FALSE).
input <- c(rep(TRUE,3), rep(FALSE,2), TRUE, FALSE)
input
[1] TRUE TRUE TRUE FALSE FALSE TRUE FALSE
The result should be c(4, 7). Does something for doing so already exist (in base)? thx, J

improving speed of list which need some cleaning

I'm trying to improve the speed of one of my script in R, one of the part which take a long time is the cleaning of a list under certain conditions.
Might not be necessary to perfectly understand what I want to do, and go directly to the code that I need to improve.
So here's the thing:
I have a list, each element of the list is a list of 2 elements :
- The first element is a vector of integer with a length between 1 and 4
- The second element is a vector of boolean of length 6
here's a piece of code to create such a list (of 1000 elements):
numberList<-1000
l.gen<-lapply(1:numberList,function(i){
return(list(var = floor(runif(floor(runif(1,1,5)),1,7)),vec = as.logical(floor(runif(6,0,1.99)))))
})
kind of look like that :
> l.gen
[[1]]
[[1]]$var
[1] 1 4 2
[[1]]$vec
[1] FALSE FALSE FALSE FALSE TRUE TRUE
[[2]]
[[2]]$var
[1] 3
[[2]]$vec
[1] FALSE FALSE FALSE TRUE TRUE FALSE
[[3]]
[[3]]$var
[1] 6
[[3]]$vec
[1] TRUE FALSE TRUE FALSE TRUE TRUE
[[4]]
[[4]]$var
[1] 6
[[4]]$vec
[1] TRUE TRUE TRUE FALSE FALSE FALSE
Now to cleaning part,
I want to remove from this list all elements "l" that meet two conditions:
the "$vec" of the element "l" has more than 3 times TRUE in common with another element of the list:
for example :
$vec
[1] TRUE TRUE FALSE TRUE TRUE FALSE
and
$vec
[1] TRUE FALSE FALSE TRUE TRUE FALSE
has 3 TRUE in common (the one's in 1st, 4th and 5th position) so it doesn't match the condition.
The second condition is tested in case we have the first one :
the $var of the element should have at least one element in common (doesn't matter their respective positions)
so
[[1]]$var
[1] 1 4 2
and
[[1]]$var
[1] 3 1
meet that condition (because the "1" is in both vectors)
In case two elements of the list meet both those conditions I delete the one with the shorter $var
for example in :
[[3]]
[[3]]$var
[1] 6
[[3]]$vec
[1] TRUE FALSE TRUE FALSE TRUE TRUE
[[3]]
[[3]]$var
[1] 6 3 5
[[3]]$vec
[1] TRUE TRUE TRUE FALSE TRUE TRUE
this element should be deleted :
[[3]]
[[3]]$var
[1] 6
[[3]]$vec
[1] TRUE FALSE TRUE FALSE TRUE TRUE
So here is the code I've tried that meet my requests :
res<-lapply(l.gen,function(l){
for (i in 1:length(l)){
if (length(l$var)<length(l.gen[[i]]$var)){
in.common<-sum(l$vec&l.gen[[i]]$vec)
if(in.common>limit){
var.in.common<-sum(l$var%in%l.gen[[i]]$var)
if(var.in.common>0){
return(NULL)
} else {
return("OK")
}
} else {
return("OK")
}
}
}
})
It works fine but it's kind of slow when the list is very Big.
I've tried to change the for loop with another lapply but it takes more time when the list is big as the return() in the for-loop works like a "break;" which can't be done in the lapply() which try every single element of the list.
I'm opened to every suggestion that might help.

How does ada::predict.ada work?

I tried on Cross-validated but without a response and this is a technical, implementation-centric question.
I used ada::ada in R to create a boosted model which is based on decision trees.
It normally returns a matrix with stats on predicted results compared to expected outcome.
It's something like that:
FALSE TRUE
FALSE 11023 1023
TRUE 997 5673
That's cool, good accuracy.
Now it's time to predict on new data. So I went with:
predict(myadamodel, newdata=giveinputs())
But instead of a simple answer TRUE/FALSE I've got:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[25] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
[49] FALSE FALSE
Levels: FALSE TRUE
I presume that this ada object is an ensamble and I received an answer from each classifier.
But in the end I need a final straight answer: TRUE/FALSE. If that's all I can get I need to know how does the "ada" function computes the final answer that was used to build the statistic. I would check that but the "ada" function is precompiled.
How do I get the final TRUE/FALS answer to comply with the statistic that ada return from the learning phase?
I've attached an example that you can copy-paste:
mydata = data.frame(a=numeric(0),b=double(0),r=logical(0))
for(i in -10:10)
for(j in 20:-4)
mydata[length(mydata[,1])+1,] = c(a=i,b=j, r= (j > i))
myada = ada(mydata[,c("a","b")], mydata[,"r"])
print(myada);
predict(myada, data.frame(a=4,b=7))
Please note that the r-column is for some reason expressed as "0" "1". I don't know why and how to tell data.frame not to convert TRUE FALSE to 0, 1 but the idea stays the same.
OK. The reproducible example helped. It looks to be a quirk in the way predict works when you pass new data that has just one row. In this case, you're getting an estimate from each of the iterations (the default number of iterations is 50). Note that you only get two values returned when you do
predict(myada, data.frame(a=4:3,b=7:8))
This is basically because of a use of sapply within the predict function. We can make our own which doesn't have this problem.
predict.ada <- ada:::predict.ada
body(predict.ada)[[12]] <- quote( tmp <- t(do.call(rbind,
lapply(1:iter, function(i) f(f = object$model$trees[[i]],
dat = newdata)))))
and then we can run
predict.ada(myada, newdata=data.frame(a=4,b=7))
# [1] TRUE
# Levels: FALSE TRUE
so this new values is predicted to be TRUE. This was tested in ada_2.0-3 and may break in other versions.
Also, in your test data, when you use c() to merge elements they must be all the same data type or they will be converted to the lowest common denominator data type that can hold all values. If you're mixing types, it's better to use a list(). For example
mydata[length(mydata[,1])+1,] = list(a=i,b=j, r= (j > i))

Lazy evaluation of `which` function arguments?

If there are multiple boolean expressions as arguments to the which function, are they evaluated lazily?
For example:
which(test1 & test2)
If test1 returns false, then test2 is not evaluated as the compound expression will be false anyway.
With if there can be efficiency gains as a result of that behavior. It is documented to work that way, and I don't think it is due to lazy evaluation. Even if you "force()-ed" that expression it would still only evaluate a series of &'s until it had a single FALSE. See this help page:
?Logic
#XuWang probably deserved the credit for emphasizing the difference between "&" and "&&". The "&" operator works on vectors and returns vectors. The "&&" operator acts on scalars (actually vectors of length==1) and returns a vector of length== 1. When offered a vector or length >1 as either side of the arguments, it will work on only the information in the first value of each and emit a warning. It is only the "&&" version that does what is being called "lazy" evaluation. You can see that hte "&" operator is not acting in a "lazy fashion with a simepl test:
fn1 <- function(x) print(x)
fn2 <- function(x) print(x)
x1 <- sample(c(TRUE, FALSE), 10, replace=TRUE)
fn1(x1) & fn2(x1) # the first two indicate evaluation of both sides regardless of first value
# [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
# [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
# [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE

Resources