I tried on Cross-validated but without a response and this is a technical, implementation-centric question.
I used ada::ada in R to create a boosted model which is based on decision trees.
It normally returns a matrix with stats on predicted results compared to expected outcome.
It's something like that:
FALSE TRUE
FALSE 11023 1023
TRUE 997 5673
That's cool, good accuracy.
Now it's time to predict on new data. So I went with:
predict(myadamodel, newdata=giveinputs())
But instead of a simple answer TRUE/FALSE I've got:
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[25] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
[49] FALSE FALSE
Levels: FALSE TRUE
I presume that this ada object is an ensamble and I received an answer from each classifier.
But in the end I need a final straight answer: TRUE/FALSE. If that's all I can get I need to know how does the "ada" function computes the final answer that was used to build the statistic. I would check that but the "ada" function is precompiled.
How do I get the final TRUE/FALS answer to comply with the statistic that ada return from the learning phase?
I've attached an example that you can copy-paste:
mydata = data.frame(a=numeric(0),b=double(0),r=logical(0))
for(i in -10:10)
for(j in 20:-4)
mydata[length(mydata[,1])+1,] = c(a=i,b=j, r= (j > i))
myada = ada(mydata[,c("a","b")], mydata[,"r"])
print(myada);
predict(myada, data.frame(a=4,b=7))
Please note that the r-column is for some reason expressed as "0" "1". I don't know why and how to tell data.frame not to convert TRUE FALSE to 0, 1 but the idea stays the same.
OK. The reproducible example helped. It looks to be a quirk in the way predict works when you pass new data that has just one row. In this case, you're getting an estimate from each of the iterations (the default number of iterations is 50). Note that you only get two values returned when you do
predict(myada, data.frame(a=4:3,b=7:8))
This is basically because of a use of sapply within the predict function. We can make our own which doesn't have this problem.
predict.ada <- ada:::predict.ada
body(predict.ada)[[12]] <- quote( tmp <- t(do.call(rbind,
lapply(1:iter, function(i) f(f = object$model$trees[[i]],
dat = newdata)))))
and then we can run
predict.ada(myada, newdata=data.frame(a=4,b=7))
# [1] TRUE
# Levels: FALSE TRUE
so this new values is predicted to be TRUE. This was tested in ada_2.0-3 and may break in other versions.
Also, in your test data, when you use c() to merge elements they must be all the same data type or they will be converted to the lowest common denominator data type that can hold all values. If you're mixing types, it's better to use a list(). For example
mydata[length(mydata[,1])+1,] = list(a=i,b=j, r= (j > i))
Related
How could I get a single boolean value that is TRUE if all values in vector are TRUE and FALSE otherwise? For instance:
> grepl("ABC",c("ABC","ABC","123ABC"))
[1] TRUE TRUE TRUE
my desired result:
[1] TRUE
Another example:
> grepl("ABC",c("ABC","ABC","123ABA"))
[1] TRUE TRUE FALSE
my desired result:
[1] FALSE
I know that it could be possibly solved with FOR loop, but this would be a time consuming method. Perhaps there is another, ready and simple solution. Please advise.
Use all :
all(grepl("ABC",c("ABC","ABC","123ABC")))
#[1] TRUE
all(grepl("ABC",c("ABC","ABC","123ABA")))
#[1] FALSE
I'm following along with this question here: efficiently locf by groups in a single R data.table
This seems perfect for my data, as I have grouped data with multiple columns, where I am trying to carry the last observation forward. However, I would like to limit how far forward it is carried. The relevant part of the code is !is.na(x). Let's say I want to limit it to two, then given the sequence TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE, I would like to have it as TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE.
This itself caries a value of true forward up to n times (very similar to XTS), which seems to make it redundant in using this method instead of xts.na.locf, but I'm hoping there is an efficient way to do this that avoids xts. Thanks for any help.
One possibility is to modify the Run Length Encoding of the vector by shifting the unwanted repetitions of FALSE onto the next TRUE:
mx <- 2
v <- c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)
r <- rle(v)
if(!r$values[length(r$values)]) {
r$values <- c(r$values,TRUE)
r$lengths <- c(r$lengths,0)
}
changes <- pmax(0,r$lengths-mx) * (r$values == FALSE)
r$lengths <- r$lengths - changes + c(0,head(changes,-1))
You'd obviously have to test if this is more efficient for your use case.
Edit: Output is as expected:
> print(inverse.rle(r))
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
Edit 2: Short explanation:
pmax(0,r$lengths-mx) is a vector whose components are either zero (if the length is at most mx) or the difference between the length and mx. Since only care the repetitions of FALSE are relevant, multiplying by (r$values == FALSE) is necessary which zeroes any entries of the vector corresponding to TRUE.
Due to the if it is known that the last element of r$values is TRUE. Thus we can move the unwanted FALSEs to the following TRUE. This is achieved by first subtracting from the number of FALSEs and then adding to the number of TRUEs. Since we know that the last entry of changes is for a TRUE taking c(0,head(changes,-1)) simply shifts all changes (for FALSE) to the right (and thus onto a TRUE).
In which cases could these 2 different ways of implimentation would give different results?
data(mtcars)
firstWay <- mtcars[grepl('6',mtcars$cyl),]
SecondWay <- mtcars[mtcars$cyl=='6',]
If these ways always give the same results which of them is recommended and why? Thanks
mtcars$cyl is a numeric column, so you would be better off comparing it to a number using mtcars[mtcars$cyl == 6, ].
But the difference between the equality operator == and grepl is that == will only be TRUE for members of the vector which are equal to "6", while grepl will match any member of the vector which has a 6 anywhere within it.
So, for example:
String == grepl
6 TRUE TRUE
123456 FALSE TRUE
6ABC FALSE TRUE
This is a long sentence which happens to have a 6 in it FALSE TRUE
Whereas this long sentence does not FALSE FALSE
The equivalent grepl pattern would be "^6$". There's a tutorial (one of many) on regex at http://www.regular-expressions.info/tutorial.html.
Well, I think that the fist difference is that with grepl you can subset even if you do not already know, for example 6, but you can try to search a rows that start or end with 6.
If you try to do this with normal subsetting technique you'll have an empty object because, for example ^6, is not recognized as a regular expression but as a string with the symbol ^ and 6.
I am sure there are other differences but I am sure professional users will provide more detailed answers.
For the side os which one could be preferred maybe there can be reasons of efficiency:
system.time(mtcars[grepl('^6',mtcars$cyl),])
user system elapsed
0.029 0.002 0.035
system.time(mtcars[mtcars$cyl=='6',])
user system elapsed
0.031 0.002 0.046
This little example can be just a guide and as #Nick K suggested first further (and precise) investigations have to be done with microbenchmark . Of course with big dataset I barely believe that a professional users (or one in need of speed) will prefer both of them but maybe it will rely on data table, or tools like dplyr written in lower level language and so more fast.
Using the package microbenchmark, we can see which is faster
library(microbenchmark)
m <- microbenchmark(mtcars[grepl('6',mtcars$cyl),], mtcars[mtcars$cyl=='6',], times=10000)
Unit: microseconds
expr min lq mean median uq max neval
mtcars[grepl("6", mtcars$cyl), ] 229.080 234.738 247.5324 236.693 239.417 6713.914 10000
mtcars[mtcars$cyl == "6", ] 214.902 220.210 231.0240 221.956 224.471 7759.507 10000
It looks like == is faster, so when possible you should use that
However, the functions do not do exactly the same thing. grepl searches for if the string is present at all wheras == checks whether the expressions are equal
grepl("6", mtcars$disp)
[1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mtcars$disp == "6"
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If there are multiple boolean expressions as arguments to the which function, are they evaluated lazily?
For example:
which(test1 & test2)
If test1 returns false, then test2 is not evaluated as the compound expression will be false anyway.
With if there can be efficiency gains as a result of that behavior. It is documented to work that way, and I don't think it is due to lazy evaluation. Even if you "force()-ed" that expression it would still only evaluate a series of &'s until it had a single FALSE. See this help page:
?Logic
#XuWang probably deserved the credit for emphasizing the difference between "&" and "&&". The "&" operator works on vectors and returns vectors. The "&&" operator acts on scalars (actually vectors of length==1) and returns a vector of length== 1. When offered a vector or length >1 as either side of the arguments, it will work on only the information in the first value of each and emit a warning. It is only the "&&" version that does what is being called "lazy" evaluation. You can see that hte "&" operator is not acting in a "lazy fashion with a simepl test:
fn1 <- function(x) print(x)
fn2 <- function(x) print(x)
x1 <- sample(c(TRUE, FALSE), 10, replace=TRUE)
fn1(x1) & fn2(x1) # the first two indicate evaluation of both sides regardless of first value
# [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
# [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
# [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
I'm trying to run a randomForest on a large-ish data set (5000x300). Unfortunately I'm getting an error message as follows:
> RF <- randomForest(prePrior1, postPrior1[,6]
+ ,,do.trace=TRUE,importance=TRUE,ntree=100,,forest=TRUE)
Error in randomForest.default(prePrior1, postPrior1[, 6], , do.trace = TRUE, :
NA/NaN/Inf in foreign function call (arg 1)
So I try to find any NA's using :
> df2 <- prePrior1[is.na(prePrior1)]
> df2
character(0)
> df2 <- postPrior1[is.na(postPrior1[,6])]
> df2
numeric(0)
which leads me to believe that it's Inf's that are the problem as there don't seem to be any NA's.
Any suggestions for how to root out Inf's?
You're probably looking for is.finite, though I'm not 100% certain that the problem is Infs in your input data.
Be sure to read the help for is.finite carefully about which combinations of missing, infinite, etc. it picks out. Specifically, this:
> is.finite(c(1,NA,-Inf,NaN))
[1] TRUE FALSE FALSE FALSE
> is.infinite(c(1,NA,-Inf,NaN))
[1] FALSE FALSE TRUE FALSE
One of these things is not like the others. Not surprisingly, there's an is.nan function as well.
randomForest's 'NA/NaN/Inf in foreign function call' is often a false warning, and really irritating:
you will get this if any of the variables passed is character
actual NaNs and Infs almost never happen in clean data
My fast-and-dirty trick to narrow things down, do a binary-search on your variable list, and use token parameters like ntree=2 to get an instant pass/fail on the subset of variables:
RF <- randomForest(prePrior1[m:n],ntree=2,...)
In analogy to is.na, you can use is.infinite to find occurrences of infinites.
Take a look at with, e.g.:
> with(df, df == Inf)
foo bar baz abc ...
[1,] FALSE FALSE TRUE FALSE ...
[2,] FALSE TRUE FALSE FALSE ...
...
joran's answer is what you want and informative. For more details about is.na() and is.infinite(), you should check out https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/is.na-methods.html
and besides, after you get the logical vector which says whether each element of the original vector is NA/Inf, you can use the which() function to get the indices, just like this:
> v1 <- c(1, Inf, 2, NaN, Inf, 3, NaN, Inf)
> is.infinite(v1)
[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
> which(is.infinite(v1))
[1] 2 5 8
> is.na(v1)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
> which(is.na(v1))
[1] 4 7
the document for which() is here https://stat.ethz.ch/R-manual/R-devel/library/base/html/any.html