I'm trying to improve the speed of one of my script in R, one of the part which take a long time is the cleaning of a list under certain conditions.
Might not be necessary to perfectly understand what I want to do, and go directly to the code that I need to improve.
So here's the thing:
I have a list, each element of the list is a list of 2 elements :
- The first element is a vector of integer with a length between 1 and 4
- The second element is a vector of boolean of length 6
here's a piece of code to create such a list (of 1000 elements):
numberList<-1000
l.gen<-lapply(1:numberList,function(i){
return(list(var = floor(runif(floor(runif(1,1,5)),1,7)),vec = as.logical(floor(runif(6,0,1.99)))))
})
kind of look like that :
> l.gen
[[1]]
[[1]]$var
[1] 1 4 2
[[1]]$vec
[1] FALSE FALSE FALSE FALSE TRUE TRUE
[[2]]
[[2]]$var
[1] 3
[[2]]$vec
[1] FALSE FALSE FALSE TRUE TRUE FALSE
[[3]]
[[3]]$var
[1] 6
[[3]]$vec
[1] TRUE FALSE TRUE FALSE TRUE TRUE
[[4]]
[[4]]$var
[1] 6
[[4]]$vec
[1] TRUE TRUE TRUE FALSE FALSE FALSE
Now to cleaning part,
I want to remove from this list all elements "l" that meet two conditions:
the "$vec" of the element "l" has more than 3 times TRUE in common with another element of the list:
for example :
$vec
[1] TRUE TRUE FALSE TRUE TRUE FALSE
and
$vec
[1] TRUE FALSE FALSE TRUE TRUE FALSE
has 3 TRUE in common (the one's in 1st, 4th and 5th position) so it doesn't match the condition.
The second condition is tested in case we have the first one :
the $var of the element should have at least one element in common (doesn't matter their respective positions)
so
[[1]]$var
[1] 1 4 2
and
[[1]]$var
[1] 3 1
meet that condition (because the "1" is in both vectors)
In case two elements of the list meet both those conditions I delete the one with the shorter $var
for example in :
[[3]]
[[3]]$var
[1] 6
[[3]]$vec
[1] TRUE FALSE TRUE FALSE TRUE TRUE
[[3]]
[[3]]$var
[1] 6 3 5
[[3]]$vec
[1] TRUE TRUE TRUE FALSE TRUE TRUE
this element should be deleted :
[[3]]
[[3]]$var
[1] 6
[[3]]$vec
[1] TRUE FALSE TRUE FALSE TRUE TRUE
So here is the code I've tried that meet my requests :
res<-lapply(l.gen,function(l){
for (i in 1:length(l)){
if (length(l$var)<length(l.gen[[i]]$var)){
in.common<-sum(l$vec&l.gen[[i]]$vec)
if(in.common>limit){
var.in.common<-sum(l$var%in%l.gen[[i]]$var)
if(var.in.common>0){
return(NULL)
} else {
return("OK")
}
} else {
return("OK")
}
}
}
})
It works fine but it's kind of slow when the list is very Big.
I've tried to change the for loop with another lapply but it takes more time when the list is big as the return() in the for-loop works like a "break;" which can't be done in the lapply() which try every single element of the list.
I'm opened to every suggestion that might help.
Related
How could I get a single boolean value that is TRUE if all values in vector are TRUE and FALSE otherwise? For instance:
> grepl("ABC",c("ABC","ABC","123ABC"))
[1] TRUE TRUE TRUE
my desired result:
[1] TRUE
Another example:
> grepl("ABC",c("ABC","ABC","123ABA"))
[1] TRUE TRUE FALSE
my desired result:
[1] FALSE
I know that it could be possibly solved with FOR loop, but this would be a time consuming method. Perhaps there is another, ready and simple solution. Please advise.
Use all :
all(grepl("ABC",c("ABC","ABC","123ABC")))
#[1] TRUE
all(grepl("ABC",c("ABC","ABC","123ABA")))
#[1] FALSE
Let Loss be a 500x24 matrix. I control is there any infinite element in i by typing:
> any(abs(Loss==Inf))
[1] TRUE
Result says us there is at least one infinite element in the matrix.
So I search the column(s) which consists the infinite element. So I type the below code:
n1<-dim(Loss)[2]
xx<-sapply(1:n1, function(i) {any(abs(Loss[i])==Inf)})
> xx
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This time it seems that no column consists infinite element.
Why is that? I will be very glad for any help. Thanks a lot.
In which cases could these 2 different ways of implimentation would give different results?
data(mtcars)
firstWay <- mtcars[grepl('6',mtcars$cyl),]
SecondWay <- mtcars[mtcars$cyl=='6',]
If these ways always give the same results which of them is recommended and why? Thanks
mtcars$cyl is a numeric column, so you would be better off comparing it to a number using mtcars[mtcars$cyl == 6, ].
But the difference between the equality operator == and grepl is that == will only be TRUE for members of the vector which are equal to "6", while grepl will match any member of the vector which has a 6 anywhere within it.
So, for example:
String == grepl
6 TRUE TRUE
123456 FALSE TRUE
6ABC FALSE TRUE
This is a long sentence which happens to have a 6 in it FALSE TRUE
Whereas this long sentence does not FALSE FALSE
The equivalent grepl pattern would be "^6$". There's a tutorial (one of many) on regex at http://www.regular-expressions.info/tutorial.html.
Well, I think that the fist difference is that with grepl you can subset even if you do not already know, for example 6, but you can try to search a rows that start or end with 6.
If you try to do this with normal subsetting technique you'll have an empty object because, for example ^6, is not recognized as a regular expression but as a string with the symbol ^ and 6.
I am sure there are other differences but I am sure professional users will provide more detailed answers.
For the side os which one could be preferred maybe there can be reasons of efficiency:
system.time(mtcars[grepl('^6',mtcars$cyl),])
user system elapsed
0.029 0.002 0.035
system.time(mtcars[mtcars$cyl=='6',])
user system elapsed
0.031 0.002 0.046
This little example can be just a guide and as #Nick K suggested first further (and precise) investigations have to be done with microbenchmark . Of course with big dataset I barely believe that a professional users (or one in need of speed) will prefer both of them but maybe it will rely on data table, or tools like dplyr written in lower level language and so more fast.
Using the package microbenchmark, we can see which is faster
library(microbenchmark)
m <- microbenchmark(mtcars[grepl('6',mtcars$cyl),], mtcars[mtcars$cyl=='6',], times=10000)
Unit: microseconds
expr min lq mean median uq max neval
mtcars[grepl("6", mtcars$cyl), ] 229.080 234.738 247.5324 236.693 239.417 6713.914 10000
mtcars[mtcars$cyl == "6", ] 214.902 220.210 231.0240 221.956 224.471 7759.507 10000
It looks like == is faster, so when possible you should use that
However, the functions do not do exactly the same thing. grepl searches for if the string is present at all wheras == checks whether the expressions are equal
grepl("6", mtcars$disp)
[1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mtcars$disp == "6"
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?
I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE
If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE
i have a dataframe named as newdata. it has two columns named as BONUS and GENDER.
When i write the following code in r:
> newdata <- within(newdata,{
PROMOTION=ifelse(BONUS>=1500,1,0)})
it works though i haven't used loop here but the following codes don't work without loop. Why?
> add <- with(newdata,
if(GENDER==F)sum(PROMOTION))
Warning message:
In if (GENDER == F) sum(PROMOTION) :
the condition has length > 1 and only the first element will be used
My question is why in the first code all elements have been used?
ifelse is vectorized, but if is not. For example:
> x <- rbinom(20,1,.5)
> ifelse(x,TRUE,FALSE)
[1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[13] FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
> if(x) {TRUE} else {FALSE}
[1] TRUE
Warning message:
In if (x) { :
the condition has length > 1 and only the first element will be used