I'm only just getting a handle on vectorizing code with R (links to useful examples would help), and I'm trying to find a faster way of handling this loop. a,b,c all have a bunch of numbers in them and I'm trying to find any particular number that occurs in all 3 columns. The loop works, but is super slow:
for(i in 1:length(a)){
if(any(a[i]==b))
if(any(a[i]==c))
print(a[i])
}
Is there an apply function that would work really well here?
Maybe this?
x <- 1:5
y <- 4:10
z <- 4:8
> Reduce(intersect,list(x,y,z))
[1] 4 5
I see you have accepted #joran solution, but it is really hidden loop. This is a "vectorized" solution:
> x <- 1:5
> y <- 4:10
> z <- 4:8
> x[ (x %in% y) & (x %in% z) ]
[1] 4 5
You could also count the total number of times each appeared (assuming there are no duplicates in each; if so, run unique on them first. This code also returns the desired numbers as characters; it could be converted back as needed.
x <- 1:5; y <- 4:10; z <- 4:8
foo <- table(c(x,y,z))
names(foo)[foo==3]
## [1] "4" "5"
You can also improve your for loop by using intersect within a for ( basically it what it is done within Reduce)
intersect.list <- function(list) { ## code from stabperf package
if (is.null(list)) return(NA)
# Handle empty list
if (length(list) < 1) return(NA)
# Start with first element of list
int <- list[[1]]
for (v in list[-1]) { int <- intersect(int, v) }
return(int)
}
intersect.list(list(x,y,z))
4 5
benchmarking :
library(microbenchmark)
set.seed(1)
N <- 1e6
x <- sample(1:100,N,rep=T)
y <- sample(1:100,N,rep=T)
z <- sample(1:100,N,rep=T)
vectorized <- function()x[ (x %in% y) & (x %in% z) ]
microbenchmark(intersect.list(list(x,y,z)),
+ vectorized(),
+ Reduce(intersect,list(x,y,z)),times=10)
Unit: milliseconds
expr min lq median uq max neval
intersect.list(list(x, y, z)) 73.2862 75.14838 76.77792 85.54216 121.8442 10
vectorized() 131.9560 132.40266 134.47248 139.93902 172.7829 10
Reduce(intersect, list(x, y, z)) 88.4308 90.06320 92.72929 128.05930 133.2982 10
As you see the for loop if slightly faster then Reduce and vectorized solution.
Related
I am trying to sum the odds numbers of a specific number (but excluding itself), for example: N = 5 then 1+3 = 4
a<-5
sum<-function(x){
k<-0
for (n in x) {
if(n %% 2 == 1)
k<-k+1
}
return(k)
}
sum(a)
# [1] 1
But the function is not working, because it counts the odds numbers instead of summing them.
We may use vectorized approach
a1 <- head(seq_len(a), -1)
sum(a1[a1%%2 == 1])
[1] 4
If we want a loop, perhaps
f1 <- function(x) {
s <- 0
k <- 1
while(k < x) {
if(k %% 2 == 1) {
s <- s + k
}
k <- k + 1
}
s
}
f1(5)
The issue in OP's code is
for(n in x)
where x is just a single value and thus n will be looped once - i.e. if our input is 5, then it will be only looped once and 'n' will be 5. Instead, it would be seq_len(x -1). The correct loop would be something like
f2<-function(x){
k<- 0
for (n in seq_len(x-1)) {
if(n %% 2 == 1) {
k <- k + n
}
}
k
}
f2(5)
NOTE: sum is a base R function. So, it is better to name the custom function with a different name
Mathematically, we can try the following code to calculate the sum (N could be odd or even)
(ceiling((N - 1) / 2))^2
It's simple and it does what it says:
sum(seq(1, length.out = floor(N/2), by = 2))
The multiplication solution is probably gonna be quicker, though.
NB - an earlier version of this answer was
sum(seq(1, N - 1, 2))
which as #tjebo points out, silently gives the wrong answer for N = 1.
We could use logical statement to access the values:
a <- 5
a1 <- head(seq_len(a), -1)
sum(a1[c(TRUE, FALSE)])
output:
[1] 4
Fun benchmarking. Does it surprise that Thomas' simple formula is by far the fastest solution...?
count_odds_thomas <- function(x){
(ceiling((x - 1) / 2))^2
}
count_odds_akrun <- function(x){
a1 <- head(seq_len(x), -1)
sum(a1[a1%%2 == 1])
}
count_odds_dash2 <- function(x){
sum(seq(1, x - 1, 2))
}
m <- microbenchmark::microbenchmark(
akrun = count_odds_akrun(10^6),
dash2 = count_odds_dash2(10^6),
thomas = count_odds_thomas(10^6)
)
m
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> akrun 22117564 26299922.0 30052362.16 28653712 31891621 70721894 100
#> dash2 4016254 4384944.0 7159095.88 4767401 8202516 52423322 100
#> thomas 439 935.5 27599.34 6223 8482 2205286 100
ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Moreover, Thomas solution works on really big numbers (also no surprise)... on my machine, count_odds_akrun stuffs the memory at a “mere” 10^10, but Thomas works fine till Infinity…
count_odds_thomas(10^10)
#> [1] 2.5e+19
count_odds_akrun(10^10)
#> Error: vector memory exhausted (limit reached?)
I have asked this question previously (see here) and received a satisfactory answer using the purr package. However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP package.
Proper subset: A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).
Suppose you have the following vectors in a list:
a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)
My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.
Previous Solution
library(purr)
possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
map2_lgl(.x = possibilities,
.y = seq_along(possibilities),
~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))
The notion here is to avoid the O(N^3) and use a less order instead. The other answer provided here will be slow still since it is greater than O(N^2). Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.
onlySet <- function(x){
i <- 1
repeat{
y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
if(all(y)){
if(i==length(x)) break
else i <- i+1
}
x <- c(x[-1][y], x[1])
}
x
}
Now to show the time difference, check out the following:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
mat <- outer(a, a, match_fun)
a[colSums(mat) == 1]
}
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(method1(poss), onlySet(poss))
Unit: milliseconds
expr min lq mean median uq max neval cld
method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077 100 b
onlySet(poss) 1.9845 2.07005 2.191647 2.15945 2.24245 3.3656 100 a
Have you tried optimising the solution in base R first? For example, the following reproduces your expected output and uses (faster) base R array routines:
match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
mat <- outer(possibilities, possibilities, match_fun)
possibilities[colSums(mat) == 1]
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 4
Inspired by Onyambu's performant solution, here is another base R option using a recursive function
f_recursive <- function(x, i = 1) {
if (i > length(x)) return(x)
idx <- which(sapply(x[-i], function(el) all(x[[i]] %in% el))) + 1
if (length(idx) == 0) f_recursive(x, i + 1) else f_recursive(x[-idx], i + 1)
}
f(possibilities)
The performance is on par with Onyambu's solution.
poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(
method1(poss),
onlySet(poss),
f_recursive(poss))
#Unit: milliseconds
# expr min lq mean median uq
# method1(poss) 682.558602 710.974831 750.325377 730.627996 765.040976
# onlySet(poss) 1.700646 1.782713 1.870972 1.819820 1.918669
# f_recursive(poss) 1.681120 1.737459 1.884685 1.806384 1.901582
# max neval
# 1200.562889 100
# 2.371646 100
# 3.217013 100
I know, that insertion sort algorithm is faster, than selection sort.
But the difference in speed is too huge in my opinion.
Here are my two codes:
#Selection sort algorithm:
u <- round(runif(100,1,100))
selection_sort <- function(x){
s <- vector('numeric')
while(length(x) != 0){
minimum <- x[1]
for(i in 1:length(x)){
ifelse(x[i]<minimum,minimum <- x[i],next())
}
x <- x[-match(minimum,x)]
s <- c(s,minimum)
}
s
}
#Insertion sort algorithm:
u <- round(runif(100,1,100))
insertion_sort <- function(x){
s <- vector('numeric')
while(length(x) !=0){
num <- x[1]
x <- x[-match(num,x)]
if(length(s) == 0){
s <- c(s,num)
} else{
for(i in 1:length(s)){
if(s[i]>=num){
s <- append(s,num,i-1)
break
}
}
if(s[length(s)]<num){
s <- c(s,num)
}
}
}
}
I have checked the speed of my codes by microbenchmark commend and I got the following results:
microbenchmark(b <- insertion_sort(u),times = 10)
expr min lq mean median uq max neval
2.793573 2.873704 3.159338 2.920087 3.136996 5.066089 10
microbenchmark(b <- selection_sort(u),times = 10)
expr min lq mean median uq max neval
21.50502 21.61436 31.7791 22.71371 40.64712 68.17705 10
Is this difference in speed ok?
I know, maybe my codes are not efficient. If such difference is not ok, how can I correct it?
P.S both codes work correct
Selection sort
Given a vector x, let the initial unsorted vector u be equal to x, and
the initial sorted vector s be a vector of length 0.
Find the smallest element of u then remove it from u and add it to
the end of s.
If u is not empty then go back to step 2.
Insertion sort
Given a vector x, let the initial unsorted vector u be equal to x, and
the initial sorted vector s be a vector of length 0.
Remove the last element of u and insert it into s so that s is still
sorted.
If u is not empty then go back to step 2.
Here is an (relatively) efficient way to implement selection sort in R:
selection_sort <- function(x){
s <- numeric(length(x))
for (i in seq_len(length(x))) {
ind <- which.min(x)
s[i] <- x[ind]
x[ind] <- NA
}
s
}
set.seed(42)
v <- rnorm(10)
selection_sort(v)
#[1] -0.56469817 -0.10612452 -0.09465904 -0.06271410 0.36312841 0.40426832 0.63286260 1.37095845 1.51152200 2.01842371
Note how I avoid resizing a vector and how I use a for loop, thereby avoiding the test in a while or repeat loop.
Similar idea (by #Roland) could also be implemented in JuliaLang (including this option here as iteration is often fast in JuliaLang)
srand(42)
v = randn(10)
v1 = deepcopy(v)
function sel_sort(x)
s = zeros(length(x))
for i in eachindex(x)
ind = indmin(x)
s[i] = x[ind]
x[ind] = maximum(x) + 1
end
s
end
sel_sort(v)
#10-element Array{Float64,1}:
#-2.64199
#-1.1449
#-0.556027
#-0.468606
#-0.444383
#-0.299484
# 0.0271553
# 0.156143
# 1.00331
# 1.77786
In addition, we can also use the sorting algorithms already implemented
sort(v1, alg = InsertionSort)
# 10-element Array{Float64,1}:
#-2.64199
#-1.1449
#-0.556027
#-0.468606
#-0.444383
#-0.299484
#0.0271553
#0.156143
#1.00331
#1.77786
The sort! changes the original vector.
sort!(v1, alg = InsertionSort)
I am wondering about the simple task of splitting a vector into two at a certain index:
splitAt <- function(x, pos){
list(x[1:pos-1], x[pos:length(x)])
}
a <- c(1, 2, 2, 3)
> splitAt(a, 4)
[[1]]
[1] 1 2 2
[[2]]
[1] 3
My question: There must be some existing function for this, but I can't find it? Is maybe split a possibility? My naive implementation also does not work if pos=0 or pos>length(a).
An improvement would be:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
which can now take a vector of positions:
splitAt(a, c(2, 4))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2 2
#
# [[3]]
# [1] 3
And it does behave properly (subjective) if pos <= 0 or pos >= length(x) in the sense that it returns the whole original vector in a single list item. If you'd like it to error out instead, use stopifnot at the top of the function.
I tried to use flodel's answer, but it was too slow in my case with a very large x (and the function has to be called repeatedly). So I created the following function that is much faster, but also very ugly and doesn't behave properly. In particular, it doesn't check anything and will return buggy results at least for pos >= length(x) or pos <= 0 (you can add those checks yourself if you're unsure about your inputs and not too concerned about speed), and perhaps some other cases as well, so be careful.
splitAt2 <- function(x, pos) {
out <- list()
pos2 <- c(1, pos, length(x)+1)
for (i in seq_along(pos2[-1])) {
out[[i]] <- x[pos2[i]:(pos2[i+1]-1)]
}
return(out)
}
However, splitAt2 runs about 20 times faster with an x of length 106:
library(microbenchmark)
W <- rnorm(1e6)
splits <- cumsum(rep(1e5, 9))
tm <- microbenchmark(
splitAt(W, splits),
splitAt2(W, splits),
times=10)
tm
Another alternative that might be faster and/or more readable/elegant than flodel's solution:
splitAt <- function(x, pos) {
unname(split(x, findInterval(x, pos)))
}
I'd like to show the names of columns in a large dataframe that contain missing values. Basically, I want the equivalent of complete.cases(df) but for columns, not rows. Some of the columns are non-numeric, so something like
names(df[is.na(colMeans(df))])
returns "Error in colMeans(df) : 'x' must be numeric." So, my current solution is to transpose the dataframe and run complete.cases, but I'm guessing there's some variant of apply (or something in plyr) that's much more efficient.
nacols <- function(df) {
names(df[,!complete.cases(t(df))])
}
w <- c("hello","goodbye","stuff")
x <- c(1,2,3)
y <- c(1,NA,0)
z <- c(1,0, NA)
tmp <- data.frame(w,x,y,z)
nacols(tmp)
[1] "y" "z"
Can someone show me a more efficient function to identify columns that have NAs?
This is the fastest way that I know of:
unlist(lapply(df, function(x) any(is.na(x))))
EDIT:
I guess everyone else wrote it out complete so here it is complete:
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
And if you microbenchmark the 4 solutions on a WIN 7 machine:
Unit: microseconds
expr min lq median uq max
1 ANDRIE 85.380 91.911 106.375 116.639 863.124
2 MANOEL 87.712 93.778 105.908 118.971 8426.886
3 MOIRA 764.215 798.273 817.402 876.188 143039.632
4 TYLER 51.321 57.853 62.518 72.316 1365.136
And here's a visual of that:
Edit At the time I wrote this anyNA did not exist or I was unaware of it. This may speed things up moreso...per the help manual for ?anyNA:
The generic function anyNA implements any(is.na(x)) in a possibly faster way (especially for atomic vectors).
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) anyNA(x)))]
}
Here is one way:
colnames(tmp)[colSums(is.na(tmp)) > 0]
Hope it helps,
Manoel
One way...
nacols <- function(x){
y <- sapply(x, function(xx)any(is.na(xx)))
names(y[y])
}
nacols(tmp)
[1] "y" "z"
Explanation: since the result y is a logical vector, names(y[y]) returns the names of y for only those cases where y is TRUE.