I am quite familiar with R as I've been using it for a few years now. Unfortunately, I am not very well versed in creating functions that involve looping or repetition of an equation. The problem goes as follows:
I have a vector containing over 1000 values. I would like to calculate the absolute difference between two juxtaposing means of equal size from a subset of that vector.
Here is an an example.
I have the vector (vec) of length 8
[1] 0.12472963 1.15341289 -1.09662288 -0.73241639 0.06437658 -0.13647136 -1.52592048 1.46450084
I would like calculate the mean of the first 2 values ( 0.12472963, 1.15341289) and obtain the absolute difference with the mean of the following 2 values (-1.09662288 -0.73241639), thereafter, working my way down the vector.
In this case, I can easily use the following equation:
abs(mean(vec[1:2])-mean(vec[3:4]))
and incrementally increase each number by 1 so as to work my way down manually until the end of the vector. I would obtain the following vector.
[1] 1.553591 0.3624149 0.8784722 0.497176 0.005337574
What I would like, however, to have an automated routine which enables be me to do that over long vectors and change the number of values from which to calculate the means.
It appears to me that it should be relatively simple, but I do not know where to start.
Use filter:
c(abs(filter(vec, c(0.5, 0.5, -0.5, -0.5), sides=1)[-(1:3)]))
#[1] 1.55359090 0.36241491 0.87847224 0.49717601 0.00533757
Using rollapply from zoo
library(zoo)
n <- 2
n1 <- length(vec)
abs(rollapply(vec[1:(n1-n)], 2, mean)-rollapply(vec[(n+1):n1], 2,mean))
#[1] 1.55359090 0.36241491 0.87847224 0.49717601 0.00533757
Also, other variations of the above code are (from commented by #G. Grothendieck- one of the authors of zoo package)
abs(rollmean(vec[1:(n1-n)], 2) - rollmean(vec[(n+1):n1], 2)) #using
#`rollmean` instead of `rollapply`
or
rollapply(vec, 4, function(x) abs(mean(x[1:2]) - mean(x[3:4])))
or
abs(rollapply(vec, 4, "%*%", c(1, 1, -1, -1)/2))
As always, I chime in with:
vec<-rep(c( 0.12472963 , 1.15341289, -1.09662288, -0.73241639 , 0.06437658, -0.13647136 ,-1.52592048 , 1.46450084 ),100)
microbenchmark(roland(vec),akrun(vec),times=3)
Unit: microseconds
expr min lq mean median uq max
roland(vec) 564.128 565.2275 647.3353 566.327 688.939 811.551
akrun(vec) 3717.410 3982.1535 4218.3057 4246.897 4468.753 4690.610
neval
3
3
Related
I have two matrices as follows in R:
M<-matrix(c(1,4,1,3,1,4,2,3,1,2,1,2),3)
1 3 2 2
4 1 3 1
1 4 1 2
N<-matrix(c(1,1,2,2,3,4,-2,2,1,4,3,-1),3)
1 2 -2 4
1 3 2 3
2 4 1 -1
I want to find a vector which is a matrix 1*3 and each of its elements is the multiplication of min element of each row of M by the max element of the corresponding row of N (for example, the first element of the vector is the min element of the first row of matrix M, which is 1, multiply by the max element of the first row of matrix N, which is 4, and so the first element of the vector is 1*4 which is 4).
The final answer is: (1*4, 1*3,1*4)=(4,3,4)
To find this vector (or matrix) I have written the below code:
c(min(M[1,])*max(N[1,]),min(M[2,])*max(N[2,]),min(M[3,])*max(N[3,]))
But it is so long. could anyone write a shorter (or simpler, or easier) code?
apply(M, 1, min) * apply(N, 1, max)
The most straightforward way to tackle this, and maybe also the most readable, is to use apply (as already suggested by #Jan):
apply(M, 1, min) * apply(N, 1, max)
However, if you have a lot of data, the apply approach - which loops through all the data - can be slow. A faster way is to use built-in vectorized functions to perform fast operations on all the rows together.
The R max.col(m) function returns the index of the column with the highest value in each row of a matrix m. There isn't a min.col(m) function, but you can obviously get the same result as by using max.col(-m).
So, the vectorized approach is:
M_min_of_each_row=M[cbind(seq_len(nrow(M)),max.col(-M))]
N_max_of_each_row=N[cbind(seq_len(nrow(N)),max.col(N))]
answer=M*N
How much faster is this for a big matrix? We can use microbenchmark to test:
using_apply=function(M,N) apply(M,1,min)*apply(N,1,max)
using_maxcol=function(M,N) M[cbind(seq_len(nrow(M)),max.col(-M))]*N[cbind(seq_len(nrow(N)),max.col(N))]
library(microbenchmark)
M=matrix(sample(1:100,40000,replace=T),ncol=4);N=matrix(sample(1:100,40000,replace=T),ncol=4)
microbenchmark(using_apply(M,N),using_maxcol(M,N))
# Unit: milliseconds
# expr min lq mean median uq max neval
# using_apply(M, N) 25.319694 28.411979 31.762766 30.829093 33.789692 71.893174 100
# using_maxcol(M, N) 1.608357 1.876968 2.117926 2.042053 2.270023 4.858531 100
# check that the results are the same:
all(using_apply(M,N)==using_maxcol(M,N))
# [1] TRUE
So: the vectorized approach is about 15x faster. But, of course, you might consider that the apply approach is good enough, and that it's more-concise and (arguably) more-readable...
I have a table of data, where I've labeled the rows based on a cluster they fall into, as well as calculated the average of the rows column values. I would like to select the median row for each cluster.
For example sake, just looking at one, I would like to use:
median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
I can see that
> as.numeric(as.vector(subset(df,df$cluster == i )$avg))
[1] 48.11111111 47.77777778 49.44444444 49.33333333 47.55555556 46.55555556 47.44444444 47.11111111 45.66666667 45.44444444
And yet, the median is
> median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
[1] 47.5
I would like to find the median record, by matching the median returned with the average in the column, but that isn't possible with this return.
I've found some documentation and questions on rounding with the mean function, but that doesn't seem to apply to this unfortunately.
I could also limit the data decimal places, but some records will be too close, that duplicates will be common if rounded to one decimal.
When the input has an even number of values (like the 10 values you have) then there is not a value directly in the middle. The standard definition of a median (which R implements) averages the two middle values in the case of an even number of inputs. You could rank the data, and in the case of an even-length input select either the n/2 or n/2 + 1 record.
So, if your data was x = c(8, 6, 7, 5), the median is 6.5. You seem to want the index of "the median", that is either 2 or 3.
If we assume there are no ties, then we can get these answers with
which(rank(x) == length(x) / 2)
# [1] 2
which(rank(x) == length(x) / 2 + 1)
# [1] 3
If ties are a possibility, then rank's default tie-breaking method will cause you some problems. Have a look at ?rank and figure out which option you'd like to use.
We can, of course, turn this into a little utility function:
median_index = function(x) {
lx = length(x)
if (lx %% 2 == 1) {
return(match(median(x), x))
}
which(rank(x, ties.method = "first") == lx/2 + 1)
}
There is an easier way to do that: use dplyr
library(dplyr)
df%>%
group_by(cluster)%>%
summarise(Median=median(avg))
I am a bit stuck with this basic problem, but I cannot find a solution.
I have two data frames (dummies below):
x<- data.frame("Col1"=c(1,2,3,4), "Col2"=c(3,3,6,3))
y<- data.frame("ColA"=c(0,0,9,4), "ColB"=c(5,3,20,3))
I need to use the location of the median value of one column in df x to then retrieve a value from df y. For this, I am trying to get the row number of the median value in e.g. x$Col1 to then retrieve the value using something like y[,"ColB"][row.number]
is there an elegant way/function for doing this? Solutions might need to account for two cases - when the sample has an even number of values, and ehwn this is uneven (when numbers are even, the median value might be one that is not found in the sample as a result of calculating the mean of the two values in the middle)
The problem is a little underspecified.
What should happen when the median isn't in the data?
What should happen if the median appears in the data multiple times?
Here's a solution which takes the (absolute) difference between each value and the median, then returns the index of the first row for which that difference vector achieves its minimum.
with(x, which.min(abs(Col1 - median(Col1))))
# [1] 2
The quantile function with type = 1 (i.e. no averaging) may also be of interest, depending on your desired behavior. It returns the lower of the two "sides" of the median, while the which.min method above can depend on the ordering of your data.
quantile(x$Col1, .5, type = 1)
# 50%
# 2
An option using quantile is
with(x, which(Col1 == quantile(Col1, .5, type = 1)))
# [1] 2
This could possibly return multiple row-numbers.
Edit:
If you want it to only return the first match, you could modify it as shown below
with(x, which.min(Col1 != quantile(Col1, .5, type = 1)))
Here, something like y$ColB[which(x$Col1 == round(median(x$Col1)))] would do the trick.
The problem is x has an even number of rows, so the median 2.5 is not an integer. In this case you have to choose between 2 or 3.
Note: The above works for your example, not for general cases (e.g. c(-2L,2L) or with rational numbers). For the more general case see #IceCreamToucan's solution.
I'm hoping someone more knowledgeable than myself can help optimize this code. I've tried a number of methods, including foreach with doparallel (and snow) and compiler, but I think there may be simpler ways to improve the code, such as changing dataframes to datatables/matrices, and perhaps pre-loading blank objects instead of concatenating vectors in a loop.
Most of the variables listed below must be allowed to change in length depending on previous steps in the pipeline. Dimensions listed are taken from 1 example to show relative magnitude.
s.ids = a factor with length 66510. Haven't noticed a difference in speed when changed to a character vector.
g.list = a character vector with length 978.
l_signatures = a 978x66511 matrix.
d_g_up and d_g_down = small dataframes (nx10, n ranging from 5-200) with metadata related to g.list
c_score_new() computes a score. It's complex enough that it's essentially unchangeable in this scenario. It expects e_signature to have 2 columns, 1 made of g.list ("ids"), and the other as corresponding "rank"s generated by: rank(-1 * l_signatures[,as.character(id)], ties.method="random")
for (id in s.ids) {
e_signature <- data.frame(g.list,
rank(-1 * l_signatures[, as.character(id)],
ties.method="random"))
colnames(e_signature) <- c("ids","rank")
d_scores <- c(d_scores, c_score_new(d_g_up$Symbol, d_g_down$Symbol, e_signature))
}
Total, this takes 5-10 minutes to compute, with 3-5 minutes attributable to the generation of e_signature, which is not computationally complex. That's where I suspect optimization might be of the most benefit.
If we did a loop to generate e_signature in a more efficient way and combined results into 1 object (978x66510) before doing c_score_new(), it might be faster?
I'm having trouble working out the details, and I'm not confident this is the best method anyhow. So before I chased this wild goose, I thought the community might be able to steer me in the best direction.
The largest amount of time is taken by rank. It is possible to reduce computation time by more than 50%, i.e. change base::rank with for loop to Rfast::colRanks, please see below:
library(microbenchmark)
library(Rfast)
n <- 978
m <- 40000 #66510
x <- matrix(rnorm(n * m), ncol = m)
microbenchmark(
Initial = {
for (i in 1:ncol(x)) {
base::rank(x[, i], ties.method = "random")
}
},
Optimized = {
colRanks(x, method = "min")
},
times = 1
)
Output:
Unit: seconds
expr min lq mean median uq max neval
Initial 8.092186 8.092186 8.092186 8.092186 8.092186 8.092186 1
Optimized 3.397526 3.397526 3.397526 3.397526 3.397526 3.397526 1
I want first 'n' consecutive composite numbers
I searched command for finding consecutive composite numbers, but i got the result proving for that thorem. I didn't get any command for that..please help me to slove this problem in R.
Here is another option:
n_composite <- function(n) {
s <- 4L
i <- 1L
vec <- numeric(n)
while(i <= n) {
if(any(s %% 2:(s-1) == 0L)) {
vec[i] <- s
i <- i + 1L
}
s <- s + 1L
}
vec
}
It uses basic control flows to cycle through positive integers indexing composites.
benchmark
all.equal(find_N_composites(1e4), n_composite(1e4))
[1] TRUE
library(microbenchmark)
microbenchmark(
Mak = find_N_composites(1e4),
plafort = n_composite(1e4),
times=5
)
Unit: milliseconds
expr min lq mean median uq
Mak 2304.8671 2347.9768 2397.0620 2376.4306 2475.2368
plafort 508.8132 509.3055 522.1436 509.3608 530.4311
max neval cld
2480.7988 5 b
552.8076 5 a
The code of #Pierre Lafortune is neat and not too slow, but I'd like to propose another approach which is substantially faster.
Tackling the problem from another perspective, finding the first n composite numbers in R can be translated to "get the first n+k integers and remove the primes". This is fast because generating the sequence 1:(n+k) takes almost no time and there are very sophisticated algorithms to find primes available, one implementation being numbers::Primes().
The sequence needs to end with n+k because within the first n integers there will be some (k1) primes that need to be replaced. Note that the range (n+1):(n+k1) might also contain k2 primes, which need to be replaced as well. And on, and on, and on, … This will require a recursive structure.
Pierre's answer basically does something similar: He iteratively checks if an integer is a composite number (non-prime) and continues until enough composites are found. However, this has one drawback: The algorithm to find (non-) primes is rather naive (as compared to other algorithms to find primes; no offense intended). One the other hand, that solution doesn't involve the recursive problem of possible primes in any range of integers mentioned above.
The recursive solution I'd like to suggest is the following:
library(numbers)
n_composite2 <- function(n, from = 2) {
endRange <- from + n - 1
numbers <- seq(from = from, to = endRange)
primes <- Primes(n1 = from, n2 = endRange)
composites <- numbers[!(numbers %in% primes)]
nPrimes <- length(primes)
if (nPrimes >= 1) return(c(composites, n_composite2(nPrimes, from = endRange + 1)))
return(composites)
}
This generates a sequence of integers (potential composites), then uses numbers::Primes() to find the primes in that range and removes them from the sequence. If some numbers have been removed, the function calls itself again, this time computing [number of primes in previous step] composites and starting the sequence from where the previous step stopped.
If there are doubts whether this actually works, here the check against Pierre's solution (n_composite()):
> all(n_composite(1e4) == n_composite2(1e4))
[1] TRUE
Comparing both functions, n_composite2() is approximately 19 times faster:
library(microbenchmark)
microbenchmark(
"n_composite2" = n_composite2(1e4),
"n_composite" = n_composite(1e4),
times=5
)
Unit: milliseconds
expr min lq mean median uq max neval
n_composite2 34.44039 34.51352 35.10659 34.71281 35.21145 36.65476 5
n_composite 642.34106 661.15725 666.02819 662.99657 671.52093 692.12512 5
As a final remark: There are many solutions "between" Pierre's approach and the solution presented here. One could use numbers::Primes() in a while loop, very similar to what's happening in n_composite(). One could also start with a "sufficiently long" sequence of integers, remove the primes and then take the first n remaining numbers. To be efficient, this approach required a good approximation of the numbers of primes in a given range which is also not trivial (for low numbers).
That is indeed a lazy way of asking a question, but nevertheless; this should do it:
is_composite<-function(x){
sapply(x,function(y) if(y<3){FALSE}else{any(y%%(2:(y-1))==0)})
}
which(is_composite(1:100))
find_N_composites<-function(N){
which(is_composite(1:(2*N+2)))[1:N]
}
find_N_composites(10)
system.time({
x<-find_N_composites(1e+04)
})
The idea is to consequently check for each number if it has any divisors except 1 and itself. The function I provided finds first 10 000 composite numbers in about 2 seconds. If you want greater speed on large numbers, it will be better to optimize it. For example, by looking for divisors only amongst simple numbers.