convolution of positively supported functions in R - r

I want the convolution of two functions defined on [0,Inf), say
f=function(x)
(1+0.5*cos(2*pi*x))*(x>=0)
and
g=function(x)
exp(-2*x)*(x>0)
Using the integrate function of R I can do this,
cfg=function(x)
integrate(function(y) f(y)*g(x-y),0,x)$value
By searching the web, it seems that there are more efficient (and more accurate) ways of doing this (say using fft() or convolve()). Can anyone with such experiences explain how please?
Thanks!

convolve or fft solutions are to get a discrete result, rather than a function as you have defined in cfg. They can give you the numeric solution to cfg on some regular, discrete input.
fft is for periodic functions (only) so that is not going to help. However, convolve has a mode of operation called "open", which emulates the operation that is being performed by cfg.
Note that with type="open", you must reverse the second sequence (see ?convolve, "Details"). You also have to only use the first half of the result. Here is a pictoral example of the result of convolution of c(2,3,5) with c(7,11,13) as would be performed by convolve(c(2,3,5), rev(c(7,11,13)), type='open'):
2 3 5 2 3 5 2 3 5 2 3 5 2 3 5
13 11 7 13 11 7 13 11 7 13 11 7 13 11 7
Sum: 14 43 94 94 65
Note that evaluation the first three elements is similar to the results of your integration. The last three would be used for the reverse convolution.
Here is a comparison with your functions. Your function, vectorized, plotted with
y <- seq(0,10,by=.01)
plot(y, Vectorize(cfg)(y), type='l')
And an application of convolve plotted with the following code. Note that there are 100 points per unit interval in y so division by 100 is appropriate.
plot(y, convolve(f(y), rev(g(y)), type='open')[1:1001]/100, type='l')
These do not quite agree, but the convolution is much faster:
max(abs(Vectorize(cfg)(y) - convolve(f(y), rev(g(y)), type='open')[1:1001]/100))
## [1] 0.007474999
benchmark(Vectorize(cfg)(y), convolve(f(y), rev(g(y)), type='open')[1:1001]/100, columns=c('test', 'elapsed', 'relative'))
## test elapsed relative
## 2 convolve(f(y), rev(g(y)), type = "open")[1:1001]/100 0.056 1
## 1 Vectorize(cfg)(y) 5.824 104

Related

Number in results from R

In R, when I press return for a line of code (for example, a histogram,) what does the [1] that comes up in the results mean?
If there's another line, it comes up as [18], then [35].
The numbers that you see in the console in the situation that you are describing are the indices of the first elements of the line.
1:20
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
# [13] 13 14 15 16 17 18 19 20
How many values are displayed by line depends by default on the width of the console (at least in Rstudio).
The value I printed is a numeric vector of length 20, a single number is technically also a numeric vector, but of length 1, in R there is no different concept for both, thus when you print only one value the [1] still shows.
42
# [1] 42
It's not obvious, for example there is no function of length 2, c(mean, median) is a list (containing functions), but it works like this for said atomic modes (see ?atomic) and usually the classes that are built on them.
You might not always see these numbers on all objects because they depend on what print methods are called, which itself depends on the class.
library(glue)
glue("a")
# a # <- we don't see [1]!
mode(glue("a"))
# character
class(glue("a"))
# [1] "glue" "character"
The print method that is called when typing print(1:20) is print.default, it can be overriden to avoid displaying the [numbers] :
print.default <- function(x) cat(x,"\n")
print(1:20)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
rm(print.default) # necessary cleanup!
The autoprint (what you get when not calling print explicitly) won't change however, as auto-printing can only involve method dispatch for explicit classes (with a class attribute, a.k.a. objects)
Type methods(print) to see all the available methods.

Check if two intervals overlap in R

Given values in four columns (FromUp,ToUp,FromDown,ToDown) two of them always define a range (FromUp,ToUp and FromDown,ToDown). How can I test whether the two ranges overlap. It is important to state that the ranges value are not sorted so the "From" value can be higher then the "To" value and the other way round.
Some Example data:
FromUp<-c(5,32,1,5,15,1,6,1,5)
ToUp<-c(5,31,3,5,25,3,6,19,1)
FromDown<-c(1,2,8,1,22,2,1,2,6)
ToDown<-c(4,5,10,6,24,4,1,16,2)
ranges<-data.frame(FromUp,ToUp,FromDown,ToDown)
So that the result would look like:
FromUp ToUp FromDown ToDown Overlap
5 5 1 4 FALSE
32 31 2 5 FALSE
1 3 8 10 FALSE
5 5 1 6 TRUE
15 25 22 24 TRUE
1 3 2 4 TRUE
6 6 1 1 FALSE
1 19 2 16 TRUE
5 1 6 2 TRUE
I tried a view things but did not get it to work especially the thing that the intervals are not "sorted" makes it for my R skills to difficult to figure out a solution.
I though about finding the min and max values of the pairs of columns(e.g FromUp, ToUp) and than compare them?
Any help would be appreciated.
Sort them
rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))
and write the condition
olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])
In one step this might be
(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))
The foverlap() function mentioned by others (or IRanges::findOveralaps()) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.
The logic of the solution here is the same as the answer of #Julius, but is 'vectorized' (e.g., 1 call to pmin(), rather than nrow(ranges) calls to sort()) and should be much faster (though using more memory) for longer vectors of possible ranges.
In general:
apply(ranges,1,function(x){y<-c(sort(x[1:2]),sort(x[3:4]));max(y[c(1,3)])<=min(y[c(2,4)])})
or, in case intervals cannot overlap at just one point (e.g. because they are open):
!apply(ranges,1,function(x){y<-sort(x)[1:2];all(y==sort(x[1:2]))|all(y==sort(x[3:4]))})

Understanding an RLE coverage value

Using R and bioconductor.
I'm not sure how to understand an integer rle that you'd get from functions like coverage() such as this
integer-Rle of length 3312 with 246 runs
Lengths: 25 34 249 16 7 11 16 ... 2 32 2 26 34 49
Values : 0 1 0 1 2 3 2 ... 1 2 1 0 1 0
Okay so I get that it represents coverage of one range vs other ranges. In this case reads of an experiment over a given range. What do the 'runs' mean? What about the 'Lengths' and 'Values'? I thought that maybe Lengths represent a postion and values represent the amount of times its covered but then why would there be multiples of the same position such as 2 above? Why would they be out of order?
I ask because I'm using
sum(coverage)
to compare the coverage of one range to another of a different length and I was wondering if that was appropriate.
Probably it's better to ask about Bioconductor packages on the Bioconductor support site.
The interpretation is that there is a run of 25 nucleotides with 0 coverage, then a run of 24 nucleotides with 1 coverage (i.e., a single read) then another run of 249 nucleotides with no coverage, then things start to get interesting as multiple reads overlap positions. From the summary line at the top of the output, your read covers 3312 nucleotides, maybe from a single transcript? If you were to
plot(as.integer(coverage))
you'd get a quick plot of how coverage varies along the length of the transcript.
Maybe sum(coverage) is appropriate; a more usual metric is to count reads rather than coverage, e.g., with GenomicRanges::summarizeOverlaps() illustrated in this DESeq2 work flow in the context of RNA-seq.
This might help to understand the concept of RLE: https://www.youtube.com/watch?v=ypdNscvym_E
Here is an easy example:
> x <- IRanges(start=c(-2L, 1L, 3L),
+ width=c( 5L, 4L, 6L))
> x
IRanges of length 3
start end width
[1] -2 2 5
[2] 1 4 4
[3] 3 8 6
> coverage(x)
integer-Rle of length 8 with 2 runs
Lengths: 4 4
Values : 2 1
The output means the first 4 places are in packs of 2 and the next four places are in single-packs. All places including 0 and below 0 were ignored!
The length means that the complete range that we are looking at, so to say all places together, are 8.
The runs are the types of packs that occur. Here, we only have overlaps that include two ranges (pack of two) and overlaps that don't really overlap (single pack).

Understanding Dynamic Time Warping

We want to use the dtw library for R in order to shrink and expand certain time series data to a standard length.
Consider, three time series with equivalent columns. moref is of length(rows) 105, mobig is 130 and mosmall is 100. We want to project mobig and mosmall to a length of 105.
moref <- good_list[[2]]
mobig <- good_list[[1]]
mosmall <- good_list[[3]]
Therefore, we compute two alignments.
ali1 <- dtw(mobig, moref)
ali2 <- dtw(mosmall, moref)
If we print out the alignments the result is:
DTW alignment object
Alignment size (query x reference): 130 x 105
Call: dtw(x = mobig, y = moref)
DTW alignment object
Alignment size (query x reference): 100 x 105
Call: dtw(x = mosmall, y = moref)
So exactly what we want? From my understanding we need to use the warping functions ali1$index1 or ali1$index2 in order to shrink or expand the time series. However, if we invoke the following commands
length(ali1$index1)
length(ali2$index1)
length(ali1$index2)
length(ali2$index2)
the result is
[1] 198
[1] 162
[1] 198
[1] 162
These are vector with indices (probably refering to other vectors). Which one of these can we use for the mapping? Aren't they all to long?
First of all, we need to agree that index1 and index2 are two vectors of the same length that maps query/input data to reference/stored data and vice versa.
Since you did not give out any data. Here is some dummy data to give people an idea.
# Reference data is the template that we use as reference.
# say perfect pronunciation from CNN
data_reference <- 1:10
# Query data is the input data that we want to map to our reference
# say random youtube audio
data_query <- seq(1,10,0.5) + rnorm(19)
library(dtw)
alignment <- dtw(x=data_query, y=data_reference, keep=TRUE)
alignment$index1
alignment$index2
lcm <- alignment$costMatrix
image(x=1:nrow(lcm), y=1:ncol(lcm), lcm)
plot(alignment, type="threeway")
Here are the outputs:
> alignment$index1
[1] 1 2 3 4 5 6 7 7 8 9 10 11 12 13 13 14 14 15 16 17 18 19
> alignment$index2
[1] 1 1 1 2 2 3 3 4 5 6 6 6 6 6 7 8 9 9 9 9 10 10
So basically, the mapping from index1 to index2 is how to map input data to the reference data.
i.e. the 10th data point at the input data has been matched to the 6th data point from the template.
index1: Warping function φx(k) for the query
index2: Warping function φy(k) for the reference
-- Toni Giorgino
Per your question, "what is the deal with the length of the index", since it is basically the coordinates of the optimal, path, it could be as long as m+n(really shallow) or min(m,n) (perfect diagonal). Clearly, it is not a one-to-one mapping which might bothers people a little bit, I guess you can do more research from here how to pick up the mapping you want.
I don't know if there is some buildin function functionality to pick up the best one-to-one mapping. But here is one way.
library(plyr)
mapping <- data.frame(index1=alignment$index1, index2=alignment$index2)
mapping <- ddply(mapping, .(index1), summarize, index2_new = max(index2))
Now mapping contains a one-to-one mapping from query to reference. Then you can map the query to the reference and scale the mapped input in whatever way you want.
I am not exactly sure about the content below the line and anyone is more than welcome to make any improvement how the mapping and scaling should work.
References: 1, 2

Affinity Propagation results do not match

I am trying to implement the Affinity Propagation clustering algorithm in C++. As part of testing I want to compare my results with well established implementations of the algorithm in Matlab (Link) and in R (package apcluster). Unfortunately, the clusterings do not agree.
To be more precise, the (test) data set is:
0.9411760 0.9702140
0.9607826 0.9744693
0.9754896 0.9574479
0.9852929 0.9489372
0.9950962 0.9234050
1.0000000 0.8936175
1.0000000 0.8723408
0.9852929 0.8595747
1.0000000 0.8893622
1.0000000 0.9191497
In R I typed:
S<-negDistMat(data)
A<-apcluster(S,maxits=1000,convits=100, lam=0.9,q=0.5)
and got:
> A#idx
2 2 2 5 5 9 9 9 9 5
2 2 2 5 5 9 9 9 9 5
In Matlab I just typed:
[idx,netsim,dpsim,expref]=apcluster(S,diag(S));
From the apcluster.m file implementing apcluster (line 77):
maxits=1000; convits=100; lam=0.9; plt=0; details=0; nonoise=0;
This explains the parameters for R, in Matlab their are the default values. Since I'm more comfortable with R concerning Affinity Propagation, for comparison reasons I stuck with Matlab's defaults, just to avoid messing something up unintentionally.
..but got:
>> idx'
ans =
3 3 3 3 5 9 9 9 9 5
In both cases the similarity matrices matched. What could I've missed?
Update:
I've also implemented the Matlab code proposed by Frey & Dueck in their original publication. (You may notice that I omitted noise) and although I can replicate the indexes provided by the former Matlab implementation, Availability and Responsibility matrices differ on some values. The error is less than 0.01 but this is significant.
Their code is:
function [idx,A,R]=frey(S);
N=size(S,1);
A=zeros(N,N);
R=zeros(N,N);
lam=0.9; % Set damping factor
for iter=1:122
% Compute responsibilities
Rold=R;
AS=A+S;
[Y,I]=max(AS,[],2);
for i=1:N
AS(i,I(i))=-realmax;
end;
[Y2,I2]=max(AS,[],2);
R=S-repmat(Y,[1,N]);
for i=1:N
R(i,I(i))=S(i,I(i))-Y2(i);
end;
R=(1-lam)*R+lam*Rold; % Dampen responsibilities
% Compute availabilities
Aold=A;
Rp=max(R,0);
for k=1:N
Rp(k,k)=R(k,k);
end;
A=repmat(sum(Rp,1),[N,1])-Rp;
dA=diag(A);
A=min(A,0);
for k=1:N
A(k,k)=dA(k);
end;
A=(1-lam)*A+lam*Aold; % Dampen availabilities
end;
E=R+A; % Pseudomarginals
I=find(diag(E)>0); K=length(I); % Indices of exemplars
[tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % Assignments
I have tried all your code and the problem is caused by the way you supply the input preference. In the first case (R), you specify q=0.5. This means that the input preference p is set to the median of off-diagonal similarities (in your example, this is -0.05129912). If I run the Matlab code as follows (I used Octave, but Matlab should give the same result), I get:
octave:7> [idx,netsim,dpsim,expref]=apcluster(S,-0.05129912);
octave:8> idx'
ans =
2 2 2 5 5 9 9 9 9 5
This is exactly the same as the R result. If I run your Matlab code (with diag(S) being the second argument) and if I run
apcluster(S, p=diag(S))
in R (which sets the input preference to 0 for all samples in both cases), I get 10 one-sample clusters in both cases. So the two results match again, though I could not recover your Matlab result
3 3 3 3 5 9 9 9 9 5
I hope that makes the difference clear.
Cheers, UBod

Resources