Calculate a specified period time in idl - idl-programming-language

I have light curve data consist of time (t) and flux (f) (4067 rows x 2 columns)
nx=4067
t=fltarr(nx)
f=fltarr(nx)
data=read_table('kplr4830001.dat')
;print,data(0,*) ;this is t
;print,data(1,*) ;this is f
window,0
plot,data(0,*)/data(0,0),data(1,*)/data(1,0),xrange=[1.045,1.13],yrange=[0.98,1.03],xstyle=1,ystyle=1
I managed to calculate the threshold (thr = 0.0067621339).
I want to calculate a specific time period (t_start) and (t_end).
t_start: The time at which the flux first exceeds the threshold (0.0067621339).
t_end: The time at which the flux first becom smaller than (3*exp(-9/2)).
This is how I did it:
;t_start
for i=0,nx-2 do begin
IF (data(1,i)/data(1,0) GT (thr)) THEN begin
print, data(1,i)/data(1,0)
endif
endfor
;t_end
for i=0,nx-2 do begin
IF (data(1,i)/data(1,0) LT (3*exp(-9/2))) THEN begin
print, data(0,i)/data(0,0)
endif
endfor
end
what I need is just the first value of data(0,i)/data(0,0) that met these conditions. How can I do it?

Some questions I have aside, here is some sample code to determine when the flux first exceeds the threshold, and then dips below the second threshold (assuming the data is ordered ascending in time):
thr1 = 0.0067621339
thr2 = 3.*exp(-9./2.)
time = data[*,0]
flux = data[*,1]
ind1 = where(flux gt thr1)
t_start = time[ind1[0]]
ind2 = where(flux[ind1[0]:*] lt thr2)
t_end = (time[ind1[0]:*])[ind2[0]]
A few additional notes: you have specified that you'd like your flux to exceed a threshold, thr, but in your code you have specified flux/flux[0] > thr. I will assume that you want to measure the relative flux of your object (but in that case, why divide by the first element of the light curve and not median(flux)? In your plot statement above you also seem to normalize your times (data[0,*]/data[0,0]) - is this what you intended to do?
In defining your second threshold of 3*exp(-9/2) be carful; in IDL 9/2 is an integer division, and gives an answer of 4, not 4.5. Instead use 3.*exp(-9./2.).
Finally, I'm assuming your data is noisy (rather than a noiseless simulation). Do you really want to find the very first data point that exceeds the threshold (which could be due to a particularly large positive outlier?) or rather when a running mean/median (or other smoothed version of your data) exceeds the threshold? You can median filter your fluxes using e.g. median(flux, boxwidth)

Related

Identification of most frequent observation in numeric vector using minimal observation ranges

Problem:
Say I have a numeric vector x of observations of some distance (in R). For instance it could be the throwing length of a number of people.
x <- c(3,3,3,7,7,7,7,8,8,12,15,15,15,20,30)
h <- hist(x, breaks = 30, xlim = c(1,30))
I then want to define a set S of "selectors" (ranges) that select as much of my observations as possible and at the same time span as little distance as possible (the cost is the sum of ranges in S). Each selector si range must be at least 3 (its resolution).
Example:
In the toy data x I could put the first selector s1 from [6;8] which will select 4+2 observations (distance 7 and 8), use 3 distances and select 6/15 observations in total ([7;9] would give the same but for simplicity I put the selector midpoint in the max frequency). Next would be adding s2 [14;16] (6 distance and select 9/15). In summary, S would be build along the steps:
[6;8] (3, 6/15) #s1
[6;8], [14;16] (6, 9/15) #s2
[3;8], [14;16] (9, 12/15) #Extending s1 (cheapest)
[3;8], [12;16] (11, 13/15) #Extending s2
[3;8], [12;16], [29;31], (14, 14/15) #s3
[3;8], [12;20], [29;31], (18, 15/15) #Extending s2
One would stop the iterations when a certain total distance is used (sum of S) or when a certain fraction of data is covered by S. Or plot the sum of S against fraction of data covered and decide from that.
For very huge data (100,000s clustered observations in 1,000,000s of distance space) I could probably be more sloppy by increasing the minimum steps allowed (above it is 1, maybe try 100) and decreasing the resolution (above its 3, one could try maybe 1000).
Since its equivalent of maximizing the area under density(x) while minimizing the ranges of x, my intuition is that one could approximate the steps described (for time and memory considerations) using density() and optim() . Maybe its even a well known maximization/minimization problem.
Any suggestions that could get me started would be very appreciated.

Algorithmically detecting jumps in a time-series

I have about 50 datasets that include all trades within a timeframe of 30 days for about 10 pairs on 5 exchanges. All pairs are of the same asset class, meaning they are strongly correlated and expect to have similar properties, but are on different scales. An example of this data would be
set.seed(1)
n <- 1000
dates <- seq(as.POSIXct("2019-08-05 00:00:00", tz="UTC"), as.POSIXct("2019-08-05 23:59:00", tz="UTC"), by="1 min")
x <- data.frame("t" = sort(sample(dates, 1000)),"p" = cumsum(sample(c(-1, 1), n, TRUE)))
Roughly, I need to identify the relevant local minima and maxima, which happen daily. The yellow marks are my points of interest. Unlike this example, there is usually only one such point per day and I consider each day separately. However, it is hard to filter out noise from my actual points of interest.
My actual goal is to find the exact point, at which the pair started to make a jump and the exact point, at which the jump is over. This needs to be as accurate as possible, as I want to observe which asset moved first and which asset followed at which point in time (as said, they are highly correlated).
Between two extreme values, I want to minimize the distance and maximize the relative/absolute change, as my points of interest are usually close to each other and their difference is quite large.
I already looked at other questions like
Finding local maxima and minima and Algorithm to locate local maxima and also this algorithm that has the same goal. However, my dataset is extremely noisy. I already reduced the dataset to 5-minute intervals, however, this has led to omitting the relevant points in the functions to identify local minima & maxima. Therefore, this was a not good solution given my goal.
How can I achieve my goal with a quite accurate algorithm? Manually skimming through all the time-series is not an option, since this would require me to evaluate 50 * 30 time-series manually, which is too time-consuming. I'm really puzzled and trying to find a suitable solution for a week.
If more code snippets are demanded, I'm happy to share, however they didn't give me meaningful results, which would be opposed to the idea of providing a minimum working example, therefore I decided to leave them out for now.
EDIT:
First off, I updated the plot and added timestamps to the dataset to give you an idea (the actual resolution). Ideally, the algorithm would detect both jumps on the left. The inner two dots because they're closer together and jump without interception, and the outer dots because they're more extreme in values. In fact, this maybe answers the question whether the algorithm is allowed to look into the future. Yes, if there's another local extrema in the range of, say, 30 observations (or 30 minutes), then ignore the intermediate local extrema.
In my data, jumps have been from 2% - ~ 15%, such that a jump needs to be at least 2% to be considered. And only if a threshold of 15 (this might be adaptable) consecutive steps in the same direction before / after the peaks and valleys is reached.
A very naive approach was to subset the data around the global minimum and maximum of a day. In most cases, this has denoised data and worked as an indicator. However, this is not robust when the global extrema are not in the range of the jump.
Hope this clarifies why this isn't a statistical question (there are some tests to determine whether a jump has happened, but not for jump arrival time afaik).
In case anyone wants a real example:
this is a corresponding graph, this is the raw data of the relevant period and this is the reduced dataset.
Perhaps as a starting point, look at function streaks
in package PMwR (which I maintain). A streak is
defined as a move of a specified size that is
uninterrupted by a countermove of the same size. The
function works with returns, not differences, so I add
100 to your data.
For instance:
set.seed(1)
n <- 1000
x <- 100 + cumsum(sample(c(-1, 1), n, TRUE))
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.12)
abline(v = s[, 1])
abline(v = s[, 2])
The vertical lines show the starts and ends of streaks.
Perhaps you can then filter the identified streaks by required criteria such as length. Or
you may play around with different thresholds for up
and down moves (though this is not really recommended
in the current implementation, but perhaps the results
are good enough). For instance, up streaks might look as follows. A green vertical shows the start of a streak; a red line shows its end.
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.05)
s <- s[!is.na(s$state) & s$state == "up", ]
abline(v = s[, 1], col = "green")
abline(v = s[, 2], col = "red")

Plotting large time series

Summary of Question:
Are there any easy to implement algorithms for reducing the number of points needed to represent a time series without altering how it appears in a plot?
Motivating Problem:
I'm trying to interactively visualize 10 to 15 channels of data logged from an embedded system at ~20 kHz. Logs can cover upwards of an hour of time which means that I'm dealing with between 1e8 and 1e9 points. Further, I care about potentially small anomalies that last for very short periods of time (i.e. less than 1 ms) such that simple decimation isn't an option.
Not surprisingly, most plotting libraries get a little sad if you do the naive thing and try to hand them arrays of data larger than the dedicated GPU memory. It's actually a bit worse than this on my system; using a vector of random floats as a test case, I'm only getting about 5e7 points out of the stock Matlab plotting function and Python + matplotlib before my refresh rate drops below 1 FPS.
Existing Questions and Solutions:
This problem is somewhat similar to a number of existing questions such as:
How to plot large data vectors accurately at all zoom levels in real time?
How to plot large time series (thousands of administration times/doses of a medication)?
[Several Cross Validated questions]
but deals with larger data sets and/or is more stringent about fidelity at the cost of interactivity (it would be great to get 60 FPS silky smooth panning and zooming, but realistically, I would be happy with 1 FPS).
Clearly, some form of data reduction is needed. There are two paradigms that I have found while searching for existing tools that solve my problem:
Decimate but track outliers: A good example of this is Matlab + dsplot (i.e. the tool suggested in the accepted answer of the first question I linked above). dsplot decimates down to a fixed number of evenly spaced points, but then adds back in outliers identified using the standard deviation of a high pass FIR filter. While this is probably a viable solution for several classes of data, it potentially has difficulties if there is substantial frequency content past the filter cutoff frequency and may require tuning.
Plot min and max: With this approach, you divide the time series up in to intervals corresponding to each horizontal pixel and plot just the minimum and maximum values in each interval. Matlab + Plot (Big) is a good example of this, but uses an O(n) calculation of min and max making it a bit slow by the time you get to 1e8 or 1e9 points. A binary search tree in a mex function or python would solve this problem, but is complicated to implemented.
Are there any simpler solutions that do what I want?
Edit (2018-02-18): Question refactored to focus on algorithms instead of tools implementing algorithms.
I had the very same problem displaying pressure timeseries of hundreds of sensors, with samples every minute for several years. In some cases (like when cleaning the data), I wanted to see all the outliers, others I was more interested in the trend. So I wrote a function that can reduce the number of data points using two methods: visvalingam and Douglas-Peucker. The first tend to remove outliers, and the second keeps them. I've optimized the function to work over large datasets.
I did that after realizing that all the plotting methods weren't capable to handle that many points, and the ones that did, were decimating the dataset in a way that I couldn't control. The function is the following:
function [X, Y, indices, relevance] = lineSimplificationI(X,Y,N,method,option)
%lineSimplification Reduce the number of points of the line described by X
%and Y to N. Preserving the most relevant ones.
% Using an adapted method of visvalingam and Douglas-Peucker algorithms.
% The number of points of the line is reduced iteratively until reaching
% N non-NaN points. Repeated NaN points in original data are deleted but
% non-repeated NaNs are preserved to keep line breaks.
% The two available methods are
%
% Visvalingam: The relevance of a point is proportional to the area of
% the triangle defined by the point and its two neighbors.
%
% Douglas-Peucker: The relevance of a point is proportional to the
% distance between it and the straight line defined by its two neighbors.
% Note that the implementation here is iterative but NOT recursive as in
% the original algorithm. This allows to better handle large data sets.
%
% DIFFERENCES: Visvalingam tend to remove outliers while Douglas-Peucker
% keeps them.
%
% INPUTS:
% X: X coordinates of the line points
% Y: Y coordinates of the line points
% method: Either 'Visvalingam' or 'DouglasPeucker' (default)
% option: Either 'silent' (default) or 'verbose' if additional outputs
% of the calculations are desired.
%
% OUTPUTS:
% X: X coordinates of the simplified line points
% Y: Y coordinates of the simplified line points
% indices: Indices to the positions of the points preserved in the
% original X and Y. Therefore Output X is equal to the input
% X(indices).
% relevance: Relevance of the returned points. It can be used to furder
% simplify the line dinamically by keeping only points with
% higher relevance. But this will produce bigger distortions of
% the line shape than calling again lineSimplification with a
% smaller value for N, as removing a point changes the relevance
% of its neighbors.
%
% Implementation by Camilo Rada - camilo#rada.cl
%
if nargin < 3
error('Line points positions X, Y and target point count N MUST be specified');
end
if nargin < 4
method='DouglasPeucker';
end
if nargin < 5
option='silent';
end
doDisplay=strcmp(option,'verbose');
X=double(X(:));
Y=double(Y(:));
indices=1:length(Y);
if length(X)~=length(Y)
error('Vectors X and Y MUST have the same number of elements');
end
if N>=length(Y)
relevance=ones(length(Y),1);
if doDisplay
disp('N is greater or equal than the number of points in the line. Original X,Y were returned. Relevances were not computed.')
end
return
end
% Removing repeated NaN from Y
% We find all the NaNs with another NaN to the left
repeatedNaNs= isnan(Y(2:end)) & isnan(Y(1:end-1));
%We also consider a repeated NaN the first element if NaN
repeatedNaNs=[isnan(Y(1)); repeatedNaNs(:)];
Y=Y(~repeatedNaNs);
X=X(~repeatedNaNs);
indices=indices(~repeatedNaNs);
%Removing trailing NaN if any
if isnan(Y(end))
Y=Y(1:end-1);
X=X(1:end-1);
indices=indices(1:end-1);
end
pCount=length(X);
if doDisplay
disp(['Initial point count = ' num2str(pCount)])
disp(['Non repeated NaN count in data = ' num2str(sum(isnan(Y)))])
end
iterCount=0;
while pCount>N
iterCount=iterCount+1;
% If the vertices of a triangle are at the points (x1,y1) , (x2, y2) and
% (x3,y3) the are uf such triangle is
% area = abs((x1*(y2-y3)+x2*(y3-y1)+x3*(y1-y2))/2)
% now the areas of the triangles defined by each point of X,Y and its two
% neighbors are
twiceTriangleArea =abs((X(1:end-2).*(Y(2:end-1)-Y(3:end))+X(2:end-1).*(Y(3:end)-Y(1:end-2))+X(3:end).*(Y(1:end-2)-Y(2:end-1))));
switch method
case 'Visvalingam'
% In this case the relevance is given by the area of the
% triangle formed by each point end the two points besides
relevance=twiceTriangleArea/2;
case 'DouglasPeucker'
% In this case the relevance is given by the minimum distance
% from the point to the line formed by its two neighbors
neighborDistances=ppDistance([X(1:end-2) Y(1:end-2)],[X(3:end) Y(3:end)]);
relevance=twiceTriangleArea./neighborDistances;
otherwise
error(['Unknown method: ' method]);
end
relevance=[Inf; relevance; Inf];
%We remove the pCount-N least relevant points as long as they are not contiguous
[srelevance, sortorder]= sort(relevance,'descend');
firstFinite=find(isfinite(srelevance),1,'first');
startPos=uint32(firstFinite+N+1);
toRemove=sort(sortorder(startPos:end));
if isempty(toRemove)
break;
end
%Now we have to deal with contigous elements, as removing one will
%change the relevance of the neighbors. Therefore we have to
%identify pairs of contigous points and only remove the one with
%leeser relevance
%Contigous will be true for an element if the next or the previous
%element is also flagged for removal
contiguousToKeep=[diff(toRemove(:))==1; false] | [false; (toRemove(1:end-1)-toRemove(2:end))==-1];
notContiguous=~contiguousToKeep;
%And the relevances asoociated to the elements flagged for removal
contRel=relevance(toRemove);
% Now we rearrange contigous so it is sorted in two rows, therefore
% if both rows are true in a given column, we have a case of two
% contigous points that are both flagged for removal
% this process is demenden of the rearrangement, as contigous
% elements can end up in different colums, so it has to be done
% twice to make sure no contigous elements are removed
nContiguous=length(contiguousToKeep);
for paddingMode=1:2
%The rearragngement is only possible if we have an even number of
%elements, so we add one dummy zero at the end if needed
if paddingMode==1
if mod(nContiguous,2)
pcontiguous=[contiguousToKeep; false];
pcontRel=[contRel; -Inf];
else
pcontiguous=contiguousToKeep;
pcontRel=contRel;
end
else
if mod(nContiguous,2)
pcontiguous=[false; contiguousToKeep];
pcontRel=[-Inf; contRel];
else
pcontiguous=[false; contiguousToKeep(1:end-1)];
pcontRel=[-Inf; contRel(1:end-1)];
end
end
contiguousPairs=reshape(pcontiguous,2,[]);
pcontRel=reshape(pcontRel,2,[]);
%finding colums with contigous element
contCols=all(contiguousPairs);
if ~any(contCols) && paddingMode==2
break;
end
%finding the row of the least relevant element of each column
[~, lesserElementRow]=max(pcontRel);
%The index in contigous of the first element of each pair is
if paddingMode==1
firstElementIdx=((1:size(contiguousPairs,2))*2)-1;
else
firstElementIdx=((1:size(contiguousPairs,2))*2)-2;
end
% and the index in contigous of the most relevant element of each
% pair is
lesserElementIdx=firstElementIdx+lesserElementRow-1;
%now we set the least relevant element as NOT continous, so it is
%removed
contiguousToKeep(lesserElementIdx(contCols))=false;
end
%and now we delete the relevant continous points from the toRemove
%list
toRemove=toRemove(contiguousToKeep | notContiguous);
if any(diff(toRemove(:))==1) && doDisplay
warning([num2str(sum(diff(toRemove(:))==1)) ' continous elements removed in one iteration.'])
end
toRemoveLogical=false(pCount,1);
toRemoveLogical(toRemove)=true;
X=X(~toRemoveLogical);
Y=Y(~toRemoveLogical);
indices=indices(~toRemoveLogical);
pCount=length(X);
nRemoved=sum(toRemoveLogical);
if doDisplay
disp(['Iteration ' num2str(iterCount) ', Point count = ' num2str(pCount) ' (' num2str(nRemoved) ' removed)'])
end
if nRemoved==0
break;
end
end
end
function d = ppDistance(p1,p2)
d=sqrt((p1(:,1)-p2(:,1)).^2+(p1(:,2)-p2(:,2)).^2);
end

How can you map a set of numbers, full of "holes" into a smaller one without "holes"

Can anyone figure out a function that can perform a mapping from a finite set of N numbers X = {x0, x1, x2, ..., xN} where each x can be valued 0 to 999999999 and N < 999999999, to a set Y = {0, 1, 2, 3, ..., N}.
In my case, i have about 24000000 element in the first set whose values can range as X. This elements have continuous block (for example 53000 to 1234500, then 8000000 to 9000000 and so on) and i have to remap this elements from 0 to 2400000. I don't require to maintain order.
I need a (possibly simple and rapid) math function, or a bitwise transformation, not something like put it ordered into an array and then binary search for their position.
Really thank to whom that can figure out a way to solve this!
Luca
If you don't want to keep some gigabytes of straight map, then augmented segment tree is reasonable approach. Tree should contain intervals and shift of every interval (sum of left intervals). Of course, finding appropriate interval (and shift) in this method is close to the binary search.
For example, you get X=80000015. Find interval for this value - it is 8000000 to 9000000. Rank of this interval is 175501 (1234500-53000 + 1). So X maps to
X => 175501 + 80000015 - 80000000 = 175516
For sparse elements make counting stage - find what is rank R for every number M and put (key=M, value=R) pair in hash table.
X = (3, 19, 20, 101)
table: [(3:0), (19:1), (20:2), (101:3)]
Note that one should keep balance between speed and space - for long filled intervals it is better to store only interval ends.

Expand a Time Series to a specific number of periods

I'm new to R and I am attempting to take a set of time series and run them through a Conditional Inference Tree to help classify the shape of the time series. The problem is that not all of the time sereis are of the same number of periods. I am trying to expand each time series to be 30 periods long, but still maintain the same "shape". This is as far as I have got
Require(zoo)
test<-c(606,518,519,541,624,728,560,512,777,728,1014,1100,930,798,648,589,680,635,607,544,566)
accordion<-function(A,N){
x<-ts(scale(A), start=c(1,1), frequency=1)
X1 <- zoo(x,seq(from = 1, to = N, by =(N-1)/(length(x)-1) ))
X2<-merge(X1, zoo(order.by=seq(start(X1), end(X1)-1, by=((N-1)/length(x))/(N/length(x)))))
X3<-na.approx(X2)
return(X3)}
expand.test<-accordion(test,30)
plot(expand.test); lines(scale(test))
length(expand.test)
The above code, scales the time series and then evenly spaces it out to 30 periods and interpolates the missing values. However, the length of the returned series is 42 units and not 30, however it retains the same "shape" as the orignal time series. Does anyone know how to modify this so that the results produced by the function accordian are 30 periods long and the time series shape remains relatively unchanged?
I think there's a base R solution here. Check out approx(), which does linear (or constant) interpolation with as many points n as you specify. Here I think you want n = 30.
test2 <- approx(test, n=30)
plot(test2)
points(test, pch="*")
This returns a list test2 where the second element y is your interpolated values. I haven't yet used your time series object, but it seems that was entirely interior to your function, correct?

Resources