Fast Fourier Transform in R. What am I doing wrong? - r

I am a non-expert in Fourier analysis and quite don't get what R's function fft() does. Even after crossreading a lot I couldnt figure it out.
I built an example.
require(ggplot2)
freq <- 200 #sample frequency in Hz
duration <- 3 # length of signal in seconds
#arbitrary sine wave
x <- seq(-4*pi,4*pi, length.out = freq*duration)
y <- sin(0.25*x) + sin(0.5*x) + sin(x)
which looks like:
fourier <- fft(y)
#frequency "amounts" and associated frequencies
amo <- Mod(fft(y))
freqvec <- 1:length(amo)
I ASSUME that fft expects a vector recorded over a timespan of 1 second, so I divide by the timespan
freqvec <- freqvec/duration
#and put this into a data.frame
df <- data.frame(freq = freqvec, ammount = amo)
Now I ASSUMABLY can/have to omit the second half of the data.frame since the frequency "amounts" are only significant to half of the sampling rate due to Nyquist.
df <- df[(1:as.integer(0.5*freq*duration)),]
For plotting I discretize a bit
df.disc <- data.frame(freq = 1:100)
cum.amo <- numeric(100)
for (i in 1:100){
cum.amo[i] <- sum(df$ammount[c(3*i-2,3*i-1,3*i)])
}
df.disc$ammount <- cum.amo
The plot function for the first 20 frequencies:
df.disc$freq <- as.factor(df.disc$freq)
ggplot(df.disc[1:20,], aes(x=freq, y=ammount)) + geom_bar(stat = "identity")
The result:
Is this really a correct spectrogram of the above function? Are my two assumptions correct? Where is my mistake? If there is no, what does this plot now tell me?
EDIT:
Here is a picture without discretization:
THANKS to all of you,
Micha.

Okay, okay. Due to the generally inferior nature of my mistake the solution is quite trivial.
I wrote freq = 200 and duration = 3. But the real duration is from -4pi to 4 pi, hence 8pi resulting in a "real" sample frequency of 1/ ((8*pi)/600) = 23.87324 which does not equal 200.
Replacing the respective lines in the example code by
freq <- 200 #sample frequency in Hz
duration <- 6 # length of signal in seconds
x <- seq(0,duration, length.out = freq*duration)
y <- sin(4*pi*x) + sin(6*pi*x) + sin(8*pi*x)
(with a more illustrative function) yields the correct frequencies as demonstrated by the following plot (restricted to the important part of the frequency domain):

Related

How to find the maximum peak and duration of a signal in R?

I am trying to find the duration in seconds of a signal, to do this I use a function from (Finding start and end of a peak in time series in R).
y = signal
y$startEndPeak <- NA
which.min(abs(diff(y$peak)))
i.mins <- which(diff(sign(diff(c(Inf, y$peak, Inf)))) == 2)
i.mins
i.mx <- which.max(y$peak)
i.mx
y$startEndPeak[i.mx]<- "peak"
ind <- sort(c(i.mins, i.mx))
ix <- ind[which(ind == i.mx) + c(-1, 1)]
ix
y$startEndPeak[ix]<- c("start", "end")
## start, end, peak
ix2<- c(ix,i.mx)
ix2
dat<- y[ix2,]
dat
The output:
Time
Peak
startEndPeak
4.46
-649.774
start
5.86
19791.226
end
5.00
48655.226
peak
I was able to detect the maximum point (peak), but the code fails to detect correctly the minimum points to the peak. Thus, it is impossible to determine correctly the duration of the signal.
This is what I want to identify, x= time, y= signal:
I would like to know if:
Is there any way to find these points accurately? I have tried multiple functions and packages in R, I have even tried Matlab and I have not been able to achieve precision.
It may be possible to draw a line at the minimum point on the left of the peak and find the intersection point on the other side of the sign.
You can download one example of the signal here
Thank you for your time!
Here is a simple solution using pspline and the time series derivative of the data. You may have to tune the cutoff values and the ignore region judging by other peaks as well:
get_duration <- function(df) {
deriv <- as.vector(predict(sm.spline(df$time, df$peak), df$time, 1))
start <- which.max(deriv > 7000)
ignore_region <- 60
end <- start + ignore_region + which.max(deriv[(start + ignore_region):length(deriv)] > -100)
df$time[end] - df$time[start]
}
get_duration(df)
#[1] 3.22

r raster cumulative sum calculation according to pre-defined time steps

I have a time series of 10 daily raster data and I need to calculate the -5 to +5 time steps cumulative sums according to the pixel values (=time 0) of a reference raster.
I am able to calculate the yearly cumulative sum (here below called ‘sum_YY’)
step1 create the 10 daily raster timeseries
require(raster)
idx <- seq(as.Date("2010/1/1"), as.Date("2011/12/31"), '10 days')
r <- raster(ncol=5, nrow=5)
s <- stack(lapply(1:length(idx), function(x) setValues(r, values = sample(x = c(-5:5),size = ncell(r), replace = T))))
s <- setZ(s, idx)
step 2 calculate the cumulative sum by year
Y <- unique(format(as.Date(getZ(s), format = "X%Y.%m.%d"), format = "%Y"))
sum_YY <- stack()
for (i in Y) {
dates <- format(getZ(s), "%Y") == i
t <- calc(subset(s, which(dates)), cumsum)
sum_YY <- stack(sum_YY,t)
}
But this is not the output I am looking for.
I need to use cumsum according to the reference raster 'refer'.
step 3 create the reference raster
refer <- raster(ncol=5, nrow=5)
refer <- setValues(refer,sample(x = c(15:25),size = ncell(refer), replace = T))
The ‘refer’ raster pixel values indicate the timestep ‘X’ and I want to calculate the cumulative sums from the ‘s stack’ timesteps X-5 to X+5 for each of the years
i.e. if ‘refer’ pixel value X = 20 I need to calculate the cumulative sums from the ‘s stack’ layers from 15 to 25 for each of the years. The same approach for the other pixels in 'refer'
I have no idea how to introduce the reference raster in the calculation, any help??
Thanks a million!!!
I can't think of a clean approach. But here is one way to do it:
rs <- stack(refer, s)
a <- calc(rs, fun=function(x) sum(x[(x[1]-4):(x[1]+6)]))
StackSelect was designed for problems like this, but I now realize that it needs more flexibility. It would be nice to be able to select multiple layers at once, and to apply a function. But you could do this (probably inefficient)
x <- calc(refer, function(x) (x-5):(x+5), forceapply=T)
y <- stackSelect(s, x)
z <- sum(y)

(Simple) Generating Multiple (1000) random variables in R

I'm attempting to create 1000 samples of a certain variable Z, in which first I generate 12 uniform RV's Ui, and then have Z = ∑ (Ui-6) from i=1 to 12. I can generate one Z from
u <- runif(12)
Z <- sum(u-6)
However I am not sure how to go about repeating that 1000x. In the end, the desire is to plot out the histogram of the Z's, and ideally it to resemble the normal curve. Sorry, clearly I am as beginner as you can get in this realm. Thank you!
If I understand the question, this is a pretty straightforward way to do it -- use replicate() to perform the calculation as many times as you want.
# number of values to draw per iteration
n_samples <- 12
# number of iterations
n_iters <- 1000
# get samples, subtract 6 from each element, sum them (1000x)
Zs <- replicate(n_iters, sum(runif(n_samples) - 6))
# print a histogram
hist(Zs)
Is this what you're after?
set.seed(2017);
n <- 1000;
u <- matrix(runif(12 * n), ncol = 12);
z <- apply(u, 1, function(x) sum(x - 6));
# Density plot
require(ggplot2);
ggplot(data.frame(z = z), aes(x = z)) + geom_density();
Explanation: Draw 12 * 1000 uniform samples in one go, store in a 1000 x 12 matrix, and then sum row entries x - 6.

XYZ data to 2D plot in R

I need to make a 2D plot of distance travelled versus my value at that point ("intensity").
My data is formatted as:
lon lat intensity
1. -85.01478 37.99030 -68.3167
2. -85.00752 37.97601 -68.0247
3. -85.00027 37.96172 -67.9565
4. -84.99302 37.94743 -67.8917
and it continues for 282 rows like this. I was looking at a few packages that calculate distance between longitude (lon) and latitude (lat) points (such as geosphere), but I couldn't understand how to get my data into the format that it wanted. I know the total distance travelled in degrees should be 4.01538, evenly spaced out between the 282 points, but I don't know how I could make a column in R with this in mind.
dfrm$dist<- cumsum(c(0, with(dfrm, sqrt( (lat[-1]-lat[-nrow(dfrm)])^2+
(lon[-1]-lon[-nrow(dfrm)])^2
))) )
with(dfrm, plot(dist, intensity, type="b"))
Or choose a more "geographic" distance measure with the lagged column values. But given the increments, I doubt the error from using a naive distance measure can be that much.
From here I found some packages to calculate distance between coordinates. Assuming your data is called dtf and using the RSEIS package:
dtf <- data.frame(rbind(c(-85.01478,37.99030,-68.3167),
c(-85.00752,37.97601,-68.0247),c(-85.00027,37.96172,-67.9565),
c(-84.99302,37.94743,-67.8917)))
names(dtf) <- c('lon','lat','int')
library(RSEIS)
travelint <- function(i,data){
ddeg <- GreatDist(dtf$lon[i],dtf$lat[i],dtf$lon[i+1],dtf$lat[i+1])$ddeg;
dint <- dtf$int[i+1] - dtf$int[i]; return(list(ddeg,dint))}
out <- sapply(1:(nrow(dtf)-1),data=dtf,travelint)
out <- data.frame(matrix(as.numeric(out),ncol=2,byrow=T))
out$X1 <- cumsum(out$X1)
This will take your data, calculate the distance traveled between points and the intensity change between them. After that it can be plotted like this:
ggplot(out,aes(X1,X2)) + geom_line() +
labs(x="Distance (Degrees)",y="Intensity Change")
If instead you want increasing intensity , you can use cumsum again to get the cumulative change in intensity and then add it to the first intensity:
out2 <- out
out2 <- rbind(c(0,0),out2)
out2$X2 <- cumsum(out2$X2) + dtf$int[1]
ggplot(out2,aes(X1,X2)) + geom_line() +
labs(x="Distance (Degrees)",y="Intensity")
As mentioned by DWin you can use naive measure or geographic distance measure. Here I am using gdist function from Imap package calculates Great-circle distance .
library(Imap)
library(lattice)
#Dummy data
longlat <- read.table(text="lon lat intensity
1. -85.01478 37.99030 -68.3167
2. -85.00752 37.97601 -68.0247
3. -85.00027 37.96172 -67.9565
4. -84.99302 37.94743 -67.8917", header=TRUE)
ll <- lapply(seq(nrow(longlat)-1), function(x){
start <- longlat[x,]
end <- longlat[x+1,]
cbind(distance = gdist(start$lon, start$lat, end$lon, end$lat,units = "m"),
intensity = end$intensity - start$intensity)
})
dd <- as.data.frame(do.call(rbind,ll))
library(lattice)
xyplot(intensity~distance,dd,type= c('p','l'),pch=20,cex=2)

Detecting dips in a 2D plot

I need to automatically detect dips in a 2D plot, like the regions marked with red circles in the figure below. I'm only interested in the "main" dips, meaning the dips have to span a minimum length in the x axis. The number of dips is unknown, i.e., different plots will contain different numbers of dips. Any ideas?
Update:
As requested, here's the sample data, together with an attempt to smooth it using median filtering, as suggested by vines.
Looks like I need now a robust way to approximate the derivative at each point that would ignore the little blips that remain in the data. Is there any standard approach?
y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512)
yy <- runmed(y, 41)
plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5)
points(yy, col="blue", type="l", lwd=2)
EDITED : function strips the regions to contain nothing but the lowest part, if wanted.
Actually, Using the mean is easier than using the median. This allows you to find regions where the real values are continuously below the mean. The median is not smooth enough for an easy application.
One example function to do this would be :
FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){
nx <- length(x)
n <- 2*(n %/% 2) + 1
# smooth out based on means
sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T)
# find which series are far from the mean
rlesx <- rle((sx-x)>0)
# construct start and end of regions
int <- embed(cumsum(c(1,rlesx$lengths)),2)
# which regions fulfill requirements
id <- rlesx$value & rlesx$length > tol
# Cut regions to be in general smaller than median
regions <-
apply(int[id,],1,function(i){
i <- min(i):max(i)
tmp <- x[i]
id <- which(tmp < quantile(tmp,p))
id <- min(id):max(id)
i[id]
})
# return
unlist(regions)
}
where
n determines how much values are used to calculate the running mean,
tol determines how many consecutive values should be lower than the running mean to talk about a low region, and
p determines the cutoff used (as a quantile) for stripping the regions to their lowest part. When p=1, the complete lower region is shown.
Function is tweaked to work on data as you presented, but the numbers might need to be adjusted a bit to work with other data.
This function returns a set of indices, which allows you to find the low regions. Illustrated with your y vector :
Lows <- FindLowRegion(y)
newx <- seq_along(y)
newy <- ifelse(newx %in% Lows,y,NA)
plot(y, col="blue", type="l", lwd=2)
lines(newx,newy,col="red",lwd="3")
Gives :
You have to smooth the graph in some way. Median filtration is quite useful for that purpose (see http://en.wikipedia.org/wiki/Median_filter). After smoothing, you will simply have to search for the minima, just as usual (i.e. search for the points where the 1st derivative switches from negative to positive).
A simpler answer (which also does not require smoothing) could be provided by adapting the maxdrawdown() function from the tseries. A drawdown is commonly defined as the retreat from the most-recent maximum; here we want the opposite. Such a function could then be used in a sliding window over the data, or over segmented data.
maxdrawdown <- function(x) {
if(NCOL(x) > 1)
stop("x is not a vector or univariate time series")
if(any(is.na(x)))
stop("NAs in x")
cmaxx <- cummax(x)-x
mdd <- max(cmaxx)
to <- which(mdd == cmaxx)
from <- double(NROW(to))
for (i in 1:NROW(to))
from[i] <- max(which(cmaxx[1:to[i]] == 0))
return(list(maxdrawdown = mdd, from = from, to = to))
}
So instead of using cummax(), one would have to switch to cummin() etc.
My first thought was something much cruder than filtering. Why not look for the big drops followed by long enough stable periods?
span.b <- 20
threshold.b <- 0.2
dy.b <- c(rep(NA, span.b), diff(y, lag = span.b))
span.f <- 10
threshold.f <- 0.05
dy.f <- c(diff(y, lag = span.f), rep(NA, span.f))
down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f)
abline(v = down)
The plot shows that it's not perfect, but it doesn't discard the outliers (I guess it depends on your take on the data).

Resources