Cross-correlation of 5 time series (distance) and interpretation - r

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!

You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Related

Cross correlatation for time series with weights?

Part 1: Previously I was using the ccf function in R to compute cross correlations between two timeseries ts_1 and ts_2. Now, I want to compute the crosscorrelation, but I have a vector of weights which is the same length as ts_1 and ts_2. From what I understand, the built-in ccf function I was using in does not have an argument for weights. Does another function/package have these capabilities?
Part 2: On a more conceptual note, I understand that Pearsons crosscorrelation cannot exceed 1 (or rather the absolute value of the crosscorrelation value). However, does this also hold true for weighted cross correlation? If so, what does this mean/how do I interpret this?
Thank you in advance for your help!
You can use wcc function from ptw package for calculation of cross-correlations. E.g. please the code below putting more weight on the tail of the series:
library(ptw)
data(gaschrom)
wcc(gaschrom[1,], gaschrom[2,], trwdth = 20, wghts = rep(seq_along(gaschrom[1, ])))
Output:
[1] 0.9997758
The wcc is a suitable measure for the similarity of two patterns when features may be shifted. Identical patterns lead to a wcc value of 1.
This means that even if two patterns are identical however shifted on time axis the weighted crosscorrelation will give you 1 (perfectly correlated).

Trying to do a simulation in R

I'm pretty new to R, so I hope you can help me!
I'm trying to do a simulation for my Bachelor's thesis, where I want to simulate how a stock evolves.
I've done the simulation in Excel, but the problem is that I can't make that large of a simulation, as the program crashes! Therefore I'm trying in R.
The stock evolves as follows (everything except $\epsilon$ consists of constants which are known):
$$W_{t+\Delta t} = W_t exp^{r \Delta t}(1+\pi(exp((\sigma \lambda -0.5\sigma^2) \Delta t+\sigma \epsilon_{t+\Delta t} \sqrt{\Delta t}-1))$$
The only thing here which is stochastic is $\epsilon$, which is represented by a Brownian motion with N(0,1).
What I've done in Excel:
Made 100 samples with a size of 40. All these samples are standard normal distributed: N(0,1).
Then these outcomes are used to calculate how the stock is affected from these (the normal distribution represent the shocks from the economy).
My problem in R:
I've used the sample function:
x <- sample(norm(0,1), 1000, T)
So I have 1000 samples, which are normally distributed. Now I don't know how to put these results into the formula I have for the evolution of my stock. Can anyone help?
Using R for (discrete) simulation
There are two aspects to your question: conceptual and coding.
Let's deal with the conceptual first, starting with the meaning of your equation:
1. Conceptual issues
The first thing to note is that your evolution equation is continuous in time, so running your simulation as described above means accepting a discretisation of the problem. Whether or not that is appropriate depends on your model and how you have obtained the evolution equation.
If you do run a discrete simulation, then the key decision you have to make is what stepsize $\Delta t$ you will use. You can explore different step-sizes to observe the effect of step-size, or you can proceed analytically and attempt to derive an appropriate step-size.
Once you have your step-size, your simulation consists of pulling new shocks (samples of your standard normal distribution), and evolving the equation iteratively until the desired time has elapsed. The final state $W_t$ is then available for you to analyse however you wish. (If you retain all of the $W_t$, you have a distribution of the trajectory of the system as well, which you can analyse.)
So:
your $x$ are a sampled distribution of your shocks, i.e. they are $\epsilon_t=0$.
To simulate the evolution of the $W_t$, you will need some initial condition $W_0$. What this is depends on what you're modelling. If you're modelling the likely values of a single stock starting at an initial price $W_0$, then your initial state is a 1000 element vector with constant value.
Now evaluate your equation, plugging in all your constants, $W_0$, and your initial shocks $\epsilon_0 = x$ to get the distribution of prices $W_1$.
Repeat: sample $x$ again -- this is now $\epsilon_1$. Plugging this in, gives you $W_2$ etc.
2. Coding the simulation (simple example)
One of the useful features of R is that most operators work element-wise over vectors.
So you can pretty much type in your equation more or less as it is.
I've made a few assumptions about the parameters in your equation, and I've ignored the $\pi$ function -- you can add that in later.
So you end up with code that looks something like this:
dt <- 0.5 # step-size
r <- 1 # parameters
lambda <- 1
sigma <- 1 # std deviation
w0 <- rep(1,1000) # presumed initial condition -- prices start at 1
# Show an example iteration -- incorporate into one line for production code...
x <- rnorm(1000,mean=0,sd=1) # random shock
w1 <- w0*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*x*sqrt(dt) -1)) # evolution
When you're ready to let the simulation run, then merge the last two lines, i.e. include the sampling statement in the evolution statement. You then get one line of code which you can run manually or embed into a loop, along with any other analysis you want to run.
# General simulation step
w <- w*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*rnorm(1000,mean=0,sd=1)*sqrt(dt) -1))
You can also easily visualise the changes and obtain summary statistics (5-number summary):
hist(w)
summary(w)
Of course, you'll still need to work through the details of what you actually want to model and how you want to go about analysing it --- and you've got the $\pi$ function to deal with --- but this should get you started toward using R for discrete simulation.

Interpreting the phom R package - persistent homology - topological analysis of data - Cluster analysis

I am learning to analyze the topology of data with the pHom package of R.
I would like to understand (characterize) a set of data (A Matrix(3500 rows,10 colums). In order to achieve such aim the R-package phom runs a persistent homology test that describes the data.
(Reference: The following video describes what we are seeking to do with homology in topology - reference video 4 min: http://www.youtube.com/embed/XfWibrh6stw?rel=0&autoplay=1).
Using the R-package "phom" (link: http://cran.r-project.org/web/packages/phom/phom.pdf) the following example can be run.
I need help in order to properly understand how the phom function works and how to interpret the data (plot).
Using the Example # 1 of the reference manual of the phom package in r, running it on R
Load Packages
library(phom)
library(Rccp)
Example 1
x <- runif(100)
y <- runif(100)
points <- t(as.matrix(rbind(x, y)))
max_dim <- 2
max_f <- 0.2
intervals <- pHom(points, max_dim, max_f, metric="manhattan")
plotPersistenceDiagram(intervals, max_dim, max_f,
title="Random Points in Cube with l_1 Norm")
I would kindly appreciate if someone would be able to help me with:
Question:
a.) what does the value max_f means and where does it come from? from my data? I set them?
b.) the plot : plotPersistenceDiagram (if you run the example in R you will see the plot), how do I interpret it?
Thank you.
Note: in order to run the "phom" package you need the "Rccp" package and you need the latest version of R 3.03.
The previous example was done in R after loading the "phom" and the "Rccp" packages respectively.
This is totally the wrong venue for this question, but just in case you're still struggling with it a year later I happen to know the answer.
Computing persistent homology has two steps:
Turn the point cloud into a filtration of simplicial complexes
Compute the homology of the simplicial complex
The "filtration" part of step 1 means you have to compute a simplicial complex for a whole range of parameters. The parameter in this case is epsilon, the distance threshold within which points are connected. The max_f variable caps the range of epsilon sweep from zero to max_f.
plotPersistenceDiagram displays the homological "persistence barcodes" as points instead of lines. The x-coordinate of the point is the birth time of that topological feature (the value of epsilon for which it first appears), and the y-coordinate is the death time (the value of epsilon for which it disappears).

Meta analysis from mean-effectsizes for overlapping samples

I run a meta analysis and use the metafor library to calculate fisher z transformed values from correlations.
>meta1 <- escalc(ri=TESTR, ni=N, measure="ZCOR", data=subdata2)
As some of the studies I include in my meta-analysis, overlap in samples (i.e. in Study XY, 5 effect-sizes are reported from the same N), I need to calculate means of the standardized z-values. To indicate overlapping samples, I gave all effect sizes IDs (in Excel) which are equal if the samples overlap.
To run the final metaanalysis, I would like R to sum the standardized effect sizes from IDs and calculate means for the final metaanalysis.
So the idea is:
IF Effect_SIZE_ID (a variable) is similar in two lines of my df, then sum both effect sizes and divide it by two (calculate the mean). Provide this result in a new column.
As I am a full newbie, please let me know if you require further specification!
Thank you so much in advance.
LEon
Have a look at the summaryBy command in the doBy package.
mymean <- summaryBy(SD_effect ~ ID, FUN = mean, data = data)
Should work in general (if you provide some sample data it is easy to check if that does what you need).

Fitting multiple peaks to a dataset and extracting individual peak information in R

I have a number of datasets composed of counts of features at various elevations. Currently there is data for each 1m interval from 1-30m. When plotted, many of my datasets exhibit 3-4 peaks, which are indicative of height layers.
Here is a sample dataset:
Height <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
Counts <-c(4000,2000,500,300,200,100,0,0,400,700,800,800,500,1000,1500,2000,2500,2200,1700,1100,500,0,0,1000,1500,2000,3000,4000,4000,2000)
I would like to fit some manner of curve function to these datasets in order to determine the total number of ‘peaks’, the peak center location (i.e. height) and peak width.
I could perform this kind of analysis by fitting multiple Gaussian functions manually using the fityk software some time ago, I would however like to know if it is possible to perform such a process automatically through R?
I’ve explored a number of other posts concerning fitting peaks to histograms, such as through the mixtools package, however I do not know if you could extract individual peak information.
Any help you can supply would be greatly appreciated.
"How do I fit a curve to my data" is way too broad of a question, because there are countless ways to do this. It's also probably more suited for https://stats.stackexchange.com/ than here. However, ksmooth from base R is a pretty good starting point for a basic smoother:
plot(Height,Counts)
smoothCounts<-ksmooth(Height,Counts,kernel="normal",bandwidth=2)
dsmooth<-diff(smoothCounts$y)
locmax<-sign(c(0,dsmooth))>0 & sign(c(dsmooth,0))<0
lines(smoothCounts)
points(smoothCounts$x[locmax],smoothCounts$y[locmax],cex=3,c=2)
A simple peak identification could be along the following lines. Looks reasonable?
library(data.table)
dt <- data.table(
Height = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30),
Counts = c(4000,2000,500,300,200,100,0,0,400,700,800,800,500,1000,1500,2000,2500,2200,1700,1100,500,0,0,1000,1500,2000,3000,4000,4000,2000)
)
# crude dHeights/dCounts
dt[,d1 := c(NA,diff(Counts))]
# previous crude dHeights/dCounts (d2Heights/dCounts2 will be even more crude so comparing change in dHeight/dCounts instead)
dt[,d2 := c(tail(d1,-1),NA)]
# local maxima
dtpeaks <- dt[d1 >=0 & d2 <=0]
I'm not very sure how you would calculate FWHM for the peaks, if you can explain the process then I should be able to help.

Resources