CCF - general problems - r

I am working on my bachelor thesis, where I want to look into the lagged cross-correlation of a timeseries of search query volumes (=x) to the price of bitcoin (=y).
I have already created several ccf-plots using the "ccf"-function in R .
See picture:
I saw in the description of R's acf-function that ccf only works with one y and one x series. I was wondering if someone knows a way to put several of those plots into one, especially since I can categorize positively correlated and negatively correlated ones.
Further I was wondering, the dashed-blue line representing the confidence value, but at what level? 0.05? 0.01?

These are two questions in one.
1. question: combine plots
This question has been asked before. Please look it up:
Combining plots created by R base, lattice, and ggplot2
Combine plots in R
2. question: confidence intervals in ccf-plot:
The plot gives you the confidence intervals. The manual advises caution with these even though it uses ci.type = "white" is default setting. This default bluntly adds some confidence based on the quantiles of a standard normal distribution. It does not take the statistical properties of your data into account. In my opinion it is altogether useless. The manual recommends ci.type = "ma". But that will only work for autocorrelations. If you try using it with cross-correlations, you will get a warning saying "can use ci.type=‘ma’ only if first lag is 0". When doing autocorrelations the function shifts the sequences from -k to +k and will allow the first lag to be zero. ccf does not.
Further support
I hope it is not against the code of conduct to offer further support.
The ccf function has some pecularities that aren't well explained in the manual. Since I had trouble with ccf myself I wrote it all down here for everybody.
Because I wanted meaningful confidence intervals I developed an improved version of 'ccf' (link to repository in case anyone is interested) myself. It offers confidence intervals. The ccf-object by the new function is compatible with the output by stats::ccf() but contains more information. Additional functions make it more useful.

Related

R vegan: adjusted p values for permanova (adonis2)

I am running an analysis of variances on a large distance matrix using adonis2 as described here: https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/adonis
That method is frequently used in microbiome analysis to calculate beta diversity. That's also what I would like to do, i.e. to find out whether my community composition differs in response to an environmental variable (continuous)
Permanova returns one p value and there is no "official" post hoc test yet. That's where my question comes in:
I've come across publications saying they adjusted their permanova result using FDR/BH method. I cannot wrap my head around this. I'm confident I understand how FDR correction is calculated, I just don't see how that would be done for PERMANOVA, or, even more, how I would code it.
Can anyone help me out here?
Would be clearer if you provide an example of so-called publication. You are right that for each variable, permanova returns 1 p-value. However, if the model includes many variables, you would have 1 p-value for each variable and you need to correct for FDR.
For example in this publication looking at variation in gut microbiome, they wrote:
To calculate the variation explained by each of our collected host
factors, we performed an Adonis test implemented in QIIME. Each host
factor was calculated according to its explanation rate, and P values
were generated based on 1,000 permutations. All P values were then
adjusted using the Benjamini–Hochberg method.
You can also see an example of this in Table S2, I attached a screenshot here:

How to take a Probability Proportional to Size (PPS) Unequal Probability sample using R?

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1 lists US states and the parameter I'm trying to estimate
ds2 has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
There is a package for the same in R - pps and the documentation is here.
Also, there is another package called survey with a bit of documentation here.
I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.
The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.
Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

1 sample t-test from summarized data in R

I can perform a 1 sample t-test in R with the t.test command. This requires actual sets of data. I can't use summary statistics (sample size, sample mean, standard deviation). I can work around this utilizing the BSDA package. But are there any other ways to accomplish this 1-sample-T in R without the BSDA pacakage?
Many ways. I'll list a few:
directly calculate the p-value by computing the statistic and calling pt with that and the df as arguments, as commenters suggest above (it can be done with a single short line in R - ekstroem shows the two-tailed test case; for the one tailed case you wouldn't double it)
alternatively, if it's something you need a lot, you could convert that into a nice robust function, even adding in tests against non-zero mu and confidence intervals if you like. Presumably if you go this route you'' want to take advantage of the functionality built around the htest class
(code and even a reasonably complete function can be found in the answers to this stats.SE question.)
If samples are not huge (smaller than a few million, say), you can simulate data with the exact same mean and standard deviation and call the ordinary t.test function. If m and s and n are the mean, sd and sample size, t.test(scale(rnorm(n))*s+m) should do (it doesn't matter what distribution you use, so runif would suffice). Note the importance of calling scale there. This makes it easy to change your alternative or get a CI without writing more code, but it wouldn't be suitable if you had millions of observations and needed to do it more than a couple of times.
call a function in a different package that will calculate it -- there's at least one or two other such packages (you don't make it clear whether using BSDA was a problem or whether you wanted to avoid packages altogether)

Determining the direction of a significant spearman's rho correlation

I asked the following question over on stackexchange https://stats.stackexchange.com/questions/272657/determining-the-direction-of-a-significant-spearmans-rho-correlation - someone pointed me in the direction of this site as I am using spss, so if anyone had any advice that would be much appreciated.
I have conducted Spearman's Rho tests with two ordinal variables (one with 4 possible answers and the other with 6). I have obtained a statistically significant correlation between the two. My question is, how can I graphically (or some other way) determine which answer of each correlate together - as a scatterplot would not work with my data (since it is not scale).
A fluctuation plot is often a good way to look at the distribution of pairs of categorical variables. There is a custom dialog available for this if you don't want to figure out the GPL code. It is available from the Community site, but if you can't find it, send me an email (jkpeck#gmail.com), and I'll send it to you.

Histogram matching - image processing - c/c++

I have two histograms.
int Hist1[10] = {1,4,3,5,2,5,4,6,3,2};
int Hist1[10] = {1,4,3,15,12,15,4,6,3,2};
Hist1's distribution is of type multi-modal;
Hist2's distribution is of type uni-modal with single prominent peak.
My questions are
Is there any way that i could determine the type of distribution programmatically?
How to quantify whether these two histograms are similar/dissimilar?
Thanks
Raj,
I posted a C function in your other question ( automatically compare two series -Dissimilarity test ) that will compute divergence between two sets of similar data. It's actually intended to tell you how closely real data matches predicted data but I suspect you could use it for your purpose.
Basically, the smaller the error, the more similar the two sets are.
These are just guesses, but I would try fitting each distribution as a gaussian distribution and use something like the R-squared value to determine if the distribution is uni-modal or not.
As to the similarity between the two distributions, I would try doing an autocorrelation and using the peak positive value in the autocorrelation as a similarity measure. These ideas are pretty rough, but hopefully they give you some ideas.
For #2, you could calculate their cross-correlation (so long as the buckets themselves can be sorted). That would give you a rough estimation of what "similarity".
Comparison of Histograms (For Use in Cloud Modeling).
(That's an MS .doc file.)
There are a variety of software packages that will "fit" your distributions to known discrete distributions for you - Minitab, STATA, R, etc. A reference to fitting distributions in R is here. I wouldn't advise programming this from scratch.
Regarding distribution comparisons, if neither distribution fits a known distribution (Poisson, Binomial, etc.), then you need to use non-parametric methods described here.

Resources