I try to replicate the results from acf() with manually using cor(), and I get differences which I cannot understand.
Consider the following toy example:
# two slightly different lines, with a bit of noise
# both are casted into time series
foo <- ts(seq(1:20) + rnorm(20))
bar <- ts((1.5*seq(1:20)-1) + rnorm(20))
foobar=cbind(foo, bar)
# lagged (auto)correlations, up to a lag of 10
pippo <- acf(foobar, 10)
# look at the results; I see the two 'slices' of the resulting
# datacube: [ , , 1] and [ , , 2]. So far, so good.
pippo$acf
# 1st slice: [1,1,1] = 1 (obviously), and
# [2,1,1] = 0.82796109 (autocorrelation of foo with itself, lagged 1)
# but if I do:
cor(foobar[1:19, 1], foobar[2:20, 1])
> [1] 0.9669717
obviously, 0.82796109 is quite different from 0.9669717,
and I cannot understand why. The same happens with cross-correlation
between the two time series, one lagged with respect to the other.
Even taking into account the
(n-1/n) unbiased/biased factor that sometimes creeps in.
What am I missing here?
Many thanks in advance to whomever wants to help. And my apologies
in advance if my question should come from a lack of understanding.
Luca
Environment: R 4.1.12, running on Win7
Related
I am generating a very large data frame consisting of a large number of combinations of values. As such, my coding has to be as efficient as possible or else 1) I get errors like - R cannot allocate vector of size XX or 2) the calculations take forever.
I am to the point where I need to calculate r (in the example below r = 3) deviations from the mean for each sample (1 sample per row of the df)(Labeled dev1 - dev3 in pic below):
These are my data in R:
I tried this (r is the number of values in each sample, here set to 3):
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
When I try this, I get:
I am guessing that this code is attempting to calculate the difference between each row of X1 (x) and the entire vector of X1$x.bar instead of 81 for the 1st row, 81.25 for the 2nd row, etc.
Once again, I can easily do this using for loops, but I'm assuming that is not the most efficient way.
Can someone please stir me in the right direction? Any assistance is appreciated.
Here is the whole code for the small sample version with r<-3. WARNING: This computes all possible combinations, so the df's get very large very quick.
options(scipen = 999)
dp <- function(x) {
dp1<-nchar(sapply(strsplit(sub('0+$', '', as.character(format(x, scientific = FALSE))), ".",
fixed=TRUE),function(x) x[2]))
ifelse(is.na(dp1),0,dp1)
}
retain1<-function(x,minuni) length(unique(floor(x)))>=minuni
# =======================================================
r<-3
x0<-seq(80,120,.25)
X0<-data.frame(t(combn(x0,r)))
names(X0)<-paste("x",1:r,sep="")
X<-X0[apply(X0,1,retain1,minuni=r),]
rm(X0)
gc()
X$x.bar<-rowMeans(X)
dp1<-dp(X$x.bar)
X1<-X[dp1<=2,]
rm(X)
gc()
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
Because R is vectorized you only need to subtract x.bar from from x1, x2, x3 collectively:
devs <- X1[ , 1:3] - X1[ , 4]
X1devs <- cbind(X1, devs)
That's it...
I think you just got the margin wrong, in apply you're using 1 as in row wise, but you want to do column wise so use 2:
X2<-apply(X1[,1:r], 2, function(x) x-X1$x.bar)
But from what i quickly searched, apply family isn't better in performance than loops, only in clarity. Check this post: Is R's apply family more than syntactic sugar?
I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.
Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!
I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")
I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.
I do not understand letter ordering in the function multcompletters from multcompView. According to documentation it should be according to mean of the group. In the following example, the middle group got c (from abc) and should have got b. Is this a bug?
require(multcompView)
# Data
datacol <- c(21.1,20.2,21.8,20.9,23.3,21.1,20.2,21.8,20.9,23.3,19.8,16.4,
16.9,16.0,17.6,17.5,16.9,13.3,18.0,17.6,13.5,12.2,15.2,15.1,15.2,14.0)
# Group
faccol <- c(rep(c(1,2),each=10),rep(3,6))
# Combined Dataframe
tukeyset <- data.frame(datacol,as.factor(faccol))
colnames(tukeyset)[2] <- "faccol"
# Tukeytest
tukeyres <- TukeyHSD(x=aov(lm(datacol~faccol,data=tukeyset)))
Tlevels <- tukeyres$faccol[,4]
multcompLetters(Tlevels) # WRONG ORDER, even reversed
# Boxplot
boxplot(tukeyset$datacol~tukeyset$faccol)
# adding the labels
text(x=c(1,2,3),y=c(aggregate(data=tukeyset,datacol~faccol,mean)$datacol),
labels=as.character(multcompLetters(Tlevels,reversed=TRUE)$Letters)[order(names(multcompLetters(Tlevels,reversed=TRUE)['Letters']$Letters))])
Oh no just encountered this problem! Spent 2 hours finally sorting it out!
You cannot call multcompLetters. It will give you the completely wrong order.
You have to use multcompLetters2, multcompLetters3 or multcompLetters4.
Another very important point is you have to convert your input dataset into a dataframe, but not tibble! Tibble doesn't work for this.
I am ashamed I need assistance on such a simple task. I want to create 20 normal distributed numbers, add them, and then do this again x times. Then plot a histogram of these sums. This is an exercise in Gilman and Hills text "Data Analysis Using Regression and Multilevel/Hierarchical Models".
I thought this would be simple, but I am into it about 10 hours now. Web searches and looking in "The Art of R Programming" by Norman Matloff and "R for Everyone" by Jared Lander have not helped. I suspect the answer is so simple that no one would suspect a problem. The syntax in R is something that I am having difficulty with.
> # chapter 2 exercise 3
> n.sim <- 10 # number of simultions
>
> sumNumbers <- rep(NA, n.sim) # generate vector of NA's
> for (i in 1:n.sim) # begin for loop
+{
+ numbers <- rnorm(20,0,1)
+ sumNumbers(i) <- sum(numbers) # defined as a vector bur R
+ # thinks it's a function
+ }
Error in sumNumbers(i) <- sum(numbers) :
could not find function "sumNumbers<-"
>
> hist(sumNumbers)
Hide Traceback
Rerun with Debug
Error in hist.default(sumNumbers) : 'x' must be numeric
3 stop("'x' must be numeric")
2 hist.default(sumNumbers)
1 hist(sumNumbers)
>
A few things:
When you put parentheses after a variable name, the R interpreter assumes that it's a function. In your case, you want to reference an index of a variable, so it should be sumNumbers[i] <- sum(numbers), which uses square brackets instead. This will solve your problem.
You can initiate sumNumbers as sumNumbers = numeric(n.sim). It's a bit easier to read in simple case like this.
By default, rnorm(n) is the same as rnorm(n,0,1). This can save you some time typing.
You can replicate an operation a specified number of times with the replicate function:
set.seed(144) # For consistent results
(simulations <- replicate(10, sum(rnorm(20))))
# [1] -9.3535884 1.4321598 -1.7812790 -1.1851263 -1.9325988 2.9652475 2.9559994
# [8] 0.7164233 -8.1364348 -7.3428464
After simulating the proper number of samples, you can plot with hist(simulations).
I've been working on a project for a little bit for a homework assignment and I've been stuck on a logistical problem for a while now.
What I have at the moment is a list that returns 10000 values in the format:
[[10000]]
X-squared
0.1867083
(This is the 10000th value of the list)
What I really would like is to just have the chi-squared value alone so I can do things like create a histogram of the values.
Is there any way I can do this? I'm fine with repeating the test from the start if necessary.
My current code is:
nsims = 10000
for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13))
malig[i] <- sum(sample(cancer.cells,21)=="M")}
benign = 21 - malig
rbenign = 13 - benign
rmalig = 24 - malig
for (i in 1:nsims) {test = cbind(c(rbenign[i],benign[i]),c(rmalig[i],malig[i]))
cancerchi[i] = chisq.test(test,correct=FALSE) }
It gives me all I need, I just cannot perform follow-up analysis on it such as creating a histogram.
Thanks for taking the time to read this!
I'll provide an answer at the suggestion of #Dr. Mike.
hist requires a vector as input. The reason that hist(cancerchi) will not work is because cancerchi is a list, not a vector.
There a several ways to convert cancerchi, from a list into a format that hist can work with. Here are 3 ways:
hist(as.data.frame(unlist(cancerchi)))
Note that if you do not reassign cancerchi it will still be a list and cannot be passed directly to hist.
# i.e
class(cancerchi)
hist(cancerchi) # will still give you an error
If you reassign, it can be another type of object:
(class(cancerchi2 <- unlist(cancerchi)))
(class(cancerchi3 <- as.data.frame(unlist(cancerchi))))
# using the ldply function in the plyr package
library(plyr)
(class(cancerchi4 <- ldply(cancerchi)))
these new objects can be passed to hist directly
hist(cancerchi2)
hist(cancerchi3[,1]) # specify column because cancerchi3 is a data frame, not a vector
hist(cancerchi4[,1]) # specify column because cancerchi4 is a data frame, not a vector
A little extra information: other useful commands for looking at your objects include str and attributes.