I think the title is fairly self-explanatory. I want to compute the cross-correlation between two time series controlled for the values at other lags. I can't find any existing R code to do this, and I'm not at all confident enough in my knowledge of statistics (or R) to try to write something myself. It would be analogous to the partial autocorrelation function, just for the cross-correlation instead of the autocorrelation.
If it helps at all, my larger objective is to look for lagged correlations between different measurements of a physical system (to start with, flux and photon index from gamma ray measurements of blazars), with the goal of building a general linear model to try to predict flaring events.

Look at my answer to my own question (same as the one you posted).
You can make use of the pacf function in R, extending it to a matrix with 2 or more time series. I have checked results between the multivariate acf and ccf functions and they yield the same results, so the same can be concluded about the multivariate pacfand the non-existing pccf.

I believe this work,
pccf <- function(x,y,nlags=7,partial=TRUE){
# x (numeric): variable that leads y
# y (numeric): variable of interest
# nlags (integer): number of lags (uncluding zero)
# partial (boolean): partial or absolute correlation
# trim y
y <- y[-(1:(nlags-1))]
# lagged matrix of x
x_lagged <- embed(x,nlags)
# process for each lag
rho <- lag <- NULL
for(i in 1:(nlags)){
# residuals of x at lag of interest regressed on all other lags of x
ex <- lm(x_lagged[,i] ~ x_lagged[,-i])$residuals
# residuals of y regressed on all lags of x but the one of interest
ey <- lm(y ~ x_lagged[,-i])$residuals
ex <- x_lagged[,i]
ey <- y
# calculate correlation
rho[i] = cor(ex,ey, use="pairwise.complete.obs")
lag[i] = i-1
tibble(lag=lag, rho=rho) %>%
# test
n <- 200 # count
nlag <- 6 # number of lags
x <- as.numeric(arima.sim(n=n,list(ar=c(phi=0.9)),sd=1)) # simulate times series x
y <- lag(x,nlag) + rnorm(n,0,0.5) # simulate y to lag x
y <- y[(nlag+1):n] # remove NAs from lag
x <- x[(nlag+1):n] # align with y
pccf(x,y,nlags=10,partial=FALSE) %>%
mutate(type='Cross correlation') %>%
pccf(x,y,nlags=10,partial=TRUE) %>%
mutate(type='Partial cross correlation')
) %>%
ggplot() +
geom_col(aes(-lag,rho),width=0.1) +
facet_wrap(~type,scales='free_y', ncol=1) +
scale_x_continuous(breaks=-10:0) +


Plotting distribution of variances

My dataset has 2 fields:
Time stamp t --- Varies between 0 to 60
Variable x – variance in value of a variable (say, A) from t-1 to t. Varies between -100% to 100%
There are roughly 500 records for each value of time stamp- e.g.
500 records where t= 0 and x takes any value between -100% to 100%
490 records where t= 1 and x takes any value between -100% to 100%, and so on.
Note, the value of x is 0 for ~80% of the records
The aim here is to determine at what value of t (Can be one value, or a range, e.g., when t= 22, or is between 20 -25), is the day-on-day change in A the minimum: Which effectively translates to finding out t when x is very frequently= 0, and when not, is at least close to zero.
To this purpose, I aim to plot the variance of x for each day. I can think of using a violin plot with x (Y axis) and t (X-axis), but there being 60 values of t makes it difficult to plot all in one chart. Can you suggest any alternative plot for the intended visual analysis?
Does it help if you do the absolute value of the variance (so its concentrated in 0-100) and trying with logs in here?
When you say smallest, you mean closest to 0, right? In this case its better to work to reduce absolute variance (on a 0-1 scale), as you can then treat this like zero-inflated binomial data e.g. with the VGAM package:
I've had a play around, and below is an example that I think makes sense. I've only had some experience with zero-inflated models, so would be good if anyone has some feedback :)
# simulate some data
N_t <- 60 # number of t
N_o <- 500 # number of observations at t
t_smallest <- 30 # best value
# simulate some data crudely
dataL <- lapply(1:N_t, function(t){
dist <- abs(t_smallest-t)+10
values <- round(rbeta(N_o, 10/dist, 300/dist), 2) * sample(c(-1,1), N_o, replace=TRUE)
data.table(t, values)
data <- rbindlist(dataL)
# raw
ggplot(data, aes(factor(t), values)) + geom_boxplot() +
coord_cartesian(ylim=c(0, 0.1))
# log transformed - may look better with your data
ggplot(data, aes(factor(t), log(abs(values)+1))) +
# use absolute values, package needs it as integer p & n, so approximate these
data[, abs.values := abs(values)]
data[, p := round(1000*abs.values, 0)]
data[, n := 1000]
# with a gam, so smooth fit on t. Found it to be unstable though
fit <- vgam(cbind(p, n-p) ~ s(t), zibinomialff, data = data, trace = TRUE)
# glm, with a coefficient for each t, so treats independently
fit2 <- vglm(cbind(p, n-p) ~ factor(t), zibinomialff, data = data, trace = TRUE)
# predict
output <- data.table(t=1:N_t)
output[, prediction := predict(fit, newdata=output, type="response")]
output[, prediction2 := predict(fit2, newdata=output, type="response")]
# plot out with predictions
ggplot(data, aes(factor(t), abs.values)) +
geom_boxplot(col="darkgrey") +
geom_line(data=output, aes(x=t, y=prediction2)) +
geom_line(data=output, aes(x=t, y=prediction), col="darkorange") +
geom_vline(xintercept = output[prediction==min(prediction), t]) +
coord_cartesian(ylim=c(0, 0.1))

Sampling from a multivariate distribution including gender in R

I'm trying to simulate a wider population from a small one in R as follows:
idata <- subset(data, select=c(WT, AGE, HT, BFP, SEX) )
M= cor(idata)
mu <- sapply(idata, mean)
sd <- sapply(idata, stdev)
sigma=cor2cov(M, sd)
simulation <-, mu, sigma))
But the problems is, for SEX, the code will consider a continuous distribution, while it has to be binary, and effects of sex has to be either fully considered (SEX==1), or not at all (SEX==0). I'd appreciate any help with this regard.
What you should do is consider that your data consists of two sub-populations, and then draw data from them, based on their proportions.
So, first estimate the proportions, pi_m and pi_f (= 1 - pi_m), which are the proportion of SEX == 0 and SEX == 1. This should be something like
pi_m = sum(idata$SEX == 1)/ nrow(idata)
Then estimate parameters for the two populations, mu_f, mu_m, sigma_f and sigma_m, which are mean and covariance parameters for the two SEX populations (now without the SEX variable).
The first draw a random number r <- runif(1), if this is less than equal to pi_m then generate a sample from N(mu_m, sigma_s) else from N(mu_f, sigma_f).
You can do this step 1000 times to get 1000 samples from your distribution.
Of course, you can vector this, by first generating 1000 samples from runif. For example
n_m <- sum(runif(1000) <= pi_m)
n_f <- 1000 - n_m
X_m <- rmvnorm(n_m, mu_m, sigma_m)
X_f <- rmvnorm(n_f, mu_f, sigma_f)
X <- rbind(X_m, X_f)

Calculation of DFFITS as diagnostic for Leverage and Influence in regression

I am trying to calculate DFFITS by hand. The value obtained should be equal to the first value obtained by dffits function. However there must be something wrong with my own calculation.
x1 <- lm(speed ~ dist, data = cars) # all observations
x2 <- lm(speed ~ dist, data = cars[-1,]) # without first obs
x <- model.matrix(speed ~ dist) # x matrix
h <- diag(x%*%solve(crossprod(x))%*%t(x)) # hat values
num_dffits <- x1$fitted.values[1] - x2$fitted.values[1] #Numerator
denom_dffits <- sqrt(anova(x2)$`Mean Sq`[2]*h[1]) #Denominator
df_fits <- num_dffits/denom_dffits #DFFITS
dffits(x1)[1] # DFFITS function
Your numerator is wrong. As you have removed first datum from the second model, corresponding predicted value is not in fitted(x2). We need to use predict(x2, cars[1, ]) in place of fitted(x2)[1].
Hat values can be efficiently computed by
h <- rowSums(qr.Q(x1$qr) ^ 2)
or using its R wrapper function
h <- hat(x1$qr, FALSE)
R also has a generic function for getting hat values, too:
h <- lm.influence(x1, FALSE)$hat
or its wrapper function
h <- hatvalues(x1)
You also don't have to call anova to get MSE:
c(crossprod(x2$residuals)) / x2$df.residual

How to perform a multivariate linear regression when y is an indicator matrix in r?

this is the first time I am posting a question, hope it looks not confusing. And thanks very much for your time.
I am working on a zipcode dataset, which can be downloaded here:
In general, my goal is to fit principle component regression model with the top 3 PCs on the train dataset for those response variable are the handwriting digits of 2, 3, 5, and 8, and then predict by using the test data. My main problem is that after performing PCA on the X matrix, I am not sure if I did the regression part correctly. I have turned the response variables into an 2487*4 indicator matrix, and want to fit a multivariate linear regression model. But the prediction results are not binomial indicators, so I am confused that how should I interpret the predictions back to the original response variables, i.e., which are predicted as 2, 3, 5, or 8. Or did I do the regression part totally wrong? Here are my code as follows:
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8:
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using svd:
cxtrain <- scale(xtrain)
svd.xtrain <- svd(cxtrain)
cxtest <- scale(xtest)
svd.xtest <- svd(cxtest)
utrain.r3 <- svd.xtrain$u[,c(1:3)] # this is the u_r
vtrain.r3 <- svd.xtrain$v[,c(1:3)] # this is the v_r
dtrain.r3 <- svd.xtrain$d[c(1:3)]
Dtrain.r3 <- diag(x=dtrain.r3,ncol=3,nrow=3) # creat the diagonal matrix D with r=3
ztrain.r3 <- cxtrain %*% vtrain.r3 # this is the scores, the new components
utest.r3 <- svd.xtest$u[,c(1:3)]
vtest.r3 <- svd.xtest$v[,c(1:3)]
dtest.r3 <- svd.xtest$d[c(1:3)]
Dtest.r3 <- diag(x=dtest.r3,ncol=3,nrow=3)
ztest.r3 <- cxtest %*% vtest.r3
Third, which is the part I was not sure if I did in the correct way, I turned the response variables into an indicator matrix, and performed a multivariate linear regression like this:
ytrain.ind <-cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain.r3,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest.r3)
pred <- predict(model_train,newdata=new)
However, the pred was not an indicator matrix, so I am getting lost that how to interpret them back to the digits and compare them with the real test data to further calculate the prediction error.
I finally figured out how to perform multivariate linear regression with categorical y. First we need to turn the y into an indicator matrix, so then we could interpret the 0 and 1 in this matrix as probabilities. And then regress y on x to build a linear model, and finally use this linear model to predict with the test set of x. The result is a matrix with same dimensions as our indicator matrix. And all the entries should be interpreted as probabilities too, although they could be larger than 1 or smaller than 0 (that's why it confused me before). So we need to find the maximum number per row, to see which predicted y has the highest probability, and this y would be our final prediction. In this way, we could convert the continuous numbers back into categories, and then make a table to compare with the test set of y. So I updated my previous code as below.
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8 (the code remains the same as the one I posted in my question):
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using eigen(). I updated this part of code, because I standardized x instead of centering it in my previous code, leading to a wrong computation of the covariance matrix of x and eigenvectors of cov(x).
cxtrain <- scale(xtrain, center = TRUE, scale = FALSE)
eigenxtrain <- eigen(t(cxtrain) %*% cxtrain / (nrow(cxtrain) -1)) # same as get eigen(cov(xtrain)), because I have already centered x before
cxtest <- scale(xtest, center = TRUE, scale = FALSE)
eigenxtest <- eigen(t(cxtest) %*% cxtest/ (nrow(cxtest) -1))
r=3 # set r=3 to get top 3 principles
vtrain <- eigenxtrain$vectors[,c(1:r)]
ztrain <- scale(xtrain) %*% vtrain # this is the scores, the new componenets
vtest <- eigenxtrain$vectors[,c(1:r)]
ztest <- scale(xtest) %*% vtest
Third, I turned the response variables into an indicator matrix, and performed a multivariate linear regression on the training set. And then use this linear model to predict.
ytrain.ind <- cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest)
pred<- predict(model_train,newdata=new)
The pred is a matrix with all the entries of probabilities, so we need to convert it back into a list of categorical y.
pred.ind <- matrix(rep(0,690*4),nrow=690,ncol=4) # build a matrix with the same dimensions as pred, and all the entries are 0.
for (i in 1:690){
j=which.max(pred[i,]) # j is the column number of the highest probability per row
pred.ind[i,j]=1 # we set 1 to the columns with highest probability per row, in this way, we could turn our pred matrix back into an indicator matrix
pred.col1=as.matrix(pred.ind[,1]*2) # first column are those predicted as digit 2
pred.col5 <- cbind(pred.col1,pred.col2,pred.col3,pred.col4)
pred.list <- NULL
for (i in 1:690){
} # In this way, we could finally get a list with categorical y
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.3289855
For the third part, we could also perform a multinomial logistic regression instead. But in this way, we don't need to convert y into an indicator matrix, we just factor it. So the code looks as below:
trainmodel <- data.frame(cbind(ztrain, ytrain))
mul <- multinom(factor(ytrain) ~., data=trainmodel)
new <- as.matrix(ztest)
colnames(new) <- colnames(trainmodel)[1:r]
predict<- predict(mul,new)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.2627907
So it showed that the logistic model do perform better than the linear model.

R: Calculating Pearson correlation and R-squared by group

I am trying to extend the answer of a question R: filtering data and calculating correlation.
To obtain the correlation of temperature and humidity for each month of the year (1 = January), we would have to do the same for each month (12 times).
cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
Is there any way to do each month automatically?
In my case I have more than 30 groups (not months but species) to which I would like to test for correlations, I just wanted to know if there is a faster way than doing it one by one.
Thank you!
cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
gives you a 2 * 2 covariance matrix rather than a number. I bet you want a single number for each Month, so use
## cor(Temp, Humidity | Month)
with(airquality, mapply(cor, split(Temp, Month), split(Humidity, Month)) )
and you will obtain a vector.
Have a read around ?split and ?mapply; they are very useful for "by group" operations, although they are not the only option. Also read around ?cor, and compare the difference between
a <- rnorm(10)
b <- rnorm(10)
cor(a, b)
cor(cbind(a, b))
The answer you linked in your question is doing something similar to cor(cbind(a, b)).
Reproducible example
The airquality dataset in R does not have Humidity column, so I will use Wind for testing:
## cor(Temp, Wind | Month)
x <- with(airquality, mapply(cor, split(Temp, Month), split(Wind, Month)) )
# 5 6 7 8 9
#-0.3732760 -0.1210353 -0.3052355 -0.5076146 -0.5704701
We get a named vector, where names(x) gives Month, and unname(x) gives correlation.
Thank you very much! It worked just perfectly! I was trying to figure out how to obtain a vector with the R^2 for each correlation too, but I can't... Any ideas?
cor(x, y) is like fitting a standardised linear regression model:
coef(lm(scale(y) ~ scale(x) - 1)) ## remember to drop intercept
The R-squared in this simple linear regression is just the square of the slope. Previously we have x storing correlation per group, now R-squared is just x ^ 2.
