Creating an autocorrelation matrix from start - R - r

I'm making a very dirty version of an autocorrelation function in R.
I have a loop that works up to a specified max lag and then returns all the correlations as a matrix, as the acf() function does.
The idea is to replicate the output of the acf() function as shown:
Autocorrelations of series ‘acfData’, by lag
0 1 2 3 4 5 6 7 8
1.000 -0.038 0.253 0.266 0.250 0.267 -0.182 0.281 -0.013
9 10 11 12 13
-0.067 -0.122 -0.115 -0.023 -0.337
What I have so far is the input of data, the specified max lag and the code then works over the range by sliding the data frame back the needed amount and then performing the covariance and standard deviation calculations over the necessary range of data matrices. This is repeated over the range of lags and then appended to the matrices as shown, I also included the cor() function with the data frames created to test.
My problem is that the code returns the correct value for the first loop, or slide, and then returns slightly wrong values from then on.
myAcf <- function(dat, lg){
dataF <- data.frame("data" = dat)
names(dataF)[1] <- "acfData"
lagMat <- c()
testMat <- c()
for(i in 0:lg){
dataLag <- slide(dataF, "acfData", slideBy = -i)
covacf <- cov(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
sd1 <- sd(dataLag[(1+i):nrow(dataLag[1]), 1])
sd2 <- sd(dataLag[(1+i):nrow(dataLag[1]), 2])
corrCalc <- covacf/(sd1 * sd2)
lagMat <- c(lagMat, corrCalc)
a <- cor(dataLag[(1+i):nrow(dataLag[1]), 1], dataLag[(1+i):nrow(dataLag[1]), 2])
testMat <- c(testMat, a)
}
plot(lagMat)
return(list(lagMat, testMat))
}
My code then returns for the same data as the acf() function input:
[[1]]
[1] 1.00000000 -0.03786539 0.27700596 0.30197418 0.31009956
[6] 0.37123797 -0.19520518 0.44399863 0.05684766 0.02063488
[11] -0.03724332
[[2]]
[1] 1.00000000 -0.03842146 0.27502462 0.29292583 0.35052131
[6] 0.40931426 -0.23637159 0.52320559 0.07270497 0.02555461
[11] -0.04524035
Any help is greatly appreciated!

Related

Why do I get the message "'names' attribute must be the same length as vector" when I run a for loop with ANOVA (R)?

I'm trying to create a for loop that runs ANOVA. I have a list of dataframes for which I need to run ANOVA, and I want to create a list with the resulting p-values. However, when I run the for loop, I get the following message:
'names' attribute [5] must be the same length as the vector [3]
This is the code I used:
##split my dataframe into a list of dataframes based on what gene the data represents
gene_data_list <- split(df_test, f = df_test$Gene)
##create an empty list to collect p-values
p_values <- list()
##run ANOVA on the list of dataframes
for(i in 1:length(gene_data_list)) {
anova <- aov(value ~ variable, data = gene_data_list[[i]])
summary <- anova_summary(anova)
append(p_values, summary$p)
}
When I run the same code on gene_data_list[[1]] outside the loop, it runs fine.
I've attached a link to the dataframe I've been using:
https://easyupload.io/z5avik
You can do all this with slightly simpler coding using lapply instead of a for loop:
df_test <- read.csv("Test_dataset.csv")
gene_data_list <- split(df_test, f = df_test$Gene)
# Run ANOVAs
gene_anova <- lapply(gene_data_list, function(x) aov(value ~ variable, data = x))
# Get summaries
summary_list <- lapply(gene_anova, rstatix::anova_summary)
# Create single table
results_table <- do.call(rbind, summary_list)
Output:
head(results_table)
# Effect DFn DFd F p p<.05 ges
# 1 1 2 24 1.3570e+00 0.276 1 0.10200
# AAGAB 1 2 15 1.9334e+27 0.000 2 1.00000
# AAK1 1 2 42 4.0000e-03 0.997 1 0.00017
# AAMDC 1 2 6 4.4182e+29 0.000 2 1.00000
# AATF 1 2 24 8.5433e+28 0.000 2 1.00000
# ABCA2 1 2 33 1.0470e+00 0.362 1 0.06000
Note that if you just wanted a list of p values, you could do
p_list <- lapply(gene_anova, function(x) rstatix::anova_summary(x)$p)
Which gives you a list with each position as the p value

How can I show simultaneously all QAP correlation coefficients and QAP p-values between my independent and all dependent matrices with sna in R?

I am running Quadratic Assignment Procedure (QAP) to find the correlation of a dependent network matrix with four independent monadic node covariate matrices in R using the sna package.
I am using the following code:
cor2<-sna::gcor(list(bilateralaid_network, GDP_capita_receiver,
GDP_capita_sender,
HDI_receiver,HDI_sender))
p2<-qaptest(list(bilateralaid_network, GDP_capita_receiver,
GDP_capita_sender,
HDI_receiver,HDI_sender), gcor, g1=1, g2=2, reps=1000)
p2<-qaptest(list(bilateralaid_network, GDP_capita_receiver,
GDP_capita_sender,
HDI_receiver,HDI_sender), gcor, g1=1, g2=3, reps=1000)
#etc
However I do not want to run a qaptest individually for each dependent variable like shown in this question.
Is there a way of returning all correlation coefficients and p-values in table or matrix format in R?
I read that it is possible with UCINET software, but I personally do not have access to it.
UCINET is able to provide a result like shown in the screenshot below. I would like to produce a similar output in R.
A very simple solution would be the following. This can be optimized in several ways, but this is quite straightforward, utilizing the functionality from the sna package.
# some random networks
g <- array(dim = c(3, 10, 10))
g[1, , ] <- sna::rgraph(10)
g[2, , ] <- sna::rgraph(10, tprob = g[1, , ]*0.8)
g[3, , ] <- 1
g[3, 1, 2] <- 0
socmat <- network::as.sociomatrix(g)
reps = 1000
testval <- sna::gcor(g)
out <- array(dim = c(dim(g)[1], dim(g)[1], reps))
for (i in 1:reps) {
out[, , i] <- sna::gcor(sna::rmperm(socmat)) - testval
}
(pgreq <- rowMeans(out > 0, dims = 2))
(pleeq <- rowMeans(out <= 0, dims = 2))
testval
Here, pgreq are the p-values of the correlation being greater than the observed correlation, pleeq contains the p-values for the correlation being smaller or equal to the observed correlation, and testval contains the correlations between the observed networks.
> pgreq
[,1] [,2] [,3]
[1,] 0.000 0.000 0.436
[2,] 0.000 0.000 0.619
[3,] 0.436 0.619 0.000
> pleeq
[,1] [,2] [,3]
[1,] 1.000 1.000 0.564
[2,] 1.000 1.000 0.381
[3,] 0.564 0.381 1.000
> testval
1 2 3
1 1.00000000 0.7135061 -0.09480909
2 0.71350607 1.0000000 -0.13287777
3 -0.09480909 -0.1328778 1.00000000
You can easily package this into a function and make it more efficient. However, this code will probably already be sufficient in many cases.

Obtaining mean phylogenetic tree branch lengths from an ensemble of phylogenetic trees

I have a set of phylogenetic trees, some with different topologies and different branch lengths. Here and example set:
(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);
(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);
(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);
(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);
(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);
(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);
(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);
(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);
In this example there are 2 unique topologies - using R's ape unique.multiPhylo shows that (assuming the example above is saved to a file tree.fn):
tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)
> length(tree)
[1] 8
> length(unique.tree)
[1] 2
My question is how do I get a list of trees, each one representing a unique topology in the input list, and the branch lengths are a summary statistic, such as mean or median, across all trees with the same topology.
In the example above, it will return the first tree as is, because its topology is unique, and another tree which is the topology of the other trees, with mean or median branch lengths?
If I understand well, you want to sort all the trees for each unique into different groups (e.g. in your example, the first group contains one tree, etc...) and then measure some stats for each group?
You can do that by first grouping the topologies into a list:
set.seed(5)
## Generating 20 4 tip trees (hopefully they will be identical topologies!)
tree_list <- rmtree(20, 4)
## How many unique topologies?
length(unique(tree_list))
## Sorting the trees by topologies
tree_list_tmp <- tree_list
sorted_tree_list <- list()
counter <- 0
while(length(tree_list_tmp) != 0) {
counter <- counter+1
## Is the first tree equal to any of the trees in the list
equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
## Saving the identical trees
sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
## Removing them from the list
tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
## Repeat while there are still some trees!
}
## The list of topologies should be equal to the number of unique trees
length(sorted_tree_list) == length(unique(tree_list))
## Giving them names for fancyness
names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))
Then for all the trees in each unique topology group you can extract different summary statistics by making a function. Here for example I will measure the branch length sd, mean and 90% quantiles.
## function for getting some stats
get.statistics <- function(unique_topology_group) {
## Extract the branch lengths of all the trees
branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))
## Apply some statistics
return(c( n = length(unique_topology_group),
mean = mean(branch_lengths),
sd = sd(branch_lengths),
quantile(branch_lengths, prob = c(0.05, 0.95))))
}
## Getting all the stats
all_stats <- lapply(sorted_tree_list, get.statistics)
## and making it into a nice table
round(do.call(rbind, all_stats), digits = 3)
# n mean sd 5% 95%
# topology1 3 0.559 0.315 0.113 0.962
# topology2 2 0.556 0.259 0.201 0.889
# topology3 4 0.525 0.378 0.033 0.989
# topology4 2 0.489 0.291 0.049 0.855
# topology5 2 0.549 0.291 0.062 0.882
# topology6 1 0.731 0.211 0.443 0.926
# topology7 3 0.432 0.224 0.091 0.789
# topology8 1 0.577 0.329 0.115 0.890
# topology9 1 0.473 0.351 0.108 0.833
# topology10 1 0.439 0.307 0.060 0.795
Of course you can tweak it to get your own desired stats or even get the stats per trees per groups (using a double lapply lapply(sorted_trees_list, lapply, get.statistics) or something like that).

How to perform Tukey's pairwise test in R?

I have the following data (dat)
I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
library(reshape2)
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
#$key
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034

Fit many formulae at once, faster options than lapply?

I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!
There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA

Resources