R: Calculating Pearson correlation and R-squared by group - r

I am trying to extend the answer of a question R: filtering data and calculating correlation.
To obtain the correlation of temperature and humidity for each month of the year (1 = January), we would have to do the same for each month (12 times).
cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
Is there any way to do each month automatically?
In my case I have more than 30 groups (not months but species) to which I would like to test for correlations, I just wanted to know if there is a faster way than doing it one by one.
Thank you!

cor(airquality[airquality$Month == 1, c("Temp", "Humidity")])
gives you a 2 * 2 covariance matrix rather than a number. I bet you want a single number for each Month, so use
## cor(Temp, Humidity | Month)
with(airquality, mapply(cor, split(Temp, Month), split(Humidity, Month)) )
and you will obtain a vector.
Have a read around ?split and ?mapply; they are very useful for "by group" operations, although they are not the only option. Also read around ?cor, and compare the difference between
a <- rnorm(10)
b <- rnorm(10)
cor(a, b)
cor(cbind(a, b))
The answer you linked in your question is doing something similar to cor(cbind(a, b)).
Reproducible example
The airquality dataset in R does not have Humidity column, so I will use Wind for testing:
## cor(Temp, Wind | Month)
x <- with(airquality, mapply(cor, split(Temp, Month), split(Wind, Month)) )
# 5 6 7 8 9
#-0.3732760 -0.1210353 -0.3052355 -0.5076146 -0.5704701
We get a named vector, where names(x) gives Month, and unname(x) gives correlation.
Thank you very much! It worked just perfectly! I was trying to figure out how to obtain a vector with the R^2 for each correlation too, but I can't... Any ideas?
cor(x, y) is like fitting a standardised linear regression model:
coef(lm(scale(y) ~ scale(x) - 1)) ## remember to drop intercept
The R-squared in this simple linear regression is just the square of the slope. Previously we have x storing correlation per group, now R-squared is just x ^ 2.

Related

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

Sampling from a multivariate distribution including gender in R

I'm trying to simulate a wider population from a small one in R as follows:
idata <- subset(data, select=c(WT, AGE, HT, BFP, SEX) )
M= cor(idata)
mu <- sapply(idata, mean)
sd <- sapply(idata, stdev)
sigma=cor2cov(M, sd)
simulation <- as.data.frame(mvrnorm(1000, mu, sigma))
But the problems is, for SEX, the code will consider a continuous distribution, while it has to be binary, and effects of sex has to be either fully considered (SEX==1), or not at all (SEX==0). I'd appreciate any help with this regard.
Thanks
What you should do is consider that your data consists of two sub-populations, and then draw data from them, based on their proportions.
So, first estimate the proportions, pi_m and pi_f (= 1 - pi_m), which are the proportion of SEX == 0 and SEX == 1. This should be something like
pi_m = sum(idata$SEX == 1)/ nrow(idata)
Then estimate parameters for the two populations, mu_f, mu_m, sigma_f and sigma_m, which are mean and covariance parameters for the two SEX populations (now without the SEX variable).
The first draw a random number r <- runif(1), if this is less than equal to pi_m then generate a sample from N(mu_m, sigma_s) else from N(mu_f, sigma_f).
You can do this step 1000 times to get 1000 samples from your distribution.
Of course, you can vector this, by first generating 1000 samples from runif. For example
n_m <- sum(runif(1000) <= pi_m)
n_f <- 1000 - n_m
X_m <- rmvnorm(n_m, mu_m, sigma_m)
X_f <- rmvnorm(n_f, mu_f, sigma_f)
X <- rbind(X_m, X_f)

Partial Cross-correlation in R

I think the title is fairly self-explanatory. I want to compute the cross-correlation between two time series controlled for the values at other lags. I can't find any existing R code to do this, and I'm not at all confident enough in my knowledge of statistics (or R) to try to write something myself. It would be analogous to the partial autocorrelation function, just for the cross-correlation instead of the autocorrelation.
If it helps at all, my larger objective is to look for lagged correlations between different measurements of a physical system (to start with, flux and photon index from gamma ray measurements of blazars), with the goal of building a general linear model to try to predict flaring events.
Look at my answer to my own question (same as the one you posted).
You can make use of the pacf function in R, extending it to a matrix with 2 or more time series. I have checked results between the multivariate acf and ccf functions and they yield the same results, so the same can be concluded about the multivariate pacfand the non-existing pccf.
I believe this work,
pccf <- function(x,y,nlags=7,partial=TRUE){
# x (numeric): variable that leads y
# y (numeric): variable of interest
# nlags (integer): number of lags (uncluding zero)
# partial (boolean): partial or absolute correlation
# trim y
y <- y[-(1:(nlags-1))]
# lagged matrix of x
x_lagged <- embed(x,nlags)
# process for each lag
rho <- lag <- NULL
for(i in 1:(nlags)){
if(partial){
# residuals of x at lag of interest regressed on all other lags of x
ex <- lm(x_lagged[,i] ~ x_lagged[,-i])$residuals
# residuals of y regressed on all lags of x but the one of interest
ey <- lm(y ~ x_lagged[,-i])$residuals
}else{
ex <- x_lagged[,i]
ey <- y
}
# calculate correlation
rho[i] = cor(ex,ey, use="pairwise.complete.obs")
lag[i] = i-1
}
return(
tibble(lag=lag, rho=rho) %>%
arrange(lag)
)
}
# test
n <- 200 # count
nlag <- 6 # number of lags
x <- as.numeric(arima.sim(n=n,list(ar=c(phi=0.9)),sd=1)) # simulate times series x
y <- lag(x,nlag) + rnorm(n,0,0.5) # simulate y to lag x
y <- y[(nlag+1):n] # remove NAs from lag
x <- x[(nlag+1):n] # align with y
pccf(x,y,nlags=10,partial=FALSE) %>%
mutate(type='Cross correlation') %>%
bind_rows(
pccf(x,y,nlags=10,partial=TRUE) %>%
mutate(type='Partial cross correlation')
) %>%
ggplot() +
geom_col(aes(-lag,rho),width=0.1) +
facet_wrap(~type,scales='free_y', ncol=1) +
scale_x_continuous(breaks=-10:0) +
theme_bw(base_size=20)

Applying an lm function to different ranges of data and separate groups using data.table

How do I perform a linear regression using different intervals for data in different groups in a data.table?
I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.
I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.
I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.
CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)
I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.
The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:
model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}
model_flux <- dlply(flux, .(fence, plot, DOY), model)
rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")
Here is a "data.table" way to do this:
library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) |
(period==2 & count %in% 2:6) |
(period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)
model <- function(df) {
fit <- lm(CO2 ~ count, data = df)
return(list(intercept=coef(fit)[1],
slope=coef(fit)[2],
rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]
Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.

Resources