Calculating correlation between residuals of linear regression with NAs and independent variable in R - r

I am trying to calculate the correlation coefficient between the residuals of a linear regression and the independent variable p.
Basically, the linear regression estimates the current sales as a function of the current price p and the past price p1.
The vector of current prices mydf$p has length 8, but the residuals is a vector of length 7 because one entry has been deleted due to the NA value of p1.
# lag vector and pad with NAs
# Source: http://heuristically.wordpress.com/2012/10/29/lag-function-for-data-frames/
lagpad <- function(x, k) {
if (!is.vector(x))
stop('x must be a vector')
if (!is.numeric(x))
stop('x must be numeric')
if (!is.numeric(k))
stop('k must be numeric')
if (1 != length(k))
stop('k must be a single number')
c(rep(NA, k), x)[1 : length(x)]
}
mydf <- data.frame(p = c(10, 8, 10, 9, 10, 9, 10, 8))
mydf$p1 <- lagpad(mydf$p,1)
mydf$sales <- with(mydf, 200 - 15 * p + 5 * p1) + rnorm(nrow(mydf), 0,0.13)
model <- lm(data = mydf, formula = 'sales ~ p + p1')
print(summary(model))
print(cor(residuals(model), mydf$p))
# Error in cor(residuals(model), mydf$p) : incompatible dimensions
In this particular case, it is easy to use mydf$p[2:8] instead of mydf$p.
However, in general, there may be multiple rows at random locations where then NAs are deleted.
How do I access the independent variables that were actually used in the regression after removing the rows containing NA?
One of my attempts was based on the R documentation for lm. I tried to access the "x" matrix through model[['x']] but that did not work.

You can get the actual data used to fit the model from model$model, and from there the p column:
cor(residuals(model), model$model$p)
Alternatively, is.na(mydf$p1) will tell you which rows in mydf have an NA in column p1:
cor(residuals(model), mydf$p[!is.na(mydf$p1)])
In general, is.na(x) tells us whether elements in x are NA or not:
> is.na(c(1,2,NA,4,NA,6))
[1] FALSE FALSE TRUE FALSE TRUE FALSE

model.matrix(model) seems to be what you are looking for
Then you can select the variables you want with [] and the column number or name

The x matrix is only created if you specify x=T in your call to lm. Then model$x will give you the value of x (this is more idiomatic that model[['x']].
lm handles missing values by just completely omitting an observation where a value is missing. Maybe you want to do something like:
cor(residuals(model), mydf$p[!is.na(mydf$p)])
?

Related

R function to find which of 3 variables correlates most with another value?

I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)

Generate numbers with specific correlation [with only positive values in the output]

I want to obtain a dataframe with simulated values which have a specific correlation to each other.
I need to use this function, but in the returned output there are also negative values, which do not have meaning for my purposes:
COR <- function (n, xmean, xsd, ymean, ysd, correlation) {
x <- rnorm(n)
y <- rnorm(n)
z <- correlation * scale(x)[,1] + sqrt(1 - correlation^2) *
scale(resid(lm(y ~ x)))[,1]
xresult <- xmean + xsd * scale(x)[,1]
yresult <- ymean + ysd * z
data.frame(x=xresult,y=yresult)
}
Please note that my question starts from this previous post (currently closed):
another similar discussion
Is there a method able to exclude from the final output all the rows which have at least one negative value? (in another terms, x and y must be always positives).
I spent many hours without any concrete result.....
Filtering rows which have at least one negative value can be done with the apply function, e.g.
df <- simcor(100, 1, 1, 1, 1, 0.8)
filter <- apply(df, 1, function(x) sum(x < 0) > 0)
df <- df[!filter,]
plot(df)
First, I create a dataframe df from your funcion. Then, I apply the function sum(x < 0) > 0 rowwise to the dataframe (the second argument of apply, 1 indicates to go along the first dimension of the dataframe or array). This will create a logical vector that is TRUE for every row with at least one negative value. Subsetting the dataframe with the inverse of that (!filter) leaves you with all rows that have no negative values.
UPDATE:
Seems like the package VineCopula offers functions to create distributions with a given correlation. However, I did not dive into the math as deep so I was not able to fully grasp how copulas (i.e. multivariate probability distributions) work. Using this package, you can at least create e.g. two gaussian distributions.
library(VineCopula)
BC <- BiCop(family = 1, par = 0.9)
sim <- BiCopSim(N = 1000, obj = BC)
cor(sim[,1], sim[,2])
plot(sim)
You might be able to then scale the resulting matrix to achieve a certain standard derivation.

extracting residuals from pixel by pixel regression

I am trying to extract the residuals from a regression run pixel by pixel on a raster stack of NDVI/precipitation. My script works when i run it with a small part of my data. But when i try to run the whole of my study area i get: "Error in setValues(out, x) : values must be numeric, integer, logical or factor"
The lm works, since I can extract both slope and intercept. I just cant extract the residuals.
Any idea of how this could be fixed?
Here is my script:
setwd("F:/working folder/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
residualfun = function(x) { if (is.na(x[1])){ NA } else { m <- lm(x[1:6] ~ x[7:12], na.action=na.exclude)
r <- residuals.lm(m)
return (r)}}
res <- calc(s,residualfun)
And here is my data: https://1drv.ms/u/s!AhwCgWqhyyDclJRjhh6GtentxFOKwQ
Your function only test if the first layer shows NA values to avoid fitting the model. But there may be NA in other layers. You know that because you added na.action = na.exclude in your lm fit.
The problem is that if the model removes some values because of NA, the residuals will only have the length of the non-NA values. This means that your resulting r vector will have different lengths depending on the amount of NA values in layers. Then, calc is not be able to combine results of different lengths in a stack a a defined number of layers.
To avoid that, you need to specify the length of r in your function and attribute residuals only to non-NA values.
I propose the following function that now works on the dataset your provided. I added (1) the possibility compare more layers of each if you want to extend your exploration (with nlayers), (2) avoid fitting the model if there are only two values to compare in each layer (perfect model), (3) added a try if for any reason the model can fit, this will output values of -1e32 easily findable for further testing.
library(raster)
setwd("/mnt/Data/Stackoverflow/test")
gimms <- list.files(pattern="*ndvi.tif")
ndvi <- stack(gimms)
precip <- list.files(pattern="*pre.tif")
pre <- stack(precip)
s <- stack(ndvi,pre)
# Number of layers of each
nlayers <- 6
residualfun <- function(x) {
r <- rep(NA, nlayers)
obs <- x[1:nlayers]
cov <- x[nlayers + 1:nlayers]
# Remove NA values before model
x.nona <- which(!is.na(obs) & !is.na(cov))
# If more than 2 points proceed to lm
if (length(x.nona) > 2) {
m <- NA
try(m <- lm(obs[x.nona] ~ cov[x.nona]))
# If model worked, calculate residuals
if (is(m)[1] == "lm") {
r[x.nona] <- residuals.lm(m)
} else {
# alternate value to find where model did not work
r[x.nona] <- -1e32
}
}
return(r)
}
res <- calc(s, residualfun)

R: interaction between continuous and categorical vars in 'isat' regression ('gets' package)

I want to calculate the differential response of y to x (continuous) depending on the categorical variable z.
In the standard lm setup:
lm(y~ x:z)
However, I want to do this while allowing for Impulse Indicator Saturation (IIS) in the 'gets' package. However, the following syntax produces an error:
isat(y, mxreg=x:z, iis=TRUE)
The error message is of the form:
"Error in solve.qr(out, tol = tol, LAPACK = LAPACK) :
singular matrix 'a' in 'solve"
1: In x:z :
numerical expression has 96 elements: only the first used
2: In x:z :
numerical expression has 96 elements: only the first used"
How should I modify the syntax?
Thank you!
At the moment, alas, isat doesn't provide the same functionality as lm on categorical/character variables, nor on using * and :. We hope to address that in a future release.
In the meantime you'll have to create distinct variables in your dataset representing the interaction. I guess something like the following...
library(gets)
N <- 100
x <- rnorm(N)
z <- c(rep("A",N/4),rep("B",N/4),rep("C",N/4),rep("D",N/4))
e <- rnorm(N)
y <- 0.5*x*as.numeric(z=="A") + 1.5*x*as.numeric(z=="B") - 0.75*x*as.numeric(z=="C") + 5*x*as.numeric(z=="D") + e
lm.reg <- lm(y ~ x:z)
arx.reg.0 <- arx(y,mxreg=x:z)
data <- data.frame(y,x,z,stringsAsFactors=F)
for(i in z[duplicated(z)==F]) {
data[[paste("Zx",i,sep=".")]] <- data$x * as.numeric(data$z==i)
}
arx.reg.1 <- arx(data$y,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")])
isat.1 <- isat(data$y,mc=TRUE,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")],max.block.size=20)
Note that as you'll be creating dummies for each category, there's a chance those dummies will cause singularity of your matrix of explanatory variables (if, as in my example, isat automatically uses 4 blocks). Using the argument max.block.size enables you to avoid this problem.
Let me know if I haven't addressed your particular point.

R: multicollinearity issues using glib(), Bayesian Model Averaging (BMA-package)

I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.

Resources