Finding non-linear correlations in R - r

I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?
I have tried building a model like this (which I could do in a loop for each variable i = 2:90):
y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2
quadratic.model = lm(y ~ x + x2)
And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?
Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.

You can use nlcor package in R. This package finds the nonlinear correlation between two data vectors.
There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.
nlcor is robust to most nonlinear shapes. It works pretty well in different scenarios.
At a high level, nlcor works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.
More details about this package here
To install nlcor, follow these steps:
install.packages("devtools")
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)
After you install it,
# Implementation
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")
# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot
As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor could detect.
Note: The order of x and y inside the nlcor is important. nlcor(x,y) is different from nlcor(y,x). The x and y here represent 'independent' and 'dependent' variables, respectively.

Fitting a generalized additive model, will help you identify curvature in the
relationships between the explanatory variables. Read the example on page 22 here.

Another option would be to compute mutual information score between each pair of variables. For example, using the mutinformation function from the infotheo package, you could do:
set.seed(1)
library(infotheo)
# corrleated vars (x & y correlated, z noise)
x <- seq(-10,10, by=0.5)
y <- x^2
z <- rnorm(length(x))
# list of vectors
raw_dat <- list(x, y, z)
# convert to a dataframe and discretize for mutual information
dat <- matrix(unlist(raw_dat), ncol=length(raw_dat))
dat <- discretize(dat)
mutinformation(dat)
Result:
| | V1| V2| V3|
|:--|---------:|---------:|---------:|
|V1 | 1.0980124| 0.4809822| 0.0553146|
|V2 | 0.4809822| 1.0943907| 0.0413265|
|V3 | 0.0553146| 0.0413265| 1.0980124|
By default, mutinformation() computes the discrete empirical mutual information score between two or more variables. The discretize() function is necessary if you are working with continuous data transform the data to discrete values.
This might be helpful at least as a first stab for looking for nonlinear relationships between variables, such as that described above.

Related

exponential regression with R ( and negative values)

I am trying to fit a curve to a set of data points but did not succeed. So I ask you.
plot(time,val) # look at data
exponential.model <- lm(log(val)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit,lwd=2) # show the fitted line
My only problem is, that my data contains negative values and so log(val) produces a lot of NA making the model computation crash.
I know that my data does not necessarily look like exponential , but I want to see the fit anyway. I also used another program which shows me val=27.1331*exp(-time/2.88031) is a nice fit but I do not know, what I am doing wrong.
I want to compute it with R.
I had the idea to shift data so no negative values remain, but result is poor and quite sure wrong.
plot(time,val+20) # look at data
exponential.model <- lm(log(val+20)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit-20,lwd=2) # show the (BAD) fitted line
Thank you!
I figured some things out and have a satisfying solution.
exponential.model <- lm(log(val)~ a) # compute model
The log(val) term is trying to rescale the values, so a linear model can be applied. Since this not possible to my values, you have to use a non-linear model (nls).
exponential.model <- nls(val ~ a*exp(b*time), start=c(b=-0.1,h=30))
This worked fine for me.
satisfying fit

R - linear model does not match experimental data

I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:
[1] cond1 repA cond1 repB cond2 repA cond2 repB ...
[2] 4.158660e+06 4454400.703 ...
[3] 1.458585e+06 4454400.703 ...
[4] NA 887776.392 ...
...
[5024] 9571785.382 9.679092e+06 ...
I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:
for (i in seq(1,13,2)){
vec <- matrix(0, nrow = nrow(df), ncol = 2)
vec[,1] <- df[,i]
vec[,2] <- df[,i+1]
vec <- na.exclude(vec)
plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}
However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:
It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).
Would have like to simply comment/ask a few questions, but can't.
From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.
What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).
Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.

extract residuals from aov()

I've run an anova using the following code:
aov2 <- aov(amt.eaten ~ salt + Error(bird / salt),data)
If I use view(aov2) I can see the residuals within the structure of aov2, but I would like to extract them in a way that doesn't involve cutting and pasting. Can someone help me out with the syntax?
Various versions of residuals(aov2) I have been using only produce NULL
I just learn that you can use proj:
x1 <- gl(8, 4)
block <- gl(2, 16)
y <- as.numeric(x1) + rnorm(length(x1))
d <- data.frame(block, x1, y)
m <- aov(y ~ x1 + Error(block), d)
m.pr <- proj(m)
m.pr[['Within']][,'Residuals']
The reason that you cannot extract residuals from this model is that you have specified a random effect due to the bird salt ratio (???). Here, each unique combination of bird and salt are treated like a random cluster having a unique intercept value but common additive effect associated with a unit difference in salt and the amount eaten.
I can't conceive of why we would want to specify this value as a random effect in this model. But in order to sensibly analyze residuals, you may want to calculate fitted differences in each stratum according to the fitted model and optimal intercept. I think this is tedious work and not very informative, however.

Error returned predicting new data using GAM with periodic smoother

Apologies if this is better suited in CrossValidated.
I am fitting GAM models to binomial data using the mgcv package in R. One of the covariates is periodic, so I am specifying the bs = "cc" cyclic cubic spline. I am doing this in a cross validation framework, but when I go to fit my holdout data using the predict function I get the following error:
Error in pred.mat(x, object$xp, object$BD) :
can't predict outside range of knots with periodic smoother
Here is some code that should replicate the error:
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
# fit gam with periodic smoother:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Any suggestions on where I'm going wrong would be greatly appreciated. Maybe manually specifying knots to fall on -pi and pi?
I did not get an error on the first run but I did replicate the error on the second try. Perhaps you need to use set.seed(123) #{no error} and set.seed(223) #{produces error}. to see if that creates partial success. I think you are just seeing some variation with a relatively small number of points in your derivation and validation datasets. 100 points for GAM fit is not particularly "generous".
Looking at the gamFit object it appears that the range of the knots is encoded in gamFit$smooth[[1]]['xp'], so this should restrict your inputs to the proper range:
x.2 <- runif(100,min=-pi,max=pi);
x.2 <- x.2[findInterval(x.2, range(gamFit$smooth[[1]]['xp']) )== 1]
# Removes the errors in all the situations I tested
# There were three points outside the range in the set.seed(223) case
The problem is that your test set contains values that were not in the range of your training set. Since you used a spline, knots were created at the minimum and maximum value of x, and your fitted function is not defined outside of that range. So, when you test the model, you should exclude those points that are outside the range. Here is how you would exclude the points in the test set:
set.seed(2)
... <Your code>
predict(gamFit,newdata=df.2[df.2$x>=min(df$x) & df.2$x<=max(df$x),,drop=F])
Or, you could specify the "outer" knot points in the model to the min and max of your whole data. I don't know how to do that offhand.

Is there an R library that estimates a multivariate natural cubic spline (or similar) function?

note: originally posted on Cross Validated (stats SE) on 07-26-2011, with no correct answers to date.
Background
I have a model, f, where Y=f(X)
X is an n x m matrix of samples from m parameters and Y is the n x 1 vector of model outputs.
f is computationally intensive, so I would like to approximate f using a multivariate cubic spline through (X,Y) points, so that I can evaluate Y at a larger number of points.
Question
Is there an R function that will calculate an arbitrary relationship between X and Y?
Specifically, I am looking for a multivariate version of the splinefun function, which generates a spline function for the univariate case.
e.g. this is how splinefun works for the univariate case
x <- 1:100
y <- runif(100)
foo <- splinefun(x,y, method = "monoH.FC")
foo(x) #returns y, as example
The test that the function interpolates exactly through the points is successful:
all(y == foo(1:100))
## TRUE
What I have tried
I have reviewed the mda package, and it seems that the following should work:
library(mda)
x <- data.frame(a = 1:100, b = 1:100/2, c = 1:100*2)
y <- runif(100)
foo <- mars(x,y)
predict(foo, x) #all the same value
however the function does not interpolate exactly through the design points:
all(y == predict(foo,x))
## FALSE
I also could not find a way to implement a cubic-spline in either the gam, marss, or earth packages.
Actually several packages can do it. The one I use is the "rms" package which has rcs, but the survival package also has pspline and the splines package has the ns function {}. "Natural splines" (constructed with ns) are also cubic splines. You will need to form multivariate fitting function with the '*' operator in the multivariate formula creating "crossed" spline terms.
that the example you offered was not sufficiently rich.
I guess I am confused that you want exact fits. R is a statistical package. Approximate estimation is the goal. Generally exact fits are more of a problem because they lead to multicollinearity.
Have a look at the DiceKriging package which was developed to undertake tasks like this.
http://cran.r-project.org/web/packages/DiceKriging/index.html
I've provided an example application at
https://stats.stackexchange.com/questions/13510/fitting-multivariate-natural-cubic-spline/65012#65012
I'm not sure if this is precisely what you are looking for, but you could try Tps() in the R package fields. It's meant for doing thin-plate splines interpolations (2D equivalent of cubic splines) for spatial data, but will take up to four covariates, although it will expect them to be euclidean x,y,z + time, so you need to be clear that you are selecting the correct options for your particular case. If you want to interpolate, set the smoothing parameter lambda to zero. You might also try the function polymars() in the R package polspline.

Resources