Interpreting a pattern in a residual plot produced by gam.check() - r

I'm working on creating a model that examines the effect of ocean characteristics on fishing outcomes. I have spatial data on a 0.5 degree grid and I created the following model:
gam(inverse hyperbolic sine(yvar) ~ s(lat, lon, bs="sos) + s(xvar1) +
s(xvar2) + s(xvar3), data = dat, method = "REML"
The QQ plot and histogram of residuals look okay. However, gam.check() produces an odd pattern in the residuals plot. I know that the points should be scattered around 0, but I have a very odd pattern in the residuals. Can anyone provide some insight on the interpretation of this plot:

Those will be either all the 0s (most likely) or 1s/smallest value in your original data. You don’t say what these data are but as you mentioning fishing outcomes it is highly likely that these have some natural lower bound and this line in the residuals are all the observations that take this lower bound (before transformation).
As you don’t exactly what your data are it is difficult to comment further as to how to proceed (this may not be an issue or you may need to not use the transform that you did, and instead use a GLM or other non-Gaussian response), but
Such patterns are common in ecological/biological data, and
Transforming your response invariably doesn’t work for ecological data.

Related

interpretation of a GAM plot with a square rooted response variable

I have a simple GAM where my goal is to understand the variation of the distance to a feature along the year, and I originally ran it with the following formula:
m1 <- gam(dist ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
Where "dist" is the distance in meters to a feature, and "id" is the animal id. When plotting the GAM I obtain the following plot:
First question, if I would be interepreting the plot/writting a figure caption, is it correct to say something like:
"GAM plot showing the partial effects of month (x-axis) on the distance to a feature (y-axis). GAM smooths are centered at zero, therefore the zero line reflects the overall mean of the distance the feature. Thus, values below zero on the y-axis reflect higher proximity to the feature, while values above zero reflect longer distances to the feature."
I say that a negative value (below zero) would mean proximity to the feature as that's also the way to interpret distance coefficients in a GLM, but I would also like to make sure that this is correct and that I'm not misinterpreting the plot.
Second question, are the values on the y-axis directly interpretable? If so, what is the scale? Is it a % of change? (Based on a comment here, but I'm not sure if I understood it properly)
Then I transform the response variable to achieve normality (the original scale was a bit left skewed), and I run this model (residuals look better with the transformation):
m2 <- gam(sqrt(dist) ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
And I obtain this plot:
Pretty similar to the previous one, and I believe I can interpret it in the same way as described above. But, third question, if I would want to say exactly what the y axis mean, what would be the most correct way to describe it with the transformation?
Any help with this is very appreciated! Many thanks in advance!

Correlation coefficient between nominal and cardinal scale variables

I have to describe the correlation between a variable "Average passes completed per game" (cardinal scale) and a variable "Position" (nominal scale) and measure the strength of the correlation. For that I have to choose the correlation coefficient correctly considering the Scales. Does anyone know what the best way to do that would be? I am not sure what to use since it is two different scales. The full dataset consists of the following variables:
PLAYER: Name of the player
COUNTRY: Country of origin
BIRTHDATE: Birthday Date
HEIGHT_IN_CM: Height of the player
POSITION: Position of the player
PASSES_COMPLETED: Passes completed by the player
DISTANCE_COVERED: Distance covered by the player in km
MINUTES_PLAYED: Minutes played
AVG_PASSES_COMPLETED: Average passes completed by the player
I would very much appreciate if someone could give me some advice on this.
Thank you!
OK, so you need to redefine your question somewhat. Without two continuous variables correlations cannot be used to "describe" a relationship as I guess you are asking. You can, however, see if there are statistically significant differences in pass rates between different positions. As for the questions on the statistics, I agree with Maurtis...CV is best place. As for the code to do the tests, try this:
Firstly you need to make sure you have the right packages installed. You will definitely need ggplot and ggfortify, and maybe others if you have to manipulate data, or other things. And load the libraries:
library(ggplot2)
library(ggfortify)
Next, make sure that your data is tidy: ie, variables in columns.
Then import your data into R:
#find file
data.location = file.choose()
#Import data
curr.data <- read.csv(data.location)
#Check data import
glimpse(curr.data)
Then plot using ggplot:
ggplot(curr.data, aes(x = POSITION, y = AVG_PASSES_COMPLETED)) +
geom_boxplot() +
theme_bw()
Then model using the linear model function (lm()) to see if there is a significant difference in pass rates with regards to position.
passrate_model <- lm(AVG_PASSES_COMPLETED ~ POSITION, data = curr.data)
Before you test your hypothesis, you need to check the appropriateness of the model
autoplot(passrate_model, smooth.colour = NA)
If the residual plots look fine, then we are ready to test. If not then you will have to use another type of model (and I'm not going into that here now....).
The appropriate test for this (I think) would be a Tukey test, which requires an ANOVA. This will give a summary, and should show you if there is variance due to position:
passrate_av <- aov(passrate_model)
summary(passrate_av)
This will perform the Tukey test and give pair-wise comparisons including difference in means, 95% confidence intervals, and adjusted p-values:
tukey.test <- TukeyHSD(passrate_av)
tukey.test
And it can even do a nice plot for you too:
plot(tukey.test)

Scale LDA decision boundary

I have a rather unconventional problem and having a hard time finding a solution to this. Would really appreciate your help.
I have 4 genes(features) and my classification here is binary(0 and 1). After a lot of back and forth, I have finalized on using LDA to do my classification. I have different studies each comparing the same two classes and I trained my model using these 4 genes on each of these studies.
I want to visualize the LDA scores in the form of points plot. Something like below, where each section represents a different study/dataset. Samples of that dataset on the X axis and the LD1 value I get using -
lda_model = lda(formula = class ~ ., data = train)
predict(lda_model,train) on the Y axis.
Since I trained a different model on each dataset, we can clearly see the the decision boundary (which I assume is the black line) for each dataset is different and on a different scale. However, I want to scale the values on the Y axis is such a way that all my datasets are on the same scale and I can represent this plot with a single decision boundary( again, something I can clearly draw on the plot, like the red line).
The LD1 values here are - a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) - mean(a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD)). This is done for each dataset individually. However, this is not exactly equal to (a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) + intercept) which we can get using logistic regression. I am trying to find that value or some method which can scale my Y axis across all the datasets using LDA.
Thanks for your help!
I did a min-max scaling and that seemed to work. It scaled all my data points across all datasets with decision boundary at zero.

geom_smooth: what is its meaning (why is it lower than the mean?)

I have data on the number of trips people make to work per week. Along with the distance of the trip, I am interested in the relationship between the two variables. (Frequency is expected to fall as distance increases, essentially a negative relationship.) Cor.test supports this hypothesis: -0.08993444 with a p value of 2.2e-16.
When I come to plot this, the distance clearly tends to decrease for more frequent trips. To make sense of the vast number of points I used geom_smooth. But I don't fully understand the result. According to the help pages, it's a "conditional mean". However, it seems never to approach the true mean,
> mean(aggs3$Distance)
[1] 9.766497
in the plot below, which seems never to go above 8.
What's going on here? I think I really want the rolling mean, but found rollmean from the zoo package a hassle to implement (you need to sort the data first), and I would like to ask for the optimal solution before forging ahead. Many thanks.
p <- ggplot(data=aggs3, aes(x=N.trips.week, y=Distance))
p + geom_point(alpha = 0.1) + geom_smooth() +
ylim(0,30) + xlim(0,25) + ylab("Distance (miles)") +
stat_density2d(aes(fill = ..level..), geom="polygon", alpha=0.5,na.rm=T, se=0.1)
(Secondary unrelated question: how do I make the 2d density layer contours smoother?)
(P.s. I know there are better ways to visualise this - e.g. below, but I for the sake of learning I need better understanding of how to use geom_smooth.)
The curve geom_smooth produces is indeed an estimate of the conditional mean function, i.e. it's an estimate of the mean distance in miles conditional on the number of trips per week (it's a particular kind of estimator called LOESS). The number you calculate, in contrast, is an estimate for the unconditional mean, i.e. the mean over all the data.
If it's the relationship between the two variables you're interested in there are plenty of ways you could model that. If you just want a linear relationship, fitting a linear model (lm()) will do the trick and if that's what you want to plot, passing method='lm' as an argument to geom_smooth will show you what that looks like. But your data really doesn't look like there's just a simple linear relationship between the two variables so you may want to think a bit harder about what it is exactly you want to do!

Is there an implementation of loess in R with more than 3 parametric predictors or a trick to a similar effect?

Calling all experts on local regression and/or R!
I have run into a limitation of the standard loess function in R and hope you have some advice. The current implementation supports only 1-4 predictors. Let me set out our application scenario to show why this can easily become a problem as soon as we want to employ globally fit parametric covariables.
Essentially, we have a spatial distortion s(x,y) overlaid over a number of measurements z:
z_i = s(x_i,y_i) + v_{g_i}
These measurements z can be grouped by the same underlying undistorted measurement value v for each group g. The group membership g_i is known for each measurement, but the underlying undistorted measurement values v_g for the groups are not known and should be determined by (global, not local) regression.
We need to estimate the two-dimensional spatial trend s(x,y), which we then want to remove. In our application, say there are 20 groups of at least 35 measurements each, in the most simple scenario. The measurements are randomly placed. Taking the first group as reference, there are thus 19 unknown offsets.
The below code for toy data (with a spatial trend in one dimension x) works for two or three offset groups.
Unfortunately, the loess call fails for four or more offset groups with the error message
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square,
normalize, :
only 1-4 predictors are allowed"
I tried overriding the restriction and got
k>d2MAX in ehg136. Need to recompile with increased dimensions.
How easy would that be to do? I cannot find a definition of d2MAX anywhere, and it seems this might be hardcoded -- the error is apparently triggered by line #1359 in loessf.f
if(k .gt. 15) call ehg182(105)
Alternatively, does anyone know of an implementation of local regression with global (parametric) offset groups that could be applied here?
Or is there a better way of dealing with this? I tried lme with correlation structures but that seems to be much, much slower.
Any comments would be greatly appreciated!
Many thanks,
David
###
#
# loess with parametric offsets - toy data demo
#
x<-seq(0,9,.1);
x.N<-length(x);
o<-c(0.4,-0.8,1.2#,-0.2 # works for three but not four
); # these are the (unknown) offsets
o.N<-length(o);
f<-sapply(seq(o.N),
function(n){
ifelse((seq(x.N)<= n *x.N/(o.N+1) &
seq(x.N)> (n-1)*x.N/(o.N+1)),
1,0);
});
f<-f[sample(NROW(f)),];
y<-sin(x)+rnorm(length(x),0,.1)+f%*%o;
s.fs<-sapply(seq(NCOL(f)),function(i){paste('f',i,sep='')});
s<-paste(c('y~x',s.fs),collapse='+');
d<-data.frame(x,y,f)
names(d)<-c('x','y',s.fs);
l<-loess(formula(s),parametric=s.fs,drop.square=s.fs,normalize=F,data=d,
span=0.4);
yp<-predict(l,newdata=d);
plot(x,y,pch='+',ylim=c(-3,3),col='red'); # input data
points(x,yp,pch='o',col='blue'); # fit of that
d0<-d; d0$f1<-d0$f2<-d0$f3<-0;
yp0<-predict(l,newdata=d0);
points(x,y-f%*%o); # spatial distortion
lines(x,yp0,pch='+'); # estimate of that
op<-sapply(seq(NCOL(f)),function(i){(yp-yp0)[!!f[,i]][1]});
cat("Demo offsets:",o,"\n");
cat("Estimated offsets:",format(op,digits=1),"\n");
Why don't you use an additive model for this? Package mgcv will handle this sort of model, if I understand your Question, just fine. I might have this wrong, but the code you show is relating x ~ y, but your Question mentions z ~ s(x, y) + g. What I show below for gam() is for response z modelled by a spatial smooth in x and y with g being estimated parametrically, with g stored as a factor in the data frame:
require(mgcv)
m <- gam(z ~ s(x,y) + g, data = foo)
Or have I misunderstood what you wanted? If you want to post a small snippet of data I can give a proper example using mgcv...?

Resources