geom_smooth: what is its meaning (why is it lower than the mean?) - r

I have data on the number of trips people make to work per week. Along with the distance of the trip, I am interested in the relationship between the two variables. (Frequency is expected to fall as distance increases, essentially a negative relationship.) Cor.test supports this hypothesis: -0.08993444 with a p value of 2.2e-16.
When I come to plot this, the distance clearly tends to decrease for more frequent trips. To make sense of the vast number of points I used geom_smooth. But I don't fully understand the result. According to the help pages, it's a "conditional mean". However, it seems never to approach the true mean,
> mean(aggs3$Distance)
[1] 9.766497
in the plot below, which seems never to go above 8.
What's going on here? I think I really want the rolling mean, but found rollmean from the zoo package a hassle to implement (you need to sort the data first), and I would like to ask for the optimal solution before forging ahead. Many thanks.
p <- ggplot(data=aggs3, aes(x=N.trips.week, y=Distance))
p + geom_point(alpha = 0.1) + geom_smooth() +
ylim(0,30) + xlim(0,25) + ylab("Distance (miles)") +
stat_density2d(aes(fill = ..level..), geom="polygon", alpha=0.5,na.rm=T, se=0.1)
(Secondary unrelated question: how do I make the 2d density layer contours smoother?)
(P.s. I know there are better ways to visualise this - e.g. below, but I for the sake of learning I need better understanding of how to use geom_smooth.)

The curve geom_smooth produces is indeed an estimate of the conditional mean function, i.e. it's an estimate of the mean distance in miles conditional on the number of trips per week (it's a particular kind of estimator called LOESS). The number you calculate, in contrast, is an estimate for the unconditional mean, i.e. the mean over all the data.
If it's the relationship between the two variables you're interested in there are plenty of ways you could model that. If you just want a linear relationship, fitting a linear model (lm()) will do the trick and if that's what you want to plot, passing method='lm' as an argument to geom_smooth will show you what that looks like. But your data really doesn't look like there's just a simple linear relationship between the two variables so you may want to think a bit harder about what it is exactly you want to do!

Related

interpretation of a GAM plot with a square rooted response variable

I have a simple GAM where my goal is to understand the variation of the distance to a feature along the year, and I originally ran it with the following formula:
m1 <- gam(dist ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
Where "dist" is the distance in meters to a feature, and "id" is the animal id. When plotting the GAM I obtain the following plot:
First question, if I would be interepreting the plot/writting a figure caption, is it correct to say something like:
"GAM plot showing the partial effects of month (x-axis) on the distance to a feature (y-axis). GAM smooths are centered at zero, therefore the zero line reflects the overall mean of the distance the feature. Thus, values below zero on the y-axis reflect higher proximity to the feature, while values above zero reflect longer distances to the feature."
I say that a negative value (below zero) would mean proximity to the feature as that's also the way to interpret distance coefficients in a GLM, but I would also like to make sure that this is correct and that I'm not misinterpreting the plot.
Second question, are the values on the y-axis directly interpretable? If so, what is the scale? Is it a % of change? (Based on a comment here, but I'm not sure if I understood it properly)
Then I transform the response variable to achieve normality (the original scale was a bit left skewed), and I run this model (residuals look better with the transformation):
m2 <- gam(sqrt(dist) ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
And I obtain this plot:
Pretty similar to the previous one, and I believe I can interpret it in the same way as described above. But, third question, if I would want to say exactly what the y axis mean, what would be the most correct way to describe it with the transformation?
Any help with this is very appreciated! Many thanks in advance!

Interpreting results from emmeans comparison

I have a glm model with two fixed effects, Treatment and Date, to estimate Temperature from data collected in a time series. Within Treatment there are three different categories: Fucus, Terrycloth or Control, and temperature is measured beneath those canopies. The model is created like so mod1 <- glm(Temp ~ Treatment * Date, data = aveTerry.df )
I am trying to tell if Terrycloth has a similar effect as Fucus canopy (i.e. replicates it).
I found the emmeans package and believe it could help me compare between these levels within treatment by using my model, and have used it as so to find the estimated marginal means terry.emmeans <- emmeans(modAllTerry, poly ~ Treatment | Date) and plotted the comparisons via plot(terry.emmeans.average, comparison = TRUE) +theme_bw()
Giving me this output linked here.
I am looking for some help understanding what this graphical output is, especially what exactly are the comparisons (which are shown by the red arrows). I somewhat understand the that blue boxes are the confidence intervals for the mean value of temperature for each treatment on one day (based on model), but am wondering how is the comparison made? And why do some days only have a one sided arrow?
As described in the documentation for plot.emmGrid, the comparison arrows are created in such a way that two arrows are disjoint if and only if their respective means are significantly different at the stated level.
The lowest mean in the set has only a right-pointing arrow because that mean will not be compared with anything smaller, obviating the need for a left-pointing arrow. For similar reasons, the highest mean has only a left-pointing arrow. These arrows do not define intervals; their only purpose is depicting comparisons.
In situations where the SEs of pairwise comparisons vary widely, it may not be possible to construct comparison arrows. If that happens, an error message is displayed.
Confidence intervals are available as well, but those CIs should not be used for comparing means.
More information and examples may be found via vignette("comparisons", "emmeans"). Also, details of how the arrows are actually constructed are given in vignette("xplanations", "emmeans")

Correlation coefficient between nominal and cardinal scale variables

I have to describe the correlation between a variable "Average passes completed per game" (cardinal scale) and a variable "Position" (nominal scale) and measure the strength of the correlation. For that I have to choose the correlation coefficient correctly considering the Scales. Does anyone know what the best way to do that would be? I am not sure what to use since it is two different scales. The full dataset consists of the following variables:
PLAYER: Name of the player
COUNTRY: Country of origin
BIRTHDATE: Birthday Date
HEIGHT_IN_CM: Height of the player
POSITION: Position of the player
PASSES_COMPLETED: Passes completed by the player
DISTANCE_COVERED: Distance covered by the player in km
MINUTES_PLAYED: Minutes played
AVG_PASSES_COMPLETED: Average passes completed by the player
I would very much appreciate if someone could give me some advice on this.
Thank you!
OK, so you need to redefine your question somewhat. Without two continuous variables correlations cannot be used to "describe" a relationship as I guess you are asking. You can, however, see if there are statistically significant differences in pass rates between different positions. As for the questions on the statistics, I agree with Maurtis...CV is best place. As for the code to do the tests, try this:
Firstly you need to make sure you have the right packages installed. You will definitely need ggplot and ggfortify, and maybe others if you have to manipulate data, or other things. And load the libraries:
library(ggplot2)
library(ggfortify)
Next, make sure that your data is tidy: ie, variables in columns.
Then import your data into R:
#find file
data.location = file.choose()
#Import data
curr.data <- read.csv(data.location)
#Check data import
glimpse(curr.data)
Then plot using ggplot:
ggplot(curr.data, aes(x = POSITION, y = AVG_PASSES_COMPLETED)) +
geom_boxplot() +
theme_bw()
Then model using the linear model function (lm()) to see if there is a significant difference in pass rates with regards to position.
passrate_model <- lm(AVG_PASSES_COMPLETED ~ POSITION, data = curr.data)
Before you test your hypothesis, you need to check the appropriateness of the model
autoplot(passrate_model, smooth.colour = NA)
If the residual plots look fine, then we are ready to test. If not then you will have to use another type of model (and I'm not going into that here now....).
The appropriate test for this (I think) would be a Tukey test, which requires an ANOVA. This will give a summary, and should show you if there is variance due to position:
passrate_av <- aov(passrate_model)
summary(passrate_av)
This will perform the Tukey test and give pair-wise comparisons including difference in means, 95% confidence intervals, and adjusted p-values:
tukey.test <- TukeyHSD(passrate_av)
tukey.test
And it can even do a nice plot for you too:
plot(tukey.test)

Interpreting a pattern in a residual plot produced by gam.check()

I'm working on creating a model that examines the effect of ocean characteristics on fishing outcomes. I have spatial data on a 0.5 degree grid and I created the following model:
gam(inverse hyperbolic sine(yvar) ~ s(lat, lon, bs="sos) + s(xvar1) +
s(xvar2) + s(xvar3), data = dat, method = "REML"
The QQ plot and histogram of residuals look okay. However, gam.check() produces an odd pattern in the residuals plot. I know that the points should be scattered around 0, but I have a very odd pattern in the residuals. Can anyone provide some insight on the interpretation of this plot:
Those will be either all the 0s (most likely) or 1s/smallest value in your original data. You don’t say what these data are but as you mentioning fishing outcomes it is highly likely that these have some natural lower bound and this line in the residuals are all the observations that take this lower bound (before transformation).
As you don’t exactly what your data are it is difficult to comment further as to how to proceed (this may not be an issue or you may need to not use the transform that you did, and instead use a GLM or other non-Gaussian response), but
Such patterns are common in ecological/biological data, and
Transforming your response invariably doesn’t work for ecological data.

Is it feasible to denoise time irrelevant sensor reading with Kalman Filter and how to code it?

After I did some research, I can understand how to implement it with time relevant functions. However, I'm not very sure about whether can I apply it to time irrelevant scenarios.
Giving that we have a simple function y=a*x^2, where both y and x are measured at a constant interval (say 1 min/sample) and a is a constant. However, both y and x measurements have white noise.
More specifically, x and y are two independently measured variables. For example, x is air flow rate in a duct and the y is the pressure drop across the duct. Because the air flow is varying due to the variation of the fan speed, the pressure drop across the duct is also varying. The relation between the pressure drop y and flow rate x is y=a*x^2, however both measurement embedded white noise. Is that possible to use Kalman Filter to estimate a more accurate y? Both x and y are recorded in a constant time interval.
Here are my questions:
Is it feasible to implement Kalman Filter for the y reading noise reduction? Or in another word, have a better estimation of y?
If this is feasible, how to code it in R or C?
P.S.
I tried to apply Kalman Filter to single variable and it works well. The result is as below. I'll have a try Ben's suggestion then and have a look whether can I make it works.
I think you can apply some Kalman Filter like ideas here.
Make your state a, with variance P_a. Your update is just F=[1], and your measurement is just H=[1] with observation y/x^2. In other words, you measure x and y and estimate a by solving for a in your original equation. Update your scalar KF as usual. Approximating R will be important. If x and y both have zero mean Gaussian noise, then y/x^2 certainly doesn't, but you can come up with an approximation.
Now that you have a running estimate of a (which is a random constant, so Q=0 ideally, but maybe Q=[tiny] to avoid numerical issues) you can use it to get a better y.
You have y_meas and y_est=a*x_meas^2. Combine those using your variances as (R_y * a * x^2 + (P_a + R_x2) * y_meas) / (R_y + P_a + R_x2). Over time as P_a goes to zero (you become certain of your estimate of a) you can see you end up combining information from your x and y measurements proportional to your trust in them individually. Early on, when P_a is high you are mostly trusting the direct measurement of y_meas because you don't know the relationship.

Resources