Correlation coefficient between nominal and cardinal scale variables - r

I have to describe the correlation between a variable "Average passes completed per game" (cardinal scale) and a variable "Position" (nominal scale) and measure the strength of the correlation. For that I have to choose the correlation coefficient correctly considering the Scales. Does anyone know what the best way to do that would be? I am not sure what to use since it is two different scales. The full dataset consists of the following variables:
PLAYER: Name of the player
COUNTRY: Country of origin
BIRTHDATE: Birthday Date
HEIGHT_IN_CM: Height of the player
POSITION: Position of the player
PASSES_COMPLETED: Passes completed by the player
DISTANCE_COVERED: Distance covered by the player in km
MINUTES_PLAYED: Minutes played
AVG_PASSES_COMPLETED: Average passes completed by the player
I would very much appreciate if someone could give me some advice on this.
Thank you!

OK, so you need to redefine your question somewhat. Without two continuous variables correlations cannot be used to "describe" a relationship as I guess you are asking. You can, however, see if there are statistically significant differences in pass rates between different positions. As for the questions on the statistics, I agree with Maurtis...CV is best place. As for the code to do the tests, try this:
Firstly you need to make sure you have the right packages installed. You will definitely need ggplot and ggfortify, and maybe others if you have to manipulate data, or other things. And load the libraries:
library(ggplot2)
library(ggfortify)
Next, make sure that your data is tidy: ie, variables in columns.
Then import your data into R:
#find file
data.location = file.choose()
#Import data
curr.data <- read.csv(data.location)
#Check data import
glimpse(curr.data)
Then plot using ggplot:
ggplot(curr.data, aes(x = POSITION, y = AVG_PASSES_COMPLETED)) +
geom_boxplot() +
theme_bw()
Then model using the linear model function (lm()) to see if there is a significant difference in pass rates with regards to position.
passrate_model <- lm(AVG_PASSES_COMPLETED ~ POSITION, data = curr.data)
Before you test your hypothesis, you need to check the appropriateness of the model
autoplot(passrate_model, smooth.colour = NA)
If the residual plots look fine, then we are ready to test. If not then you will have to use another type of model (and I'm not going into that here now....).
The appropriate test for this (I think) would be a Tukey test, which requires an ANOVA. This will give a summary, and should show you if there is variance due to position:
passrate_av <- aov(passrate_model)
summary(passrate_av)
This will perform the Tukey test and give pair-wise comparisons including difference in means, 95% confidence intervals, and adjusted p-values:
tukey.test <- TukeyHSD(passrate_av)
tukey.test
And it can even do a nice plot for you too:
plot(tukey.test)

Related

interpretation of a GAM plot with a square rooted response variable

I have a simple GAM where my goal is to understand the variation of the distance to a feature along the year, and I originally ran it with the following formula:
m1 <- gam(dist ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
Where "dist" is the distance in meters to a feature, and "id" is the animal id. When plotting the GAM I obtain the following plot:
First question, if I would be interepreting the plot/writting a figure caption, is it correct to say something like:
"GAM plot showing the partial effects of month (x-axis) on the distance to a feature (y-axis). GAM smooths are centered at zero, therefore the zero line reflects the overall mean of the distance the feature. Thus, values below zero on the y-axis reflect higher proximity to the feature, while values above zero reflect longer distances to the feature."
I say that a negative value (below zero) would mean proximity to the feature as that's also the way to interpret distance coefficients in a GLM, but I would also like to make sure that this is correct and that I'm not misinterpreting the plot.
Second question, are the values on the y-axis directly interpretable? If so, what is the scale? Is it a % of change? (Based on a comment here, but I'm not sure if I understood it properly)
Then I transform the response variable to achieve normality (the original scale was a bit left skewed), and I run this model (residuals look better with the transformation):
m2 <- gam(sqrt(dist) ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
And I obtain this plot:
Pretty similar to the previous one, and I believe I can interpret it in the same way as described above. But, third question, if I would want to say exactly what the y axis mean, what would be the most correct way to describe it with the transformation?
Any help with this is very appreciated! Many thanks in advance!

Latent class growth modelling in R/flexmix with multinomial outcome variable

How to run Latent Class Growth Modelling (LCGM) with a multinomial response variable in R (using the flexmix package)?
And how to stratify each class by a binary/categorical dependent variable?
The idea is to let gender shape the growth curve by cluster (cf. Mikolai and Lyons-Amos (2017, p. 194/3) where the stratification is done by education. They used Mplus)
I think I might have come close with the following syntax:
lcgm_formula <- as.formula(rel_stat~age + I(age^2) + gender + gender:age)
lcgm <- flexmix::stepFlexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
nrep=1, # would be 50 in real analysis to avoid local maxima
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula,varFix=T,fixed = ~0))
,which is close to what Wardenaar (2020,p. 10) suggests in his methodological paper for a continuous outcome:
stepFlexmix(.~ .|ID, k = 1:4,nrep = 50, model = FLXMRglmfix(y~ time, varFix=TRUE), data = mydata, control = list(iter.max = 500, minprior = 0))
The only difference is that the FLXMRmultinom probably does not support varFix and fixed parameters, altough adding them do produce different results. The binomial equivalent for FLXMRmultinom in flexmix might be FLXMRglm (with family="binomial") as opposed FLXMRglmfix so I suspect that the restrictions of the LCGM (eg. fixed slope & intercept per class) are not specified they way it should.
The results are otherwise sensible, but model fails to put men and women with similar trajectories in the same classes (below are the fitted probabilities for each relationship status in each class by gender):
We should have the following matches by cluster and gender...
1<->1
2<->2
3<->3
...but instead we have
1<->3
2<->1
3<->2
That is, if for example men in class one and women in class three would be forced in the same group, the created group would be more similar than the current first row of the plot grid.
Here is the full MVE to reproduce the code.
Got similar results with another dataset with diffent number of classes and up to 50 iterations/class. Have tried two alternative ways to predict the probabilities, with identical results. I conclude that the problem is most likely in the model specification (stepflexmix(...,model=FLXMRmultinom(...) or this is some sort of label switch issue.
If the model would be specified correctly and the issue is that similar trajectories for men/women end up in different classes, is there a way to fix that? By for example restricting the parameters?
Any assistance will be highly appreciated.
This seems to be a an identifiability issue apparently common in mixture modelling. In other words the labels are switched so that while there might not be a problem with the modelling as such, men and women end up in different groups and that will have to be dealt with one way or another
In the the new linked code, I have swapped the order manually and calculated the predictions with by hand.
Will be happy to hear, should someone has an alternative approach to deal with the label swithcing issue (like restricting parameters or switching labels algorithmically). Also curious if the model could/should be specified in some other way.
A few remarks:
I believe that this is indeed performing a LCGM as we do not specify random effects for the slopes or intercepts. Therefore I assume that intercepts and slopes are fixed within classes for both sexes. That would mean that the model performs LCGM as intended. By the same token, it seems that running GMM with random intercept, slope or both is not possible.
Since we are calculating the predictions by hand, we need to be able to separate parameters between the sexes. Therefore I also added an interaction term gender x age^2. The calculations seems to slow down somewhat, but the estimates are similar to the original. It also makes conceptually sense to include the interaction for age^2 if we have it for age already.
varFix=T,fixed = ~0 seem to be reduntant: specifying them do not change anything. The subsampling procedure (of my real data) was unaffected by the set.seed() command for some reason.
The new model specification becomes:
lcgm_formula <- as.formula(rel_stat~ age + I(age^2) +gender + age:gender + I(age^2):gender)
lcgm <- flexmix::flexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
#nrep=1, # would be 50 in real analysis to avoid local maxima (and we would use the stepFlexmix function instead)
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula))
And the plots:

Interpreting results from emmeans comparison

I have a glm model with two fixed effects, Treatment and Date, to estimate Temperature from data collected in a time series. Within Treatment there are three different categories: Fucus, Terrycloth or Control, and temperature is measured beneath those canopies. The model is created like so mod1 <- glm(Temp ~ Treatment * Date, data = aveTerry.df )
I am trying to tell if Terrycloth has a similar effect as Fucus canopy (i.e. replicates it).
I found the emmeans package and believe it could help me compare between these levels within treatment by using my model, and have used it as so to find the estimated marginal means terry.emmeans <- emmeans(modAllTerry, poly ~ Treatment | Date) and plotted the comparisons via plot(terry.emmeans.average, comparison = TRUE) +theme_bw()
Giving me this output linked here.
I am looking for some help understanding what this graphical output is, especially what exactly are the comparisons (which are shown by the red arrows). I somewhat understand the that blue boxes are the confidence intervals for the mean value of temperature for each treatment on one day (based on model), but am wondering how is the comparison made? And why do some days only have a one sided arrow?
As described in the documentation for plot.emmGrid, the comparison arrows are created in such a way that two arrows are disjoint if and only if their respective means are significantly different at the stated level.
The lowest mean in the set has only a right-pointing arrow because that mean will not be compared with anything smaller, obviating the need for a left-pointing arrow. For similar reasons, the highest mean has only a left-pointing arrow. These arrows do not define intervals; their only purpose is depicting comparisons.
In situations where the SEs of pairwise comparisons vary widely, it may not be possible to construct comparison arrows. If that happens, an error message is displayed.
Confidence intervals are available as well, but those CIs should not be used for comparing means.
More information and examples may be found via vignette("comparisons", "emmeans"). Also, details of how the arrows are actually constructed are given in vignette("xplanations", "emmeans")

Interpreting a pattern in a residual plot produced by gam.check()

I'm working on creating a model that examines the effect of ocean characteristics on fishing outcomes. I have spatial data on a 0.5 degree grid and I created the following model:
gam(inverse hyperbolic sine(yvar) ~ s(lat, lon, bs="sos) + s(xvar1) +
s(xvar2) + s(xvar3), data = dat, method = "REML"
The QQ plot and histogram of residuals look okay. However, gam.check() produces an odd pattern in the residuals plot. I know that the points should be scattered around 0, but I have a very odd pattern in the residuals. Can anyone provide some insight on the interpretation of this plot:
Those will be either all the 0s (most likely) or 1s/smallest value in your original data. You don’t say what these data are but as you mentioning fishing outcomes it is highly likely that these have some natural lower bound and this line in the residuals are all the observations that take this lower bound (before transformation).
As you don’t exactly what your data are it is difficult to comment further as to how to proceed (this may not be an issue or you may need to not use the transform that you did, and instead use a GLM or other non-Gaussian response), but
Such patterns are common in ecological/biological data, and
Transforming your response invariably doesn’t work for ecological data.

geom_smooth: what is its meaning (why is it lower than the mean?)

I have data on the number of trips people make to work per week. Along with the distance of the trip, I am interested in the relationship between the two variables. (Frequency is expected to fall as distance increases, essentially a negative relationship.) Cor.test supports this hypothesis: -0.08993444 with a p value of 2.2e-16.
When I come to plot this, the distance clearly tends to decrease for more frequent trips. To make sense of the vast number of points I used geom_smooth. But I don't fully understand the result. According to the help pages, it's a "conditional mean". However, it seems never to approach the true mean,
> mean(aggs3$Distance)
[1] 9.766497
in the plot below, which seems never to go above 8.
What's going on here? I think I really want the rolling mean, but found rollmean from the zoo package a hassle to implement (you need to sort the data first), and I would like to ask for the optimal solution before forging ahead. Many thanks.
p <- ggplot(data=aggs3, aes(x=N.trips.week, y=Distance))
p + geom_point(alpha = 0.1) + geom_smooth() +
ylim(0,30) + xlim(0,25) + ylab("Distance (miles)") +
stat_density2d(aes(fill = ..level..), geom="polygon", alpha=0.5,na.rm=T, se=0.1)
(Secondary unrelated question: how do I make the 2d density layer contours smoother?)
(P.s. I know there are better ways to visualise this - e.g. below, but I for the sake of learning I need better understanding of how to use geom_smooth.)
The curve geom_smooth produces is indeed an estimate of the conditional mean function, i.e. it's an estimate of the mean distance in miles conditional on the number of trips per week (it's a particular kind of estimator called LOESS). The number you calculate, in contrast, is an estimate for the unconditional mean, i.e. the mean over all the data.
If it's the relationship between the two variables you're interested in there are plenty of ways you could model that. If you just want a linear relationship, fitting a linear model (lm()) will do the trick and if that's what you want to plot, passing method='lm' as an argument to geom_smooth will show you what that looks like. But your data really doesn't look like there's just a simple linear relationship between the two variables so you may want to think a bit harder about what it is exactly you want to do!

Resources