How to check Variable A affects Variable B? - r

I was analyzing the Moore dataset in the carData package, and I wanted to see the whether partner.status affects conformity or not.
set.seed(200)
library(carData)
library(ggplot2)
And then using ggplot, I plotted the two variables using boxplot.
ggplot(data = Moore, aes (x = partner.status, y = conformity )) +
geom_boxplot()
The plot only tells that those who have high status have high conforming responses and those with low status have lower median conforming responses.
Question: how to show that there’s evidence that partner.status affects conformity? What statistical methods do I have to use?

Boxplot is a great start, it gives you an idea of what results you should expect.
Now, you need to find out if data is normally distributed (shapiro.test(Moore$conformity)) ), and has homoscedastic variance (fligner.test(Moore$conformity ~ Moore$partner.status) read as conformity by partner.status). You can read about the p-values here, and make a conlcusion.
There are tons of other tests, these two are quite robust for this purpose.
Now, assuming you have normality and homoscedasticity you can do a t-test.
If, you have normality and heteroscedasticity you can do a oneway.test. If you don't have normality you can use the Kruskal-Wallis test.
Now, analysing the output of one of those tests you can reject or not reject the hypothesis of equal means.

You have 2 groups for partner status (low and high). Conformity is a continuous dependent variable. I recommend an independent samples t-test.
Check and make sure you do not violate any assumptions.
Relevant information:
https://statistics.laerd.com/statistical-guides/independent-t-test-statistical-guide.php

Related

Multiple meta-regression with metafor

I have performed a multiple meta-regression with the package metafor, but struggle with the interpretation of the Test of Moderators (i.e., QM). My model includes two variables: (1) sample type (dummy: community vs. forensic) and (2) proportion of females in sample (continuous).
This is the output I get:
The results indicate that proportion_females is significantly predicting the effect size while controlling for sample type. However, QM shows a non-significant result (p < 0.05).
How is that possible? It was my understanding that QM tests the hypothesis H0: β_sample = β_females = 0. If Proportion_females is clearly != 0, why does QM not yield a significant result?
This can happen just like in regular regression (where the overall/omnibus F-test can fail to be significant, but an individual coefficient is found to be significant). This is more likely to happen when the model includes multiple non-relevant predictors/moderators, since this will decrease the power of the omnibus test. It can also go the other way around where none of the individual coefficients are found to be significant, but the omnibus test is.

ANOVA on ranks VS kruskal wallis, how different is it

i'm not sure that this is the perfect place for such a question but maybe you can help me.
I want to check for differences of a quantitative variable between 3 treatments, i.e perform an ANOVA.
Unfortunately the residuals of my model aren't normally distributed.
I usually have here two solutions : Transform my data or use a non parametric equivalent of my test (here a kruskal wallis rank test).
None of the transformations that i tried managed to satisfy normality (log, 1/x, square root, tukey and boxcox power) so I wanted to use a kruskal and to move on.
However, my project manager insisted on having only ANOVAs and talked about ANOVA on rank as a magic solution.
Working on R I looked for some examples and find a function art from ARTool package that perform anova on rank.
library(ARTool)
model <- art(variable~treatment,data)
anova(model)
Basically it takes your variable and replace it by its rank (dealing with ties by averaging the rank) as :
model2 <- lm(rank(variable, ties.method = "average")~treatment,data)
anova(model2)
gives exactly the same output.
I'm not an expert statistician and I wonder how valid is this solution/transformation.
It seems quite brutal to me and not this far from the logic of the kruskal-wallis test
even tho the statistic is not computed directly on ranks.
I find this very confusing to have an 'ANOVA on ranks' test that is different from the kruskal-wallis (also known as One-way ANOVA on ranks) and I don't know how to chose between those two tests.
I don't know if I've been very clear and if someone can help me but, anyway,
Thanks for your attention and comments!
PS: here is an exemple on dummy data
library(ARTool)
# note that dummy data are random so we shouldn't have the same results
treatment <- as.factor(c(rep("A",100),rep("B",100),rep("C",100)))
variable <- as.numeric(c(sample(c(0:30),100,replace=T),sample(c(10:40),100,replace=T),sample(c(5:35),100,replace=T)))
dummy <- data.frame(treatment,variable)
model <- art(variable~treatment)
anova(model) #f.value = 30.746 and p = 7.312e-13
model2 <- lm(rank(variable, ties.method = "average")~treatment,dummy)
anova(model2) #f.value = 30.746 and p = 7.312e-13
kruskal.test(variable~treatment,dummy)

Removing outliers that are skewing data

I am looking at the relationship between agricultural intensity and functional diversity of birds.
In my GLM model I have included a number of other variables including forest, semi-natural habitat, temperature, pesticides etc.
When looking to see whether my variables are normally distributed or not, I used a QQplot to identify the normality and there appears to be these 3 outliers.
I wondered how I would remove these outliers to make my data more normally distributed?
I tried to use the outliers package but all the examples I found failed to work, or I failed to understand how they worked!
Any help would be appreciated. This is my QQ plot for my functional dispersion model and a scatter of functional dispersion x agricultural intensity.
QQ plot
functional dispersion x agriculture scatter
You could remove the observations that appear out of place. Given the amount of observations, this is unlikely to change estimates, but please make sure this is indeed the case. Also, when reporting your work, make sure you justify why you removed those points based on your domain knowledge about the variable.
You can remove the observation using
model.data.scaled <- model.data.scaled[model.data.scaled$agri > -5, ]

Testing a General Linear Hypothesis in R

I'm working my way through a Linear Regression Textbook and am trying to replicate the results from a section on the Test of the General Linear Hypothesis, but I need a little bit of help on how to do so in R.
I've already taken a look at a number of other posts, but am hoping someone can give me some example code. I have data on twenty-six subjects which has the following form:
Group, Weight (lb), HDL Cholesterol mg/decaliters
1,163.5,75
1,180,72.5
1,178.5,62
2,106,57.5
2,134,49
2,216.5,74
3,163.5,76
3,154,55.5
3,139,68
Given this data I am trying to test to see if the regression lines fit to the three groups of subjects have a common slope. The models postulated are:
y=βo + β1⋅x + ϵ
y=γ0 + γ1⋅xi + ϵ
y= δ0 + δ1⋅xi + ϵ
So the hypothesis of interest is H0: β1 = γ1 = δ1
I have been trying to do this using the linearHypothesis function in the car library, but have been having trouble knowing what the model object should be, and am not confident that this is the correct approach (or package) to be using.
Any help would be much appreciated – Thanks!
Tim, your question doesn't seem so much to be about R code. Instead, it appears that you have questions about how to test the interaction of your Group and Weight (lb) variables on the outcome HDL Cholesterol mg/decaliters. You don't state this specifically, but I'm taking a guess that these are your predictors and outcome, respectively.
So essentially, you're trying to see if the predictor Weight (lb) has differential effects depending on the level of the variable Group. This can be done in a number of ways using the linear model. A simple regression approach would be lm(hdl ~ 1 + group + weight + group*weight). And then the coefficient for the interaction term group*weight would tell you whether or not there is a significant interaction (i.e., moderation) effect.
However, I think we would have a major concern. In particular, we ought to worry that our hypothesized effect is that the group variable and the hdl variable do not interact. That is, you're essentially predicting the null. Furthermore, you're predicting the null despite having a small sample size. Therefore, it would be rather unlikely that we have sufficient statistical power to detect an effect, even if there were one to be observed.

randomForest using R for regression, make sense?

I want to exam which variable impacts most on the outcome, in my data, which is the stock yield. My data is like below.
And my code is also attached.
library(randomForest)
require(data.table)
data = fread("C:/stockcrazy.csv")
PEratio <- data$offeringPE/data$industryPE
data_update <- data.frame(data,PEratio)
train <- data_update[1:47,]
test <- data_update[48:57,]
For the above subset data set train and test, I am not sure if I need to do a cross validation on this data. And I don't know how to do it.
data.model <- randomForest(yield ~ offerings + offerprice + PEratio + count + bingo
+ purchase , data=train, importance=TRUE)
par(mfrow = c(1, 1))
varImpPlot(data.model, n.var = 6, main = "Random Forests: Top 6 Important Variables")
importance(data.model)
plot(data.model)
model.pred <- predict(data.model, newdata=test)
model.pred
d <- data.frame(test,model.pred)
I am sure not sure if the result of IncMSE is good or bad. Can someone interpret this?
Additionally, I found the predicted values of the test data is not a good prediction of the real data. So how can I improve this?
Let's see. Let's start with %IncMSE:
I found this really good answer on cross validated about %IncMSE which I quote:
if a predictor is important in your current model, then assigning
other values for that predictor randomly but 'realistically' (i.e.:
permuting this predictor's values over your dataset), should have a
negative influence on prediction, i.e.: using the same model to
predict from data that is the same except for the one variable, should
give worse predictions.
So, you take a predictive measure (MSE) with the original dataset and
then with the 'permuted' dataset, and you compare them somehow. One
way, particularly since we expect the original MSE to always be
smaller, the difference can be taken. Finally, for making the values
comparable over variables, these are scaled.
This means that in your case the most important variable is purchase i.e. when the variable purchase was permuted (i.e. the order of the values randomly changed) the resulting model was 12% worse than having the variable in its original order in terms of calculating the mean square error. The MSE was 12% higher using a permuted purchase variable meaning that the this variable is the most important. Variable importance is just a measure of how important your predictor variables were in the model you used. In your case purchase was the most important and P/E ratio was the least (for those 6 variables). This is not something you can interpret as good or bad because it doesn't show you how well the model fits unseen data. I hope this is clear now.
For the cross-validation:
You do not need to do a cross validation during the training phase because it happens automatically. Approximately, 2/3 of the records are used for the creation of a tree and the 1/3 that is left out (out-of-bag data) is used to assess the tree afterwards (the R squared for the tree is computed using the oob data)
As for the improvement of the model:
By showing just the 10 first lines of the predicted and the actual values of yield, you cannot make a safe decision on whether the model is good or bad. What you need is a test of fitness. The most common one is the R squared. It is simplistic but for comparing models and getting a first opinion about your model it does its job. This is calculated by the model for every tree that you make and can be accessed by data.model$rsq. This ranges from 0 to 1 with 1 being the perfect model and 0 showing really poor fit ( it can sometimes even take negative values which shows a bad fit). If your rsq is bad then you can try the following to improve your model although it is not certain that you will get the results you wish for:
Calibrate your trees in a different way. Change the number of trees grown and prune the trees by specifying a big nodesize number. (here you use the default 500 trees and a nodesize of 5 which might overfit your model.)
Increase the number of variables if possible.
Choose a different model. There are cases were a random Forest would not work well

Resources