Removing outliers that are skewing data - r

I am looking at the relationship between agricultural intensity and functional diversity of birds.
In my GLM model I have included a number of other variables including forest, semi-natural habitat, temperature, pesticides etc.
When looking to see whether my variables are normally distributed or not, I used a QQplot to identify the normality and there appears to be these 3 outliers.
I wondered how I would remove these outliers to make my data more normally distributed?
I tried to use the outliers package but all the examples I found failed to work, or I failed to understand how they worked!
Any help would be appreciated. This is my QQ plot for my functional dispersion model and a scatter of functional dispersion x agricultural intensity.
QQ plot
functional dispersion x agriculture scatter

You could remove the observations that appear out of place. Given the amount of observations, this is unlikely to change estimates, but please make sure this is indeed the case. Also, when reporting your work, make sure you justify why you removed those points based on your domain knowledge about the variable.
You can remove the observation using
model.data.scaled <- model.data.scaled[model.data.scaled$agri > -5, ]

Related

Probit Regression in R

My question might seem very basic but since am new to R, I am struggling with it. Shall be grateful if I could get some insight...
I need to conduct a probit regression with respect to temperature and people's preference (desire warmer or desire cooler). The preference is therefore my dependent binary variable. I basically desire a plot in which on the y axis there is percentage of respondents desiring warmer/cooler and the x axis has temperature. The plot must therefore have two probit curves- one representing desire warmer and the other curve indicating desire cooler. I am interested at the intersection of the two curves.
Shall be grateful if someone could help me in the R code for the same.

How to check Variable A affects Variable B?

I was analyzing the Moore dataset in the carData package, and I wanted to see the whether partner.status affects conformity or not.
set.seed(200)
library(carData)
library(ggplot2)
And then using ggplot, I plotted the two variables using boxplot.
ggplot(data = Moore, aes (x = partner.status, y = conformity )) +
geom_boxplot()
The plot only tells that those who have high status have high conforming responses and those with low status have lower median conforming responses.
Question: how to show that there’s evidence that partner.status affects conformity? What statistical methods do I have to use?
Boxplot is a great start, it gives you an idea of what results you should expect.
Now, you need to find out if data is normally distributed (shapiro.test(Moore$conformity)) ), and has homoscedastic variance (fligner.test(Moore$conformity ~ Moore$partner.status) read as conformity by partner.status). You can read about the p-values here, and make a conlcusion.
There are tons of other tests, these two are quite robust for this purpose.
Now, assuming you have normality and homoscedasticity you can do a t-test.
If, you have normality and heteroscedasticity you can do a oneway.test. If you don't have normality you can use the Kruskal-Wallis test.
Now, analysing the output of one of those tests you can reject or not reject the hypothesis of equal means.
You have 2 groups for partner status (low and high). Conformity is a continuous dependent variable. I recommend an independent samples t-test.
Check and make sure you do not violate any assumptions.
Relevant information:
https://statistics.laerd.com/statistical-guides/independent-t-test-statistical-guide.php

How can I plot my lmer() mixed model growth curves in r?

I have constructed a mixed effect model using lmer() with the aim of comparing the growth in reading scores for four different groups of children as they age.
I would like to plot a graph of the 4 different slopes with confidence intervals in R in order to visualize this relationship but I keep getting stuck.
I have tried to use the plot function and some versions of the ggplot as I have done for previous lm() models but it isn't working so far. Here is my attempted model which I hope looks at how the change in reading scores over time(age) interacts with a child's SESDLD grouping (this indicated whether a child has a language problem and whether or not they are high or low income).
AgeSES.model <- lmer(ReadingMeasure ~ Age.c*SESDLD1 + (1|childid), data = reshapedomit, REML = FALSE)
The ReadingMeasure is a continuous score, age.c is centered age measured in months. SESDLD1 is a categorical measure which has 4 levels. I would expect four positive slopes of ReadingMeasure growth with different intercepts and probably differing slopes.
I would really appreciate any pointers on how to do this!
Thank you so much!!
The type of plot I would like to achieve - this was done in Stata

Plotting backtransformed data with LS means plot

I have used the package lsmeans in R to get the average estimate for all observations for my treatment factor (across the levels of a block factor in the experimental design that has been included with systematic effect because it only had 3 levels). I have used a sqrt transformation for my response variable.
Thus I have used the following commands in R.
First defining model
model<-sqrt(response)~treatment+block
Then applying lsmeans
model_lsmeans<-lsmeans(model,~treatment)
Then plotting this
plot(model_lsmeans,ylab="treatment", xlab="response(with 95% CI)")
This gives a very nice graph with estimates and 95% confidense intervals for the different treatment.
The problems is just that this graph is for the transformed response.
How do I get this same plot with the backtransformed response (so the squared response)?
I have tried to create a new data frame and extract the lsmean, lower.CL, and upper.CL:
a<-summary(model_lsmeans)
New_dataframe<-as.data.frame(a[c("treatment","lsmean","lower.CL","upper.CL")])
And then make these squared
New_dataframe$lsmean<-New_dataframe$lsmean^2
New_dataframe$lower.CL<-New_dataframe$lower.CL^2
New_dataframe$upper.CL<-New_dataframe$upper.CL^2
New_dataframe
This gives me the estimates and CI boundaries squared that I need.
The problem is that I cannot make the same graph for thise estimates and CI as the one that I did in LS means above.
How can I do this? The reason that I ask is that I want to have graphs that are all of a similar style for my article. Since I very much like this LSmeans plot, and it is very convenient for me to use on the non-transformed response variables, I would like to have all my graphs in this style.
Thank you very much for your help! Hope everything is clear!
Kind regards
Ditlev

Simulating data using existing data and probability

I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.

Resources