Plot linear and multiple linear reg on the same graph (ggplot) - r

I have for instance this data frame :
data <- data.frame(
x=c(1:12)
, case=c(3,5,1,8,2,4,5,0,8,2,3,5)
, rain=c(1,8,2,1,4,5,3,0,8,2,3,4)
, country=c("A","A","A","A","B","B","B","B","C","C","C","C")
, year=rep(seq(2000,2003,1),3)
)
I would like to perform 2 linear regressions and plot them on one graph.
In a nutshell, I would like to compare the crude trend of cases over time (simple lm) with the same trend of cases but this time adjusted to rainfall over the years 2000 to 2003, on one and same graph.
model<-lm(case~ year, data=data)
the second one would be a multiple linear regression. I used this code for the purpose, but not sure it is ideal.
modelrain<-lm(case~ I(year +rain), data=data)
I did it with a simple plot with abline, but don't know how to make it with ggplot. I've created a new dataframe, but doesn't seem to work perfectly ( so I don't put the rest of my code here).
Thank you very much

Building off the suggestions in the comments there are three valid regression models
model1<-lm(case~ year, data=data)
summary(model1)
model2<-lm(case~ year+rain, data=data)
summary(model2)
model3<-lm(case~ year*rain, data=data)
summary(model3)
With the limited data we have doesn't seem to be a lot going on.
The first question on how to plot the regression line for model1 using ggplot is just:
ggplot(data,aes(x=year,y=case)) + geom_point() + geom_lm()
As others have noted it is unclear what user3355655 means by "adjusted" for rain (since rain and year can't truly exist on the same x axis) but if we're willing to take the simplest course and simply treat rain as a "factor" then:
ggplot(data,aes(x=year,y=case,color=factor(rain))) + geom_point() + geom_smooth(method="lm",fill=NA) + scale_y_continuous(limits = c(-1, 10))

Related

Interaction in GLM (Ai and Norton 2003) given controls

When including an interaction effect in a logistic regression, I learned from Ai and Norton (2003) that one cannot readily interpret it.
First, I import the data and then run the model
Example <- read.csv("~/Desktop/Example.csv")
summary(model <- glm(Job_filled ~ log10_wage*Online_sentiment + Department + Industry + CompanySize, Example, family=binomial("logit")))
Second, I use the DAMisc package and its inteff() command to mirror their example from: https://www.rdocumentation.org/packages/DAMisc/versions/1.7.2/topics/intEff:
library(DAMisc)
out <- intEff(obj=model, vars=c("log10_wage", "Online_sentiment"), data=Example)
out <- out$byobs$int
plot(out$phat, out$int_eff, xlab="Predicted Pr(Y=1|X)", ylab = "Interaction Effect")
My first question is: What I have on the y-axis is the cross-derivative of the expected value of the dependent variable, correct?
My second question: How can I plot the cross-derivative given all possible combinations of control variables? This is in order to see that for most (e.g., more than 95%) of possible combinations the interaction effect is positive or negative?

Visualising variance from random effects in a mixed model by group

I have run a linear mixed model in R using lmer. I am attempting to visualise the random effect structure. To produce a graph I have used print(dotplot(ranef(RT.model.4, condVar=T))[['part_no']]) where part_no is the random effect from the mixed model. It creates something like this:
This is great. However I want to be able to visually tell the difference between my two groups of participants (the random effect being discussed) in the graph. I have group A and group B. In my dataset I have a column for participant type and for each row it gives a value of A or B.
I would like to either colour code the graph to show participants from groups A and B. Or perhaps better would be to create two separate panels, one for each group.
Any suggestions on how to do this would be very much appreciated.
This is a way using ggplot rather than lattice (just because I am more familiar with it) using code from the examples in ?dotplot.ranef.mer. You need to match your treatment group in the data to the random effects grouping variables returned by ranef. I don't see how this can be done automatically within dotplot.ranef.mer.
Create a small example with a treatment group; each subject is assigned to one treatment group.
library(lme4)
library(ggplot2)
sleepstudy$trt = as.integer(sleepstudy$Subject %in% 308:340)
m = lmer(Reaction ~ trt + (1|Subject), sleepstudy)
Convert the random effects to a dataframe and match in the treatment groups
dd = as.data.frame(ranef(m, condVar=TRUE), "Subject")
dd$trt = with(sleepstudy, trt[match(dd$grp, Subject)])
You can then plot how you want, say using facet_'s or assigning a colour to each group, or ...
ggplot(dd, aes(y=grp,x=condval, colour=factor(trt))) +
geom_point() + facet_wrap(~term,scales="free_x") +
geom_errorbarh(aes(xmin=condval -2*condsd,
xmax=condval +2*condsd), height=0)
ggplot(dd, aes(y=grp,x=condval)) +
geom_point() +
geom_errorbarh(aes(xmin=condval -2*condsd,
xmax=condval +2*condsd), height=0)+
facet_wrap(~trt)
You should be able to use the groups= option in dotplot(). Assuming your data is in a dataframe called df with the group variable being in group, you could use
print(dotplot(ranef(RT.model.4, condVar=T), groups=df$group)[['part_no']])

Scale LDA decision boundary

I have a rather unconventional problem and having a hard time finding a solution to this. Would really appreciate your help.
I have 4 genes(features) and my classification here is binary(0 and 1). After a lot of back and forth, I have finalized on using LDA to do my classification. I have different studies each comparing the same two classes and I trained my model using these 4 genes on each of these studies.
I want to visualize the LDA scores in the form of points plot. Something like below, where each section represents a different study/dataset. Samples of that dataset on the X axis and the LD1 value I get using -
lda_model = lda(formula = class ~ ., data = train)
predict(lda_model,train) on the Y axis.
Since I trained a different model on each dataset, we can clearly see the the decision boundary (which I assume is the black line) for each dataset is different and on a different scale. However, I want to scale the values on the Y axis is such a way that all my datasets are on the same scale and I can represent this plot with a single decision boundary( again, something I can clearly draw on the plot, like the red line).
The LD1 values here are - a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) - mean(a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD)). This is done for each dataset individually. However, this is not exactly equal to (a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) + intercept) which we can get using logistic regression. I am trying to find that value or some method which can scale my Y axis across all the datasets using LDA.
Thanks for your help!
I did a min-max scaling and that seemed to work. It scaled all my data points across all datasets with decision boundary at zero.

How do I fix the abline warning, only using first two coefficients?

I have been unable to resolve an error when using abline(). I keep getting the warning message: In abline(model): only using the first two of 7 regression coefficients. I've been searching and seen many instances of others with this error but their examples are for multiple linear functions. I'm new to R and below is a simple example I'm using to work with. Thanks for any help!
year = c('2010','2011','2012','2013','2014','2015','2016')
population = c(25244310,25646389,26071655,26473525,26944751,27429639,27862596)
Texas=data.frame(year,population)
plot(population~year,data=Texas)
model = lm(population~year,data=Texas)
abline(model)
You probably want something like the following where we make sure that year is interpreted as a numeric variable in your model:
plot(population ~ year, data = Texas)
model <- lm(population ~ as.numeric(as.character(year)), data = Texas)
abline(model)
This makes lm to estimate an intercept (corresponding to a year 0) and slope (the mean increase in population each year), which is correctly interpreted by abline as can also be seen on the plot.
The reason for the warning is that year becomes a factor with 7 levels and so your lm call estimate the mean value for the refence year 2010 (the intercept) and 6 contrasts to the other years. Hence you get many coefficients and abline only uses the first two incorrectly.
Edit: With that said, you probably want change the way year is stored to a numeric. Then your code works, and plot also makes a proper scatter plot as regression line.
Texas$year <- as.numeric(as.character(Texas$year))
plot(population ~ year, data = Texas, pch = 16)
model <- lm(population ~ year, data = Texas)
abline(model)
Note that the as.character is needed in general, but it works in lm without it by coincidence (because the years are consecutive)

Predict the next values based on the given data in R

I have a vector which represents violations in each year.How to predict the violations in the next years in R.
year <- c(190519, 223721, 235321, 101934)
Please help me out
To illustrate the comments made by akash87 and Dominic Comtols that it would be futile to predict with little information, here's a linear model method and visualisation with ggplot:
year<-c(190519 ,223721, 235321, 101934)
df <- data.frame(year=1:4, crime= year)
library(ggplot2)
ggplot(df, aes(x=year, y=crime)) +
geom_point() +
geom_smooth(method="lm", fullrange=T) +
xlim(1,6)
As seen from the plot, the predicted value by extrapolating the linear model in Year 6 can be anyway within the gray area, i.e between -339737 and 537576. You're better off just guess...
The dataset is too small for a reliable forecast, but you could try the following, just to illustrate a possibility on how time series forecasts could be obtained in principle:
year <- c(190519, 223721, 235321, 101934)
library(forecast)
yearforecasts <- HoltWinters(as.ts(year), beta=FALSE, gamma=FALSE)
yearforecasts2 <- forecast.HoltWinters(yearforecasts,h=1)
> yearforecasts2
# Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#5 190518.3 95821.09 285215.5 45691.42 335345.2
plot.forecast(yearforecasts2)
The forecast is inaccurate and has a large error margin due to the very small number of data points. As pointed out at the beginning of this answer and in the comments, more data is required for a useful forecast. For the same reason, it is not possible to forecast more than one year ahead with this method.

Resources