geom_smooth on a subset of data - r

Here is some data and a plot:
set.seed(18)
data = data.frame(y=c(rep(0:1,3),rnorm(18,mean=0.5,sd=0.1)),colour=rep(1:2,12),x=rep(1:4,each=6))
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+geom_point()+ geom_smooth(method='lm',formula=y~x,se=F)
As you can see the linear regression is highly influenced by the values where x=1.
Can I get linear regressions calculated for x >= 2 but display the values for x=1 (y equals either 0 or 1).
The resulting graph would be exactly the same except for the linear regressions. They would not "suffer" from the influence of the values on abscisse = 1

It's as simple as geom_smooth(data=subset(data, x >= 2), ...). It's not important if this plot is just for yourself, but realize that something like this would be misleading to others if you don't include a mention of how the regression was performed. I'd recommend changing transparency of the points excluded.
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=subset(data, x >= 2)) + geom_point(data=subset(data, x < 2), alpha=.2) +
geom_smooth(data=subset(data, x >= 2), method='lm',formula=y~x,se=F)

The regular lm function has a weights argument which you can use to assign a weight to a particular observation. In this way you can plain with the influence which the observation has on the outcome. I think this is a general way of dealing with the problem in stead of subsetting the data. Of course, assigning weights ad hoc does not bode well for the statistical soundness of the analysis. It is always best to have a rationale behind the weights, e.g. low weight observations have a higher uncertainty.
I think under the hood ggplot2 uses the lm function so you should be able to pass the weights argument. You can add the weights through the aesthetic (aes), assuming that the weight is stored in a vector:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = runif(nrow(data))), method='lm')
you could also put weight in a column in the dataset:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = weight), method='lm')
where the column is called weight.

I tried #Matthew Plourde's solution, but subset did not work for me. It did not change anything when I used the subset compared to the original data. I replaced subset with filter and it worked:
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=data[data$x >= 2,]) + geom_point(data=data[data$x < 2,], alpha=.2) +
geom_smooth(data=data[data$x >= 2,], method='lm',formula=y~x,se=F)

Related

R Syntax Simple Slopes MEM

Question regarding the syntax of a mixed effects model on R.
I have run the following code to examine the simple slope to determine the effect of one of my variables (variability) within another one of my variables (ambiguity):
lmer.E1.v2 <- lmer(logRT ~ Variability.c / Ambiguity.c + (Variability.c + Ambiguity.c|ID),
data=data %>% filter(Experiment == "E1"),
control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
summary(lmer.E1.v2)
When I reverse these two variables, so that the code looks like this:
lmer.E1.v2 <- lmer(logRT ~ Ambiguity.c / Variability.c + (Ambiguity.c + Variability.c|ID),
data=data %>% filter(Experiment == "E1"),
control=lmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)))
summary(lmer.E1.v2)
.. and I get different output in the first section of code than the second. What is the difference in interpretation in reversing the order of my two variables in the syntax?
The primary issue is that the / operator is not commutative (i.e. a/b != b/a): a/b expands to a + a:b, while b/a expands to b + a:b. You should get the same overall fit (predictions, likelihood, etc.), at least up to some degree of numeric fuzz, but the model parameterization will be different.
There do exist cases where (a+b|g) gives different answers from (b+a|g) (see here, but this is unusual).

lavaan WARNING: some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate

I am trying to evaluate the sem model from a dataset, some of the data are in likert scale i.e from 1-5. and some of the data are COUNTS generated from the computer log for some of the activity.
Whereas while performing the fits the laveen is giving me the error as:
lavaan WARNING: some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate
To mitigate this warning I want to scale some of the variables. But couldn't understand the way for doing that.
Log_And_SurveyResult <- read_excel("C:/Users/Aakash/Desktop/analysis/Log-And-SurveyResult.xlsx")
model <- '
Reward =~ REW1 + REW2 + REW3 + REW4
ECA =~ ECA1 + ECA2 + ECA3
Feedback =~ FED1 + FED2 + FED3 + FED4
Motivation =~ Reward + ECA + Feedback
Satisfaction =~ a*MaxTimeSpentInAWeek + a*TotalTimeSpent + a*TotalLearningActivityView
Motivation ~ Satisfaction'
fit <- sem(model,data = Log_And_SurveyResult)
summary(fit, standardized=T, std.lv = T)
fitMeasures(fit, c("cfi", "rmsea", "srmr"))
I want to scale some of the variables like MaxTimeSpentInAWeek and TotalTimeSpent
Could you please help me figure out how to scale the variables? Thank you very much.
As Elias pointed out, the difference in the magnitude between the variables is huge and it is suggested to scale the variables.
The warning gives a hint and inspecting varTable(fit) returns summary information about the variables in a fitted lavaan object.
Rather than running scale() separately on each column, you could use apply() on a subset or on your whole data.frame:
## Scale the variables in the 4th and 7h column
Log_And_SurveyResult[, c(4, 7)] <- apply(Log_And_SurveyResult[, c(4, 7)], 2, scale)
## Scale the whole data.frame
Log_And_SurveyResult <- apply(Log_And_SurveyResult, 2, scale)
You can just use scale(MaxTimeSpentInAWeek). This will scale your variable to mean = 0 and variance = 1. E.g:
Log_And_SurveyResult$MaxTimeSpentInAWeek <-
scale(Log_And_SurveyResult$MaxTimeSpentInAWeek)
Log_And_SurveyResult$TotalTimeSpent <-
scale(Log_And_SurveyResult$TotalTimeSpent)
Or did I misunderstand your question?

Plot "regression line" from multiple regression in R

I ran a multiple regression with several continuous predictors, a few of which came out significant, and I'd like to create a scatterplot or scatter-like plot of my DV against one of the predictors, including a "regression line". How can I do this?
My plot looks like this
D = my.data; plot( D$probCategorySame, D$posttestScore )
If it were simple regression, I could add a regression line like this:
lmSimple <- lm( posttestScore ~ probCategorySame, data=D )
abline( lmSimple )
But my actual model is like this:
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
I would like to add a regression line that reflects the coefficient and intercept from the actual model instead of the simplified one. I think I'd be happy to assume mean values for all other predictors in order to do this, although I'm ready to hear advice to the contrary.
This might make no difference, but I'll mention just in case, the situation is complicated slightly by the fact that I probably will not want to plot the original data. Instead, I'd like to plot mean values of the DV for binned values of the predictor, like so:
D[,'probCSBinned'] = cut( my.data$probCategorySame, as.numeric( seq( 0,1,0.04 ) ), include.lowest=TRUE, right=FALSE, labels=FALSE )
D = aggregate( posttestScore~probCSBinned, data=D, FUN=mean )
plot( D$probCSBinned, D$posttestScore )
Just because it happens to look much cleaner for my data when I do it this way.
To plot the individual terms in a linear or generalised linear model (ie, fit with lm or glm), use termplot. No need for binning or other manipulation.
# plot everything on one page
par(mfrow=c(2,3))
termplot(lmMultiple)
# plot individual term
par(mfrow=c(1,1))
termplot(lmMultiple, terms="preTestScore")
You need to create a vector of x-values in the domain of your plot and predict their corresponding y-values from your model. To do this, you need to inject this vector into a dataframe comprised of variables that match those in your model. You stated that you are OK with keeping the other variables fixed at their mean values, so I have used that approach in my solution. Whether or not the x-values you are predicting are actually legal given the other values in your plot should probably be something you consider when setting this up.
Without sample data I can't be sure this will work exactly for you, so I apologize if there are any bugs below, but this should at least illustrate the approach.
# Setup
xmin = 0; xmax=10 # domain of your plot
D = my.data
plot( D$probCategorySame, D$posttestScore, xlim=c(xmin,xmax) )
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
# create a dummy dataframe where all variables = their mean value for each record
# except the variable we want to plot, which will vary incrementally over the
# domain of the plot. We need this object to get the predicted values we
# want to plot.
N=1e4
means = colMeans(D)
dummyDF = t(as.data.frame(means))
for(i in 2:N){dummyDF=rbind(dummyDF,means)} # There's probably a more elegant way to do this.
xv=seq(xmin,xmax, length.out=N)
dummyDF$probCSBinned = xv
# if this gives you a warning about "Coercing LHS to list," use bracket syntax:
#dummyDF[,k] = xv # where k is the column index of the variable `posttestScore`
# Getting and plotting predictions over our dummy data.
yv=predict(lmMultiple, newdata=subset(dummyDF, select=c(-posttestScore)))
lines(xv, yv)
Look at the Predict.Plot function in the TeachingDemos package for one option to plot one predictor vs. the response at a given value of the other predictors.

R Nonlinear Least Squares (nls) Model Fitting

I'm trying to fit the information from the G function of my data to the following mathematical mode: y = A / ((1 + (B^2)*(x^2))^((C+1)/2)) . The shape of this graph can be seen here:
http://www.wolframalpha.com/input/?i=y+%3D+1%2F+%28%281+%2B+%282%5E2%29*%28x%5E2%29%29%5E%28%282%2B1%29%2F2%29%29
Here's a basic example of what I've been doing:
data(simdat)
library(spatstat)
simdat.Gest <- Gest(simdat) #Gest is a function within spatstat (explained below)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
GvsR_dataframe <- data.frame(R = Rvalues, G = rev(Gvalues))
themodel <- nls(rev(Gvalues) ~ (1 / (1 + (B^2)*(R^2))^((C+1)/2)), data = GvsR_dataframe, start = list(B=0.1, C=0.1), trace = FALSE)
"Gest" is a function found within the 'spatstat' library. It is the G function, or the nearest-neighbour function, which displays the distance between particles on the independent axis, versus the probability of finding a nearest neighbour particle on the dependent axis. Thus, it begins at y=0 and hits a saturation point at y=1.
If you plot simdat.Gest, you'll notice that the curve is 's' shaped, meaning that it starts at y = 0 and ends up at y = 1. For this reason, I reveresed the vector Gvalues, which are the dependent variables. Thus, the information is in the correct orientation to be fitted the above model.
You may also notice that I've automatically set A = 1. This is because G(r) always saturates at 1, so I didn't bother keeping it in the formula.
My problem is that I keep getting errors. For the above example, I get this error:
Error in nls(rev(Gvalues) ~ (1/(1 + (B^2) * (R^2))^((C + 1)/2)), data = GvsR_dataframe, :
singular gradient
I've also been getting this error:
Error in nls(Gvalues1 ~ (1/(1 + (B^2) * (x^2))^((C + 1)/2)), data = G_r_dataframe, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562
I haven't a clue as to where the first error is coming from. The second, however, I believe was occurring because I did not pick suitable starting values for B and C.
I was hoping that someone could help me figure out where the first error was coming from. Also, what is the most effective way to pick starting values to avoid the second error?
Thanks!
As noted your problem is most likely the starting values. There are two strategies you could use:
Use brute force to find starting values. See package nls2 for a function to do this.
Try to get a sensible guess for starting values.
Depending on your values it could be possible to linearize the model.
G = (1 / (1 + (B^2)*(R^2))^((C+1)/2))
ln(G)=-(C+1)/2*ln(B^2*R^2+1)
If B^2*R^2 is large, this becomes approx. ln(G) = -(C+1)*(ln(B)+ln(R)), which is linear.
If B^2*R^2 is close to 1, it is approx. ln(G) = -(C+1)/2*ln(2), which is constant.
(Please check for errors, it was late last night due to the soccer game.)
Edit after additional information has been provided:
The data looks like it follows a cumulative distribution function. If it quacks like a duck, it most likely is a duck. And in fact ?Gest states that a CDF is estimated.
library(spatstat)
data(simdat)
simdat.Gest <- Gest(simdat)
Gvalues <- simdat.Gest$rs
Rvalues <- simdat.Gest$r
plot(Gvalues~Rvalues)
#let's try the normal CDF
fit <- nls(Gvalues~pnorm(Rvalues,mean,sd),start=list(mean=0.4,sd=0.2))
summary(fit)
lines(Rvalues,predict(fit))
#Looks not bad. There might be a better model, but not the one provided in the question.

Calculating a correlation coefficient that includes missing values

I'm looking to calculate some form of correlation coefficient in R (or any common stats package actually) in which the value of the correlation is influenced by missing values. I am not sure if this is possible and am looking for a method. I do not want to impute data, but actually want the correlation to be reduced based on the number of incomplete cases included in some systematic fashion. The data are a series of time points generated by different individuals and the correlation coefficient is being used to compute reliability. In many cases, one individual's data will include several more time points than the other individual...
Again, not sure if there is any standard procedure for dealing with such a situation.
One thing to look at is fitting a logistic regression to whether or not a point is missing. If there is no relationship then that provides support for assuming that the missing values won't provide any information. If that is your case then you won't have to impute anything and can just perform your computation without the missing values. glm in R can be used for logistic regression.
Also on a different note, see the use="pairwise.complete.obs" argument to cor which may or may not apply to you.
EDIT: I have revised this answer based on rereading the question.
My feeling is that when there is a datapair that has one of the timeseries showing NA, that pair cannot be used for calculating a correlation as there is no information at that point. As there is no information on that point, there is no way to know how it would influence the correlation. Specifying that an NA reduces the correlation seems tricky, if an observation would be present at a point this could just as easily have improved the correlation.
Default behavior in R is to return NA for the correlation if there is an NA present. This behavior can be tweaked using the 'use' argument. See the documentation of that function for more details.
As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.
Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB))). A simple illustration follows.
fun = function(x1, x2, idx_rm) {
temp = x2
# remove 'idx_rm' points from x2
temp[idx_rm] = NA
# calculate correlations
r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
(!is.na(temp) & is.na(x1)))
r_pen = round(r_NA * penalty, 2)
# plot
plot(x1, temp, main = paste('r_full =', r_full,
'; r_NA =', r_NA,
'; r_pen =', r_pen),
xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)
regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
abline(regr_full[1], regr_full[2])
abline(regr_NA[1], regr_NA[2], lty = 2)
}
Run a simple simulation to illustrate the possible effects of missing values and penalization:
set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation,
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))
# A case when NA's DEflate the correlation,
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))
# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)
# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))

Resources