What does the span argument control in geom_smooth? - r

I am using geom_smooth from the ggplot2 package to create a smoothed line on a time series scatter plot (one point for each day of the year, so I have 365 points). One of the arguments is called span, and going into the help file (?geom_smooth) the following description is given:
span controls the amount of smoothing for the default loess smoother. Smaller numbers produce wigglier lines, larger numbers produce smoother lines.
However, this doesn't actually tell me what the span argument is controlling. Setting it to 1 is useless, and setting it to 0.1 provides something that looks good.
span = 0.5
span = 0.1
However, when describing the plot, since I'm not totally sure what span actually changes, I'm not sure how to describe the smoothed line. Any pointers?

The span (also defined alpha) will determine the width of the moving window when smoothing your data.
"In a loess fit, the alpha parameter determines the width of the sliding window. More specifically, alpha gives the proportion of observations that is to be used in each local regression. Accordingly, this parameter is specified as a value between 0 and 1. The alpha value used for the loess curve in Fig. 2 is 0.65; so, each of the local regressions used to produce that curve incorporates 65% of the total data points. "
Taken from:
Jacoby (2000) Loess:: a nonparametric, graphical tool for depicting relationships between variables. Electoral Studies 19-4. (Paywalled paper)
For more details check the referenced paper.

LOESS smoothing is a non-parametric form of regression that uses a weighted, sliding-window, average to calculate a line of best fit. Within each "window", a weighted average is calculated, and the sliding window passes along the x-axis.
One can control the size of this window with the span argument. The span element controls the alpha, which is the degree of smoothing. The smaller the span, the smaller the 'window', hence the noisier/ more jagged the line.
Look for documentation under LOESS rather than span.

Related

Box plot with only one Whisker line?

I wanted to compare a variable among two different places. For this I used the following code in R:
boxplot(variable~place)
As my place had two levels. However, I get a weird looking box plot. Do you know what seems to be the problem?
As has been noted in comments, there's nothing wrong with the boxplots. The fact that there is just one whisker each is a reflection of your data.
Boxplots are an extremely instructive type of graphic: they not only show the location and spread of data but also indicate skewness. The interpretation of the boxplot depends on its ‘syntax’, i.e., its main graphical elements. There are five such elements:
1. bold horizontal line: depicts the median (the middle value of a sorted distribution)
2. box: represents the interquartile range (IQR)
3. whiskers: separate values lying outside the IQR (but still somewhat typical of the data) from outliers
3. empty circles: depict outliers (values surprisingly large or small given all values considered)
4. notches (optional): give a rough impression of the significance of the difference between the medians
The fact that there's just one whisker in each boxplot is, then, due to the extreme skewness of your data: in the case of box 1 the upper limit of the values is the upper limit of the IQR, and in the case of box 2 there exists no value smaller than the median!
Hope this helps.

Does this curve represent non-linearity in my residuals vs fitted plot? (simple linear regression)

Hi,
I am running a simple linear regression model in R at the moment and wanted to check my assumptions. As seen by the plot, my red line does not appear to be flat and instead curved in places.
I am having a little difficulty interpreting this - does this imply non-linearity? And if so, what does this say about my data?
Thank you.
The observation marked 19 on your graph (bottom right corner) seems to have significant influence and is pulling down your line more than other points are pulling it up. The relationship looks linear all in all, getting rid of that outlier by either nullifying it by increasing sample size (Law of large numbers) or removing the outlier(s) should fix your problem without compromising the story your data is trying to tell you and give you the nice graph you're looking for.

How to generate mean curve of non-function?

I am currently working on curves generated in tensile tests of polymer specimens. Here, I try to generate a mean curve of five data sets generated at the same composition of the samples. Unfortunately, the resulting curve is not a function but has a vertical section which is why a simple smooth is not sufficient. Is there a way to fix the smoothed curve to a defined end point in R? Or an even better way that I did not see yet?
I already tried a geometric_smooth() from ggplot2 on all data points but it did not work as wished.
My current approach:
data <- read.csv("data.csv", header = TRUE, sep = ";")
ggplot(data, aes(y=stress, x=strain))+geom_point()+geom_smooth()
In the figure, you can see that the blue average curve does not fit the actual curves near their end points, probably due to the vertical sections. That's why I want to fix it to the mean end point. Additionally, I would like to fix it to (0|0) as the blue mean curve starts somewhere above it which does not fit the actual behaviour.

Forest plot from cox object

Please be tolerant :) I am a dummy user of R and I am using the code and sample data to learn how to make forest plot that was shown in the previous post -
Optimal/efficient plotting of survival/regression analysis results
I was wondering is it possible to set user-defined x-axis scale with the code shown there? Up to now x a-axis scale is defined somehow automatically.
Thank you for any tips.
I'm unimpressed with the precision of the documentation since one might assume that the limits argument would be values on the relative risk scale rather than on the log-transformed scale. One gets a ridiculous result if that is done. That quibble not withstanding, it's relatively easy to use that parameter to created an expanded plot:
install('devtools') # then use it to get current package
# executing the install and load of the package referenced at the top of that answer
print(forest_model(lung_cox, limits=log( c(.5, 50) ) ))
Trying for a lower range of 0 on the relative risk scale is not sensible. Would imply a -Inf value on hte log-transformed scale. Trying for lower value, say log(0.001), confuses the pretty printing of the scale in my tests.

how to fit baseline/background in R

I am trying to fit the background shape in nmr spectra. For this I have been using the loess function so far.
First I try to identify all the peaks (which works more or less) and remove them from the spectrum. Then I try to fit the rest of the spectrum with the loess function.
My problem now is that if the removal of peaks doesn't work perfectly there are still some points left which are clearly not background.
Is there a way to tell the fit not to go over the data, i.e. having the fitted line always below the data points (which is clearly what you want from a baseline)? My hope is that, if I am able to constrain the fit to be below the data points I can find suitable parameters, so that the remaining points from the peaks are ignored.
Thanks
John

Resources