ggplot trend line is flat - r

I'm trying to plot a trend line along with a 95% confidence interval for my data in this csv file. When I issue this command:
ggplot(trimmed_data, aes(x=week, y=V4)) +
geom_smooth(fill='blue', alpha=.2, color='blue')
I get this plot, which is great:
However, when I use the since_weeks column (which is the correct one I'd like to use), I get a flat line:
ggplot(trimmed_data, aes(x=since_weeks, y=V4)) +
geom_smooth(fill='blue', alpha=.2, color='blue')
the weeks column has a range of 0-51, while the since_weeks column has a range of 1-52. Essentially I'm just re-ordering the rows.
I get this warning with both plots:
geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Related

Plot bin averaged values with error bars in R

I have a dataframe with three columns "DateTime", "T_ET", and "LAI". I want to plot T_ET (on y-axis) against LAI (on x-axis) along with 0.1-bin LAI averaged values of T_ET on the same plot something like below (Wei et al., 2017):
In above figure, y-axis is T_ET or T/(E+T), x-axis is LAI, red open diamonds with error bars are 0.1-bin LAI averaged of black points and the standard deviation, solid line is
a regression of the individual data points (estimated from the bin averages), n is available data points. Dash lines are 95% confidence bounds.
How can I obtain the plot similar to above plot? Please find the sample data using the following link: file
or use following sample data:
df <- structure(list(DateTime = structure(c(1478088000, 1478347200, 1478692800, 1478779200, 1478865600, 1478952000, 1479124800, 1479211200, 1479297600, 1479470400), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
T_ET = c(0.996408350852751, 0.904748351479432, 0.28771236118773, 0.364402232484906, 0.452348409759872, 0.415408041501318, 0.629291202120187, 0.812083112145703, 0.992414777441755, 0.818032913071265),
LAI = c(1.3434, 1.4669, 1.6316, 1.6727, 1.8476, 2.0225, 2.3723, 2.5472, 2.7221, 3.0719)),
row.names = c(NA, 10L),
class = "data.frame")
You can do this directly while plotting via stat_summary_bin(). By default, the geom associated with this would be the pointrange geom and uses mean_se(). bins= controls the number of bins, but you can also supply binwidth=. Note that with the pointrange geom, fatten controls the size of the central point:
ggplot(df, aes(LAI, T_ET)) + geom_point() + theme_classic() +
stat_summary_bin(bins=3, color='red', shape=5, fatten=5)
Your sample data is a little light, so here's another example via the diamonds dataset. Here, I'm constructing the same look as the example plot you show by combining the errorbar and poing geom. Please note that apparently setting the width of the errorbar doesn't work correctly with stat_summary_bin().
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
theme_classic()
EDIT: Showing Regression for Binned Data
As indicated in the comments, drawing a regression line based on the binned data and not the original data is possible, but not through the stat_summary_bin() function unless you are okay to use loess. If you're looking for linear regression, you'll need to bin the data outside of ggplot, then plot the regression on the binned data.
The reason for this is probably by design. It's inherently not a good idea to draw a regression line (a way of summarizing data) that is based on summarized data. Regardless, here's one way to do this via the diamonds dataset. We can use the cut() function to cut into separate bins, then summarize the data on those binned values. Due to the way the cut() function labels the output, we have to create our own labels. Since we're cutting into 12 equal pieces in this example, I'm creating 12 evenly-spaced positions on the x axis for our data values to sit into - this may be different in your case, just take care you label according to what the data represents and what makes the most statistical sense.
df <- diamonds
# setting interval labeling
bin_width <- diff(range(df$carat)/12)
bin_labels <- c((range(df$carat)[1] + (bin_width/2))+(0:11*bin_width))
# cutting the data
df$bins <- cut(df$carat, breaks=12, labels=bin_labels)
df$bins <- as.numeric(levels(df$bins)[df$bins]) # convert factor to numeric
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(data=df, aes(x=bins), method='lm', color='blue') +
theme_classic()
Note that the regression line above is weighting all binned values equally. This is generally not a good idea unless your data is spaced evenly among the dataset. I'd still recommend if you're going to draw a regression line, have it linked to the original data, which is much more representative of the reality within your data. That would look like this:
ggplot(diamonds, aes(carat, price)) + geom_point(size=0.3) +
stat_summary_bin(geom='errorbar', color='red', bins=12, width=0.001) +
stat_summary_bin(geom='point', size=3, shape=5, color='red', bins=12) +
geom_smooth(method='lm', color='green') +
theme_classic()
When it comes down to it, drawing a regression line for binned data is summarizing the summarized data rather than summarizing your original data. It's statistical heresay, so use at your own risk. But if you simply must for whatever strange reason... I can't stop you. ;)

Trouble with fitting a power curve in ggplot2

I have a dataset in r with the following columns:
> names(dataset)
[1] "Corp.Acct.Name" "Product-name" "Package.Type" "Total.Quantity" "ASP.Ex.Works"
What I am trying to do is create a scatter plot with Total.Quantity on the x axis and ASP.Ex.Works on the y axis, and then fit a power curve to the scatterplot.
I have tried the following using stat_smooth:
p <- ggplot(data = dataset, # specify dataset
aes(x = Total.Quantity, y = ASP.Ex.Works)) + # Quantity on x, ASP on Y
geom_point(pch = 1) + # plot points (pch = 1: circles, type '?pch' for other options)
xlim(0, xlimmax) +
ylim(0, ylimmax) +
xlab("Quantity (lbs)") +
ylab("Average Sale Price Ex Freight ($)") +
#Add line using non-linear regreassion
stat_smooth(method="nls",formula = ASP.Ex.Works ~a*exp(-Total.Quantity*b),method.args=list(start=c(a=2,b=2)),se=F,color="red")
p
but am thrown the following error:
Warning message: Computation failed in stat_smooth(): parameters
without starting value in 'data': ASP.Ex.Works, Total.Quantity
I have tried several different methods, including specifying the model outside of ggplot, but haven't had any luck. I am trying to recreate excel's power curve option in r for a dynamic visual in Power BI.

R : stat_smooth groups (x axis)

I have a Database, and want to show a figure using stat_smooth.
I can show the avg_time vs Scored_Probabilities figure, which looks like this:
c <- ggplot(dataset1, aes(x=Avg.time, y=Scored.Probabilities))
c + stat_smooth()
But when changing Avg.time to time or Age, an error occurs:
c <- ggplot(dataset1, aes(x=Age, y=Scored.Probabilities))
c + stat_smooth()
error: geom_smooth: Only one unique x value each group. Maybe you want aes(group = 1)?
How could I fix it?
the error message says to set group=1, doing that gives another error
ggplot(dataset1, aes(x=Age, y=Scored.Probabilities, group=1))+stat_smooth()
geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Error in smooth.construct.cr.smooth.spec(object, data, knots) :
x has insufficient unique values to support 10 knots: reduce k.
Now the number of unique x values is not enough.
So two solutions : i) using another function like mean, ii) using jitter to move slightly Age.
ggplot(dataset1, aes(x=Age, y=Scored.Probabilities, group=1))+
geom_point()+
stat_summary(fun.y=mean, colour="red", geom="line", size = 3) # draw a mean line in the data
Or
ggplot(dataset1, aes(x=jitter(as.numeric(as.character(Age))), y=Scored.Probabilities, group=1))+
geom_point()+stat_smooth()
Note the use of as.numeric because Age is a factor.

ggplot error on aesthetics on predicted values of Regression in R

I am trying to plot a graph of predicted values in ggplot.The script is depicted below -
Program1
lumber.predict.plm1=lm(lumber.1980.2000 ~ scale(woman.1980.2000) +
I(scale(woman.1980.2000)^2), data=lumber.unemployment.women)
xmin=min(lumber.unemployment.women$woman.1980.2000)
xmax=max(lumber.unemployment.women$woman.1980.2000)
predicted.lumber.all=data.frame(woman.1980.2000=seq(xmin,xmax,length.out=100))
predicted.lumber.all$lumber=predict(lumber.predict.plm1,newdata=predicted.lumber.all)
lumber.predict.plot=ggplot(lumber.unemployment.women,mapping=aes(x=woman.1980.2000,
y=lumber.1980.2000)) +
geom_point(colour="red") +
geom_line(data=predicted.lumber.all,size=1)
lumber.predict.plot
Error: Aesthetics must either be length one, or the same length as the dataProblems:woman.1980.2000
I believe, we do not need to match the number of observations in base dataset with the one in predicted values dataset. The same logic/program works when I try it on 'cars' dataset.
speed.lm = lm(speed ~ dist, data = cars)
xmin=10
xmax=120
new = data.frame(dist=seq(xmin,xmax,length.out=200))
new$speed=predict(speed.lm,newdata=new,interval='none')
sp <- ggplot(cars, aes(x=dist, y=speed)) +
geom_point(colour="grey40") + geom_line(data=new, colour="green", size=.8)
The above code works fine.
Unable to figure out the problem with my first program.
You should use the same y value in the predicted data. Change this line
predicted.lumber.all$lumber=
predict(lumber.predict.plm1,newdata=predicted.lumber.all)
by this one :
predicted.lumber.all$lumber.1980.2000= ## very bad variable name!
predict(lumber.predict.plm1,newdata=predicted.lumber.all)
Or recall aes as :
geom_line(data=new,aes(y=lumber),
colour="green", size=.8)
The basic problem is that in your code,
...
geom_line(data=predicted.lumber.all,size=1)
...
ggplot does not know which column from predicted.lumber to use. As #agstudy says, you can specify this with aes(...) in geom_line:
...
geom_line(data=predicted.lumber.all, aes(y=lumber), size=1)
...
Since you're just plotting the regression curve, you could accomplish the same thing with less code using:
df <- lumber.unemployment.women
model <- lumber.1980.2000 ~ scale(woman.1980.2000) + I(scale(woman.1980.2000)^2)
ggplot(df, aes(x=woman.1980.2000, y=lumber.1980.2000)) +
geom_point(color="red") +
stat_smooth(formula=model, method="lm", se=T, color="green", size=0.8)
Note that se=T gives you the confidence limits on the regression curves.

Plotting average of multiple variables in time-series using ggplot

I have a file which contains time-series data for multiple variables from a to k.
I would like to create a graph that plots the average of the variables a to k over time and above and below that average line adds a smoothed area representing maximum and minimum variation on each day.
So something like confidence intervals but in a smoothed version.
Here's the dataset:
https://dl.dropbox.com/u/22681355/co.csv
and here's the code I have so far:
library(ggplot2)
library(reshape2)
meltdf <- melt(df,id="Year")
ggplot(meltdf,aes(x=Year,y=value,colour=variable,group=variable)) + geom_line()
This depicts bootstrapped 95 % confidence intervals:
ggplot(meltdf,aes(x=Year,y=value,colour=variable,group=variable)) +
stat_summary(fun.data = "mean_cl_boot", geom = "smooth")
This depicts the mean of all values of all variables +-1SD:
ggplot(meltdf,aes(x=Year,y=value)) +
stat_summary(fun.data ="mean_sdl", mult=1, geom = "smooth")
You might want to calculate the year means before calculating the means and SD over the variables, but I leave that to you.
However, I believe a boostrap confidence interval would be more sensible, since the distribution is clearly not symmetric. It would also be narrower. ;)
And of course you could log-transform your values.

Resources