The difference between geom_density in ggplot2 and density in base R - r

I have a data in R like the following:
bag_id location_type event_ts
2 155 sorter 2012-01-02 17:06:05
3 305 arrival 2012-01-01 07:20:16
1 155 transfer 2012-01-02 15:57:54
4 692 arrival 2012-03-29 09:47:52
10 748 transfer 2012-01-08 17:26:02
11 748 sorter 2012-01-08 17:30:02
12 993 arrival 2012-01-23 08:58:54
13 1019 arrival 2012-01-09 07:17:02
14 1019 sorter 2012-01-09 07:33:15
15 1154 transfer 2012-01-12 21:07:50
where class(event_ts) is POSIXct.
I wanted to find the density of bags at each location in different times.
I used the command geom_density(ggplot2) and I could plot it very nice. I wonder if there is any difference between density(base) and this command. I mean any difference about the methods that they are using or the default bandwith that they are using and the like.
I need to add the densities to my data frame. If I had used the function density(base), I knew how I can use the function approxfun to add these values to my data frame, but I wonder if it is the same when I use geom_density(ggplot2) .

A quick perusal of the ggplot2 documentation for geom_density() reveals that it wraps up the functionality in stat_density(). The first argument there references that the adjust parameter coming from the base function density(). So, to your direct question - they are built off of the same function, though the exact parameters used may be different. You have some control over setting those parameters, but you may not be able to have the amount of flexibility you want.
One alternative to using geom_density() is to calculate the density that you want outside of ggplot() and then plot it with geom_line(). For example:
library(ggplot2)
#100 random variables
x <- data.frame(x = rnorm(100))
#Calculate own density, set parameters as you desire
d <- density(x$x)
x2 <- data.frame(x = d$x, y = d$y)
#Using geom_density()
ggplot(x, aes(x)) + geom_density()
#Using home grown density
ggplot(x2, aes(x,y)) + geom_line(colour = "red")
Here, they give nearly identical plots, though they may vary more significantly with your data and your settings.

Related

Scatterplot with different colors for each filtred dataset

I am working on a project that tries to evaluate bus drivers driving efficiency. The data i am working with looks like this:
Driver fvec_arpox_md
1 2561
2 1245
2 2315
2 1264
3 1256
3 1235
1 2145
2 3265
5 2121
9 1256
5 1785
46 1945
2 1261
3 1245
So i would like to do a scatterplot that shows fvec_aprox_md just for some specific drivers (for instance, drivers 1 and 2) and each dataset with a different color.
I have just started to learn R and so far i have just been able to get a plot like this:
Also i added a filter that just shows on the console filtred data, but it does not affect the scatterplot. This is the code i have used:
library(ggplot2)
library(dplyr)
filter(Ruta268, num_conductor==4327)
b<-ggplot(Ruta268, aes(y=Ruta268$fvec_aprox_md, x=seq(1,length(Ruta268$fvec_aprox_md)), group=num_conductor, color=num_conductor))
b + geom_point() +
geom_smooth()
If you'd like to add a colour code for each driver, you need to add the colour argument into the aesthetics aes(). In ggplot2, whenever you want the function to refer to something within your data, you need to put it into aes(). When you just want to change an argument for the whole plot or within each geom layer, independent of your data, you can put an argument after the aes(), such as ggplot(data, aes(x,y), colour = 'red').
This is the reason why your geom_smooth() does not work either. The group argument needs to be in the aesthetics as well for geom_smooth() to find the data and fit a smooth line for each of your drivers.
b<-ggplot(Ruta268,
aes(y=Ruta268$fvec_aprox_md,
x=seq(1,length(Ruta268$fvec_aprox_md),
color=num_conductor,
group=num_conductor
)
)
b + geom_point() +
geom_smooth()
If you're new to R I strongly recommend this open book written by the tidyversedevelopers to get you started more smoothly.
And if you only want to have a few drivers within your whole dataset, you should filter them beforehand as #s_t suggested and use it in ggplot() instead of Ruta268.
filtered.df <- Ruta268 %>%
filter(num_conductor == 4327)

Plotting different y-axis scaling using ggplot facet_grid()?

I'm running into trouble plotting some data onto two seperate y-scales. Here are two visualizations of some air quality data I've been working with. The first figure depicts each pollutant on a parts per billion y-scale. In this figure, co dominates the y-axis, and none of the other pollutants' variation is being properly represented. Within air quality science, the pollutant co is conventionally represented in parts per million instead of parts per billion. The second figure illustrates the same no, no2, and o3 data, but I've converted the co concentration from ppb to ppm (divide by 1000). However, while no, no2, and o3 look better, the variation in co is not being justly represented...
Is there an easy way using ggplot() to normalize the scale of the y-axis and best represent each type of pollutant? I'm also trying to work through some other examples that make use of gridExtra to stitch together two seperate plots, each retaining their original y-scales.
The data required to generate these figures is huge (26,295 observations), so I'm still working on a reproducible example. Hopefully a solution can be found within the ggplot() code described below:
plt <- ggplot(df, aes(x=date, y = value, color = pollutant)) +
geom_point() +
facet_grid(id~pollutant, labeller = label_both, switch = "y")
plt
Here's what the head(df) looks like (before converting the co to ppm):
date id pollutant value
1 2017-06-16 10:00:00 Pohl co 236.00
2 2017-06-16 10:00:00 Pohl no 23.06
3 2017-06-16 10:00:00 Pohl no2 12.05
4 2017-06-16 10:00:00 Pohl o3 8.52
5 2017-06-16 11:00:00 Pohl co 207.00
6 2017-06-16 11:00:00 Pohl no 20.82
Marius pointed out that including scales = "free_y" in the facet_grid() function would provide the desired output. Thanks!
Solution:
plt <- ggplot(df, aes(x=date, y = value, color = pollutant)) +
geom_point() +
facet_grid(pollutant~id, scales = "free_y", labeller = label_both, switch = "y")
plt
Output:

R ggplot multiple series curved line

I am plotting multiple series of data on one plot.
I have data that looks like this:
count_id AMV Hour duration_in_traffic AMV_norm
1 16012E 4004 14 99 0
2 16012E 4026 12 94 22
3 16012E 4099 15 93 95
4 16012E 4167 11 100 163
5 16012E 4239 10 97 235
I am plotting in R using:
ggplot(td_results, aes(AMV,duration_in_traffic)) + geom_line(aes(colour=count_id))
This is giving me:
However, rather than straight lines linking points I would like curved.
I found the following question but got an unexpected output. Equivalent of curve() for ggplot
I used: ggplot(td_results, aes(AMV,duration_in_traffic)) + geom_line(aes(colour=count_id)) + stat_function(fun=sin)
Thus giving:
How can I get a curve with some form of higher order polynomial?
As #MrFlick mentions in the comments, there are serious statistical ways of getting curved lines, which are probably off topic here.
If you just want your graph to look nicer however, you could try interpolating your data with spline, then adding it on as another layer.
First we make some spline data, using 10 times the number of data points you had (you can increase or decrease this as desired):
library(dplyr)
dat2 <- td_results %>% select(count_id, AMV, duration_in_traffic) %>%
group_by(count_id) %>%
do(as.data.frame(spline(x= .[["AMV"]], y= .[["duration_in_traffic"]], n = nrow(.)*10)))
Then we plot, using your original data for points, but then using lines from the spline data (dat2):
library(ggplot2)
ggplot(td_results, aes(AMV, duration_in_traffic)) +
geom_point(aes(colour = factor(count_id))) +
geom_line(data = dat2, aes(x = x, y = y, colour = factor(count_id)))
This gives me the following graph from your test data:

Using ggplot, how can I add a point at a time in a loop and then connect them all by a line after the loop?

I am going dealing with a lot of data and I was thinking of plotting one part of the data at a time using a loop.
Here's a sample of the data:
Department Period Sales
1005 1 3354.256
1005 1 5587.164
1005 2 3946.055
1005 2 5739.555
1005 3 3990.139
1005 3 6208.411
1005 4 3887.84
1005 4 6397.811
1008 1 4014.629
1008 1 5370.781
1008 2 4311.249
1008 2 5403.442
1008 3 4028.125
1008 3 6660.305
1008 4 4564.816
My initial idea was to plot one point at a time and then connect the points with a line after exiting the loop.
gp <- ggplot()
for (i in 1:4) {
dat <- qdat[qdat$Period == i,]
gp <- gp + stat_summary(data = dat , aes(x=Period , y=Sales), geom="point", fun.y="sum")
print(gp)
}
final_plot <- gp + geom_line()
However, I only get the points, but am not able to generate any lines connecting the points.
Ideally, I would also like to know if it's possible to plot different line segments at a time to make one continuous line using a loop.
Thanks a lot!!
As was pointed out in the comments, using ggplot you should not add points one by one to the plot. It is easy to plot what you want in one step. I assume (tell me when I'm wrong) that you want to have Period on the x-axis and the sum over all sales belonging to that period on the y-asix. This can be done as follows.
First, I use aggregate() to sum up the sales per period:
plot.data <- aggregate(Sales~Period,data=qdat,FUN=sum)
With this data set, the plot can be done in a single line:
ggplot(plot.data,aes(x=Period,y=Sales)) + geom_point() + geom_line()
Note that I use geom_point() and geom_line() in order to get points connected by a line. Using the data sample you gave, I get the following picture:

set x/y limits in facet_wrap with scales = 'free'

I've seen similar questions asked, and this discussion about adding functionality to ggplot Setting x/y lim in facet_grid . In my research I often want to produce several panels plots, say for different simulation trials, where the axes limits remain the same to highlight differences between the trials. This is especially useful when showing the plot panels in a presentation. In each panel plot I produce, the individual plots require independent y axes as they're often weather variables, temperature, relative humidity, windspeed, etc. Using
ggplot() + ... + facet_wrap(~ ..., scales = 'free_y')
works great as I can easily produce plot panels of different weather variables.
When I compare between different plot panels, its nice to have consistent axes. Unfortunately ggplot provides no way of setting the individual limits of each plot within a panel plots. It defaults to using the range of given data. The Google Group discussion linked above discusses this shortcoming, but I was unable to find any updates as to whether this could be added. Is there a way to trick ggplot to set the individual limits?
A first suggestion that somewhat sidesteps the solution I'm looking for is to combine all my data into one data table and use facet_grid on my variable and simulation
ggplot() + ... + facet_grid(variable~simulation, scales = 'free_y')
This produces a fine looking plot that displays the data in one figure, but can become unwieldy when considering many simulations.
To 'hack' the plotting into producing what I want, I first determined which limits I desired for each weather variable. These limits were found by looking at the greatest extents for all simulations of interest. Once determined I created a small data table with the same columns as my simulation data and appended it to the end. My simulation data had the structure
'year' 'month' 'variable' 'run' 'mean'
1973 1 'rhmax' 1 65.44
1973 2 'rhmax' 1 67.44
... ... ... ... ...
2011 12 'windmin' 200 0.4
So I created a new data table with the same columns
ylims.sims <- data.table(year = 1, month = 13,
variable = rep(c('rhmax','rhmin','sradmean','tmax','tmin','windmax','windmin'), each = 2),
run = 201, mean = c(20, 100, 0, 80, 100, 350, 25, 40, 12, 32, 0, 8, 0, 2))
Which gives
'year' 'month' 'variable' 'run' 'mean'
1 13 'rhmax' 201 20
1 13 'rhmax' 201 100
1 13 'rhmin' 201 0
1 13 'rhmin' 201 80
1 13 'sradmean' 201 100
1 13 'sradmean' 201 350
1 13 'tmax' 201 25
1 13 'tmax' 201 40
1 13 'tmin' 201 12
1 13 'tmin' 201 32
1 13 'windmax' 201 0
1 13 'windmax' 201 8
1 13 'windmin' 201 0
1 13 'windmin' 201 2
While the choice of year and run is aribtrary, the choice of month need to be anything outside 1:12. I then appended this to my simulation data
sim1data.ylims <- rbind(sim1data, ylims)
ggplot() + geom_boxplot(data = sim1data.ylims, aes(x = factor(month), y = mean)) +
facet_wrap(~variable, scale = 'free_y') + xlab('month') +
xlim('1','2','3','4','5','6','7','8','9','10','11','12')
When I plot these data with the y limits, I limit the x-axis values to those in the original data. The appended data table with y limits has month values of 13. As ggplot still scales axes to the entire dataset, even when the axes are limited, this gives me the y limits I desire. Important to note that if there are data values greater than the limits you specify, this will not work.
Before: Notice the differences in the y limits for each weather variable between the panels.
After: Now the y limits remain consistent for each weather variable between the panels.
I hope to edit this post in the coming days and add a reproducible example for better explanation. Please comment if you've heard anything about adding this functionality to ggplot.

Resources