Using 'difftime' data in a ggplot2 boxplot in R - r

I created a difftime object to determine the amount of hours it takes to report a crime that has occurred. Also, in the same dataset I have a variable which indicates whether the crime occurred on a weekday or in the weekend. Now I'd like to create a ggplot2 boxplot with 'weekday' and 'weekend' on the x-axis and use difftime on the y-axis.
I used:
ggplot(data = data, aes(x = workday, y = difftime_var)) +
geom_boxplot()
However, this gives the warning: Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
I'd like to adjust the boxplot in such way that it looks like a 'real' boxplot, showing the mean amount of time it takes etc. Right now, it's basically a flat line at the bottom of the graph with a few dots above. The y-axis goes from 0 to 40 000. Probably because the min and max value of the difftime object are very small / large.
Thanks in advance for helping out!

Please provide an reproducible example dataset to your question.
I guess the problem is that difftime has a huge range, which makes it impossible to show a boxplot. First thing you can try is
ggplot(data = data, aes(x = workday, y = difftime_var)) +
geom_boxplot(outlier.shape=NA)
Another (not elegant) way is to set a limit to the yaxis:
ggplot(data = data, aes(x = workday, y = difftime_var)) +
geom_boxplot() + ylim(ymin, ymax)
For more information, there was a similar question asked before:
How to remove outliers in boxplot in R?

Related

Making a geom_bar from a dataframe in R

Background
I have a dataframe, df, of athlete injuries:
df <- data.frame(number_of_injuries = c(1,2,3,4,5,6),
number_of_people = c(73,52,43,12,7,2),
stringsAsFactors=FALSE)
The Problem
I'd like to use ggplot2 to make a bar chart or histogram of this simple data using geom_bar or geom_histogram. Important point: I'm pretty novice with ggplot2.
I'd like something where the x-axis shows bins of the number of injuries (number_of_injuries), and the y-axis shows the counts in number_of_people. Like this (from Excel):
What I've tried
I know this is the most trivial dang ggplot issue, but I keep getting errors or weird results, like so:
ggplot(df, aes(number_of_injuries)) +
geom_bar(stat = "count")
Which yields:
I've been in the tidyverse reference website for an hour at this and I can't crack the code.
It can cause confusion from time to time. If you already have "count" statistics, then do not count data using geom_bar(stats = "count") again, otherwise you simply get 1 in all categories. You want to plot those values as they are with geom_col:
ggplot(df, aes(x = number_of_injuries, y = number_of_people)) + geom_col()

Stop graph touching zero in ggplot geom_freqpoly function

I am creating a frequency plot using the geom_freqpoly function in ggplot2. I have a large data set of social media comments across 14 months and am plotting the number of comments for each week of that data. I am using this code, first converting the UTC to POSIXct and the doing the frequency plot:
ggplot(data = TRP) +
geom_freqpoly(mapping = aes(x = created_utc), binwidth = 604800)
This is creating a plot that looks like this:
I want however to top and tail the plot, as it touches 'zero' at both the start and end, making it look like there was rapid growth and rapid decline. This is not the case as this is simply a snapshot of the data, which exists before and after my analysis. The data begins at the 4,000 mark and ends at around 2,000 and I want it represented like that. I have checked the 'pad' instruction and have insured it is set at FALSE.
Any help as to why this may be occurring would be greatly appreciated.
Thanks!
Rather than adjusting the geom_freqpoly to work differently than intended, it might be simpler to calculate the weekly totals yourself and use geom_line:
library(lubridate); library(dplyr)
set.seed(1)
df <- data.frame(
datetime = ymd_h(2018010101) + dhours(runif(1000, 0, 14*30*24))
)
df %>%
count(week_count = floor_date(datetime, "1 week")) %>%
ggplot(aes(week_count, n)) +
geom_line()

ggplot2 in R: Calculate percentage and make a graph that might be a geom_area plot

I'm a beginner in R, so please be patient with me if there are very obvious mistakes in my code and for my question! For a homework problem, I am struggling to make what I think is a geom_area plot look like this:
As background, we are using the diamonds dataframe from ggplot2 library. We were given the plot and asked to reproduce it. My biggest problem is with the y-axis. The graph given indicated that the y-axis represents density, which I think is the percentage/proportion of each clarity grade given the title. Originally, I thought perhaps I needed to create a new dataframe with "Price" and "Clarity Proportion (or, density)", but I wasn't sure how to do that. The professor hinted that we should not need to create a new variable for this problem.
Here's what I have so far. It produces the error message: "In Ops.ordered(left, right): '/' is not meaningful for ordered factors":
set.seed(123)
d <- ggplot(diamonds[sample(nrow(diamonds),5000),]) #these were given in the homework
d + geom_area(aes(x = price, y = lapply(count(diamonds$clarity), FUN = count(diamonds$clarity)/53940), colour = clarity), position = "fill") +
labs(title = "Clarity Proportion by Price")
I know my y-axis is wrong, but I'm just not sure how to transform it. Your explanation and insight are greatly appreciated!

R ggplot: eliminating empty date range from time series plot?

On the bottom image, I have a graph produced by this code:
library(lubridate)
shangPM$date <- with(shangPM, ymd_h(paste(year, month, day, hour, sep= ' ')))
ggplot(data = shangPM, aes(x = date, y = PM_US.Post)) +
geom_line()
However, there is four years shown on my x-axis with no data, making the graph look weird. I tried using xlim and coord_cartesian, but this does not seem to be working well with my date variable (maybe I'm wrong?)
A bit of a noob here - can someone help me zoom in on only the dates I have data for for my plot?
Here is my error:
Error in as.POSIXct.numeric(value) : 'origin' must be supplied

R incorrect y-axis in ggplots geom_bar()

I have a dataframe with Wikipedia edits, with information about the number of edit for the user (1st edit, 2nd edit and so on), the timestamp when the edit was made, and how many words were added.
In the actual dataset, I have up to 20.000 edits per user and in some edits, they add up to 30.000 words.
However, here is a downloadable small example dataset to exemplify my problem. The header looks like this:
I am trying to plot the distribution of added words across the Edit Progression and across time. If I use the regular R barplot, i works just like expected:
barplot(UserFrame3$NoOfAdds,UserFrame3$EditNo)
But I want to do it in ggplot for nicer graphics and more customizing options.
If I plot this as a scatterplot, I get the same result:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) + geom_point(size = 0.1)
Same for a linegraph:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) +geom_line(size = 0.1)
But when I try to plot it as a bargraph in ggplot, I get this result:
ggplot(data = UserFrame3, aes(x = UserFrame3$EditNo, y = UserFrame3$NoOfAdds)) + geom_bar(stat = "identity", position = "dodge")
There appear to be a lot more holes on the X-axis and the maximum is nowhere close to where it should be (y = 317).
I suspect that ggplot somehow groups the bars and uses means instead of the actual values despite the "dodge" parameter? How can I avoid this? and how would I go about plotting the time progression as a bargraph aswell without ggplot averaging over multiple edits?
You should expect more x-axis "holes" using bars as compared with lines. Lines connect the zero values together, bars do not.
I used geom_col with your data download, it looks as expected:
UserFrame3 %>%
ggplot(aes(EditNo, NoOfAdds)) + geom_col()

Resources