Stop graph touching zero in ggplot geom_freqpoly function - r

I am creating a frequency plot using the geom_freqpoly function in ggplot2. I have a large data set of social media comments across 14 months and am plotting the number of comments for each week of that data. I am using this code, first converting the UTC to POSIXct and the doing the frequency plot:
ggplot(data = TRP) +
geom_freqpoly(mapping = aes(x = created_utc), binwidth = 604800)
This is creating a plot that looks like this:
I want however to top and tail the plot, as it touches 'zero' at both the start and end, making it look like there was rapid growth and rapid decline. This is not the case as this is simply a snapshot of the data, which exists before and after my analysis. The data begins at the 4,000 mark and ends at around 2,000 and I want it represented like that. I have checked the 'pad' instruction and have insured it is set at FALSE.
Any help as to why this may be occurring would be greatly appreciated.
Thanks!

Rather than adjusting the geom_freqpoly to work differently than intended, it might be simpler to calculate the weekly totals yourself and use geom_line:
library(lubridate); library(dplyr)
set.seed(1)
df <- data.frame(
datetime = ymd_h(2018010101) + dhours(runif(1000, 0, 14*30*24))
)
df %>%
count(week_count = floor_date(datetime, "1 week")) %>%
ggplot(aes(week_count, n)) +
geom_line()

Related

Making multi-histogram in ggplot, not recognizing grouping

I'm trying to make a stack of histograms (or a ridgeplot) so I can compare distributions at certain timepoints in my observations.
I used this source for the histogram, and this for the ridge plots.
However, I cannot figure out how to set up my code to make either a stacked histogram of each length (L) by week, so that I can see L distributions at different weeks. I have tried the fill option in ggplot (which in the example seems to produce automatic color differences for the weeks because it is in the aes()?) and other "stacks" using the y= argument, but haven't had much success, I think due to the way my data is set up. If anyone can help me figure out how to make multiple histograms by week, that would be useful!
Thanks!
#fake data
L = rnorm(100, mean=10, sd=2)
t = c((rep.int(7,10)), (rep.int(14,20)), rep.int(21,30), rep.int(28,20), (rep.int(31, 20)), (rep.int(36,10)))
fake = data.frame(cbind(L,t))
#subset data into weeks for convenience
dayofweek = seq(7,120,7)
fake2 = as.data.frame(subset(fake, t %in% dayofweek))
fake2$week <- floor(fake2$t/7)
#Plots, basic code
ggplot(fake2, aes(x=L, fill=week)) +
geom_histogram()
I tried facet_grid before, but for some reason facet_wrap actually at least separated the graphs correctly, AND magically made the color fill work:
ggplot(fake2, aes(x=L, fill = week)) +
geom_histogram()+
facet_wrap(.~week)

How to plot a total of multiple rows across a timeline in ggplot2

I want to show change in job numbers within certain time period. Ideally, I'd like to use a ggplot2 geom_dotplot and then color those dots by the column that they are in for that month. One idea I have not tried yet: do I need to reformat my data using tidyr from a wide to a long format in order to plot this?
Example data
Month Finance Tech Construction Manufacturing
Jan 14,000 6,800 11,000 17,500
Feb 11,500 8,400 9,480 15,000
Mar 15,250 4,200 7,200 12,400
Apr 12,000 6,400 10,300 8,500
My current r code attempt: I know that I need to fill the dot color by a factor of industry type. Maybe I have to have the data in a long format to do so.
library(tidyverse)
g <- ggplot(dat, aes(x = Month)) +
geom_dotplot(stackgroups = TRUE, binwidth = 1000, binpositions = "all") +
theme_light()
g
Here's how the plot I'm trying to make could look. Ideally I'd like to bin the dots as one dot per 1000 in the column value. Is that possible?
Thank you for taking the time to help someone who is new to R and is studying in school. Much appreciated as always,
I could not get the geom_dotplot to work, the y-axis always comes out wrong. Try something like, first pivot long and we repeat the Month+category per every 1000, note this solution below rounds up:
library(dplyr)
library(tidyr)
library(ggplot2)
test = pivot_longer(dat,-Month,names_to="category") %>%
group_by(Month,category) %>%
summarize(bins=ceiling(value/ 1000)) %>%
uncount(bins)
If you would prefer to round down to the nearest 1000, use floor() instead of ceiling() .
Then plot:
test$Month = factor(test$Month,levels=dat[,1])
test %>% ggplot(aes(x=Month,y=1,col=category)) +
geom_point(position=position_stack()) +
scale_y_continuous(labels=scales::number_format(scale=1000))

Using multiple summary statistics in a ggplot2 plot

I'm analysing some house sale transaction data, and I want to produce a geographic plot with the colour indicating average price per (hex-binned) region. Some regions have limited data, and I want to indicate this by adjusting the opacity to reflect the number of points in each region.
This would require me to calculate two statistics for each hex: average price and number of points. The ggplot2 package makes it very easy to calculate and plot one statistic in a chart, but I can't figure out how to calculate two.
To illustrate the point:
library(ggplot2)
N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) # dummy data
# I want to produce a hex-binned version of this:
ggplot(data=df_demo) + geom_point(mapping=aes(x=A, y=B, color=C))
# It's easy to get each hex's average price *or* its point density:
ggplot(data=df_demo) + stat_summary_hex(mapping=aes(x=A,y=B,z=C), fun=mean) # color = average of C across hex, but opacity can't be adjusted
ggplot(data=df_demo) + geom_hex(mapping=aes(x=A, y=B, color=C, alpha=..ndensity..)) # opacity = normalised # of points, but color is *total* value which is wrong
I would like to combine the effects of the last two lines, but that doesn't seem to be an option: the ..ndensity.. statistic doesn't work in the context of stat_summary_hex(), and geom_hex() won't calculate the mean value.
Is there a way to do this that I'm overlooking? Alternatively, is there an obvious way of precomputing the statistics needed before constructing the plot? E.g. by determining the expected hex for each datum during my dplyr pipeline.
One hint that there may not be an easy solution is this non-CRAN package which - if I've understood correctly - solves more or less this problem. However, I'd rather not rely on out-of-CRAN code if at all possible, so I'm holding onto hope that I've missed something obvious.
What about a different geom? E.g. geom_tile - you can create cuts for each dimension (A/B) and then pre-calculate mean and number for each tile and then plot like this:
library(tidyverse)
N = 1000;
df_demo = data.frame(A=runif(N), B=runif(N), C=runif(N)) %>%
mutate(cuts_a= cut(A, breaks = 20), cuts_b= cut(B, breaks = 20)) %>%
group_by(cuts_a, cuts_b) %>% mutate(mean_c = mean(C), n_obs = n())
# I want to produce a hex-binned version of this:
ggplot(data=df_demo) +
geom_tile(mapping=aes(x=cuts_a, y=cuts_b, fill=mean_c, alpha = n_obs))
Created on 2020-02-13 by the reprex package (v0.3.0)

Differentiate missing values from main data in a plot using R

I create a dummy timeseries xts object with missing data on date 2-09-2015 as:
library(xts)
library(ggplot2)
library(scales)
set.seed(123)
seq <- seq(as.POSIXct("2015-09-01"),as.POSIXct("2015-09-02"), by = "1 hour")
ob1 <- xts(rnorm(length(seq),150,5),seq)
seq2 <- seq(as.POSIXct("2015-09-03"),as.POSIXct("2015-09-05"), by = "1 hour")
ob2 <- xts(rnorm(length(seq2),170,5),seq2)
final_ob <- rbind(ob1,ob2)
plot(final_ob)
# with ggplot
df <- data.frame(time = index(final_ob), val = coredata(final_ob) )
ggplot(df, aes(time, val)) + geom_line()+ scale_x_datetime(labels = date_format("%Y-%m-%d"))
After plotting my data looks like this:
The red coloured rectangular portion represents the date on which data is missing. How should I show that data was missing on this day in the main plot?
I think I should show this missing data with a different colour. But, I don't know how should I process data to reflect the missing data behaviour in the main plot.
Thanks for the great reproducible example.
I think you are best off to omit that line in your "missing" portion. If you have a straight line (even in a different colour) it suggests that data was gathered in that interval, that happened to fall on that straight line. If you omit the line in that interval then it is clear that there is no data there.
The problem is that you want the hourly data to be connected by lines, and then no lines in the "missing data section" - so you need some way to detect that missing data section.
You have not given a criteria for this in your question, so based on your example I will say that each line on the plot should consist of data at hourly intervals; if there's a break of more than an hour then there should be a new line. You will have to adjust this criteria to your specific problem. All we're doing is splitting up your dataframe into bits that get plotted by the same line.
So first create a variable that says which "group" (ie line) each data is in:
df$grp <- factor(c(0, cumsum(diff(df$time) > 1)))
Then you can use the group= aesthetic which geom_line uses to split up lines:
ggplot(df, aes(time, val)) + geom_line(aes(group=grp)) + # <-- only change
scale_x_datetime(labels = date_format("%Y-%m-%d"))

ggplot time series plotting: group by dates

I would like to plot several time series on the same panel graph, instead of in separate panels. I took the below R code from another stackoverflow post.
Please note how the 3 time series are in 3 different panels. How would I be able to layer the 3 time series on 1 panal, and each line can differ in color.
Time = Sys.time()+(seq(1,100)*60+c(rep(1,100)*3600*24, rep(2, 100)*3600*24, rep(3, 100)*3600*24))
Value = rnorm(length(Time))
Group = c(0, cumsum(diff(Time) > 1))
library(ggplot2)
g <- ggplot(data.frame(Time, Value, Group)) +
geom_line (aes(x=Time, y=Value, color=Group)) +
facet_grid(~ Group, scales = "free_x")
If you run the above code, you get this:
When the facet_grid() part is eliminated, I get a graph that looks like this:
Basically, I would like ggplot to ignore the differences in the dates, and only consider the times. And then use group to identify the differing dates.
This problem could potentially be solved by creating a new column that only contains the times (eg. 22:01, format="%H:%M"). However, when as.POSIXct() function is used, I get a variable that contains both date and time. I can't seem to escape the date part.
Since the data file has different days for each group's time, one way to get all the groups onto the same plot is to just create a new variable, giving all groups the same "dummy" date but using the actual times collected.
experiment <- data.frame(Time, Value, Group) #creates a data frame
experiment$hms <- as.POSIXct(paste("2015-01-01", substr(experiment$Time, 12, 19))) # pastes dummy date 2015-01-01 onto the HMS of Time
Now that you have the times with all the same date, you then can plot them easily.
experiment$Grouping <- as.factor(experiment$Group) # gglot needed Group to be a factor, to give the lines color according to Group
ggplot(experiment, aes(x=hms, y=Value, color=Grouping)) + geom_line(size=2)
Below is the resulting image (you can change/modify the basic plot as you see fit):

Resources