Time series difference in ggplot - r

I'm trying to plot the (first) difference of a time series with ggplot.
As the difference (by definition) contains one less element than the data, I (predictably) get the error message: "Error: Aesthetics must be either length 1 or the same as the data".
I solved this by defining my y aesthetic as c(NA, diff(data)) instead of just diff(data), which works.
However, this feels like a clumsy workaround and only works so far as it gets me in trouble for instance when I'm trying to facet several plots. (Also, you need to keep adding NA's if a higher order of difference is needed, or more lag).
Anyone knows of a more robust solution?
The ultimate problem is this:
What I want (this was made using patchwork::)
What I get using faceting (NB: if I put the NA at the end, it's the third chart which becomes correct)

While adding NA to the differenced vector is not clumsy, doing this within the ggplot aesthetic is. Compare the following two code:
ggplot(data = data.long, aes(x = date, y = c(NA, count %>% diff()))) +
geom_point()
data.long %>%
mutate(diff_count = c(NA, diff(count))) %>%
ggplot(aes(x = date, y = diff_count)) +
geom_point()
They both would give the same graph, but the second code is the preferred method since the data used for plotting (the differenced count) is calculated before being sent to ggplot and is easier to read and modify. In other words, do the data management first, then visualise the data. As you say, using the first method can get you into trouble later, for example doing more complicated graphing such as facetting.

Related

Best way to visually compare an individual's value to multiple subgroup means in R

For an individual feedback sheet generated by a Shiny App in R I would like to visually compare an individual's value in variable X to the mean of the whole group, the mean of people of the same age and the mean of people playing the same sports. I was considering making a barplot with four bars for each value and since I keep reading ggplot2 is neat for making plots tried to figure out how to do it in ggplot2. However when trying to implement this idea the factor on the x axis would conceptually be the subsets of the dataset and since the subsets are build from different variables and one individual can be in more than one subset I absolutely can't seem to wrap my head around how to actually feed that into any barplot synthax I found. I wondered if your could just make a list along the lines of c(your_value, mean(group), mean(age_subset), mean(sports_subset)) but I didn't find if that was possible also first making a list or even a second dataframe seems kinda messy to me - isn't there an easier and more elegant way to do something like that?
Below I start with arbitrary numbers (equivalent to the list you considered starting with). The code might give you an idea how to make a general function of the kind you're seeking.
library(ggplot2)
library(dplyr)
own_result <- 5.4
mean_age <- 5.6
mean_sport <- 4.5
data.frame(group = c("age", "sport"),
means = c(mean_age, mean_sport)) %>%
ggplot(aes(x = group, y = means)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = own_result, lty = 2, col = "red")
Created on 2021-07-20 by the reprex package (v2.0.0)

Reordering data based on a column in [r] to order x-value items from lowest to highest y-values in ggplot

I have a dataframe that I want to reorder to make a ggplot so I can easily see which items have the highest and lowest values in them. In my case, I've grouped the data into two groups, and it'd be nice to have a visual representation of which group tends to score higher. Based on this question I came up with:
library(ggplot2)
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- line that doesn't seem to be working
ggplot(cor.data.sorted,aes(x=pic,y=r.val,size=df.val,color=exp)) + geom_point()
which produces this:
I've tried quite a few variants to reorder the data, and I feel like this should be pretty simple to achieve. To clarify, if I had succesfully reorganised the data then the y-values would go up as the plot moves along the x-value. So maybe i'm focussing on the wrong part of the code to achieve this in a ggplot figure?
You could do something like this?
library(tidyverse);
cor.data %>%
mutate(pic = factor(pic, levels = as.character(pic)[order(r.val)])) %>%
ggplot(aes(x = pic, y = r.val, size = df.val, color = exp)) + geom_point()
This obviously still needs some polishing to deal with the x axis label clutter etc.
Rather than try to order the data before creating the plot, I can reorder the data at the time of writing the plot:
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- This line controls order points drawn created to make (slightly) more readible plot
gplot(cor.data.sorted,aes(x=reorder(pic,r.val),y=r.val,size=df.val,color=exp)) + geom_point()
to create

Display maximum frequency point of each bin in ggplot2 stat_binhex

I have a data set in which a coordinate can be repeated several times.
I want to make a hexbinplot displaying the maximum number of times a coordinate is repeated within that bin. I am using R and I would prefer to make it with ggplot so the graph is consistent with other graphs in the same report.
Minimum working example (the bins display the count not the max):
library(ggplot2)
library(data.table)
set.seed(41)
dat<-data.table(x=sample(seq(-10,10,1),1000,replace=TRUE),
y=sample(seq(-10,10,1),1000,replace=TRUE))
dat[,.N,by=c("x","y")][,max(N)]
# No bin should be over 9
p1 <- ggplot(dat,aes(x=x,y=y))+stat_binhex(bins=10)
p1
I believe the approach should be related to this question:
calculating percentages for bins in ggplot2 stat_binhex but I am not sure how to adapt it to my case.
Also, I am concerned about this issue ggplot2: ..count.. not working with stat_bin_hex anymore as it can make my objective harder than what I initially thought.
Is it possible to make the bins display the maximum number of times a point is repeated?
I think, after playing with the data a bit more, I now understand. Each bin in the plot represents multiple points, e.g., (9,9);(9,10)(10,9);(10,10) are all in a single bin in the plot. I must caution that this is the expected behavior. It is unclear to me why you do not want to do it this way. Instead, you seem to want to display the values of just one of those points (e.g. 9,9).
I don't think you will be able to do this directly in a call to geom_hex or stat_hexbin, as those functions are trying to faithfully represent all of the data. In fact, they are not necessarily expecting discrete coordinates like you have at all -- they work equally well on continuous data.
For your purpose, if you want finer control, you may want to instead use geom_tile and count the values yourself, eg. (using dplyr and magrittr):
countedData <-
dat %$%
table(x,y) %>%
as.data.frame()
ggplot(countedData
, aes(x = x
, y = y
, fill = Freq)) +
geom_tile()
and you might play with the representation a bit from there, but it would at least display each of the separate coordinates more faithfully.
Alternatively, you could filter your raw data to only include the points that are the maximum within a bin. That would require you to match the binning, but could at least be an option.
For completeness, here is how to adapt the stat_summary_hex solution that #Jon Nagra (OP) linked to. Note that there are a few additional steps, so I don't think that this is quite a duplicate. Specifically, the table step above is required to generate something that can be used as a z for the summaries, and then you need to convert x and y back from factors to the original scale.
ggplot(countedData
, aes(x = as.numeric(as.character(x))
, y = as.numeric(as.character(y))
, z = Freq)) +
stat_summary_hex(fun = max, bins = 10
, col = "white")
Of note, I still think that the geom_tile may be more useful, even it is not quite as flashy.

R ggplot geom_text Aesthetic Length

I'm working with a really big data setcontaining one dummy variable and a factor variable with 14 levels- a sample of which I have posted here. I'm trying to make a stacked proportional bar graph using the following code:
ggplot(data,aes(factor(data$factor),fill=data$dummy))+
geom_bar(position="fill")+
ylab("Proportion")+
theme(axis.title.y=element_text(angle=0))
It works great and its almost the plot I need. I just want to add small text labels reporting the number of observations of each factor level. My intuition tells me that something like this should work
Labels<-c("n=1853" , "n=392", "n=181" , "n=80", "n=69", "n=32" , "n=10", "n=6", "n=4", "n=5", "n=3", "n=3", "n=2", "n=1" )
ggplot(data,aes(factor(data$factor),fill=data$dummy))+
geom_bar(position="fill")+
geom_text(aes(label=Labels,y=.5))+
ylab("Proportion")+
theme(axis.title.y=element_text(angle=0))
But it spits out a blank graph and the error
Aesthetics must either be length one, or the same length as the dataProblems:Labels
this really doesn't make sense to me because I know for a fact that the length of my factor levels is the same length as the number of labels I muscled in. I've been trying to figure out how I can get it to just print what I need without creating a vector of values for the number of observations like this example, but no matter what I try I always get the same Aesthetics error.
How about this:
library(dplyr)
# Create a separate data frame of counts for the count labels
counts = data %>% group_by(factor) %>%
summarise(n=n()) %>%
mutate(dummy=NA)
counts$factor = factor(counts$factor, levels=0:10)
ggplot(data, aes(factor(factor), fill=factor(dummy))) +
geom_bar(position="fill") +
geom_text(data=counts, aes(label=n, x=factor, y=-0.03), size=4) +
ylab("Proportion")+
theme(axis.title.y=element_text(angle=0))
Your method is the right idea, but Labels needs to be a data frame, rather than a vector. geom_text needs to be given the name of the data frame using the data argument. Then, the label argument inside aes tells geom_text which column to use for the labels. Also, even though geom_text doesn't use the dummy column, it has to be in the data frame or you'll get an error.

Re-ordering by multi-dimensional data for ggplot2 plotting

I'm having some trouble producing what I think should be a fairly straightforward ggplot2 graph.
I have some experimental data in a data frame. Each data entry is identified by the system that was being measured, and the instance (problem) it was run on. Each entry also has a value measured for the particular system and instance.
For instance:
mydata <- data.frame(System=c("a","b","a","b","a","b"), Instance=factor(c(1,1,2,2,3,3)), Value=c(10,5,4,2,7,8))
Now, I'd like to plot this data in a boxplot where the x-axis contains the instance identifier, and the color of the bar indicates which system the value is for. The bar heights should be weighted by the value in the dataframe.
This almost does what I want:
qplot(data=mydata, weight=Value, Instance, fill=System, position="dodge")
The final thing that I would like to do is reorder the bars so they are sorted by the value of system A. However, I can't figure out an elegant way to do this.
My first instinct was to use qplot(data=mydata, weight=Value, reorder(Instance, Value), fill=System, position="dodge"), but this will order by the mean value for each instance, and I just want to use the value from A. I could use qplot(data=mydata, weight=Value, reorder(Instance, Value, function(x) { x[1] } ), fill=System, position="dodge") to order the instances by "the first value", but this is dangerous (what if the order changes?) and unclear to a reader.
What is a more elegant solution?
I'm sure there is a better way than this, but making Instance an ordered works, and would continue to work even if the data changes:
qplot(data=mydata, weight=Value,
ordered(Instance,
levels=mydata[System=='a','Instance'] [order(mydata[System=='a','Value'])])
,fill=System, position="dodge")
Perhaps a slightly more elegant way of writing the same thing:
qplot(data=mydata, weight=Value,
ordered(Instance,
levels=Instance [System=='a'] [order(Value [System=='a'])]) # Corrected
,fill=System, position="dodge")

Resources