I'm having some trouble producing what I think should be a fairly straightforward ggplot2 graph.
I have some experimental data in a data frame. Each data entry is identified by the system that was being measured, and the instance (problem) it was run on. Each entry also has a value measured for the particular system and instance.
For instance:
mydata <- data.frame(System=c("a","b","a","b","a","b"), Instance=factor(c(1,1,2,2,3,3)), Value=c(10,5,4,2,7,8))
Now, I'd like to plot this data in a boxplot where the x-axis contains the instance identifier, and the color of the bar indicates which system the value is for. The bar heights should be weighted by the value in the dataframe.
This almost does what I want:
qplot(data=mydata, weight=Value, Instance, fill=System, position="dodge")
The final thing that I would like to do is reorder the bars so they are sorted by the value of system A. However, I can't figure out an elegant way to do this.
My first instinct was to use qplot(data=mydata, weight=Value, reorder(Instance, Value), fill=System, position="dodge"), but this will order by the mean value for each instance, and I just want to use the value from A. I could use qplot(data=mydata, weight=Value, reorder(Instance, Value, function(x) { x[1] } ), fill=System, position="dodge") to order the instances by "the first value", but this is dangerous (what if the order changes?) and unclear to a reader.
What is a more elegant solution?
I'm sure there is a better way than this, but making Instance an ordered works, and would continue to work even if the data changes:
qplot(data=mydata, weight=Value,
ordered(Instance,
levels=mydata[System=='a','Instance'] [order(mydata[System=='a','Value'])])
,fill=System, position="dodge")
Perhaps a slightly more elegant way of writing the same thing:
qplot(data=mydata, weight=Value,
ordered(Instance,
levels=Instance [System=='a'] [order(Value [System=='a'])]) # Corrected
,fill=System, position="dodge")
Related
I'm trying to plot the (first) difference of a time series with ggplot.
As the difference (by definition) contains one less element than the data, I (predictably) get the error message: "Error: Aesthetics must be either length 1 or the same as the data".
I solved this by defining my y aesthetic as c(NA, diff(data)) instead of just diff(data), which works.
However, this feels like a clumsy workaround and only works so far as it gets me in trouble for instance when I'm trying to facet several plots. (Also, you need to keep adding NA's if a higher order of difference is needed, or more lag).
Anyone knows of a more robust solution?
The ultimate problem is this:
What I want (this was made using patchwork::)
What I get using faceting (NB: if I put the NA at the end, it's the third chart which becomes correct)
While adding NA to the differenced vector is not clumsy, doing this within the ggplot aesthetic is. Compare the following two code:
ggplot(data = data.long, aes(x = date, y = c(NA, count %>% diff()))) +
geom_point()
data.long %>%
mutate(diff_count = c(NA, diff(count))) %>%
ggplot(aes(x = date, y = diff_count)) +
geom_point()
They both would give the same graph, but the second code is the preferred method since the data used for plotting (the differenced count) is calculated before being sent to ggplot and is easier to read and modify. In other words, do the data management first, then visualise the data. As you say, using the first method can get you into trouble later, for example doing more complicated graphing such as facetting.
I have a stacked barchart that looks like this.
If I have a second dataframe that has the same layout as the one that created the plot, and I want to group both datasets by position while still keeping the stacked percentages, how would I go about this. I'm not sure how to do it in ggplot2
Hard to say without seeing the data and without more information about what you actually want to achieve, but the general approach I would use is to say combine your dataframes - especially if the variables are the same. You just want to make sure to maintain "where" each dataset originated, and that will be your identifying column.
So, if your data is in myData1 and myData2:
# add identifying columns
myData1$id <- 'dataset1'
myData2$id <- 'dataset2'
# put them together
newData <- rbind(myData1, myData2)
You are not clear on what you're looking for in the combined plot, so you can go about that any number of ways (depending on what you want to do). Maybe the simplest example would be to use facet_grid() or facet_wrap() from ggplot2 to show them in side-by-side plots:
ggplot(newData, aes(x=name, y=value)) +
geom_col(aes(fill=gene)) +
facet_wrap(~id)
I have a data set in which a coordinate can be repeated several times.
I want to make a hexbinplot displaying the maximum number of times a coordinate is repeated within that bin. I am using R and I would prefer to make it with ggplot so the graph is consistent with other graphs in the same report.
Minimum working example (the bins display the count not the max):
library(ggplot2)
library(data.table)
set.seed(41)
dat<-data.table(x=sample(seq(-10,10,1),1000,replace=TRUE),
y=sample(seq(-10,10,1),1000,replace=TRUE))
dat[,.N,by=c("x","y")][,max(N)]
# No bin should be over 9
p1 <- ggplot(dat,aes(x=x,y=y))+stat_binhex(bins=10)
p1
I believe the approach should be related to this question:
calculating percentages for bins in ggplot2 stat_binhex but I am not sure how to adapt it to my case.
Also, I am concerned about this issue ggplot2: ..count.. not working with stat_bin_hex anymore as it can make my objective harder than what I initially thought.
Is it possible to make the bins display the maximum number of times a point is repeated?
I think, after playing with the data a bit more, I now understand. Each bin in the plot represents multiple points, e.g., (9,9);(9,10)(10,9);(10,10) are all in a single bin in the plot. I must caution that this is the expected behavior. It is unclear to me why you do not want to do it this way. Instead, you seem to want to display the values of just one of those points (e.g. 9,9).
I don't think you will be able to do this directly in a call to geom_hex or stat_hexbin, as those functions are trying to faithfully represent all of the data. In fact, they are not necessarily expecting discrete coordinates like you have at all -- they work equally well on continuous data.
For your purpose, if you want finer control, you may want to instead use geom_tile and count the values yourself, eg. (using dplyr and magrittr):
countedData <-
dat %$%
table(x,y) %>%
as.data.frame()
ggplot(countedData
, aes(x = x
, y = y
, fill = Freq)) +
geom_tile()
and you might play with the representation a bit from there, but it would at least display each of the separate coordinates more faithfully.
Alternatively, you could filter your raw data to only include the points that are the maximum within a bin. That would require you to match the binning, but could at least be an option.
For completeness, here is how to adapt the stat_summary_hex solution that #Jon Nagra (OP) linked to. Note that there are a few additional steps, so I don't think that this is quite a duplicate. Specifically, the table step above is required to generate something that can be used as a z for the summaries, and then you need to convert x and y back from factors to the original scale.
ggplot(countedData
, aes(x = as.numeric(as.character(x))
, y = as.numeric(as.character(y))
, z = Freq)) +
stat_summary_hex(fun = max, bins = 10
, col = "white")
Of note, I still think that the geom_tile may be more useful, even it is not quite as flashy.
I am a new R user.
I have a difficult time figuring out how to combine different barplot into one graph.
For example,
Suppose, the top five of professions in China, are, government employees, CEOs, Doctors, Athletes, artists, with the incomes (in dollars) respectively, 20,000,17,000,15,000,14,000,and 13,000, while the top five of professions in the US, are, doctors, athletes, artists, lawyers, teachers with the incomes (in dollars) respectively, 40,000,35,000,30,000,25,000 and 20,000.
I want to show the differences in one graph.
How am I supposed to do that? Beware that they have different names.
The answer to the question is fairly straight forward. As a new R user, I recommend you make liberal use of the 'ggplot2' package. For many R users, this one package is enough.
To get the "combined" barchart described in the original post, the answer is to put all of the data into one dataset and then add grouping variables, like so:
Step 1: Make the dataset.
data <- read.table(text="
Country,Profession,Income
China,Government employee,20000
China,CEO,17000
China,Doctor,15000
China,Athlete,14000
China,Artist,13000
USA,Doctor,40000
USA,Athlete,35000
USA,Artist,30000
USA,Lawyer,25000
USA,Teacher,20000", header=TRUE, sep=",")
You'll notice I'm using the 'read.table' function here. This is not required and is purely for readability in this example. The important part is that we have our values (Income) and our grouping variables (Country, Profession).
Step 2: Create a barchart with Income as the height of the bars, Profession as the x-axis, and color the bars by Country.
library(ggplot2)
ggplot(data, aes(x=Profession, y=Income, fill=Country)) +
geom_bar(stat="identity", position="dodge") +
theme(axis.text.x = element_text(angle = 90))
Here we are first loading the 'ggplot2' package. You may need to install this.
Then, we specify what data we want to use and how to separate it.
ggplot(data, aes(x=Profession, y=Income, fill=Country))
This tells 'ggplot' to use our dataset in the 'data' data frame. The aes() command specifies how 'ggplot' should read the data. We map the grouping variable Profession onto the x-axis, map the Income onto the y-axis, and change the color (fill) of each bar according to the grouping variable Country.
Next, we specify what kind of barchart we want.
geom_bar(stat="identity", position="dodge")
This tells 'ggplot' to make a barchart (geom_bar()). By default, the 'geom_bar' function tries to make a histogram, but we already have the totals we want to use. We tell it to use our totals by specifying that the type of statistic represented in Income is the total, or actual values (identity) that we want to chart (stat="identity"). Finally, I made a judgement call about how to display the data and decided to set one set of data on next to the other when a single profession has multiple income values (position="dodge").
Finally, we need to rotate the x-axis labels, since some of them are quite long. We do this with a simple 'theme' command that changes the rotation of the x-axis text elements.
theme(axis.text.x = element_text(angle = 90))
We chain all of these commands together with the +, and it's done!
I have the following data.frame:
sample <- data.frame(day=c(1,2,5,10,12,12,14))
sample.table <- as.data.frame(table(sample$day))
Now what I'd like to do is graph the day against the count of days, so something like:
require(ggplot2)
qplot(Var1, Freq, data=sample.table)
I realized though that Var1 really really really wants to be a factor. This works fine for a small number of days, but is terrible when days becomes much larger because the graph becomes unreadable. If I change it to a numeric or integer, then instead of plotting day on the x-axis, it plots the count of day, e.g. 1,2,3,4,5,6,7.
What can I do so that if I have, say 5000 days, it is still visible well?
This is because when you use table you get a vector with names (which are characters), and when you convert to data.frame these get converted to factors with the default settings.
You could avoid this by using your original data and getting ggplot2 to count the data:
qplot(day, ..count.., data=sample, stat="bin", binwidth=1)
or just use a histogram,
qplot(day, data=sample, geom="histogram", binwidth=1)
Note that you can adjust the binwidth argument to count in larger groups.
Figured out a hack for this.
as.integer(as.character(sample$day))