In UpSetR, how to show decimal number on the intersection bar - r

I am making an upset diagram for the following data in percentages. This is a dummy example for my more complicated data.
x <- c(a=80, b=9.9, c=5, 'a&b'=0.1, 'a&c'=1.65, 'c&b'=3.35)
upset(fromExpression(x), order.by = "freq")
I want these percentages to appear as decimal numbers and all the bars visible even if it is 0.1%. All the data is important in this plot.

The upset'ting plot
library(UpSetR)
x <- c(a=80, b=9.9, c=5, 'a&b'=0.1, 'a&c'=1.65, 'c&b'=3.35)
upset(fromExpression(x), order.by = "freq", show.numbers = 'yes')
Your question
So you want two things:
percentages to appear as decimal numbers
bars visible even if it is 0.1%
Percentages to appear as decimal numbers
You start by converting your vector of percentages to counts (integer) with fromExpression. So the input to upset is then a dataframe:
library(UpSetR)
x <- c(a=80, b=9.9, c=5, 'a&b'=0.1, 'a&c'=1.65, 'c&b'=3.35)
str(fromExpression(x))
#> 'data.frame': 98 obs. of 3 variables:
#> $ a: num 1 1 1 1 1 1 1 1 1 1 ...
#> $ b: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ c: num 0 0 0 0 0 0 0 0 0 0 ...
upset internally then gets the labels from this data, so the link to your original percentages is no longer present inside upset.
Having labels as percentages, or some other custom labels, does not seem to be a supported option for the function upset from the UpSetR package at the moment.
There is the show.numbers argument but only allow to show those absolute frequencies on top of the bars (show.numbers = "yes" or show.numbers = "Yes") or not (any other value for show.numbers), here's the code bit involved:
https://github.com/hms-dbmi/UpSetR/blob/fe2812c8cbe87af18c063dcee9941391c836e7b2/R/MainBar.R#L130-L132
So I think you need to change that piece of code, i.e., the geom_text and aes_string, to use a different aesthetic mapping (your relative frequencies). So maybe ask the developer to do it?
Bars visible even if it is 0.1%
Well, this ultimately depends on your y-axis dynamic range and the size of your plot, i.e., if the tallest bar is a lot greater than the shortest than it might be impossible to see both in the same chart (unless you make y-axis discontinuous).
Conclusion
I understand this is not really a solution to your problem but it is an answer that hopefully points you in the direction of the solution to your problem.

Two facts are standing in the way of a quick and easy solution to this problem:
UpSetR is very strongly oriented toward discrete sets of countable objects.
A potential solution would be instead of using whole objects to use fractional objects, but the first thing upset() does is to check for which columns of your data frame have "0" and "1" as their only levels. This is hardcoded. If this fails, the startend object becomes NULL and there is no way the function will be able to do anything.
UpSetR does not give very good access to the plots it creates.
Once the plots are made, you are left with no return value from upset(). This means you cannot modify the plot objects themselves or change way they are plotted outside of the arguments allowed to pass to upset().
So, what can you do?
Depending on how complicated your real plot is (and how often have to replot it) you might just do this:
x <- c(a=80, b=9.9, c=5, 'a&b'=0.1, 'a&c'=1.65, 'c&b'=3.35)
upset(fromExpression(x*100), order.by = "freq")
and then edit in inkscape/illustrator. (BAD)
Fork UpSetR and hijack the scale.intersections and scale.sets parameters. In the Make_main_bar() function you would just change the way it handles a "percent" argument to scale_intersections, and change the way Make_size_plot() handles the same argument to scale_sets. This would then become:
x <- c(a=80, b=9.9, c=5, 'a&b'=0.1, 'a&c'=1.65, 'c&b'=3.35)
upset(fromExpression(x*100), order.by = "freq",
scale.intersections="percent", scale.sets="percent")
I have personally forked UpSetR myself for other purposes, but the package in general needs a major refactoring so that it might be applied to additional use cases. The authors may have wanted to prevented uses of the concept outside of their concept.

Related

R: ggplot:: geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

I am trying to make a line graph in ggplot and I am having difficulty diagnosing my error. I've read through nearly all the similar threads, but have been unable to solve my issue.
I am trying to plot Japanese CPI. I downloaded the data online from FRED.
my str looks like:
str(jpycpi)
data.frame: 179 obs. of 2 variables:
$ DATE : Factor w/ 179 levels "2000-04-01","2000-05-01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ JPNCPIALLMINMEI: num 103 103 103 102 103 ...
My code to plot:
ggplot(jpycpi, aes(x=jpycpi$DATE, y=jpycpi$JPNCPIALLMINMEI)) + geom_line()
it gives me an error saying:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
I have tried the following and have been able to plot it, but the graph x bar is distorted for some odd reason. That code is below:
ggplot(jpycpi, aes(x=jpycpi$DATE, y=jpycpi$JPNCPIALLMINMEI, group=1)) + geom_line()
The "Each group consists of only one observation" error message happens because your x aesthetic is a factor. ggplot takes that to mean that your independent variable is categorical, which doesn't make sense in conjunction with geom_line.
In this case, the right way to fix it is to convert that column of the data to a Date vector. ggplot understands how to use all of R's date/time classes as the x aesthetic.
Converting from a factor to a Date is a little tricky. A direct conversion,
jpycpi$DATE <- as.Date(jpycpi$DATE)
works in R version 3.3.1, but, if I remember correctly, would give nonsense results in older versions of the interpreter, because as.Date would look only at the ordinals of the factor levels, not at their labels. Instead, one should write
jpycpi$DATE <- as.Date(as.character(jpycpi$DATE))
Conversion from a factor to a character vector does look at the labels, so the subsequent conversion to a Date object will do the Right Thing.
You probably got a factor for $DATE in the first place because you used read.table or read.csv to load up the data set. The default behavior of these functions is to attempt to convert each column to a numeric vector, and failing that, to convert it to a factor. (See ?type.convert for the exact behavior.) If you're going to be importing lots of data with date columns, it's worth learning how to use the colClasses argument to read.table; this is more efficient and doesn't have gotchas like the above.

ggplot2: how to read the scale transformation from a plot object

I'm trying to extract information about the limits and transform of an existing ggplot object. I'm getting close, but need some help. Here's my code
data = data.frame(x=c(1,10,100),y=(c(1,10,100)))
p = ggplot(data=data,aes(x=x,y=y)) + geom_point()
p = p + scale_y_log10()
q = ggplot_build(p)
r = q$panel$y_scales
trans.y = (q$panel$y_scales)[[1]]$trans$name
range.y = (q$panel$y_scales)[[1]]$rang
print(trans.y) gives me exactly what I want
[1] "log-10"
But range.y is a funky S4 object (see below).
> print(range.y)
Reference class object of class "Continuous"
Field "range":
[1] 0 2
> unclass(range.y)
<S4 Type Object>
attr(,".xData")
<environment: 0x11c9a0630>
I don't really understand S4 objects or how to query their attributes and methods. Or, if I'm just going down the wrong rabbit hole here, a better solution would be great :) In Matlab, I could just use the commands "get(gca,'YScale')" and "get(gca,'YLim')", so I wonder if I'm making this harder than it needs to be.
As #MikeWise points out in the comments, this all becomes a lot easier if you update ggplot to v2.0. It now uses ggproto objects instead of proto, and these are more convenient to get info from.
It's easy to find now what you need. Just printing ggplot_build(p) gives you a nice list of all that's there.
ggplot_build(p)$panel$y_scales[[1]]$range here gives you a ggproto object. You can see that contains several parts, one of which is range (again), which contains the data range. All the way down, you end up with:
ggplot_build(p)$panel$y_scales[[1]]$range$range
# [1] 0 2
Where 0 is 10^0 = 1 and 2 is 10^2 = 100.
Another way might be to just look it up in $data part like this:
apply(ggplot_build(p)$data[[1]][1:2], 2, range)
# y x
# 1 0 1
# 2 1 10
# 3 2 100
You can also get the actual range of the plotting window with:
ggplot_build(p)$panel$ranges[[1]]$y.range
[1] -0.1 2.1

How do I plot data by splitting it unto 5 second intervals?

I'm completely new to R, and I have been tasked with making a script to plot the protocols used by a simulated network of users into a histogram by a) identifying the protocols they use and b) splitting everything into a 5-second interval and generate a graph for each different protocol used.
Currently we have
data$bucket <- cut(as.numeric(format(data$DateTime, "%H%M")),
c(0,600, 2000, 2359),
labels=c("00:00-06:00", "06:00-20:00", "20:00-23:59")) #Split date into dates that are needed to be
to split the codes into 3-zones for another function.
What should the code be changed to for 5 second intervals?
Sorry if the question isn't very clear, and thank you
The histogram function hist() can aggregate and/or plot all by itself, so you really don't need cut().
Let's create 1,000 random time stamps across one hour:
set.seed(1)
foo <- as.POSIXct("2014-12-17 00:00:00")+runif(1000)*60*60
(Look at ?POSIXct on how R treats POSIX time objects. In particular, note that "+" assumes you want to add seconds, which is why I am multiplying by 60^2.)
Next, define the breakpoints in 5 second intervals:
breaks <- seq(as.POSIXct("2014-12-17 00:00:00"),
as.POSIXct("2014-12-17 01:00:00"),by="5 sec")
(This time, look at ?seq.POSIXt.)
Now we can plot the histogram. Note how we assign the output of hist() to an object bar:
bar <- hist(foo,breaks)
(If you don't want the plot, but only the bucket counts, use plot=FALSE.)
?hist tells you that hist() (invisibly) returns the counts per bucket. We can look at this by accessing the counts slot of bar:
bar$counts
[1] 1 2 0 1 0 1 1 2 3 3 0 ...

Plot frequencies of factor variable

I'm trying to get a handle on all of the various tools for manipulating data structures - I've looked into apply, sapply, tapply, reshape, etc. and I still feel very unsure about which to use in each situation.
For my current problem, I have data that looks like:
ID T1Measure T2Measure ...
1 1 1
2 1 2
...
where T1Measure represents the measure of a factor/categorical variable at time 1, T1Measure is the measure of the same variable for the same user at time 2, etc.
My goal is to produce graphs of how the distribution of this measure changes over time (both the frequency of each factor and the proportion of each factor).
I know this is simple, but I'm having a hard time wrapping my head around how I can get what I want.
I believe that for ggplot, I want something like:
FactorID variable value
1 T1 2
2 T1 0
1 T2 1
2 T2 1
...
I want to know which package I should be looking at to do this, but more generally, a good way of thinking about data structures, and how to recognize the best way to manipulate them.
I'm not sure I would use any apply statements here, but the reshape2 package would help.
#sample data
dd<-data.frame(
ID=c(1,2,3,4,5,6),
T1=c(1,2,2,1,1,2),
T2=c(1,1,2,1,1,2),
T3=c(2,1,1,2,1,1)
)
library(reshape2)
mm<-melt(dd,id.vars="ID", variable.name="Measure", value.name="FactorID")
#option 1 (useful for counts of discrete values)
as.data.frame(with(mm, table(FactorID, Measure))
#option 2 (useful for collapsing data)
aggregate(ID~FactorID+Measure, mm, FUN=length)
I used standard base functions for collapsing the data and making counts. I tend to perfer the syntax of reshape2 to the base reshape() function but that might be able to work as well.

Simple line plot using R ggplot2

I have data as follows in .csv format as I am new to ggplot2 graphs I am not able to do this
T L
141.5453333 1
148.7116667 1
154.7373333 1
228.2396667 1
148.4423333 1
131.3893333 1
139.2673333 1
140.5556667 2
143.719 2
214.3326667 2
134.4513333 3
169.309 8
161.1313333 4
I tried to plot a line graph using following graph
data<-read.csv("sample.csv",head=TRUE,sep=",")
ggplot(data,aes(T,L))+geom_line()]
but I got following image it is not I want
I want following image as follows
Can anybody help me?
You want to use a variable for the x-axis that has lots of duplicated values and expect the software to guess that the order you want those points plotted is given by the order they appear in the data set. This also means the values of the variable for the x-axis no longer correspond to the actual coordinates in the coordinate system you're plotting in, i.e., you want to map a value of "L=1" to different locations on the x-axis depending on where it appears in your data.
This type of fairly non-sensical thing does not work in ggplot2 out of the box. You have to define a separate variable that has a proper mapping to values on the x-axis ("id" in the code below) and then overwrite the labels with the values for "L".
The coe below shows you how to do this, but it seems like a different graphical display would probbaly be better suited for this kind of data.
data <- as.data.frame(matrix(scan(text="
141.5453333 1
148.7116667 1
154.7373333 1
228.2396667 1
148.4423333 1
131.3893333 1
139.2673333 1
140.5556667 2
143.719 2
214.3326667 2
134.4513333 3
169.309 8
161.1313333 4
"), ncol=2, byrow=TRUE))
names(data) <- c("T", "L")
data$id <- 1:nrow(data)
ggplot(data,aes(x=id, y=T))+geom_line() + xlab("L") +
scale_x_continuous(breaks=data$id, labels=data$L)
You have an error in your code, try this:
ggplot(data,aes(x=L, y=T))+geom_line()
Default arguments for aes are:
aes(x, y, ...)

Resources