Plotly histogram is missing data - r

I'm trying to create a histogram which involves a lot of repeated values in one of the cases. One of the data points is not being represented in the graph. Here is the smallest, simplest subset I could find that still reproduced my issue.
cleanVar <- c(rep(1,9),1.25,1.5)
plot_ly(data.table(cleanVar),
x = ~cleanVar,
type = "histogram")
The above graph shows only two bars. One centered at 1 of height 9, and one centered at 1.2 of height 1.
Also strangely, the hover-over shows "1" for the first bar, despite it covering the range [.9,1.1], and it shows "1.25" for the second bar, despite it covering the range [1.1,1.3].
If we change the 1 to only be repeated 8 times cleanVar <- c(rep(1,8),1.25,1.5), so that there are 10 total values in the histogram, it works better, but still, the three bins it creates are .25 wide according to the hover-over, yet they are only .2 wide on the graph itself.
What is plotly doing? How can I properly show 3 bins of height 9,1,1 and width .25? binning options in layout() aren't working.

By default plotly uses the following procedure to define the bins:
start:
Sets the starting value for the x axis bins. Defaults to the minimum data value, shifted down if necessary to make nice round values and to remove ambiguous bin edges. For example, if most of the data is integers we shift the bin edges 0.5 down, so a size of 5 would have a default start of -0.5, so it is clear that 0-4 are in the first bin, 5-9 in the second, but continuous data gets a start of 0 and bins [0,5), [5,10) etc. Dates behave similarly, and start should be a date string. For category data, start is based on the category serial numbers, and defaults to -0.5. If multiple non-overlaying histograms share a subplot, the first explicit start is used exactly and all others are shifted down (if necessary) to differ from that one by an integer number of bins.
end:
Sets the end value for the x axis bins. The last bin may not end
exactly at this value, we increment the bin edge by size from
start until we reach or exceed end. Defaults to the maximum data
value. Like start, for dates use a date string, and for category
data end is based on the category serial numbers.
You can find this information via:
library(listviewer)
schema(jsonedit = interactive())
Navigate as follows: object ► traces ► histogram ► attributes ► xbins ► start
To avoid the default behaviour just make your x variable a factor:
library(plotly)
library(data.table)
cleanVar <- c(rep(1, 9), 1.25, 1.5)
plot_ly(data.table(cleanVar),
x = ~factor(cleanVar),
type = "histogram")
Result:

Related

Forecast plot with x axis labels as date

I have a dataset like revenue and date.
I used arima to plot the data.
ts_data = ts(dataset$Revenue,frequency = 7)
arima.ts = auto.arima(ts_data)
pred = forecast(arima.ts,h=30)
plot(pred,xaxt="n")
When I plot the data, it produces plot like below.
My expectations are below,
I need to display values in Million for predicted values like 13.1M.
I need to show x-axis as date instead of data points numbers.
I tried several links but couldn't crack it. Below are the experiments I made,
Tried with start date and end date in ts_data that also doesnt work.My start date is "2019-09-27" and end date is "2020-07-02"
tried wit axis_date in plot function that also doesnt work.
Please help me to crack the same.
Thanks a lot.
You can specify axis tick values and labels with axis()
plot(pred,xaxt="n", yaxt="n") # deactivate x and y axis
yl<-seq(-5000000,15000000,by=5000000) # position of y-axis ticks
axis(2, at=yl, label=paste(yl/1000000, "M")) # 2 = left axis
You can specify the desired position of y axis ticks with at and the labels to be associated with label. In order to obtain the values like 10 M I have used the function paste to join the numbers with the letter M.
You could use the same method also for x-axis, even tough more efficient methods probably exist. Not having the specific data available I have used a generic spacing and labels only to give you the idea. Once you have set the desired position you can generate the sequence of dates associated with it (to generate a sequence of dates see https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/seq.Date)
axis(1, at=seq(1,40,by=1), label=seq(as.Date("2019-09-27"),as.Date("2020-07-02"),by="week")) # 1 = below axis
You can also change the format of the dates displayed with format() for example label=format(vector_of_date, "%Y-%b-%d") to have year-month(in letter)-day.

set the last bin of histogram to a interval between a number and infinity in ggplot

I have a really skewed data and I am want to set my histogram's last bin to include a threshold number to infinity so that my histogram will be not skewed. I know we can set xlim or coord_cartisian to zooming but I want to keep all the data.
x=data.frame(100*rbeta(10000,2,50))
ggplot(data=x,aes(x))+geom_histogram(bins=20)+scale_x_continuous(breaks =seq(1,100,by=5))
The accepted answer will get a little ugly if the aggreagted bin gets too big. You can map the values:
x <- mapvalues(x,
from = c(aggBinLow:aggBinHigh),
to = c(rep.int(aggBinLow,aggBinHigh-aggBinLow+1)))
and add a scale with distinct values:
g +
scale_x_continuous(breaks=min:aggBinLow,labels=c(sprintf("%s",min:aggBinLow-1),">aggBinLow-1"))
Use geom_histogram(breaks=c(...)) to set customised bins, where c(...) is the vector of values you want. For example c(seq(from=1,to=11,by=1),100000)

R: setting label spacing and position on a bar plot

I want to present percentages over a 24h period in 15 min intervals as a bar plot.
When I use barplot(), the labels for those timepoints are more or less randomly chosen by R (depending on how I format the window. I know it's not random, but it's not what I want either). I would rather have them evenly spaced at 1 h intervals (that is every 4th bar).
I have searched extensively on this and know i can add labels later with axis() but I have not found a way to set which bars are labeled and which are left blank.
So here is an example. Sorry for the long lines:
x<-sample(1:100,96)
Labels<-c("09","09:15","09:30","09:45","10:00","10:15","10:30","10:45","11","11:15","11:30","11:45","12","12:15","12:30","12:45","13","13:15","13:30","13:45","14","14:15","14:30","14:45","15","15:15","15:30","15:45","16","16:15","16:30","16:45","17","17:15","17:30","17:45","18","18:15","18:30","18:45","19","19:15","19:30","19:45","20","20:15","20:30","20:45","21","21:15","21:30","21:45","22","22:15","22:30","22:45","23","23:15","23:30","23:45","00","00:15","00:30","00:45","01","01:15","01:30","01:45","02","02:15","02:30","02:45","03","03:15","03:30","03:45","04","04:15","04:30","04:45","05","05:15","05:30","05:45","06","06:15","06:30","06:45","07","07:15","07:30","07:45","08","08:15","08:30","08:45")
names(x)<-Labels
barplot(x)
I do not think you can force R to show every label if it does not have enough space. But at least if you want to add the labels every 1h, the following code should work :
x<-sample(1:100,96)
Labels<-c("09","09:15","09:30","09:45","10","10:15","10:30","10:45","11","11:15","11:30","11:45","12","12:15","12:30","12:45","13","13:15","13:30","13:45","14","14:15","14:30","14:45","15","15:15","15:30","15:45","16","16:15","16:30","16:45","17","17:15","17:30","17:45","18","18:15","18:30","18:45","19","19:15","19:30","19:45","20","20:15","20:30","20:45","21","21:15","21:30","21:45","22","22:15","22:30","22:45","23","23:15","23:30","23:45","00","00:15","00:30","00:45","01","01:15","01:30","01:45","02","02:15","02:30","02:45","03","03:15","03:30","03:45","04","04:15","04:30","04:45","05","05:15","05:30","05:45","06","06:15","06:30","06:45","07","07:15","07:30","07:45","08","08:15","08:30","08:45")
b=barplot(x,axes = F)
axis(2)
axis(1,at=c(b[seq(1,length(Labels),4)],b[length(b)]+diff(b)[1]),labels = c(Labels[seq(1,length(Labels),4)],"09"))

Binary spark lines with R

I'm looking to plot a set of sparklines in R with just a 0 and 1 state that looks like this:
Does anyone know how I might create something like that ideally with no extra libraries?
I don't know of any simple way to do this, so I'm going to build up this plot from scratch. This would probably be a lot easier to design in illustrator or something like that, but here's one way to do it in R (if you don't want to read the whole step-by-step, I provide my solution wrapped in a reusable function at the bottom of the post).
Step 1: Sparklines
You can use the pch argument of the points function to define the plotting symbol. ASCII symbols are supported, which means you can use the "pipe" symbol for vertical lines. The ASCII code for this symbol is 124, so to use it for our plotting symbol we could do something like:
plot(df, pch=124)
Step 2: labels and numbers
We can put text on the plot by using the text command:
text(x,y,char_vect)
Step 3: Alignment
This is basically just going to take a lot of trial and error to get right, but it'll help if we use values relative to our data.
Here's the sample data I'm working with:
df = data.frame(replicate(4, rbinom(50, 1, .7)))
colnames(df) = c('steps','atewell','code','listenedtoshell')
I'm going to start out by plotting an empty box to use as our canvas. To make my life a little easier, I'm going to set the coordinates of the box relative to values meaningful to my data. The Y positions of the 4 data series will be the same across all plotting elements, so I'm going to store that for convenience.
n=ncol(df)
m=nrow(df)
plot(1:m,
seq(1,n, length.out=m),
# The following arguments suppress plotting values and axis elements
type='n',
xaxt='n',
yaxt='n',
ann=F)
With this box in place, I can start adding elements. For each element, the X values will all be the same, so we can use rep to set that vector, and seq to set the Y vector relative to Y range of our plot (1:n). I'm going to shift the positions by percentages of the X and Y ranges to align my values, and modified the size of the text using the cex parameter. Ultimately, I found that this works out:
ypos = rev(seq(1+.1*n,n*.9, length.out=n))
text(rep(1,n),
ypos,
colnames(df), # These are our labels
pos=4, # This positions the text to the right of the coordinate
cex=2) # Increase the size of the text
I reversed the sequence of Y values because I built my sequence in ascending order, and the values on the Y axis in my plot increase from bottom to top. Reversing the Y values then makes it so the series in my dataframe will print from top to bottom.
I then repeated this process for the second label, shifting the X values over but keeping the Y values the same.
text(rep(.37*m,n), # Shifted towards the middle of the plot
ypos,
colSums(df), # new label
pos=4,
cex=2)
Finally, we shift X over one last time and use points to build the sparklines with the pipe symbol as described earlier. I'm going to do something sort of weird here: I'm actually going to tell points to plot at as many positions as I have data points, but I'm going to use ifelse to determine whether or not to actually plot a pipe symbol or not. This way everything will be properly spaced. When I don't want to plot a line, I'll use a 'space' as my plotting symbol (ascii code 32). I will repeat this procedure looping through all columns in my dataframe
for(i in 1:n){
points(seq(.5*m,m, length.out=m),
rep(ypos[i],m),
pch=ifelse(df[,i], 124, 32), # This determines whether to plot or not
cex=2,
col='gray')
}
So, piecing it all together and wrapping it in a function, we have:
df = data.frame(replicate(4, rbinom(50, 1, .7)))
colnames(df) = c('steps','atewell','code','listenedtoshell')
BinarySparklines = function(df,
L_adj=1,
mid_L_adj=0.37,
mid_R_adj=0.5,
R_adj=1,
bottom_adj=0.1,
top_adj=0.9,
spark_col='gray',
cex1=2,
cex2=2,
cex3=2
){
# 'adJ' parameters are scalar multipliers in [-1,1]. For most purposes, use [0,1].
# The exception is L_adj which is any value in the domain of the plot.
# L_adj < mid_L_adj < mid_R_adj < R_adj
# and
# bottom_adj < top_adj
n=ncol(df)
m=nrow(df)
plot(1:m,
seq(1,n, length.out=m),
# The following arguments suppress plotting values and axis elements
type='n',
xaxt='n',
yaxt='n',
ann=F)
ypos = rev(seq(1+.1*n,n*top_adj, length.out=n))
text(rep(L_adj,n),
ypos,
colnames(df), # These are our labels
pos=4, # This positions the text to the right of the coordinate
cex=cex1) # Increase the size of the text
text(rep(mid_L_adj*m,n), # Shifted towards the middle of the plot
ypos,
colSums(df), # new label
pos=4,
cex=cex2)
for(i in 1:n){
points(seq(mid_R_adj*m, R_adj*m, length.out=m),
rep(ypos[i],m),
pch=ifelse(df[,i], 124, 32), # This determines whether to plot or not
cex=cex3,
col=spark_col)
}
}
BinarySparklines(df)
Which gives us the following result:
Try playing with the alignment parameters and see what happens. For instance, to shrink the side margins, you could try decreasing the L_adj parameter and increasing the R_adj parameter like so:
BinarySparklines(df, L_adj=-1, R_adj=1.02)
It took a bit of trial and error to get the alignment right for the result I provided (which is what I used to inform the default values for BinarySparklines), but I hope I've given you some intuition about how I achieved it and how moving things using percentages of the plotting range made my life easier. In any event, I hope this serves as both a proof of concept and a template for your code. I'm sorry I don't have an easier solution for you, but I think this basically gets the job done.
I did my prototyping in Rstudio so I didn't have to specify the dimensions of my plot, but for posterity I had 832 x 456 with the aspect ratio maintained.

How to set the hist graph properly?

Here is my data file in (update 2021: link is dead... http://s.yunio.com/87HT7f), please download it and save it as mydata.
y<-scan("mydata")
hist(y,breaks=c(0,60,70,80,90,100),freq=TRUE)
axis(2,at=seq(0,20,length.out=5),labels=c(0,5,10,15,20))
There are two problems:
1.Warning message:
In plot.histogram(r, freq = freq1, col = col, border = border, angle = angle, :
the AREAS in the plot are wrong -- rather use freq=FALSE
I just want freq not probability, times counted on the y axis, how to make the warning message vanish?
2.When run
axis(2,at=seq(0,20,length.out=5),labels=c(0,5,10,15,20))
There no 20 on y axis.
For the first problem, it is a warning, not the error. This warning says that visually areas of each bar do not correspond to their actual frequency - you can see it from the first bar that has the largest area but frequency is only 5.
For the second problem, you have to set ylim=c(0,20) inside hist() to see also number 20 because y axis is shorter than 20. Function axis() plots just labels, it doesn't change length of axis (originally there is no space for number 20).
hist(y,breaks=c(0,60,70,80,90,100),freq=TRUE,ylim=c(0,20))
axis(2,at=seq(0,20,length.out=5),labels=c(0,5,10,15,20))
Check the manual for hist:
freq:
Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant
(and 'probability' is not specified).

Resources