Tufte tables: convert quartile plots into standard error plots hacking qTable function from NMOF package - r

If you remember there is a nice version of table conceived by Tufte that include small quartile plots running next to the corresponding data rows:
There is an implementation of such solution in R using package NMOF and function qTable, which basically creates the table shown above and outputs it as a LaTeX code:
require(NMOF)
x <- rnorm(100, mean = 0, sd = 2)
y <- rnorm(100, mean = 1, sd = 2)
z <- rnorm(100, mean = 1, sd = 0.5)
X <- cbind(x, y, z)
qTable(X,filename="res.tex")#this will save qTable in tex format in current dir
This method of visualization seem to be especially useful if you have a small amount of information to present, and you don't want to waste a space for the full graph. But I would like to hack qTable a bit. Instead of displaying quantile plots, I would prefer to display standard errors of the mean. I am not great in hacking such functions, and I used brute force to do it. I replaced the line from the qTable function which computes quantiles:
A <- apply(X, 2L, quantile, c(0.25, 0.5, 0.75))
with something very brutal, that computes standard errors:
require(psych)#got 'SE' function to compute columns standard deviation
M = colMeans(X)#column means
SE = SD(X)/sqrt(nrow(X))#standard error
SELo = M-SE#lower bound
SEHi = M+SE#upper bound
A = t(data.frame(SELo,M,SEHi))#combines it together
I know, I know it's possibly unsustainable approach, but it actually works to some extend - it plots standard errors but keeps this gap in the plot:
and I would like this gap to disappear.
Here is a qTable function with modification discussed above.

To remove the gaps, you can insert these two lines:
B[2, ] <- B[3, ]
B[4, ] <- B[3, ]
right before the for loop that starts with
for (cc in 1L:dim(X)[2L]) {...
Why does it work? Reading the graph from left to right, the five rows of B correspond to where
1) the left segments start
2) the left segments ends
3) the dots are
4) the right segments start
5) the right segments end
so by forcing B[2, ] and B[4, ] to B[3, ] you are effectively getting rid of the gaps.

Related

How to smooth a curve in R?

location diffrence<-c(0,0.5,1,1.5,2)
Power<-c(0,0.2,0.4,0.6,0.8,1)
plot(location diffrence,Power)
The guy which has written the paper said he has smoothed the curve using a weighted moving average with weights vector w = (0.25,0.5,0.25) but he did not explained how he did this and with which function he achieved that.i am really confused
Up front, as #MartinWettstein cautions, be careful in when you smooth data and what you do with it (infer from it). Having said that, a simple exponential moving average might look like this.
# replacement data
x <- seq(0, 2, len=5)
y <- c(0, 0.02, 0.65, 1, 1)
# smoothed
ysm <-
zoo::rollapply(c(NA, y, NA), 3,
function(a) Hmisc::wtd.mean(a, c(0.25, 0.5, 0.25), na.rm = TRUE),
partial = FALSE)
# plot
plot(x, y, type = "b", pch = 16)
lines(x, ysm, col = "red")
Notes:
the zoo:: package provides a rolling window (3-wide here), calling the function once for indices 1-3, then again for indices 2-4, then 3-5, 4-6, etc.
with rolling-window operations, realize that they can be center-aligned (default of zoo::rollapply) or left/right aligned. There are some good explanations here: How to calculate 7-day moving average in R?)
I surround the y data with NAs so that I can mimic a partial window. Normally with rolling-window ops, if k=3, then the resulting vector is length(y) - (k-1) long. I'm inferring that you want to include data on the ends, so the first smoothed data point would be effectively (0.5*0 + 0.25*0.02)/0.75, the second smoothed data point (0.25*0 + 0.5*0.02 + 0.25*0.65)/1, and the last smoothed data point (0.25*1 + 0.5*1)/0.75. That is, omitting the 0.25 times a missing data point. That's a guess and can easily be adjusted based on your real needs.
I'm using Hmisc::wtd.mean, though it is trivial to write this weighted-mean function yourself.
This is suggestive only, and not meant to be authoritative. Just to help you begin exploring your smoothing processes.

Customising intervals/bins with the cut function to tabulate data

I have a variable which I want to use in a contingency table, and so I want to cut the variable's (discrete) values into bins (or rather intervals) which I can then sort my data from a population into. However, I cannot find anyway online that allows me to choose my bins in the following way:
[-30, -20) [-20, -10) [-10, 0) 0 (0, 10] (10, 20] (20, 30]
i.e. I want some intervals to be left open and right closed, some the other way around, and in the middle zero being a different category altogether. Is there anyway I can do this? I just want to tabulate data.
I think you will need two calls to cut for this:
x <- sample(-30:30, 1000, replace = TRUE)
The key is using the right parameter to get the closure:
x_lower <- as.character(cut(x, breaks = c(-30,-20,-10,0), right = FALSE))
x_upper <- as.character(cut(x, breaks = c(0,10,20,30), right = TRUE ))
And then combine them with ifelse (they are mutually exclusive and the two sets of intervals cover the whole range except zero so this should be fine):
x_new <- ifelse(is.na(x_lower), ifelse(is.na(x_upper), "0", x_upper), x_lower)

R - Histogram Doesn't show density due to magnitude of the Data

I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!
Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)

Want to plot a function between [a,b] and on the same plot shade the area under [c,d] where c and d lie between a and b

I'm trying to plot a function using ggplot, which I can do. For example y = x. I plot between -1 and 4. Works great. On the same graph I now want to shade the area under the curve between 1 and 3. I cannot get it to work, nor can I find any documentation. Can someone help me?
Skeleton code that I'm trying:
eq<-function(x){(x)}
ggplot(data.frame(x=c(-1,4)),aes(x)) +
stat_function(fun=eq,geom="line",color="red") +
stat_function(fun=eq,geom="area",fill="blue")
I tried all different permutations. If there was a way to limit the second stat_function to a different domain it might work. Any ideas?
It appears that stat_function is working over the full range of data, rather than with the particular values supplied. One option is to generate the data you want to plot first, then pass it to a specific geom_*. However, if you really want to stay with stat_function, you may need to do a bit more work around. One approach would be to create two functions, one of which limits the outputs to the range you want.
eq<-function(x){(x)}
eqB<-function(x){ifelse(x < 3 & x > 0, x, NA)}
ggplot(data.frame(x=c(-1,4)),aes(x)) +
stat_function(fun=eq,geom="line",color="red") +
stat_function(fun=eqB,geom="area",fill="blue")
A more robust solution is to create a single function that accepts range limits, then pass those in using args:
eqC <- function(x,upr = max(x), lwr = min(x)){
ifelse(x <= upr & x >= lwr, x, NA)
}
ggplot(data.frame(x=c(-1,4)),aes(x)) +
stat_function(fun=eqC,geom="line",color="red") +
stat_function(fun=eqC,geom="area",fill="blue"
, args = list(upr = 3, lwr = 0))
Both of these generate a plot that looks like this:

Graphing a polynomial output of calc.poly

I apologize first for bringing what I imagine to be a ridiculously simple problem here, but I have been unable to glean from the help file for package 'polynom' how to solve this problem. For one out of several years, I have two vectors of x (d for day of year) and y (e for an index of egg production) data:
d=c(169,176,183,190,197,204,211,218,225,232,239,246)
e=c(0,0,0.006839425,0.027323127,0.024666883,0.005603878,0.016599262,0.002810977,0.00560387 8,0,0.002810977,0.002810977)
I want to, for each year, use the poly.calc function to create a polynomial function that I can use to interpolate the timing of maximum egg production. I want then to superimpose the function on a plot of the data. To begin, I have no problem with the poly.calc function:
egg1996<-poly.calc(d,e)
egg1996
3216904000 - 173356400*x + 4239900*x^2 - 62124.17*x^3 + 605.9178*x^4 - 4.13053*x^5 +
0.02008226*x^6 - 6.963636e-05*x^7 + 1.687736e-07*x^8
I can then simply
plot(d,e)
But when I try to use the lines function to superimpose the function on the plot, I get confused. The help file states that the output of poly.calc is an object of class polynomial, and so I assume that "egg1996" will be the "x" in:
lines(x, len = 100, xlim = NULL, ylim = NULL, ...)
But I cannot seem to, based on the example listed:
lines (poly.calc( 2:4), lty = 2)
Or based on the arguments:
x an object of class "polynomial".
len size of vector at which evaluations are to be made.
xlim, ylim the range of x and y values with sensible defaults
Come up with a command that successfully graphs the polynomial "egg1996" onto the raw data.
I understand that this question is beneath you folks, but I would be very grateful for a little help. Many thanks.
I don't work with the polynom package, but the resultant data set is on a completely different scale (both X & Y axes) than the first plot() call. If you don't mind having it in two separate panels, this provides both plots for comparison:
library(polynom)
d <- c(169,176,183,190,197,204,211,218,225,232,239,246)
e <- c(0,0,0.006839425,0.027323127,0.024666883,0.005603878,
0.016599262,0.002810977,0.005603878,0,0.002810977,0.002810977)
egg1996 <- poly.calc(d,e)
par(mfrow=c(1,2))
plot(d, e)
plot(egg1996)

Resources