Customising intervals/bins with the cut function to tabulate data - r

I have a variable which I want to use in a contingency table, and so I want to cut the variable's (discrete) values into bins (or rather intervals) which I can then sort my data from a population into. However, I cannot find anyway online that allows me to choose my bins in the following way:
[-30, -20) [-20, -10) [-10, 0) 0 (0, 10] (10, 20] (20, 30]
i.e. I want some intervals to be left open and right closed, some the other way around, and in the middle zero being a different category altogether. Is there anyway I can do this? I just want to tabulate data.

I think you will need two calls to cut for this:
x <- sample(-30:30, 1000, replace = TRUE)
The key is using the right parameter to get the closure:
x_lower <- as.character(cut(x, breaks = c(-30,-20,-10,0), right = FALSE))
x_upper <- as.character(cut(x, breaks = c(0,10,20,30), right = TRUE ))
And then combine them with ifelse (they are mutually exclusive and the two sets of intervals cover the whole range except zero so this should be fine):
x_new <- ifelse(is.na(x_lower), ifelse(is.na(x_upper), "0", x_upper), x_lower)

Related

R - Histogram Doesn't show density due to magnitude of the Data

I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!
Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)

Breaks not unique error when using cut and ddply

I am trying to break a dataset into quantiles based on a group.
I have the following code which if i try to do a cut using seq(0,1,.5) it works fine but when I change to the seq(0,1,.2) then it gives :
Error in cut.default(x = fwd_quarts$v, breaks =
quantile(fwd_quarts$v, : 'breaks' are not unique
Tring different code, I can't get away from the error. How do I adjust this so when it expands to larger data sets that the quantiles will be created without the error?
ddf <- vector(mode="numeric", length=0)
df <- vector(mode="numeric", length=0)
g<-data.frame( g= c(1,1,1,1,2,2,2,2,3,3))
v<-data.frame( v= c(1,4,4,5,NA,2,6,NA,7,8))
df<-cbind(g,v)
df<-df[complete.cases(df), ]
ddf<-ddply(df, "g", function(fwd_quarts){
eps_quartile <- cut(x = fwd_quarts$v, breaks =quantile(fwd_quarts$v, probs = seq(0, 1, 0.5)),na.rm=TRUE, labels = FALSE, include.lowest = TRUE)
cbind(ddf,eps_quartile)
})
df<-cbind(df,fwde_quart=ddf$eps_quartile)
This has nothing to do with ddply.
If your data is not generating unique breaks, you can make them unique by wrapping the breaks with a unique statement.
breaks =unique(quantile(fwd_quarts$v, probs = seq(0, 1, 0.2)))
However, this will lower the number of levels from what you originally desired.
Generally speaking, if you have data like c(1,1,1,2) you can't break it into 3 groups. The number of groups should be less than or equal to the unique values in your data. HTH.
I got the same problem in the leaflet, if there is not enough observation to make the map it gives the same error. As a solution I just combine the clusters that having low observations.

R: Create a more readable X-axis after binning data in ggplot2. Turn bins into whole numbers

I have a dummy variable call it "drink" and a corresponding age variable that represents a precise age estimate (several decimal points) for each person in a dataset. I want to first "bin" the age variable, extracting the mean value for each bin based on the "drink" dummy, and then graph the result. My code to do so looks like this:
df$bins <- cut(df$age, seq(from = 17, to = 31, by = .2), include.lowest = TRUE)
df.plot <- ddply(df, .(bins), summarise, avg.drink = mean(drinks_alcohol))
qplot(bins, avg.drink, data = df.plot)
This works well enough, but the x-axis in the graph is unreadable because it corresponds to the length size of the bins. Is there a way to make the modify the X-axis to show, for example, ages 19-23 only, with the "ticks" still aligning with the correct bins? For example, in my current code there is a bin for (19, 19.2] and another bin for (20, 20.2]. I would want only the bins that start in whole numbers to be identified on the X-axis with the first number (19, 20), not the second (19.2, 20.2) shown.
Is there any straightforward way to do this?
The most direct way to specify axis labels is with the appropriate scale function... in the case of factors on the x axis, scale_x_discrete. It will use whatever labels you give it with the labels argument, or you can give it a function that formats things as you like.
To "manually" specify the labels, you just need to create a vector of appropriate length. In this case, if you factor values go are intervals beginning with seq(17, 31.8, by = 0.2) and you want to label bins beginning with integers, then your labels vector will be
bin_starts = seq(17, 31.8, by = 0.2)
bin_labels = ifelse(bin_starts - trunc(bin_starts) < 0.0001, as.character(bin_starts), "")
(I use the a - b < 0.0001 in case of precision problems, though it shouldn't be a problem in this particular case).
A more robust solution would to label the factor levels with the number at the start of the interval from the beginning. cut also has a labels argument.
my_breaks = seq(17, 32, by = 0.2)
df$bins <- cut(df$age, breaks = my_breaks, labels = head(my_breaks, -1),
include.lowest = TRUE)
You could then fairly easily write a formatter (following templates from the scales package) to print only the ones you want:
int_only = function(x) {
# test if we can coerce to numeric, if not do nothing
if (any(is.na(as.numeric(x)))) return(x)
# otherwise convert to numeric and return integers and blanks as labels
x = as.numeric(x)
return(ifelse(x - trunc(x) < 1e-10, as.character(x), ""))
}
Then, using the nicely formatted data created above, you should be able to pass int_only as a formatter function to labels to get the labels you want. (Note: untested! necessary tweaks left as an exercise for the reader, though I'll gladly accept edits :) )

avoiding over-crowding of labels in r graphs

I am working on avoid over crowding of the labels in the following plot:
set.seed(123)
position <- c(rep (0,5), rnorm (5,1,0.1), rnorm (10, 3,0.1), rnorm (3, 4, 0.2), 5, rep(7,5), rnorm (3, 8,2), rnorm (10,9,0.5),
rep (0,5), rnorm (5,1,0.1), rnorm (10, 3,0.1), rnorm (3, 4, 0.2), 5, rep(7,5), rnorm (3, 8,2), rnorm (10,9,0.5))
group <- c(rep (1, length (position)/2),rep (2, length (position)/2) )
mylab <- paste ("MR", 1:length (group), sep = "")
barheight <- 0.5
y.start <- c(group-barheight/2)
y.end <- c(group+barheight/2)
mydf <- data.frame (position, group, barheight, y.start, y.end, mylab)
plot(0,type="n",ylim=c(0,3),xlim=c(0,10),axes=F,ylab="",xlab="")
#Create two horizontal lines
require(fields)
yline(1,lwd=4)
yline(2,lwd=4)
#Create text for the lines
text(10,1.1,"Group 1",cex=0.7)
text(10,2.1,"Group 2",cex=0.7)
#Draw vertical bars
lng = length(position)/2
lg1 = lng+1
lg2 = lng*2
segments(mydf$position[1:lng],mydf$y.start[1:lng],y1=mydf$y.end[1:lng])
segments(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2],y1=mydf$y.end[lg1:lg2])
text(mydf$position[1:lng],mydf$y.start[1:lng]+0.65, mydf$mylab[1:lng], srt = 90)
text(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2]+0.65, mydf$mylab[lg1:lg2], srt = 90)
You can see some areas are crowed with the labels - when x value is same or similar. I want just to display only one label (when there is multiple label at same point). For example,
mydf$position[1:5] are all 0,
but corresponding labels mydf$mylab[1:5] -
MR1 MR2 MR3 MR4 MR5
I just want to display the first one "MR1".
Similarly the following points are too close (say the difference of 0.35), they should be considered a single cluster and first label will be displayed. In this way I would be able to get rid of overcrowding of labels. How can I achieve it ?
If you space the labels out and add some extra lines you can label every marker.
clpl <- function(xdata, names, y=1, dy=0.25, add=FALSE){
o = order(xdata)
xdata=xdata[o]
names=names[o]
if(!add)plot(0,type="n",ylim=c(y-1,y+2),xlim=range(xdata),axes=F,ylab="",xlab="")
abline(h=1,lwd=4)
dy=0.25
segments(xdata,y-dy,xdata,y+dy)
tpos = seq(min(xdata),max(xdata),len=length(xdata))
text(tpos,y+2*dy,names,srt=90,adj=0)
segments(xdata,y+dy,tpos,y+2*dy)
}
Then using your data:
clpl(mydf$position[lg1:lg2],mydf$mylab[lg1:lg2])
gives:
You could then think about labelling clusters underneath the main line.
I've not given much thought to doing multiple lines in a plot, but I think with a bit of mucking with my code and the add parameter it should be possible. You could also use colour to show clusters. I'm fairly sure these techniques are present in some of the clustering packages for R...
Obviously with a lot of markers even this is going to get smushed, but with a lot of clusters the same thing is going to happen. Maybe you end up labelling clusters with a this technique?
In general, I agree with #Joran that cluster labelling can't be automated but you've said that labelling a group of lines with the first label in the cluster would be OK, so it is possible to automate some of the process.
Putting the following code after the line lg2 = lng*2 gives the result shown in the image below:
clust <- cutree(hclust(dist(mydf$position[1:lng])),h=0.75)
u <- rep(T,length(unique(clust)))
clust.labels <- sapply(c(1:lng),function (i)
{
if (u[clust[i]])
{
u[clust[i]] <<- F
as.character(mydf$mylab)[i]
}
else
{
""
}
})
segments(mydf$position[1:lng],mydf$y.start[1:lng],y1=mydf$y.end[1:lng])
segments(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2],y1=mydf$y.end[lg1:lg2])
text(mydf$position[1:lng],mydf$y.start[1:lng]+0.65, clust.labels, srt = 90)
text(mydf$position[lg1:lg2],mydf$y.start[lg1:lg2]+0.65, mydf$mylab[lg1:lg2], srt = 90)
(I've only labelled the clusters on the lower line -- the same principle could be applied to the upper line too). The parameter h of cutree() might have to be adjusted case-by-case to give the resolution of labels that you want, but this approach is at least easier than labelling every cluster by hand.

Tufte tables: convert quartile plots into standard error plots hacking qTable function from NMOF package

If you remember there is a nice version of table conceived by Tufte that include small quartile plots running next to the corresponding data rows:
There is an implementation of such solution in R using package NMOF and function qTable, which basically creates the table shown above and outputs it as a LaTeX code:
require(NMOF)
x <- rnorm(100, mean = 0, sd = 2)
y <- rnorm(100, mean = 1, sd = 2)
z <- rnorm(100, mean = 1, sd = 0.5)
X <- cbind(x, y, z)
qTable(X,filename="res.tex")#this will save qTable in tex format in current dir
This method of visualization seem to be especially useful if you have a small amount of information to present, and you don't want to waste a space for the full graph. But I would like to hack qTable a bit. Instead of displaying quantile plots, I would prefer to display standard errors of the mean. I am not great in hacking such functions, and I used brute force to do it. I replaced the line from the qTable function which computes quantiles:
A <- apply(X, 2L, quantile, c(0.25, 0.5, 0.75))
with something very brutal, that computes standard errors:
require(psych)#got 'SE' function to compute columns standard deviation
M = colMeans(X)#column means
SE = SD(X)/sqrt(nrow(X))#standard error
SELo = M-SE#lower bound
SEHi = M+SE#upper bound
A = t(data.frame(SELo,M,SEHi))#combines it together
I know, I know it's possibly unsustainable approach, but it actually works to some extend - it plots standard errors but keeps this gap in the plot:
and I would like this gap to disappear.
Here is a qTable function with modification discussed above.
To remove the gaps, you can insert these two lines:
B[2, ] <- B[3, ]
B[4, ] <- B[3, ]
right before the for loop that starts with
for (cc in 1L:dim(X)[2L]) {...
Why does it work? Reading the graph from left to right, the five rows of B correspond to where
1) the left segments start
2) the left segments ends
3) the dots are
4) the right segments start
5) the right segments end
so by forcing B[2, ] and B[4, ] to B[3, ] you are effectively getting rid of the gaps.

Resources