I'm a beginner R programmer attempting to plot a histogram of an insurance claims dataset with 100,000+ observations which is heavily skewed (mean=$61,000, median=$20,000, max value=$15M).
I've submitted the following code to graph the adj_unl_claim variable over the $0-$100,000 domain:
hist(test$adj_unl_claim, freq=FALSE, ylim=c(0,1), xlim=c(0,100000),
prob=TRUE, breaks=10, col='red')
with the result being an empty graph with axes but no histogram bars - just an empty graph.
I suspect the problem is related to the skewed nature of my data, but I've tried every combination of breaks and xlim and nothing works. Any solutions are much appreciated!
If you've set freq = FALSE, then you are getting a histogram of probability densities. These are likely much less than 1. Consequently, your histogram bars are probably printed super-tiny along the x-axis. Try again without setting the ylim, and R will automatically calculate reasonable y axis limits.
Note also that setting the xlim doesn't change the actual plot, just how much of it you see. So you might not actually see 10 breaks, if some of them fall beyond the 100000 limit in your plot. You might actually want to subset your data to exclude values over 100000 first, and then do a histogram on the reduced dataset to get the plot you want. Maybe, I'm not sure what your objective is here.
This might give you something to play with, using some of Tyler's suggestions.
> claim <- c(15000000, rexp(99999, rate = 1/400)^1.76)
> summary(claim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 4261 20080 61730 67790 15000000
>
> hs <- 100000 # highest value to show on histogram
> br <- 10 # number of bars to show on histogram
>
> hist(claim, xlim = c(0,hs), freq = FALSE, breaks = br*max(claim)/hs, col='red')
>
> length(claim[claim<hs]) / length(claim) #proportion of claims shown
[1] 0.82267
> sum(claim[claim<hs]) / sum(claim) #proportion of value shown
[1] 0.3057994
where hist produced something like
The problem with this is that although the histogram coves about 82% of the claims in this pseudo-data, it only covers about 31% of the value of the claims. So unless the only point you want to make is that most claims are small, you might want to consider a different graph.
My guess is that the real point from your data is that while most claims are fairly small, most of the cost is in the big claims. The big claims will not show up in a histogram, even if you extend the scale. Instead break the claims up into groups of differing widths, including for example 0-$1000 and $1M+, and show with a dot plot (a) what proportion of claims fall into each group and (b) what proportion of the values of claims fall into each group.
Two things to try:
hist(test$adj_unl_claim[test$adj_unl_claim < 100000])
will plot a histogram of all claims of less than $100k. This omits the tail in the interest of showing the bulk of the data. Alternatively,
hist(log(test$adj_unl_claim))
will log-transform your claim size, effectively bringing the long tail back in.
Thanks, subsetting my data did the trick. I also added two lines of code that calculate the proportion of observations in each histogram bin and then plots them out with specific y and x subsets:
k<-hist(gb2_agg$adj_unl_claim,prob=TRUE,breaks=100000)
k$counts<-k$counts/sum(k$counts)
plot(k,ylim=c(0,.02),xlim-c(0,50000),col='blue')
Related
I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?
I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.
I have a matrix which has the following approximate dimensions: 20000 x 1. I would like to plot the values in a histogram with bins of length 0.01 from -0.05 to +0.15. However, the values in the matrix are pretty random - for eg, 0.0123421, 0.0124523, 0.124523, -0.011234, etc. Thus, I need to first count the number of values that fall into a particular bin, and then plot a histogram. For the numbers I gave, I'd have 2 values between 0.01 and 0.02, 1 between -0.02 and -0.01, and so on, which I need in a histogram. Is there an easy way to do this? I'm relatively new to R, so any help is appreciated!
As an example illustrating breaks (content summarized from an excellent post on R-bloggers which you can refer to here), lets assume that you start with some normally distributed data. In R, you can generate normal data this way using the rnorm() function:
data <-rnorm(n=1000, m=24.2, sd=2.2)
We can then generate a simple histogram using the following call:
hist(data)
Now, let's assume that you want to have coarser or finer groups for your bins. There are a number of ways to do this. You could, for example, use the breaks() option. Below is a tidy example illustrating this:
hist(data, breaks=20, main="Breaks=20")
hist(data, breaks=5, main="Breaks=5")
Now, if you want more control over exactly the breakpoints between bins, you can be more precise with the breaks() option and give it a vector of breakpoints, like this:
hist(data, breaks=c(17,20,23,26,29,32), main="Breaks is vector of breakpoints")
This dictates exactly the start and end point of each bin. Of course, you could give the breaks vector as a sequence like this to cut down on the messiness of the code:
hist(data, breaks=seq(17,32,by=3), main="Breaks is vector of breakpoints")
Note that when giving breakpoints, the default for R is that the histogram cells are right-closed (left open) intervals of the form (a,b]. You can change this with the right=FALSE option, which would change the intervals to be of the form [a,b). This is important if you have a lot of points exactly at the breakpoint.
hist(x, breaks = seq(-.05, .15, .01))
See ?hist
This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.
I have a qqnorm plot of a data set with 1000+ points. I want to draw a line between two quantiles at a time (say 30% and 70%) just as qqline does with 25% and 75%, but with the freedom of choosing which quantiles to use. I've poked around with ggplot a bit but can't seem to get something that works. What's the best way to do this?
edit:
My data are the changes in the dollar to pound exchange rate in the Garch data set of Ecdat:
diffbp = diff(Garch[,5])
After that, all I'm doing is
qqnorm(diffbp)
qqline(diffbp)
but I also need plots where qqline can go through different quantiles, for comparison.
edit 2:
I've tried the following:
> qqnorm(diffbp)
> quantile(diffbp,probs=c(0.01,0.99))
1% 99%
-0.029850 0.032675
> qqline(c(-0.029850,0.032675))
I don't think the resulting lines are exactly what I'm going for. Namely, I believe that the line always goes through -1 and 1 on the x axis, when I want it to go through the appropriate quantiles.
edit 3: Figured it out! Computed the sample and theoretical quantiles by hand, computed the equation of the line through them, and then drew the line using abline(). Rather time-consuming way of going about it, though.
I think you just need
qqline(diffbp,probs=c(0.3,0.7))
edit: I see from the comments on R: qqline connects the first and third quartiles. How do I draw a line between different quantiles (ie 30% and 70%)? that the difference is that I am using R-devel, and (as #caracal points out) that this feature is new since R 2.15.1 patched (see http://developer.r-project.org/blosxom.cgi/R-2-15-branch/NEWS , 7 July 2012)
Here's an example that seems to show that qqline() works (I used 0.1, 0.9 for greater contrast):
set.seed(101)
z <- rgamma(1000,shape=1)
newprobs <- c(0.1,0.9)
## png("qq.png")
qqnorm(z,pch=".")
qqline(z,col="red")
qqline(z,col="blue",probs=newprobs)
## add reference lines
abline(h=quantile(z,c(0.25,0.75)),col="red",lty=2)
abline(h=quantile(z,newprobs),col="blue",lty=2)
abline(v=qnorm(c(0.25,0.75)),col="red",lty=2)
abline(v=qnorm(newprobs),col="blue",lty=2)
## dev.off()
EDIT:
The information below is valid for R (and stats package) version 2.15.1 - apparently later versions will incorporate the built-in capability of plotting arbitrary quantiles.
qqline seems to be hard-coded to plot the .25 and .75 quantiles. But if you don't mind creating your own function, something like this might do:
myQqplot <- function(data, probs){
qqnorm(data)
theQuants <- quantile(data, probs = probs)
lm1 <- lm(theQuants~qnorm(probs))
abline(lm1)
invisible()
}
myQqplot(diffbp, c(.5,.99))
There's no check to make sure that you supply only two quantiles to the probs argument, but you could add it if you want.
Make a new function by copying the code and changing the second line to:
x <- qnorm(c(0.30, 0.70))
I have two related problems.
Problem 1: I'm currently using the code below to generate a histogram overlayed with a density plot:
hist(x,prob=T,col="gray")
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(density(x))
I've pasted the data (i.e. x above) here.
I have two issues with the code as it stands:
the last tick and label (100) of the x-axis does not appear on the histogram/plot. How can I put these on?
I'd like the y-axis to be of count or frequency rather than density, but I'd like to retain the density plot as an overlay on the histogram. How can I do this?
Problem 2: using a similar solution to problem 1, I now want to overlay three density plots (not histograms), again with frequency on the y-axis instead of density. The three data sets are at:
http://pastebin.com/z5X7yTLS
http://pastebin.com/Qg8mHg6D
http://pastebin.com/aqfC42fL
Here's your first 2 questions:
myhist <- hist(x,prob=FALSE,col="gray",xlim=c(0,100))
dens <- density(x)
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(dens$x,dens$y*(1/sum(myhist$density))*length(x))
The histogram has a bin width of 5, which is also equal to 1/sum(myhist$density), whereas the density(x)$x are in small jumps, around .2 in your case (512 even steps). sum(density(x)$y) is some strange number definitely not 1, but that is because it goes in small steps, when divided by the x interval it is approximately 1: sum(density(x)$y)/(1/diff(density(x)$x)[1]) . You don't need to do this later because it's already matched up with its own odd x values. Scale 1) for the bin width of hist() and 2) for the frequency of x length(x), as DWin says. The last axis tick became visible after setting the xlim argument.
To do your problem 2, set up a plot with the correct dimensions (xlim and ylim), with type = "n", then draw 3 lines for the densities, scaled using something similar to the density line above. Think however about whether you want those semi continuous lines to reflect the heights of imaginary bars with bin width 5... You see how that might make the density lines exaggerate the counts at any particular point?
Although this is an aged thread, if anyone catches this. I would only think it is a 'good idea' to forego translating the y density to count scales based on what the user is attempting to do.
There are perfectly good reasons for using frequency as the y value. One idea in particular that comes to mind is that using counts for the y scale value can give an analyst a good idea about where to begin the 'data hunt' for stratifying heterogenous data, if a mixed distribution model cannot soundly or intuitively be applied.
In practice, overlaying a density estimate over the observed histogram can be very useful in data quality checks. For example, in the above, if I were looking at the above graphic as a single source of data with the assumption that it describes "1 thing" and I wish to model this as "1 thing", I have an issue. That is, I have heterogeneous data which may require some level of stratification. The density overlay then becomes a simple visual tool for detecting heterogeneity (apart from using log transformations to smooth between-interval variation), and a direction (locations of the mixed distributions) for stratifying the data.