I have the following matrix:
test <- matrix(c(2006,100,
2007,105,
2008,98,
2009,102,
2010,107),ncol=2,byrow=TRUE)
And I want to draw its boxplot with
boxplot.matrix(test)
However, I only get two flat lines:
I can't pinpoint what I am doing wrong. What could be the problem?
If you examine the nature of your data, you will see that there are 2 groups that are far apart but within each group, the data points are close together.
Due to the clustering and the scaling, your data appear the way they are.
If you examine each column separately, you will get a "typical" box plot
> boxplot(test[,1], main="boxplot of column 1")
> boxplot(test[,2], main="boxplot of column 2")
Related
I have a big data frame with IDs, date, and test results in it - I want to run a for loop to go through all the IDs and plot a line graph that shows the evolution of the results across time, but also to add some key-indicators as points in the graph. (key indicators come from a different data frame).
Not all the IDs have all the indicators - the problem is that when I plot these points, I use dplyr filtering to filter for ID == i and something like key_ind != 0. My problem is that when there are no key indicators for a certain ID, the filtered data frame has 0 observations and ggplot returns an error.
I want that when there are no points to plot (the test results still get plotted as a line) - they wont be plotted, but the line graph of the results still be plotted. Does that make sense? How can I do that? I have tried using tryCatch() but it didn't work.
I'd like to put multiple plots onto a single visual output in R, based on data that I have in a CSV that looks something like this:
user,size,time
fred,123,0.915022
fred,321,0.938769
fred,1285,1.185608
wilma,5146,2.196687
fred,7506,1.181990
barney,5146,1.860287
wilma,1172,1.158015
barney,5146,1.219313
wilma,13185,1.455904
wilma,8754,1.381372
wilma,878,1.216908
barney,2974,1.223852
I can read this just fine, using, e.g.:
data = read.csv('data.csv')
For the moment, a fairly simple plot is fine, so I'm just trying plot(), without much to it (setting type='o' to get lines and points), and' from solving a past problem, I know that I can do, e.g., the following, to get data for just fred:
plot(data$time[which(data$user == 'fred')], data$size[which(data$user == 'fred')], type='o')
What I'd like, though, is to have the data for each user all showing up on one set of axes, with color coding (and a legend to match users to colors) to identify different user data.
And if another user shows up, I'd like another line to show up, with another color (perhaps recycling if I have too many users at once).
However, just this doesn't do it:
plot(data$size, data$time, type='o',col=c("red", "blue", "green"))
Because it doesn't seem to group by the user.
And just this:
plot(data, type='o')
gives me an error:
Error in plot.default(...) :
formal argument "type" matched by multiple actual arguments
This:
plot(data)
does do something, but not what I want.
I've poked around, but I'm new enough to R that I'm not quite sure how best to search for this, nor where to look for examples that would hit a use-case like this.
I even got somewhat closer with this:
plot(data$size[which(data$user == 'wilma')], data$time[which(data$user == 'wilma')], type='o', col=c('red'))
lines(data$size[which(data$user == 'fred')], data$time[which(data$user == 'fred')], type='o', col=c('green'))
lines(data$size[which(data$user == 'barney')], data$time[which(data$user == 'barney')], type='o', col=c('blue'))
This gives me a plot (which I'd post inline, but as a new user, I'm not allowed to yet):
not-quite-right plot
which is kind of close to what I want, except that it:
doesn't have a legend
has ugly axis labels, instead of just time and size
is scaled to the first plot, and thus is missing data from some of the others
isn't sorted by x-axis, which I could do externally, though I'm guessing I could do it fairly easily in R.
So, the question, ultimately, is this:
What's an easy way to plot data like this which:
has multiple lines based on the labels in the first column of the CSV
uses the same set of axes for the data in columns 2 and 3, regardless of the label
has a legend and color-coding for which label is being used for a particular line (or set of points)
will adapt to adding new labels to the data file, hopefully without change to the R code.
Thanks in advance for any help or pointers on this.
P.S. I looked around for similar questions, and found one that's sort of close, but it's not quite the same, and I failed to figure out how to adapt it to what I'm trying to do.
Good question. This is doable in base plot, but it's even easier and more intuitive using ggplot2. Below is an example of how to do this with random data in ggplot2
First download and install the package
install.packages("ggplot2",repos='http://cran.us.r-project.org')
require(ggplot2)
Next generate the data
a <- c(rep('a',3),rep('b',3),rep('c',3))
b <- rnorm(9,50,30)
c <- rep(seq(1,3),3)
dat <- data.frame(a,b,c)
Finally, make the plot
ggplot(data=dat, aes(x=c, y=b , group=a, colour=a)) + geom_line() + geom_point()
Basically, you are telling ggplot that your x axis corresponds to the c column (dat$c), your y axis corresponds to the b column (y$b) and to group (draw separate lines) by the a column (dat$a). Colour specifies that you want to group colour by the a column as well.
The resulting graph looks like this:
Sequential portions of my time series are under different treatments, and I'd like to separately color a line connecting observations in each portion.
For example, in the series under treatment A I'd have a red line, and in the succeeding series under treatment B I'd have a blue line.
plot(response, type="l",col="treatment") failed - all observations were connected with a line the same color.
This listhost posting proposed just splitting the data by treatment and then separately plotting each subset on the same plot. (http://r.789695.n4.nabble.com/Can-R-plot-multicolor-lines-td791081.html).
Is there a more elegant way?
An alternative using Map that avoids manually plotting segments:
dat <- data.frame(treatment=rep(LETTERS[1:2],3:4),
response=c(6,5,2,1,5,6,7),time=1:7)
plot(response ~ time, data=dat, type="n")
Map(
function(x) lines(response ~ time, data=x, col=x$treatment),
split(dat, dat$treatment)
)
There are two popular more elegant ways. One is to use the ggplot2 package. Without more information it's hard to advise you other than look at help or examples in various places. The other is to check out the function matplot. That will require you to first restructure your data as a matrix but it can easily do what you want. Keep in mind that while it says in the help, "Plot the columns of one matrix against the columns of another", the x-axis matrix can be a vector the same length as one column of a matrix containing your line information. The function will just recycle the x vector.
This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.
I am completely new to using ggplot2 but heard of it's great plotting capabilities. I have a list with of different samples and for each sample observations according to three instruments. I would like to turn that into a figure with boxplots. I cannot include a figure but the code to make an example figure is included below. The idea is to have for each instrument a figure with boxplots for each sample.
In addition, next to the plots I would like to make a sort of legend giving a name to each of the sample numbers. I have no idea on how to start doing this with ggplot2.
Any help will be appreciated
The R-code to produce the example image is:
#Make data example
Data<-list();
Data$Sample1<-matrix(rnorm(30),10,3);
Data$Sample2<-matrix(rnorm(30),10,3);
Data$Sample3<-matrix(rnorm(30),10,3);
Data$Sample4<-matrix(rnorm(30),10,3);
#Make the plots
par(mfrow=c(3,1)) ;
boxplot(data.frame(Data)[seq(1,12,by=3)],names=c(1:4),xlab="Sample number",ylab="Instrument 1");
boxplot(data.frame(Data)[seq(2,12,by=3)],names=c(1:4),xlab="Sample number",ylab="Instrument 2");
boxplot(data.frame(Data)[seq(3,12,by=3)],names=c(1:4),xlab="Sample number",ylab="Instrument 3");
First, you'll want to set your data up differently: as a data.frame rather than a list of matrices. You want one column for sample, one column for instrument, and one column for the observed value. Here's a fake dataset:
df <- data.frame(sample = rep(c("One","Two","Three","Four"),each=30),
instrument = rep(rep(c("My Instrument","Your Instrument","Joe's Instrument"),each=10),4),
value = rnorm(120))
> head(df)
sample instrument value
1 One My Instrument 0.08192981
2 One My Instrument -1.11667766
3 One My Instrument 0.34117450
4 One My Instrument -0.42321236
5 One My Instrument 0.56033804
6 One My Instrument 0.32326817
To get three plots, we're going to use faceting. To get boxplots we use geom_boxplot. The code looks like this:
ggplot(df, aes(x=sample,y=value)) +
geom_boxplot() +
facet_wrap(~ instrument, ncol=1)
Rather than including a legend for the sample numbers, if you put the names directly in the sample variable it will print them below the plots. That way people don't have to reference numbers to names: it's immediately clear what sample each plot is for. Note that ggplot puts the factors in alphabetical order by default; if you want a different ordering you have to change it manually.