ggplot making a descriptive bar graph with no clear y variable - r

I've been trying to create a proportional stacked bar graph using ggplot and a huge data set that is one column of a dummy variable and one column a factor variable with 14 different levels.
I posted a small sample of the data here.
Despite not having a clear y-variale in my data, I can produce a plot that is only really useful looking at the factors that have a lot of observations, but when there's only one or two, you can't see the proportion at all. The code I used is here.
ggplot(data,aes(factor(data$factor),fill=data$dummy))+
geom_bar()
ggplot says you need to apply a ddply function to the data frame.
ce<-ddply(data,"factor",transform, percent_y=y/sum(y)*100)
Their example doesn't really apply in the case of this data since there's no clear y-variable to call in the plot; just counts of each factor that is 1 or 0.
My best guess for a ddply function spits out an error about differeing number of rows.
ce<-ddply(plot,"factor(data$factor)",transform,
percent=sum(data$dummy)*100/(dim(data$dummy)[1]))

Related

Adding multiple lines to plot, without ggplot

I would like to plot multiple lines on the same plot, without using ggplot.
I have scores for different individuals across a set time period and wish to plot a line between yearly scores for each individual. Data is organised with each row representing an individual and each column an observed value in a given year.
Currently I am using a for loop, but am aware that this is often not efficient in R, and am interested if there are any more suitable approaches available within base R.
I will be working with up 100,000 individuals
Thanks.
Code:
df=data.frame(runif(10,0,100),runif(10,0,100),runif(10,0,100),runif(10,0,100))
df=data.frame(t(df))
Years=seq(1,10,1)
plot(1,type="n",xlab="Year",ylab="Score", xlim=c(1,10), ylim=c(0,100))
for(x in 1:4){lines(Years,df[x,])}
Efficiency is not much of a consideration when plotting since plotting to a device is a slow operation in itself. You can use matplot (which uses a loop internally). It's basically a more sophisticated version of your code wrapped in a function.
matplot(Years, t(df), xlab="Year", ylab="Score", type = "l")

Plotting pre aggregated data

I have pre-calculated data with amount on the x axis and the count (as a proportion) which I'm using as the y axis.
What I would like to have is the functionality I would get if I had used stat="bin". I can't use rep to simply explode the data to it's original form and then rebin it, because of the large size of the dataset.
For example:
I would like to be able to smooth the data, like I would have been able to by using binwidth.
Also, I'm plotting this data using geom_freqpoly. However, if I don't have a specific amount on the x axis I'd like to have it as a 0 value, instead of joining to the next point, which binning using ggplot does.
Since no one had a response for ggplot, I used rep to re-expand and sample the data.
So, if I had 18 million observations originally, I used 180,000 for the times argument of rep, and multiplied by this by my previously calculated proportion of the data. I'm not sure what the threshold would then be for the times argument (if it's less than 1 will no data point be created?). This means I lose the less frequent observations altogether, but this is OK in my case.
Many of the ggplot stat functions will accept a weight as part of the aesthetic, e.g.: aex(x=X, y=Y, weight=n). Depending on you versions, a couple even complain about the "unused aesthetic, 'weight'", but then go on to do the right thing! I've used this on geom_hist, bin2d, and probably others.

How to structure data for R?

So... newbie R user here. I have some observations that I'd like to record using R and be able to add to later.
The items are sorted by weights, and the number at each weight recorded. So far what I have looks like this:
weights <- c(rep(171.5, times=1), rep(171.6, times=2), rep(171.7, times=4), rep(171.8, times=18), rep(171.9, times=39), rep(172.0, times=36), rep(172.1, times=34), rep(172.2, times=25))
There will be a total of 500 items being observed.
I'm going to be taking additional observations over time to (hopefully) see how the distribution of weights changes with use/wear. I'd like to be able plots showing either stacked histograms or boxplots.
What would be the best way to format / store this data to facilitate this kind of use case? A matrix, dataframe, something else?
As other comments have suggest, the most versatile (and perhaps useful) container (structure) for your data would be a data frame - for use with the library(ggplot2) for your future plotting and graphing needs(such as BoxPlot with ggplot and various histograms
Toy example
All the code below does is use your weights vector above, to create a data frame with some dummy IDs and plot a box and whisker plot, and results in the below plot.
library(ggplot2)
IDs<-sample(LETTERS[1:5],length(weights),TRUE) #dummy ID values
df<-data.frame(ID=IDs,Weights=weights) #make data frame with your
#original `weights` vector
ggplot(data=df,aes(factor(ID),Weights))+geom_boxplot() #box-plot

Uniform plot points in R -- Research / HW

This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.

Boxplots using ggplot2

I am completely new to using ggplot2 but heard of it's great plotting capabilities. I have a list with of different samples and for each sample observations according to three instruments. I would like to turn that into a figure with boxplots. I cannot include a figure but the code to make an example figure is included below. The idea is to have for each instrument a figure with boxplots for each sample.
In addition, next to the plots I would like to make a sort of legend giving a name to each of the sample numbers. I have no idea on how to start doing this with ggplot2.
Any help will be appreciated
The R-code to produce the example image is:
#Make data example
Data<-list();
Data$Sample1<-matrix(rnorm(30),10,3);
Data$Sample2<-matrix(rnorm(30),10,3);
Data$Sample3<-matrix(rnorm(30),10,3);
Data$Sample4<-matrix(rnorm(30),10,3);
#Make the plots
par(mfrow=c(3,1)) ;
boxplot(data.frame(Data)[seq(1,12,by=3)],names=c(1:4),xlab="Sample number",ylab="Instrument 1");
boxplot(data.frame(Data)[seq(2,12,by=3)],names=c(1:4),xlab="Sample number",ylab="Instrument 2");
boxplot(data.frame(Data)[seq(3,12,by=3)],names=c(1:4),xlab="Sample number",ylab="Instrument 3");
First, you'll want to set your data up differently: as a data.frame rather than a list of matrices. You want one column for sample, one column for instrument, and one column for the observed value. Here's a fake dataset:
df <- data.frame(sample = rep(c("One","Two","Three","Four"),each=30),
instrument = rep(rep(c("My Instrument","Your Instrument","Joe's Instrument"),each=10),4),
value = rnorm(120))
> head(df)
sample instrument value
1 One My Instrument 0.08192981
2 One My Instrument -1.11667766
3 One My Instrument 0.34117450
4 One My Instrument -0.42321236
5 One My Instrument 0.56033804
6 One My Instrument 0.32326817
To get three plots, we're going to use faceting. To get boxplots we use geom_boxplot. The code looks like this:
ggplot(df, aes(x=sample,y=value)) +
geom_boxplot() +
facet_wrap(~ instrument, ncol=1)
Rather than including a legend for the sample numbers, if you put the names directly in the sample variable it will print them below the plots. That way people don't have to reference numbers to names: it's immediately clear what sample each plot is for. Note that ggplot puts the factors in alphabetical order by default; if you want a different ordering you have to change it manually.

Resources