Uniform plot points in R -- Research / HW - r

This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,

Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.

Related

Linear regression with R: How do I get labels on data point in qq plot, scale location plot, Residuals vs Leverage etc

I have a small dataset with EU member states that contains values on their degree of negotiation success and the activity level the member states showed in the negotiations.
I am doing a linear regression with R.
In short the hypothesis is:
The more activity a member state shows, the more success it will have in negotiations.
I played around a lot with the data, transformed it etc.
What I have done so far:
# Stored the dataset from a csv file in object linData
linData = read.csv(file.choose(), sep = ";", encoding = "de_DE.UTF-8")
# As I like to switch variables and test different models, I send the relevant ones to objects x and y.
# So it is easier for me to change it in the future.
x = linData$ALL_Non_Paper_Art.Ann.Recit.Nennung
y = linData$Success_high
# I put the label for each observation in a factor lab
lab = linData$MS_short
# After this I run the linear model
linModel = lm(y~x, data = linData)
summary(linModel)
# I create a simple scatterplot. Here the labels from the factor lab work fine
plot(x, y)
text(x, y, labels=lab, cex= 0.5, pos = 4)
So far so good. Now I want to check for model quality. For visual insepection I found out I can use the command
plot(linModel)
This produces 4 plots in a row:
Residuals vs Fitted
Normal Q-Q
Scale Location
Residuals vs Leverage
As you can see in every picture R marks problematic observations by a number. It would be very convenient if R could just use the column "MS_short" from te dataset and add the label to the marked observations. I am sure this is possible... but how?
I work with R for 2 months now. I found some stuff here and via googe but nothing helped me to solve the problem. I have no one I can ask. This is my 1st post here an stackoverflow.
Thank you in advance
Rainer
With the help of G. Grothendieck I solved the problem.
After entering the R-help of plot, more specific the help for plot and linear regression (plot.lm) with the command
?plot.lm
I read the box with the "arguments and usage" part and identified the labels.id argument AND the id.n argument.
id.n is "number of points to be labelled in each plot, starting with the most extreme."
I needed that. I was interested in the identification of this extreme points. R already marked the 3 most extreme points in all graphics (see initial post) but used the observations numbers and not any useful labels. Any other labelling would mess up the graphics. So, we remember: In my case I want the 3 most extreme values to be labelled.
Now let's add this to the command:
I started the same as above, with a plot of my already computed linear model -> plot(linModel). After that I added "id.n =" and set the value to "3". That looked like that:
plot(linModel, id.n = 3,
So far so good, now R knows what to label BUT still not what should be used as label.
For this we have to add the labels.id to the command.
labels.id is the "vector of labels, from which the labels for extreme points will be chosen."
I assumed that one column in my dataset (NOT the linear model!) has the property of a vector and so I added a comma and then "labels.id =" to the command and typed in the name of my dataset and then the column, so in my case: "linData$MS_short" where linData is the dataset and MS_short the column with the 2 letter string for each member state. The final command looked like this:
plot(linModel, id.n = 3, labels.id = linData$MS_short)
And then it worked (see here). End of story.
Hope this can help some other newbies. Greetings.

Plotting pre aggregated data

I have pre-calculated data with amount on the x axis and the count (as a proportion) which I'm using as the y axis.
What I would like to have is the functionality I would get if I had used stat="bin". I can't use rep to simply explode the data to it's original form and then rebin it, because of the large size of the dataset.
For example:
I would like to be able to smooth the data, like I would have been able to by using binwidth.
Also, I'm plotting this data using geom_freqpoly. However, if I don't have a specific amount on the x axis I'd like to have it as a 0 value, instead of joining to the next point, which binning using ggplot does.
Since no one had a response for ggplot, I used rep to re-expand and sample the data.
So, if I had 18 million observations originally, I used 180,000 for the times argument of rep, and multiplied by this by my previously calculated proportion of the data. I'm not sure what the threshold would then be for the times argument (if it's less than 1 will no data point be created?). This means I lose the less frequent observations altogether, but this is OK in my case.
Many of the ggplot stat functions will accept a weight as part of the aesthetic, e.g.: aex(x=X, y=Y, weight=n). Depending on you versions, a couple even complain about the "unused aesthetic, 'weight'", but then go on to do the right thing! I've used this on geom_hist, bin2d, and probably others.

R histogram showing time spent in each bin

I'm trying to create a plot similar to the ones here:
Basically I want a histogram, where each bin shows how long was spent in that range of cadence (e.g 1 hour in 0-20rpm, 3 hours in 21-40rpm, etc)
library("rjson") # 3rd party library, so: install.packages("rjson")
# Load data from Strava API.
# Ride used for example is http://app.strava.com/rides/13542320
url <- "http://app.strava.com/api/v1/streams/13542320?streams[]=cadence,time"
d <- fromJSON(paste(readLines(url)))
Each value in d$cadence (rpm) is paired with the same index in d$time (the number of seconds from the start).
The values are not necessarily uniform (as can be seen if you compare plot(x=d$time, y=d$cadence, type='l') with plot(d$cadence, type='l') )
If I do the simplest possible thing:
hist(d$cadence)
..this produces something very close, but the Y value is "frequency" instead of time, and ignores the time between each data point (so the 0rpm segment in particular will be underrepresented)
You need to create a new column to account for the time between samples.
I prefer data.frames to lists for this kind of thing, so:
d <- as.data.frame(fromJSON(paste(readLines(url))))
d$sample.time <- 0
d$sample.time[2:nrow(d)] <- d$time[2:nrow(d)]-d$time[1:(nrow(d)-1)]
now that you've got your sample times, you can simply "repeat" the cadence measures for anything with a sample time more than 1, and plot a histogram of that
hist(rep(x=d$cadence, times=d$sample.time),
main="Histogram of Cadence", xlab="Cadence (RPM)",
ylab="Time (presumably seconds)")
There's bound to be a more elegant solution that wouldn't fall apart for non-integer sample times, but this works with your sample data.
EDIT: re: the more elegant, generalized solution, you can deal with non-integer sample times with something like new.d <- aggregate(sample.time~cadence, data=d, FUN=sum), but then the problem becomes plotting a histogram for something that looks like a frequency table, but with non-integer frequencies. After some poking around, I'm coming to the conclusion you'd have to roll-your-own histogram for this case by further aggregating the data into bins, and then displaying them with a barchart.

Clustering and heatmap in R

I am a newbie to R and I am trying to do some clustering on a data table where rows represent individual objects and columns represent the features that have been measured for these objects. I've worked through some clustering tutorials and I do get some output, however, the heatmap that I get after clustering does not correspond at all to the heatmap produced from the same data table with another programme. While the heatmap of that programme does indicate clear differences in marker expression between the objects, my heatmap doesn't show much differences and I cannot recognize any clustering (i.e., colour) pattern on the heatmap, it just seems to be a randomly jumbled set of colours that are close to each other (no big contrast). Here is an example of the code I am using, maybe someone has an idea on what I might be doing wrong.
mydata <- read.table("mydata.csv")
datamat <- as.matrix(mydata)
datalog <- log(datamat)
I am using log values for the clustering because I know that the other programme does so, too
library(gplots)
hr <- hclust(as.dist(1-cor(t(datalog), method="pearson")), method="complete")
mycl <- cutree(hr, k=7)
mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(mycl)]
heatmap(datamat, Rowv=as.dendrogram(hr), Colv=NA,
col=colorpanel(40, "black","yellow","green"),
scale="column", RowSideColors=mycol)
Again, I plot the original colours but use the log-clusters because I know that this is what the other programme does.
I tried to play around with the methods, but I don't get anything that would at least somehow look like a clustered heatmap. When I take out the scaling, the heatmap becomes extremely dark (and I am actually quite sure that I have somehow to scale or normalize the data by column). I also tried to cluster with k-means, but again, this didn't help. My idea was that the colour scale might not be used completely because of two outliers, but although removing them slightly increased the range of colours plotted on the heatmap, this still did not reveal proper clusters.
Is there anything else I could play around with?
And is it possible to change the colour scale with heatmap so that outliers are found in the last bin that has a range of "everything greater than a particular value"? I tried to do this with heatmap.2 (argument "breaks"), but I didn't quite succeed and also I didn't manage to put the row side colours that I use with the heatmap function.
If you are okay with using heatmap.2 from the gplots package that will allow you to add breaks to assign colors to ranges represented in your heatmap.
For example if you had 3 colors blue, white, and red with the values going from low to high you could do something like this:
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
In this case you have 3 sets of values that correspond to the 3 colors, the values will differ of course depending on what values you have with your data.
One thing you are doing in your program is to call hclust on your data then to call heatmap on it, however if you look in the heatmap manual page it states:
Defaults to hclust.
So I don't think you need to do that. You might want to take a look at some similar questions that I had asked that might help to point you in the right direction:
Heatmap Question 1
Heatmap Question 2
If you post an image of the heatmap you get and an image of the heatmap that the other program is making it will be easier for us to help you out more.

Plotting scatterplots with pairs in R, in log scale with data containing zeros

I am trying to plot some pairs of scatterplots using "pairs".
My dataframe look like :
>e
X Y Z
0 0 0
2 3 4
0 3 4
3 3 3
A completely standard dataframe here.
I use this to plot my scatter plots, again nothing fancy:
pairs(~X+Y+Z, data=e, log="xy")
It works great, but it doesn't plot the labels. However if I remove the log="xy" in the command, then the labels are plotted nicely. So I guess it has to do with the fact that I want my scatterplots to be in log scale.
So my question is what shall I do?
Shall I remove all lines with zeros in it before hand (how do you do that?)
Is there a magic trick that will let me have log="xy" and my scatterplots labeled?
Please let me know if it is not clear.
You ignored this (where I called your data frame DF):
R> pairs(~X+Y+Z, data=df, log="xy")
There were 30 warnings (use warnings() to see them)
and if you look at these thirty warnings, you will see that
you cannot plot data containing zeros on a log scale (and I guess you know why)
log is not a recognised parameter for pairs()
So if you want a pairs plot in logs, you may have to takes logs yourself (and either add a small epsilon or use a transformation like log(1 + x) and call pairs() on that data.
Edit The easiest is probably pairs(~X+Y+Z, data=log(1+DF))

Resources