Decreasing the range of a dataset - r

I have a dataset which ranges from 0.00000787 to 1.39151821, quite a large disparity when it comes to plotting the data. I'd like to try and decrease the range of data so the plot (I'm using a colour coded plot, and right now it's pretty monotonous) is more visually understandable. I tried using log(dataset) however this creates some negative numbers which my software doesn't like.
Mathematics is not my strong point, if someone could recommend a method of fitting my data into a smaller range it would be much appreciated.
Thanks.

Try log + 1, like this:
list<-seq(0.00000787,1.39151821,0.01)
plot(log(list+1))

Related

How can I convert a scatterplot into a hexagonal/honeycomb chart?

I have data in a scatterplot that places colors respective of their lightness (x-axis) and saturation (y-axis):
Here's the data set in a spreadsheet.
I would like to transform this into a hexagonal/honeycomb chart. I did this by hand...finding some "lines" in the data for the edges, and then filling in the middle based on intuition:
I'm not sure if that's the "best" honeycomb representation of the data, but something that looks okay to me.
Does anyone have a suggestion on how I could make this process into an algorithm? I have a feeling there's some math or algorithms that would fit this problem which I am unaware of.
Thanks!

heatmap.2 color legend custom bins

Hi there stackoverflow community!
I am a graduate student inquiring for some consultation on an aethetics R problem I am encountering.
The data I am working with is in the form of a VERY large matrix (49x51).
My problem is that my data ranges from very small to very large, with the bulk of my data falling within the "very large" end of the spectrum, so unless I convert my data to log10, the heatmap is rather boring and almost entirely the same color.
The spectrum of my data is totally within the range I am expecting, but I am hoping to display it in a more aesthetic way.
Proposed solution: I think I need to bin my data in a non-uniform way. If you look at the attached image, you will see that their heatmap looks nice and the color key shows the heat spectrum in a non-fixed bin format. I would like to do something like that, however, I am not sure how to declare cutoffs for each bin. I would ideally like to declare the cutoffs.
For example, bin 1 (0-1), bin 2 (2-50), bin 3 (51-5000). As you can see, my bins would not be fixed in equal increments.
I have been using heatmap.2 for this. Thanks so much in advance!
heatmap with color legend in non-uniform bins:
Hey #Punintended and #S Rivero,
I think I have reached the point that my heatmap will only improve marginally. Both of you contributed deeply to this success, so thanks! First, to condense the matrix values as much as possible, I normalized by column. I was then able to assign gradients. This turned out much better than I had hoped. As you can see, most of my data is clustered (check out the density in the key) at very low values, this is okay though, for I am interested in the higher values. I had to use custom color gradients to account for possible instances of colorblind attendees that might look at my poster. Anyways, if you guys have comments or recommendations, they will be much appreciated :). Again, thanks a bunch!
enter image description here

R plotting strangeness with large dataset

I have a data frame with several million points in it - each having two values.
When I plot this like this:
plot(myData)
All the points are plotted, but the plot is quite busy, so I thought I'd plot it as a line:
plot(myData, type="l")
But while the x axis doesn't change (i.e. goes from 0 to 7e+07), the actual plotting stops at about 3e+07 and I don't actually get a proper line plot either.
Is there a limitation on line plotting?
Update
If I use
plot(myData, type="h")
I get correct and useable output, but I still wonder why the type="l" option fails so badly.
Further update
I am plotting a time series - here is one output using type="h":
That's perfectly usable, but having a line would allow me to compare several outputs.
High dimensional data graphic representation is growing issue in data analysis. The problem, actually, is not create the graph. The problem is make the graph capable of communicate information that we could transform in useful knowledge. Allow me to present an example to produce this point, by considering a data with a million observations, that is, not that big.
x <- rnorm(10^6, 0, 1)
y <- rnorm(10^6, 0, 1)
Let's plot it. R can yes easily manage such a problem. But can we? Probably not.
Afterall, what kind of information can we deduce from an ink hard stain? Probably, no more than a tasseographyst trying to divinate the future in patterns of tea leaves, coffee grounds, or wine sediments.
plot(x, y)
A different approach is represented by the smoothScatter function. It creates a density plot of bivariate data. There, we create two examples.
First, with defaults.
smoothScatter(x, y)
Second, the bandwidth was specified to be a little larger than the default, and five points are specified to be shown using a different symbol pch = 3.
smoothScatter(x, y, bandwidth=c(5,1)/(1/3), nrpoints=5, pch=3)
As you can see, the problem is not solved. Nevertheless, we can have a better grasp on the distribution of our data. This kind of approach is still in development, and there are several matters that are discussed and evolved. If this approach represents a more suitable approach to represent your big dataset, I suggest you to visit this blog that discuss throughfully the issue.
For what it's worth, all the evidence I have is that is computer - even though it was a lump of big iron - ran out of memory.

Intelligent Y Axis Scaling BarPlot R

I want to plot some data with barplot. Rather, I want to make a bar graph and barplot seemed the logical choice. I am plotting just fine but I was wondering if there is a way to intelligently scale the y axis to round up from the highest count.
For example I set the yaxis in this case to be 30, because I knew that Strand.22 had 27 counts in it: barplot(unlist(d), ylim=c(0,30), xlab="Forward Reverse", ylab="Counts")
In the future, I want this script to run on its own, so it would be optimal for the the Y-axis to choose it's own ylim. Short of pulling the information out of my 'd' variable I can't think of a good way to do this. Is there an easy way to do this with barplot? Would some other plotter work better? I have seen things about ggplots but it seemed super complex and I wasn't sure that it would do anything better.
EDIT: If I do not choose a ylim it picks automatically and this is what it decided was best.
I disagree with it's choice.
If you don't specify ylim, R will come up with something based on the data. (Sounds like you don't like it's choice, which is fair.)
If you specify something based on the data like:
barplot(unlist(d), ylim=c(0,1.1*max(unlist(d)))
R will draw you a plot that reflects the maximum value of data. That example just takes the maximum of your values and multiplies that by 1.1 (this could be any number) to give it a little extra height. R does something similar to this when you make a scatterplot but it handles barplots slightly differently.

Actuarial survival analysis, divided into intervals

I'm trying to create an actuarial survival analysis in R (I'm following some worked examples). I think the best way to do this is using the survival package. So something like:
library(survival)
surv.test <- survfit(Surv(TIME,STATUS), data=test)
However, to get the correct answer I will need to divide the TIME variable into 365 day intervals and I can't quite work out how to do this so it matches the given result.
As far as I can make out, there is no option within the survfit function that will do this. I went through several document examples and none of them were trying to create a stairstep type of plot (there is a type='interval' option, but seems to do something different). So I guess I need to regroup my data before I apply the survival function?
Any ideas?
P.S: In SPSS this would be INTERVAL = THRU 10000 BY 365; in Stata intervals(365) ... connect(stairsteps)
I am guessing that you want to divide the TIME variable into intervals because you want to plot a Kaplan-Meier curve. In R, that isn't necessary, you can just call plot on the survfit object. For example,
s=survfit(Surv(futime, fustat)~rx, data=ovarian)
plot(s)
I think I understand your question a little better. The reason why you are getting a thick black line is because you have a lot of censoring, and a + is being plotted at every single point where there is censoring, you can turn this off with mark.time=F. (You can see other options in ?survival:::plot.survfit)
However, if you still want to aggregate by year, simply divide your follow up time by 365, and round up. ceiling is used to round up. Here is an example of aggregating at different time levels without censoring.
par(mfrow=c(1,3))
plot(survfit(Surv(ceiling(futime), fustat)~rx, data=ovarian),col=c('blue','red'),main='Day',mark.time=F)
plot(survfit(Surv(ceiling(futime/30), fustat)~rx, data=ovarian),col=c('blue','red'),main='Month',mark.time=F)
plot(survfit(Surv(ceiling(futime/365), fustat)~rx, data=ovarian),col=c('blue','red'),main='Year',mark.time=F)
par(mfrow=c(1,1))
But I think that plotting the Kaplan-Meier without the censoring symbols will look very nice, and provide more insight.
Hurray, I should be able to post the images now:
1) this is how the R basic survival plot looks like at the moment
2) and this is how it should look like (SPSS example)
That was exactly what I was missing! Thanks!
Solution:
vas.surv <- survfit(Surv(ceiling(TIME/365), STATUS)~1, conf.type="none", data=vasectomy)
plot(vas.surv, ylim=c(0.975,1), mark.time=F, xlab="Years", ylab="Cumulative Survival")
A nice touch would be to displays the days on the x-axis instead of the years (as in SPSS) example, but I'm not too bothered about this.

Resources