Graph with Gnuplot of statistic with different three different input data - graph

I am plotting the frequency of the data sampling in an interval. The code is:
n=50 #number of intervals
plot "xxx.csv" u ($0):1 #To get the max and min value
max=GPVAL_Y_MAX
min=GPVAL_Y_MIN
width=(max-min)/n #interval width
#function used to map a value to the intervals
hist(x,width)=width*floor(x/width)
set ytic auto
set xtic auto
plot "xxx.csv" u (hist($1,width)):(1.0) smooth freq w histeps ls 1 title "xxx"
This works, but I would like to put two similar graph overlapped with different data. The problem is that the data are different so max, min and width are not the same. The data are separated files like yyy.csv and zzz.csv. How can I do this?

Do you have gnuplot >= 4.6? If so you can use the stats command to get statistics for those files easily, otherwise it would probably be a matter of doing what you did in your script (plot, then use GPVAL_Y_MIN, etc.) and create a set of variables for each data set.
(Posting my earlier comment as an answer.)

Related

Why the histogram look different using two different breaks argument in R?

I want to plot the distribution of the datasets using the histogram in R. I tried using different arguments (default, Freedman-Diaconis, and Scott) to get the best representation. I consider using a log scale later, but first I want to know the raw distribution without any scaling. However, the results look different, why is that? The dataset I use can be downloaded from here data or here data. The code I'm running are
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = 200)
result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = "Scott")
Result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks="Freedman-Diaconis")
result is
Please help. Thank you very much.
Histograms are very sensitive to the choice of cell break points. Even for the same (!) number of cells, the histogram can become considerably different by just a small shift of the cell borders. It is thus generally preferable to use kernel density estimators instead of histograms, because they do not depend on random cell border placement:
# increase n if you have a wide range of values
d <- density(as.matrix(deviation_all_genes_all_spots), n=512)
plot(d$x, d$y)
In your second and third call of hist, you ask for an automatic way to select the number of cells and the cell borders. Obviously, this results in more cells than in your first call with breaks=200. You can query the cells from the return value of hist, e.g.
h <- hist(as.matrix(deviation_all_genes_all_spots))
cat(srintf("number of cells = %i\n", length(h$mids))

Creating Histograms with R. Questions regarding possibilities of use and problems with overlapping values

For my thesis i want to create a histogram on standardized earnings. This histogram should ideally have the following properties:
The histogram should be able to have the intervals of the data
(bins) played with.
Since i have my data in a spreadsheet. Is it possible to consider
more than one column?
Also it should have the ability to set the range of the data that is
included in the histogram for example from -50 mio. to 200 mio. (But
i could do this in my input)
Sadly I was not able to perform this task my own.
I have downloaded the data from orbis in spreadsheet (xlsx). Afterwards I cleaned my data of symbols that R can't read, saved everything as a Tab separated .txt and imported it into R-Studio:
setwd("/path")
getwd()
df<- read.table("importFile", header = TRUE)
View(df)
This worked nicely.
Now i tried creating the histogram
library(ggplot2)
myplot=ggplot(df, aes(JuStandartisiert2007))
myplot+ stat_count(width = 1000)
Then i received the following warning:
position_stack requires non-overlapping x intervals
My histogram looks horrible:
This perplexes me, I tried making a histogram on the airquality dataset and it works without problems.
Also note that i have to use stat_count for my histogram in a youtube video i saw, they did it the following way:
myplot+ geom_histogram(binwidth = 10)
My questions are now:
What is wrong with my Data why i have overlapping x Values? To my naked eye my data looks the same than that from R's airquality dataset.
How can I sepparate my x values?
Can i set max and min values for the data that enters my Histogram?
Can I consider more than one column in my dataset.
Here is my Dataset as TAB separated txt file.
https://www.dropbox.com/sh/jbscj6cftpcqaxh/AADglvv_xnG2wWN-o2SIrTwpa?dl=0
I would rather begin with base plotting such as:
hist(df$JuStandartisiert2007,breaks=1000,xlim=c(-2,2))
you can also observe the limits for the x-axis.
In order to have the plot of two columns try :
plot(df$JuStandartisiert2007,df$BilanzsummeAktiva2007,xlim = c(-5,5),ylim=c(-1,1000))
Once again observe the x and y limits represented by: xlim and ylim

Uniform plot points in R -- Research / HW

This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.

R histogram showing time spent in each bin

I'm trying to create a plot similar to the ones here:
Basically I want a histogram, where each bin shows how long was spent in that range of cadence (e.g 1 hour in 0-20rpm, 3 hours in 21-40rpm, etc)
library("rjson") # 3rd party library, so: install.packages("rjson")
# Load data from Strava API.
# Ride used for example is http://app.strava.com/rides/13542320
url <- "http://app.strava.com/api/v1/streams/13542320?streams[]=cadence,time"
d <- fromJSON(paste(readLines(url)))
Each value in d$cadence (rpm) is paired with the same index in d$time (the number of seconds from the start).
The values are not necessarily uniform (as can be seen if you compare plot(x=d$time, y=d$cadence, type='l') with plot(d$cadence, type='l') )
If I do the simplest possible thing:
hist(d$cadence)
..this produces something very close, but the Y value is "frequency" instead of time, and ignores the time between each data point (so the 0rpm segment in particular will be underrepresented)
You need to create a new column to account for the time between samples.
I prefer data.frames to lists for this kind of thing, so:
d <- as.data.frame(fromJSON(paste(readLines(url))))
d$sample.time <- 0
d$sample.time[2:nrow(d)] <- d$time[2:nrow(d)]-d$time[1:(nrow(d)-1)]
now that you've got your sample times, you can simply "repeat" the cadence measures for anything with a sample time more than 1, and plot a histogram of that
hist(rep(x=d$cadence, times=d$sample.time),
main="Histogram of Cadence", xlab="Cadence (RPM)",
ylab="Time (presumably seconds)")
There's bound to be a more elegant solution that wouldn't fall apart for non-integer sample times, but this works with your sample data.
EDIT: re: the more elegant, generalized solution, you can deal with non-integer sample times with something like new.d <- aggregate(sample.time~cadence, data=d, FUN=sum), but then the problem becomes plotting a histogram for something that looks like a frequency table, but with non-integer frequencies. After some poking around, I'm coming to the conclusion you'd have to roll-your-own histogram for this case by further aggregating the data into bins, and then displaying them with a barchart.

R histogram results in empty graph

I'm a beginner R programmer attempting to plot a histogram of an insurance claims dataset with 100,000+ observations which is heavily skewed (mean=$61,000, median=$20,000, max value=$15M).
I've submitted the following code to graph the adj_unl_claim variable over the $0-$100,000 domain:
hist(test$adj_unl_claim, freq=FALSE, ylim=c(0,1), xlim=c(0,100000),
prob=TRUE, breaks=10, col='red')
with the result being an empty graph with axes but no histogram bars - just an empty graph.
I suspect the problem is related to the skewed nature of my data, but I've tried every combination of breaks and xlim and nothing works. Any solutions are much appreciated!
If you've set freq = FALSE, then you are getting a histogram of probability densities. These are likely much less than 1. Consequently, your histogram bars are probably printed super-tiny along the x-axis. Try again without setting the ylim, and R will automatically calculate reasonable y axis limits.
Note also that setting the xlim doesn't change the actual plot, just how much of it you see. So you might not actually see 10 breaks, if some of them fall beyond the 100000 limit in your plot. You might actually want to subset your data to exclude values over 100000 first, and then do a histogram on the reduced dataset to get the plot you want. Maybe, I'm not sure what your objective is here.
This might give you something to play with, using some of Tyler's suggestions.
> claim <- c(15000000, rexp(99999, rate = 1/400)^1.76)
> summary(claim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 4261 20080 61730 67790 15000000
>
> hs <- 100000 # highest value to show on histogram
> br <- 10 # number of bars to show on histogram
>
> hist(claim, xlim = c(0,hs), freq = FALSE, breaks = br*max(claim)/hs, col='red')
>
> length(claim[claim<hs]) / length(claim) #proportion of claims shown
[1] 0.82267
> sum(claim[claim<hs]) / sum(claim) #proportion of value shown
[1] 0.3057994
where hist produced something like
The problem with this is that although the histogram coves about 82% of the claims in this pseudo-data, it only covers about 31% of the value of the claims. So unless the only point you want to make is that most claims are small, you might want to consider a different graph.
My guess is that the real point from your data is that while most claims are fairly small, most of the cost is in the big claims. The big claims will not show up in a histogram, even if you extend the scale. Instead break the claims up into groups of differing widths, including for example 0-$1000 and $1M+, and show with a dot plot (a) what proportion of claims fall into each group and (b) what proportion of the values of claims fall into each group.
Two things to try:
hist(test$adj_unl_claim[test$adj_unl_claim < 100000])
will plot a histogram of all claims of less than $100k. This omits the tail in the interest of showing the bulk of the data. Alternatively,
hist(log(test$adj_unl_claim))
will log-transform your claim size, effectively bringing the long tail back in.
Thanks, subsetting my data did the trick. I also added two lines of code that calculate the proportion of observations in each histogram bin and then plots them out with specific y and x subsets:
k<-hist(gb2_agg$adj_unl_claim,prob=TRUE,breaks=100000)
k$counts<-k$counts/sum(k$counts)
plot(k,ylim=c(0,.02),xlim-c(0,50000),col='blue')

Resources