I have a data frame data, I would like draw a proper density plot for it.When I have drown plot the interval is shown a wider range than my data.
input:
X Y
1 0.4078791 0.471845
2 0.2892282 0.205871
3 0.4254774 0.407548
4 0.4749196 0.396765
5 0.2763627 0.142572
6 0.3942402 0.457668
7 0.2427948 0.248003
8 0.3117754 0.322484
9 0.4350599 0.450679
10 0.4459200 0.338858
That's how a kernel density estimation works. The result must cover a wider range than your data. You can try different kernels and bandwidth algorithms or fiddle with the adjust parameter, but you actually want the density estimate to cover a wider range than your data. Otherwise it wouldn't be a proper density estimate.
Related
Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.
I need to plot more than one confidence interval in one plot at a particular order.
For example, my data is:
N Est. Lower Upper
1 5 3 6
2 1 0 4
3 3 0 7
I use the following command to plot:
proc sgplot data=confidence;
scatter y=N x=est. / xerrorlower=lower xerrorupper=upper
markerattrs=(symbol=circlefilled size=9);
run;
SAS will always plot the confidence interval at the order of N from 1 to 3. However, I need to show a trend of est. change. i.e the order I need is N=2 at first followed by N=3 and N=1 corresponding to est. = 1 3 5. Even after sorted by est., SAS still do the same things. I know I can sort and add an new order to my data to make the result I want, but I still want to show the correct N in my final plot to tell me the number of my confidence interval. Thanks.
You can request a discrete vertical axis, and specify the ordering method using the yaxis statement:
yaxis discreteorder = data type = discrete;
This will tell SAS to ignore the values in N and display them based on the order in which they are read from the dataset. You will have to sort your data in advance.
I have used sm.density.compare to plot 3 density functions for data with values between -90 and +10. The Y axis is labeled "density" and has the range 0 - 1.0 as for proportions or probability.
I then plot 4 density functions for data with values between 0 and 1.0. I get a useful plot and the Y axis still reads "density" but the values are apparently counts and range between 0 and 12 or so.
The function sm.options does not seem to offer control of which you get. I'd like both to be probability or proportions.
I'm new to R but have a substantial history with other software.
I would like to show a simple box plot of data from 3 time points (0, 7 and 28) against abundance. I want to split the plots into treatment (i.e. CO2 level/Temperature) which will be nested within. Essentially I have 2 box plots per time point indicating the 2 different treatments. I Was going to use an overlay but because I have 2 box plots for each time point I am finding it tricky to work out the correct code.
Thanks
I have a number of coordinates and I want to plot them in a gridded interface by using R.
The problem is that the relative distance between observations is large. Coordinates are in a geographic coordinate system and the study area is Switzerland. Moreover, id of the points is required to be plotted.
The problem is that two clusters of the points are dense and some other points are separated with a large distance. How I can plot them in a proper way to have readable presentation? Any suggestion for plotting the data?
Preferably, do not use ggplot as I used it before and it did not present proper results.
Data:
id x y
2 7.1735 45.86880001
3 7.17254 45.86887001
4 7.171636 45.86923601
5 7.18018 45.87158001
6 7.17807 45.87014001
7 7.177229 45.86923001
8 7.17524 45.86808001
9 7.181409 45.87177001
10 7.179299 45.87020001
11 7.178359 45.87070001
12 7.175189 45.86974001
13 7.179379 45.87081001
14 7.175509 45.86932001
15 7.176839 45.86939001
17 7.18099 45.87262001
18 7.18015 45.87248001
19 7.18122 45.87355001
20 7.17491 45.86922001
25 7.15497 45.87058001
28 7.153399 45.86954001
29 7.152649 45.86992001
31 7.154419 45.87004001
32 7.156099 45.86983001
GSBi_1 7.184 45.896
GSBi__1 7.36 45.901
GSBj__1 7.268 45.961
GSBj_1 7.276 45.836
GSB 7.272 45.899
GSB_r 7.166667 45.866667
Location of points:
As you can see in the plot, the points' ids are not readable both for the dense parts and others.
Practically, it is not always possible to ensure that all points are visually separable on the screen when plotting a set of points that contains very close and very far points at the same time.
Think of a 1000x800 pixel screen. Let's say we have three points A, B and C that are located respectively on the same horizontal line such that: the distance between A and B is 1 unit and the distance between A and C is 4000 unit.
If you map this maximum distance (4000 unit) to the width of the screen (1000px). Then a pixel will correspond to 4 units in horizontal. That means A and B will fit into one pixel since the distance between them is only 1 unit. So, they will not be visually separable on the screen.
Your points are far too close to really do too much with, but an idea might be spread.labels from plotrix:
opar <- par()
par(xpd=TRUE)
plot(dat$x, dat$y)
spread.labels(dat$x,dat$y,dat$id)
par(opar)
You may want to consider omitting all the numerical labels and placing them in a different graph.