I want to identify a couple of points with high leverage on the plot below, but unfortunately, their row number is illegible, because there must be a couple of such points, and their id is printed out one on top of the other. They are all the way to the right of the plot:
How can the print out of these labels on the plot be resized and spread out so that they can be legible?
The easiest way to find the cooks distance is the built in function:
LM = lm(speed ~ dist, cars)
cooks.distance(LM)
You can pick out whatever values you want:
> which(cooks.distance(LM) > 0.05)
1 2 23 35 39 49
1 2 23 35 39 49
Related
I have a graph with three plots. I made them with the package ggplot2.
I want to calculate to what time the three plots reach the same value of X C using R programming.
I want also want to mark the point on the graph e.g. with a longitudinal dashed line from the x-axis.
I tried to use which.max, but it only gives me one value for one of the plots that reach plateau first. I want the the time-value where all plots reach 37 C at the same time.
You could do something like this, which gives you the index of the first time they are all equal to 37:
# generate example data
x <- c(3:37, rep(37,3))
y <- c(5:37, rep(37,5))
z <- c(7:37, rep(37,7))
# first instance when all are equal to 37
min(which(x == 37 & y == 37 & z ==37))
#> [1] 35
Created on 2021-03-04 by the reprex package (v1.0.0)
Right off the bat, I'm a newbie to R and maybe I'm misunderstanding the concept of what my code does vs what I want it to do. Here is the code I've written so far.
his <- hist(x, breaks=seq(floor(min(x)) , ceiling(max(x)) + ifelse(ceiling(max(x)) %% 5 != 0, 5, 0), 5)
Here is some sample data:
Autonr X
1 -12
2 -6
3 -17
4 8
5 -11
6 -10
7 10
8 -22
I'm not able to upload one of the histograms that did work, but it should show bins of 5, no matter how large the spread of the data. The amount of bins should therefore be flexible.
The idea of the code above is to make sure that the outer ranges of my data always fall within neatly defined 5mm bins. Maybe I lost oversight. but I can't seem to understand why this does not always work. In some cases it does, but with other datasets it doesn't.
I get: some 'x' not counted; maybe 'breaks' do not span range of 'x'.
Any help would be greatly appreciated, as I don't want to have to tinker around with my breaks and bins everytime I get a new dataset to run through this.
Rather than passing a vector of breaks, you can supply a single value, in this case the calculation of how many bins are needed given the range of the data and a bindwidth of 5.
# Generate uniformly distributed dummy data between 0 and 53
set.seed(5)
x <- runif(1000, 0, 53)
# Plot histogram with binwidths of 5.
hist(x, breaks = ceiling(diff(range(x)) / 5 ))
For the sake of completeness, here is another approach which uses breaks_width() from the scales package. scales is part of Hadley Wickham's ggplot2 ecosphere.
# create sample data
set.seed(5)
x <- runif(1000, 17, 53)
# plot histogram
hist(x, breaks = scales::breaks_width(5)(range(x)))
Explanation
scales::breaks_width(5) creates a function which is then called with the vector of minimum and maximum values of x as returned by range(x). So,
scales::breaks_width(5)(range(x))
returns a vector of breaks with the required bin size of 5:
[1] 15 20 25 30 35 40 45 50 55
The nice thing is that we have full control of the bins. E.g., we can ask for weird things like a bin size of 3 which is offset by 2:
scales::breaks_width(3, 2)(range(x))
[1] 17 20 23 26 29 32 35 38 41 44 47 50 53 56
Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.
I'm fairly new to R but I am trying to create line graphs that monitor growth of bacteria over the course of time. I can successfully do this but the resulting graph isn't to my satisfaction. This is because I'm not using evenly spaced time increments although R plots these increments equally. Here is some sample data to give you and idea of what I'm talking about.
x=c(.1,.5,.6,.7,.7)
plot(x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")
axis(1,at=1:5,lab=c(0,24,72,96,120))
As you can see there are 48 hours between 24 and 72 but this is evenly distributed on the graph, is there anyway I can adjust the scale to more accurately display my data?
It's always best in R to use data structures that exhibit the relationships between your data. Instead of defining growth and time as two separate vectors, use a data frame:
growth <- c(.1,.5,.6,.7,.7)
time <- c(0,24,72,96,120)
df <- data.frame(time,growth)
print(df)
time growth
1 0 0.1
2 24 0.5
3 72 0.6
4 96 0.7
5 120 0.7
plot(df, type="o")
Not sure if this produces the exact x-axis labels that you want, but you should be free to edit the graph now without changing the relationship between the growth and time variables.
x=data.frame(x=c(.1,.5,.6,.7,.7), y=c(0,24,72,96,120))
plot(x$y, x$x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")
I have a number of coordinates and I want to plot them in a gridded interface by using R.
The problem is that the relative distance between observations is large. Coordinates are in a geographic coordinate system and the study area is Switzerland. Moreover, id of the points is required to be plotted.
The problem is that two clusters of the points are dense and some other points are separated with a large distance. How I can plot them in a proper way to have readable presentation? Any suggestion for plotting the data?
Preferably, do not use ggplot as I used it before and it did not present proper results.
Data:
id x y
2 7.1735 45.86880001
3 7.17254 45.86887001
4 7.171636 45.86923601
5 7.18018 45.87158001
6 7.17807 45.87014001
7 7.177229 45.86923001
8 7.17524 45.86808001
9 7.181409 45.87177001
10 7.179299 45.87020001
11 7.178359 45.87070001
12 7.175189 45.86974001
13 7.179379 45.87081001
14 7.175509 45.86932001
15 7.176839 45.86939001
17 7.18099 45.87262001
18 7.18015 45.87248001
19 7.18122 45.87355001
20 7.17491 45.86922001
25 7.15497 45.87058001
28 7.153399 45.86954001
29 7.152649 45.86992001
31 7.154419 45.87004001
32 7.156099 45.86983001
GSBi_1 7.184 45.896
GSBi__1 7.36 45.901
GSBj__1 7.268 45.961
GSBj_1 7.276 45.836
GSB 7.272 45.899
GSB_r 7.166667 45.866667
Location of points:
As you can see in the plot, the points' ids are not readable both for the dense parts and others.
Practically, it is not always possible to ensure that all points are visually separable on the screen when plotting a set of points that contains very close and very far points at the same time.
Think of a 1000x800 pixel screen. Let's say we have three points A, B and C that are located respectively on the same horizontal line such that: the distance between A and B is 1 unit and the distance between A and C is 4000 unit.
If you map this maximum distance (4000 unit) to the width of the screen (1000px). Then a pixel will correspond to 4 units in horizontal. That means A and B will fit into one pixel since the distance between them is only 1 unit. So, they will not be visually separable on the screen.
Your points are far too close to really do too much with, but an idea might be spread.labels from plotrix:
opar <- par()
par(xpd=TRUE)
plot(dat$x, dat$y)
spread.labels(dat$x,dat$y,dat$id)
par(opar)
You may want to consider omitting all the numerical labels and placing them in a different graph.