Here is a plot of several different time series that I made in R:
I made these using a simple loop:
for(i in 1:ngroups){
x[paste0("Group_",i)] = apply(x[,group == i],1,mean)
}
plot(x$Group_1,type="l",ylim=c(0,300))
for(i in 2:ngroups){
lines(x[paste0("Group_",i)],col=i)
}
I also could have made this plot using matplot. Now, as you can see, each group is the mean of several other columns. What I would like to do is plot the series as in the plot above, but additionally show the range of the underlying data contributing to that mean. For example, the purple line would be bounded by a region shaded light purple. At any given time index, the purple region will extend from the lowest value in the purple group to the highest value (or, say, the 5 to 95 percentiles). Is there an elegant/clever way to do this?
Here is an answer using the graphics package (graphics that come with R). I also try to explain how it is that the polygon (which is used to generate the CI) is created. This can be repurposed to solve your problem, for which I do not have the exact data.
# Values for noise and CI size
s.e. <- 0.25 # standard error of noise
interval <- s.e.*qnorm(0.975) # standard error * 97.5% quantile
# Values for Fake Data
x <- 1:10 # x values
y <- (x-1)*0.5 + rnorm(length(x), mean=0, sd=s.e.) # generate y values
# Main Plot
ylim <- c(min(y)-interval, max(y)+interval) # account for CI when determining ylim
plot(x, y, type="l", lwd=2, ylim=ylim) # plot x and y
# Determine the x values that will go into CI
CI.x.top <- x # x values going forward
CI.x.bot <- rev(x) # x values backwards
CI.x <- c(CI.x.top, CI.x.bot) # polygons are drawn clockwise
# Determine the Y values for CI
CI.y.top <- y+interval # top of CI
CI.y.bot <- rev(y)-interval # bottom of CI, but rev Y!
CI.y <- c(CI.y.top,CI.y.bot) # forward, then backward
# Add a polygon for the CI
CI.col <- adjustcolor("blue",alpha.f=0.25) # Pick a pretty CI color
polygon(CI.x, CI.y, col=CI.col, border=NA) # draw the polygon
# Point out path of polygon
arrows(CI.x.top[1], CI.y.top[1]+0.1, CI.x.top[3], CI.y.top[3]+0.1)
arrows(CI.x.top[5], CI.y.top[5]+0.1, CI.x.top[7], CI.y.top[7]+0.1)
arrows(CI.x.bot[1], CI.y.bot[1]-0.1, CI.x.bot[3], CI.y.bot[3]-0.1)
arrows(CI.x.bot[6], CI.y.bot[6]-0.1, CI.x.bot[8], CI.y.bot[8]-0.1)
# Add legend to explain what the arrows are
legend("topleft", legend="Arrows indicate path\nfor drawing polygon", xjust=0.5, bty="n")
And here is the final result:
I have made a df using some random data.
Here's the df
df
x y
1 1 3.1667912
2 1 3.5301539
3 1 3.8497014
4 1 4.4494311
5 1 3.8306889
6 1 4.7681518
7 1 2.8516945
8 1 1.8350802
9 1 5.8163498
10 1 4.8589443
11 2 0.3419090
12 2 2.7940851
13 2 1.9688636
14 2 1.3475315
15 2 0.9316124
16 2 1.3208475
17 2 3.0367743
18 2 3.2340156
19 2 1.8188969
20 2 2.5050162
When you plot using stat_summary with mean_cl_normal and geom smooth
ggplot(df,aes(x=x,y=y))+geom_point() +
stat_summary(fun.data=mean_cl_normal, geom="smooth", colour="red")
As someone commented, maybe mean_cl_boot was better so I used it.
ggplot(df,aes(x=x,y=y))+geom_point() +
stat_summary(fun.data=mean_cl_boot, geom="smooth", colour="red")
They are indeed a little different. Also you could play with confint parameter depending on your need.
Related
I have looked around for solutions to this problem and the closest I found was: Get upper triangular matrix from nonsymmetric matrix, but this did not work for me.
I have a matrix in R with 2 columns and over 3000 rows.
head(diag_calc)
X Y
[1,] 0.4991733 0.05358506
[2,] 1.1758962 0.70707194
[3,] 0.2197383 -0.00148791
[4,] 0.6389240 0.24411083
[5,] 0.8708275 0.16959840
[6,] 0.9784328 0.10341456
When I plot them against each other they look like this:
I want to identify all the rows containing points on either extreme of the diagonals. I have tried labeling points by being in the 3rd quartile of X and 1st quartile of Y and colored them orange. I did the reverse and colored them purple. However, this metric does not capture the true biological variability in my system and it appears that identifying cells at the extremes off a diagonal (which starts at the inflection point of the quartile that I labeled) would provide a better result.
I have tried using diag, upper.tri, and lower.tri from base R, but these do not work, I think due the the asymmetrical nature of my matrix. Diag does work to calculate the inflection point where each diagonal line passes through though. As such:
diag_calc <- Ad_SF7_fc_scored_NK %>%
select(one_of("X", "Y")) %>%
as.matrix(.)
diag(diag_calc) -> diag_test
diag_test
[1] 0.4991733 0.7070719
I can get the other inflection point by swapping the X and Y variables when generating my matrix.
Does anyone have a solution or advice on potential approaches to use?
Thanks!
Here is a way to proceed making an assumption about how you defined your diagonal lines. First create reproducible data and get the quantiles:
set.seed(42)
X <- rnorm(500, 1.5, .5)
Y <- rnorm(500)
Xq <- quantile(X)
Yq <- quantile(Y)
df <- data.frame(X, Y)
Now plot the data and identify a line that passes through the lower left quantile intersection and the upper right quantile intersection. Then use the slope to identify parallel lines that pass through the upper left and lower right intersections:
plot(X~Y, df, pch=20)
abline(v=Yq[2:4], lty=3)
abline(h=Xq[2:4], lty=3)
diag <- lm(Xq[c(2, 4)]~Yq[c(2, 4)])
points(Yq[c(2, 4)], Xq[c(2, 4)], cex=2, col="red", lwd=2)
abline(diag)
b <- coef(diag)[2]
a1 <- Xq[4] - b * Yq[2]
a2 <- Xq[2] - b * Yq[4]
abline(a1, b)
abline(a2, b)
Now identify the points above and below these two lines:
res1 <- X - (a1 + b * Y)
res2 <- (a2 + b * Y) - X
clr <- c("black", "purple", "darkorange")
idx <- ifelse(res1 > 0, 3, ifelse(res2 > 0, 2, 1))
plot(X~Y, pch=20, col=clr[idx])
abline(a1, b, col="red")
abline(a2, b, col="red")
Finally add the identification of the outliers to the data:
position <- c("inside", "below", "above")
df$outlier <- position[idx]
head(df)
# X Y outlier
# 1 2.185479 1.029140719 inside
# 2 1.217651 0.914774868 below
# 3 1.681564 -0.002456267 inside
# 4 1.816431 0.136009552 inside
# 5 1.702134 -0.720153545 inside
# 6 1.446938 -0.198124330 inside
# 7 2.255761 -1.029208806 above
# 8 1.452670 -0.966955896 inside
# 9 2.509212 -1.220813089 above
# 10 1.468643 0.836207704 inside
I'm looking to create a model that classifies a set of points that are near a pre-defined point.
For example, let's say I have points:
X
Y
1
1
1
2
1
3
2
1
2
3
3
1
3
2
3
3
6
6
8
7
8
5
9
3
10
7
My goal is to identify which points are closest to predefined point (2,2) and ideally output which points those are.
I tried using KNN, but I could not figure out how to get the KNN model to train results near (2,2). Any guidance to how I may accomplish this would be awesome. :)
Plot of Points
df <- data.frame( x = c(1,1,1,2,2,2,3,3,3,6,8,8,9,10), y = c(1,2,3,1,2,3,1,2,3,6,7,5,3,7))
df
goal_point <- c(x=2,y=2)
goal_point
You might approach this by calculating distance from goal as a feature.
df$dist = sqrt((df$x - goal_point["x"])^2 +
(df$y - goal_point["y"])^2)
df$clust = kmeans(df, 2)$cluster
library(ggplot2)
ggplot(df, aes(x, y, color = clust)) +
geom_point()
In this case kmeans is using x, y, and distance from goal. You could also use just distance from goal by using df$clust = kmeans(df[,3], 2)$cluster, which would lead here to the same clustering.
I'm trying to find sites to collect snails by using a semi-random selection method. I have set a 10km2 grid around the region I want to collect snails from, which is broken into 10,000 10m2 cells. I want to randomly this grid in R to select 200 field sites.
Randomly sampling a matrix in R is easy enough;
dat <- matrix(1:10000, nrow = 100)
sample(dat, size = 200)
However, I want to bias the sampling to pick cells closer to a single position (representing sites closer to the research station). It's easier to explain this with an image;
The yellow cell with a cross represents the position I want to sample around. The grey shading is the probability of picking a cell in the sample function, with darker cells being more likely to be sampled.
I know I can specify sampling probabilities using the prob argument in sample, but I don't know how to create a 2D probability matrix. Any help would be appreciated, I don't want to do this by hand.
I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.
# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns
# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y
xy <- cbind(x, y); xy
# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required
# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)
From the help page of dist2
dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.
We can visualize this using the image function.
XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1
# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
(x - min(x)) / (m + max(x) - min(x))
# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))
Lighter tones indicate grids closer to the station at (3,2).
# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)
# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3
To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.
snail.sam <- function(nsamples) {
sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}
SAMPLES <- replicate(10000, snail.sam(5))
tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
col=cols[tab])
If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:
There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.
sam <- sample(1:nrow(xy), size = 200, prob=prob^3)
The tendency for the points to be located near the station is now more obvious.
There may be a better way than this but a quick way to do it is to randomly sample on both x and y axis using a distribution (I used the normal - bell shaped distribution, but you can really use any). The trick is to make the mean of the distribution the position of the research station. You can change the bias towards the research station by changing the standard deviation of the distribution.
Then use the randomly selected positions as your x and y coordinates to select the positions.
dat <- matrix(1:10000, nrow = 100)
#randomly selected a position for the research station
rs <- c(80,30)
# you can change the sd to change the bias
x <- round(rnorm(400,mean = rs[1], sd = 10))
y <- round(rnorm(400, mean = rs[2], sd = 10))
position <- rep(NA, 200)
j = 1
i = 1
# as some of the numbers sampled can be outside of the area you want I oversampled # and then only selected the first 200 that were in the area of interest.
while (j <= 200) {
if(x[i] > 0 & x[i] < 100 & y[i] > 0 & y [i]< 100){
position[j] <- dat[x[i],y[i]]
j = j +1
}
i = i +1
}
plot the results:
plot(x,y, pch = 19)
points(x =80,y = 30, col = "red", pch = 19) # position of the station
I have data that contains information about sub-plots with different numbers and their corresponding species types (more than 3 species within each subplot). Every species have X & Y coordinates.
> df
subplot species X Y
1 1 Apiaceae 268675 4487472
2 1 Ceyperaceae 268672 4487470
3 1 Vitaceae 268669 4487469
4 2 Ceyperaceae 268665 4487466
5 2 Apiaceae 268662 4487453
6 2 Magnoliaceae 268664 4487453
7 3 Magnoliaceae 268664 4487453
8 3 Apiaceae 268664 4487456
9 3 Vitaceae 268664 4487458
with these data, I have created ppp for the points of each subplot within a window of general plot (big).
grp <- factor(data$subplot)
win <- ripras(data$X, data$Y)
p.p <- ppp(data$X, data$Y, window = window, marks = grp)
Now I want to divide a plot into equal 3 x 3 sub-plots because there are 9 subplots. The genetal plot is not rectangular looks similar to rombo shape when I plot.
I could use quadrats() funcion as below but it has divided my plot into unequal subplots. Some are quadrat, others are traingle etc which I don't want. I want all the subplots to be equal sized quadrats (divide it by lines that paralel to each sides). Can you anyone guide me for this?
divide <-quadrats(p.patt,3,3)
plot(divide)
Thank you!
Could you break up the plot canvas into 3x3, then run each plot?
> par(mfrow=c(3,3))
> # run code for plot 1
> # run code for plot 2
...
> # run code for plot 9
To return back to one plot on the canvas type
> par(mfrow=c(1,1))
This is a question about the spatstat package.
You can use the function quantess to divide the window into tiles of equal area. If you want the tile boundaries to be vertical lines, and you want 7 tiles, use
B <- quantess(Window(p.patt), "x", 7)
where p.patt is your point pattern.
enter image description hereI am quite new to R and programming in general. So please forgive my ignorance, I am trying to learn.
I have two sets of data and I would like to plot them against each other. Both have 27 rows and 3 columns; one set is called "range" and the other is called "rangePx".
Column “Comp” has the different components, column “Min” is the minimum concentration in % and column “Max” is the maximum concentration in %.
I want to make a 2-y axis dumbbell plot, with the y axis being the different components and x axis being the concentration.
I do manage to create 1 y axis dumbbell plot, but I have troubles to add the second y axis.
Here is a snap from the "range" data
head(range)
# A tibble: 6 x 3
Comp Min Max
<chr> <dbl> <dbl>
1 Methane 0.0100 100
2 Ethane 0.0100 65.0
3 Ethene 0.100 20.0
4 Propane 0.0100 40.0
5 Propene 0.100 6.00
6 Propadien 0.0500 2.00
and here is a snap from the "rangePx" data
head(rangePx)
# A tibble: 6 x 3
Comp Min Max
<chr> <dbl> <dbl>
1 Methane 50.0 100
2 Ethane 0.00800 14.0
3 Ethene 0 0
4 Propane 0.00800 8.00
5 Propene 0 0
6 Propadien 0 0
Here is the piece of code that I use:
library(ggplot2)
library(ggalt)
library(readxl)
theme_set(theme_classic())
range <- read_excel(range.xlsx)
rangePx <- read_excel(rangePx.xlsx")
p <- ggplot(range, aes(x=Max, xend=Min, y = Comp, group=Comp))
p <- p + geom_dumbbell(color="blue")
p
px <- ggplot(rangePx, aes(x=Max, xend=Min, y = Comp, group=Comp))
px <- px + geom_dumbbell(color="green")
p <- p + geom_dumbbell(aes(y=px, color="red"))
p
and here is the complain I get when I call p:
Error: Aesthetics must be either length 1 or the same as the data (27): y, colour, x, xend, group
Here I saw a 6x3 data frame but my original data are 27x3
can anyone help me?
Thnx in advance
ggplot2 does not have the ability to plot 2 y-axes - this is an intentional decision by Hadley Wickham who wrote the package. You can see his response to a similar question here where he comments on his reasons for not including:
Plot with 2 y axes, one y axis on the left, and another y axis on the right
As mentioned in the comments and in reply to the question, if you want to use ggplot2 you have to use faceting to compare. Otherwise you need to use a different plotting package.