I have a map done in R with stat_density2d. This is the code:
ggplot(data, aes(x=Lon, y=Lat)) +
stat_density2d(aes(fill = ..level..), alpha=0.5, geom="polygon",show.legend=FALSE)+
geom_point(colour="red")+
geom_path(data=map.df,aes(x=long, y=lat, group=group), colour="grey50")+
scale_fill_gradientn(colours=rev(brewer.pal(7,"Spectral")))+
xlim(-10,+2.5) +
ylim(+47,+60) +
coord_fixed(1.7) +
theme_void()
And it produces this:
Great. It works. However I do not know what the legend means. I did find this wikipedia page:
https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
And the example they used (which contains red, orange and yellow) stated:
The coloured contours correspond to the smallest region which contains
the respective probability mass: red = 25%, orange + red = 50%, yellow
+ orange + red = 75%
However, using stat_density2d, I have 11 contours in my map. Does anyone know how stat_density2d works and what the legend means? Ideally I wanted to be able to state something like the red contour contains 25% of the plots etc.
I have read this: https://ggplot2.tidyverse.org/reference/geom_density_2d.html and I am still none the wiser.
Let's take the faithful example from ggplot2:
ggplot(faithful, aes(x = eruptions, y = waiting)) +
stat_density_2d(aes(fill = factor(stat(level))), geom = "polygon") +
geom_point() +
xlim(0.5, 6) +
ylim(40, 110)
(apologies in advance for not making this prettier)
The level is the height at which the 3D "mountains" were sliced. I don't know of a way (others might) to translate that to a percentage but I do know to get you said percentages.
If we look at that chart, level 0.002 contains the vast majority of the points (all but 2). Level 0.004 is actually 2 polygons and they contain all but ~dozen of the points. If I'm getting the gist of what you're asking that's what you want to know, except not count but the percentage of points encompassed by polygons at a given level. That's straightforward to compute using the methodology from the various ggplot2 "stats" involved.
Note that while we're importing the tidyverse and sp packages we'll use some other functions fully-qualified. Now, let's reshape the faithful data a bit:
library(tidyverse)
library(sp)
xdf <- select(faithful, x = eruptions, y = waiting)
(easier to type x and y)
Now, we'll compute the two-dimensional kernel density estimation the way ggplot2 does:
h <- c(MASS::bandwidth.nrd(xdf$x), MASS::bandwidth.nrd(xdf$y))
dens <- MASS::kde2d(
xdf$x, xdf$y, h = h, n = 100,
lims = c(0.5, 6, 40, 110)
)
breaks <- pretty(range(zdf$z), 10)
zdf <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
z <- tapply(zdf$z, zdf[c("x", "y")], identity)
cl <- grDevices::contourLines(
x = sort(unique(dens$x)), y = sort(unique(dens$y)), z = dens$z,
levels = breaks
)
I won't clutter the answer with str() output but it's kinda fun looking at what happens there.
We can use spatial ops to figure out how many points fall within given polygons, then we can group the polygons at the same level to provide counts and percentages per-level:
SpatialPolygons(
lapply(1:length(cl), function(idx) {
Polygons(
srl = list(Polygon(
matrix(c(cl[[idx]]$x, cl[[idx]]$y), nrow=length(cl[[idx]]$x), byrow=FALSE)
)),
ID = idx
)
})
) -> cont
coordinates(xdf) <- ~x+y
data_frame(
ct = sapply(over(cont, geometry(xdf), returnList = TRUE), length),
id = 1:length(ct),
lvl = sapply(cl, function(x) x$level)
) %>%
count(lvl, wt=ct) %>%
mutate(
pct = n/length(xdf),
pct_lab = sprintf("%s of the points fall within this level", scales::percent(pct))
)
## # A tibble: 12 x 4
## lvl n pct pct_lab
## <dbl> <int> <dbl> <chr>
## 1 0.002 270 0.993 99.3% of the points fall within this level
## 2 0.004 259 0.952 95.2% of the points fall within this level
## 3 0.006 249 0.915 91.5% of the points fall within this level
## 4 0.008 232 0.853 85.3% of the points fall within this level
## 5 0.01 206 0.757 75.7% of the points fall within this level
## 6 0.012 175 0.643 64.3% of the points fall within this level
## 7 0.014 145 0.533 53.3% of the points fall within this level
## 8 0.016 94 0.346 34.6% of the points fall within this level
## 9 0.018 81 0.298 29.8% of the points fall within this level
## 10 0.02 60 0.221 22.1% of the points fall within this level
## 11 0.022 43 0.158 15.8% of the points fall within this level
## 12 0.024 13 0.0478 4.8% of the points fall within this level
I only spelled it out to avoid blathering more but the percentages will change depending on how you modify the various parameters to the density computation (same holds true for my ggalt::geom_bkde2d() which uses a different estimator).
If there is a way to tease out the percentages without re-performing the calculations there's no better way to have that pointed out than by letting other SO R folks show how much more clever they are than the person writing this answer (hopefully in more diplomatic ways than seem to be the mode of late).
Related
I have a dataframe containing 2479 peptides with their sequence, p-value and logfold change.
# A tibble: 6 x 3
Sequence p log2fold
<chr> <dbl> <dbl>
1 FLENEDR 0.343 1.21
2 DTEEEDFHVDQATTVK 0.270 0.771
3 DTEEEDFHVDQATTVK 0.112 1.18
4 SCRASQSVSSSF 0.798 0.139
5 RLSCTTSGF 0.739 0.110
6 SCRASQSVSSSY 0.209 0.375
I'm trying to make a volcano plot while labelling the up and downregulated peptides. However, for some reason, ggplot only uses 6 labels. I have no idea why.
I have trying loads of different things. I tried using up and downregulation in expression column, I tried increasing and decreasing my cut-off values to check if this was a problem. I used ggrepel to try and center them out more. Nothing seems to be working. My latest tries with the code is in this code.
Basically as a last resort I made a new group and only took the significant and fold change peptides with me, resulting in 39 peptides. Then I used this as header and matched peptides between the two dataframes.
Another problem that arises is in my legend, a character appears since using geom_text_repel. I have no idea how or why this is happening.
library(ggplot2)
library(ggrepel)
library(tidyverse)
Volc <- R_volcano
expression <- ifelse(Volc$p < 0.05 & abs (Volc$log2fold) >=1, ifelse(Volc$log2fold>1, 'up', 'down'), 'stable')
Volc <- cbind(Volc, expression)
colnames(Volc)[1] <- 'Sequencenames'
Volc["group"] <- "NotSignificant"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) < 1 ),"group"] <- "Significant"
Volc[which(Volc['p'] > 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "FoldChange"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "Significant&FoldChange"
VolcFilter <- Volc %>% filter(group=="Significant&FoldChange")
p <- ggplot(data = Volc, aes(x = log2fold, y = -log10(p), colour=expression, label='Sequencenames')) +
geom_point(alpha=0.4, size=2) +
scale_color_manual(values=c("blue", "grey","red"))+
xlim(c(-4.5, 4.5)) +
geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.8) +
geom_text_repel(data=head(VolcFilter), aes(label=Sequencenames))+
labs(x="log2(fold change)",
y="-log10 (p-value)",
title="Differential expression") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5),
legend.position="right",
legend.title = element_blank())
p
Any help is much appreciated. Fairly new to R.
I have colored a graph with ggplot2 based on a threshold value of 1. Surface scores greater than 1
was colored azure and surface scores less than 1 is colored beige. Here is my sample code.
library(ggplot2)
setwd("F:/SUST_mutation/Graph_input")
d <- read.csv(file = "N.csv", sep = ",", header = TRUE)
ggplot(d, aes(x= Position,y= wild_Score)) + xlab("Positions") + ylab("Scores") +
geom_ribbon(aes(ymin=pmin(wild_Score,1), ymax=1), fill="beige", alpha= 1.5) +
geom_ribbon(aes(ymin=1, ymax=pmax(wild_Score,1)), fill="azure", alpha= 1.5)
My problem is that if I go through the upper surface to the lower surface, I expect the surface line in one line.
But if you see the figure, you will see that they are not. Around the threshold line, the lower surface does not meet the upper surface rather it creates some extra surface. For convenience, I have marked the portions with a red circle.
extra surface on the negative portion close to threshold:
Position Wild_Score
4 1.048
5 1.052
6 1.016
7 0.996
8 0.97
9 0.951
10 0.971
11 1.047
12 1.036
13 1.051
14 1.124
15 1.172
16 1.172
17 1.164
18 1.145
19 1.186
20 1.197
21 1.197
22 1.216
23 1.193
24 1.216
25 1.216
26 1.262
Problem-2:
I have a data frame like following.
Position Score_1 Score_2
4 1.048 1.048
5 1.052 1.052
6 1.016 1.016
7 0.996 1.433
8 0.97 1.432
9 0.951 1.567
10 0.971 1.231
11 1.047 1.055
12 1.036 1.036
13 1.051 1.051
14 1.124 1.124
15 1.172 1.172
16 1.172 1.172
17 1.164 1.164
I plot the surface for position vs score_1 with Tibble and a line graph on that surface with the same positions vs score_2 like the following,
desired graph
As the line just differs at some points I subsetted the main dataset(both column and row).
I get the following error.
"Error: Aesthetics must be either length 1 or the same as the data (13): x" I guess this is because I used two different data frames for the graphs.
here is my code:
d <- read.csv(file = "E.csv", sep = ",", header = TRUE)
d1 <- tibble::tibble(
x = seq(min(d$Position), max(d$Position), length.out = 1000),
y = approx(d$Position, d$Score_1, xout = x)$y
)
ggplot(d1, aes(x= x,y= y)) + xlab("Positions") + ylab("Scores") +
geom_ribbon(aes(ymin=pmin(y,1), ymax=1), fill="red", alpha= 1.5) +
geom_ribbon(aes(ymin=1, ymax=pmax(y,1)), fill="blue", alpha= 1.5) +
geom_line(aes(y=1)) + geom_line(d = d[c(3:10), c(1,3)],aes(y =
Score_2), color = "blue", size = 1)
I want to know what is causing the problem and how should I deal with it?
It's because the negative surface at, for example, row 3 and 4 starts from 1 and goes to 0.996, instead of going from 1.016 to 0.996. Relevant discussion and other examples at ggplot2's issue tracker.
This problem is typically only visible if the number of observations is small-ish, so the typical way people overcome this problem is to interpolate the data. You can find an example of that below (I've omitted your colours because it was hard to see):
library(ggplot2)
# txt <- "your_example_table" # Omitted for brevity
df <- read.table(text = txt, sep = "\t", header = TRUE)
data2 <- tibble::tibble(
x = seq(min(df$Position), max(df$Position), length.out = 1000),
y = approx(df$Position, df$Wild_Score, xout = x)$y
)
ggplot(data2, aes(x= x,y= y)) + xlab("Positions") + ylab("Scores") +
geom_ribbon(aes(ymin=pmin(y,1), ymax=1, fill = "A")) +
geom_ribbon(aes(ymin=1, ymax=pmax(y,1), fill = "B"))
This is great for hiding the problem, but calculating the exact line intersection points is a bit of a pain. I apologise for the self-promotion but I ran into this too and wrapped my solution for finding these line intersection points in a function on the dev version of my package ggh4x, which you might find useful.
library(ggh4x) # devtools::install_github("teunbrand/ggh4x")
ggplot(df, aes(x= Position,y= Wild_Score)) +
stat_difference(aes(ymin = 1, ymax = Wild_Score))
Created on 2021-08-15 by the reprex package (v1.0.0)
I have a data frame that looks like this:
Teff logg M_div_H U B V R I J H K L Lprime M
1 2000 4.0 -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345 4.765 5.680
2 2000 4.5 -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352 4.772 5.687
3 2000 5.0 -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361 4.779 5.685
4 2000 5.5 -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355 4.753 5.638
5 2200 3.5 -0.1 -11.866 -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929 4.289 4.842
6 2200 4.5 -0.1 -11.845 -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947 4.305 4.939
...
Let's say I have two values:
input_Teff = 4.8529282904170595E+003
input_log_g = 1.9241934741026787E+000
Notice how every V value has a unique Teff, logg combination. From the input values, I would like to interpolate a value for V. Is there a way to do this in R?
Edit 1: Here is the link to the full data frame: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
Building on Ian Campbell's observation that you can consider your data as points on a two-dimensional plane, you can use spatial interpolation methods. The simplest approach is inverse-distance weighting, which you can implement like this
library(data.table)
d <- fread("https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=1")
setnames(d,"#Teff","Teff")
First rescale the data as appropriate (not shown here, see Ian's answer)
library(gstat)
# fit model
idw <- gstat(id="V", formula = V~1, locations = ~Teff+logg, data=d, nmax=7, set=list(idp = .5))
# new "points" to predict to
newd <- data.frame(Teff=c(4100, 4852.928), logg=c(1.5, 1.9241934741026787))
p <- predict(idw, newd)
#[inverse distance weighted interpolation]
p$V.pred
#[1] -0.9818571 -0.3602857
For higher dimensions you could use fields::Tps (I think you can force that to be an exact method, that is, exactly honor the observations, by making each observation a node)
We can imagine that Teff and logg exist in a 2-dimensional plane. We can see that your input point exists in that same space:
library(tidyverse)
ggplot(data,aes(x = Teff, y = logg)) +
geom_point() +
geom_point(data = data.frame(Teff = 4.8529282904170595e3, logg = 1.9241934741026787),
color = "orange")
However, we can see the scale of Teff and logg are not the same. Simply taking log(Teff) gets us pretty close, but not quite. So we can rescale between 0 and 1 instead. We can create a custom rescale function. It will become clear why we can't use scales::rescale in a moment.
rescale = function(x,y){(x - min(y))/(max(y)-min(y))}
We can now rescale the data:
data %>%
mutate(Teff.scale = rescale(Teff,Teff),
logg.scale = rescale(logg,logg)) -> data
From here, we might use raster::pointDistance to calculate the distance from the input point to all of the scaled values:
raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),
data[,c("Teff.scale","logg.scale")],
lonlat = FALSE)
We can use which.min to find the row with the minimum distance:
data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),
data[,c("Teff.scale","logg.scale")],
lonlat = FALSE)),]
Teff logg M_div_H U B V R I J H K L Lprime M Teff.scale logg.scale
1: 4750 2 -0.1 -2.447 -1.438 -0.355 0.159 0.589 1.384 1.976 1.881 2.079 2.083 2.489 0.05729167 0.4631902
Here we can visualize the result:
ggplot(data,aes(x = Teff.scale, y = logg.scale)) +
geom_point() +
geom_point(data = data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),data[,c("Teff.scale","logg.scale")], FALSE)),],
color = "blue") +
geom_point(data = data.frame(Teff.scale = rescale(input_Teff,data$Teff),logg.scale = rescale(input_log_g,data$logg)),
color = "orange")
And access the appropriate value for V:
data[which.min(raster::pointDistance(cbind(rescale(input_Teff,data$Teff),rescale(input_log_g,data$logg)),data[,c("Teff.scale","logg.scale")], FALSE)),"V"]
V
1: -0.355
Data:
library(data.table)
data <- fread("https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=1")
setnames(data,"#Teff","Teff")
input_Teff = 4.8529282904170595E+003
input_log_g = 1.9241934741026787E+000
I have a grid of 10x10m coordinates that I extracted from a raster. I have a set of 'starting points'. For each starting point, I want to find the location (coordinates) of cells within a 10-50m radius around it.
I am aware of functions to do this with a raster starting point, but additional analyses that I have not included here require that I perform the search from a grid of coordinates in the format shown below.
The code below achieves my aim, however the outer function produces vectors that are far too large (> 10 Gb) on my actual dataset (which is a grid of 9 million 10x10m cells, with 3000 starting points).
I am looking for alternatives that achieve the same result as the following (simplified) code, but do not require large vector storage or looping over each starting point separately.
library(raster)
library(tidyverse)
#Set up the mock raster
orig=raster(nrows=100, ncols=100)
res(orig)=10
vals <- rep(c(1, 2, 3, 1, 2, 3, 1, 3, 2), times = c(72, 72, 72, 72, 72, 72, 72, 72, 72))
setValues(orig, vals)
values(orig) <- vals
xygrid <- as.data.frame(orig, xy = TRUE) %>% .[,1:2]
head(xygrid)
x y
1 -175 85
2 -165 85
3 -155 85
4 -145 85
5 -135 85
6 -125 85
#the initial starting points
init_locs <- c(5, 10, 15, 20)
#calculate the distance to every surrounding cell from starting point
Rx <- outer(xygrid[init_locs, 1], xygrid[, 1], "-")
Ry <- outer(xygrid[init_locs, 2], xygrid[, 2], "-")
R <- sqrt(Rx^2+Ry^2) #overall distance
for (i in 1:length(R[,1])) {
expr2 <- (R[i,] > 10 & R[i,] <= 50) #extract the location of cells within 10-50m
inv <- xygrid[expr2,] #extract the coordinates of these cells
}
head(inv)
x y
15 -35 85
16 -25 85
17 -15 85
18 -5 85
22 35 85
23 45 85
(Raster and spatial data are not my specialty, but this made me think of a naive approach that might work acceptably. I don't know anything about the methods #Robert Hijmans mentioned, those are likely much more performant. I just thought this sounded like an interesting question to explore with basic methods.)</caveat>
Approach
The main challenge here is you have 9 million cells, but only around 80 of those will be with 50m of any given point. If you calculate all those cells' distances to 3,000 starting points and then filter for those under 50m, that's 9M x 3k = 27 billion calculations, and a gigantic data structure, almost all of which is unnecessary.
We can quickly get ~1,000x more efficient by splitting this into two problems -- first, what general region of potentially-within-50m-points should we look at, and second, what is the actual distance to the points in those regions?
We can precalculate a modestly sized <2MB hash table for step 1. Then, by joining it to our locations (a very fast operation), we can focus our calculations on the 1/1000th of points that have a chance of being within 50m. I arbitrarily split the original cells into 100 x 100 = 10k sectors, each sector holding 30x30 cells.
1. Creating hash table
For the hash table, I'll assign each point to a sector, somewhat arbitrarily as 30x30 cells, so we have 100x100 = 10k sectors. This could be tuned based on speed vs. memory tradeoffs.
max_dist = 30 # sector width, in cells
xygrid2 <- expand_grid(
x = seq(0, 2999, by = 1), # 3000x3000 location grid
y = seq(0, 2999, by = 1))
xygrid2$sector_x = xygrid2$x %/% max_dist # 100 x 100 sectors
xygrid2$sector_y = xygrid2$y %/% max_dist
y_range = max(xygrid2$sector_y) + 1
xygrid2$sector_num = xygrid2$sector_x*y_range + xygrid2$sector_y
We now have 10,000 sectors assigned. Now which sectors are adjacent to which others? In every case, the adjacent sectors follow the same pattern. In this case, I have 100 sectors across x, so the sectors adjacent to sector S will have sector numbers that vary from S by -101 -100 -99 -1 0 1 99 100 101. We can use this pattern to assign all the adjacencies instantaneously. For simplicity, I leave in sectors outside our range; they will be ignored later anyway.
sector_num_deltas <- rep(-1:1, by = 3) + rep(-1:1, each = 3) * y_range
distinct(xygrid2, sector_num) %>%
uncount(9) %>% # copy each row 9 times, one for each adjacency
mutate(sector_num_adj = sector_num + sector_num_deltas) -> adjacencies
2. Join and calculate
Now that we have that, the rest goes much faster, since we can do the calculations only on the 1/1000th of sectors that are nearby. With that, we can now identify the 240,000 points that are within 50m of the 3,000 starting positions in under 4 seconds:
# Here are 3,000 random starting locations
set.seed(42)
sample_starts <- xygrid2 %>%
slice_sample(n = 3000) %>%
mutate(sample_num = row_number())
# Join each location to all the adjacent sectors, and then add all the
# locations within those sectors, and then calculate distances.
sample_starts %>% # 3,000 starting points...
# join each position to the nine adjacent sectors = ~27,000 rows
left_join(adjacencies, by = "sector_num") %>%
# join each sector to the (30x30 = 900) cells in those sectors --> 24 million rows
# That's a lot, but it's only 1/1000th of the starting problem with
# 3k x 9M = 27 billion comparisons!
left_join(xygrid2, by = c("sector_num_adj" = "sector_num")) %>%
select(-contains("sector")) %>%
mutate(dist = sqrt((x.x-x.y)^2 + (y.x-y.y)^2)) %>%
filter(dist <= 5) -> result
The result tells us that our 3,000 sample starting points are within 5 decimeters (50m) of 242,575 cells, about 80 for each starting point.
result
# A tibble: 242,575 x 6
x.x y.x sample_num x.y y.y dist
<dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1069 140 1 1064 140 5
2 1069 140 1 1065 137 5
3 1069 140 1 1065 138 4.47
4 1069 140 1 1065 139 4.12
5 1069 140 1 1065 140 4
6 1069 140 1 1065 141 4.12
7 1069 140 1 1065 142 4.47
8 1069 140 1 1065 143 5
9 1069 140 1 1066 136 5
10 1069 140 1 1066 137 4.24
# … with 242,565 more rows
Here's a sample to see how that's working in a small corner of our data:
ggplot(a %>% mutate(sample_grp = sector_num_adj %% 8 %>% as.factor),
aes(x.y, y.y, color = sample_grp)) +
geom_point(data = adjacencies %>% filter(sector_num_adj == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray80", shape = 21,
aes(x, y)) +
geom_point(data = adjacencies %>% filter(sector_num == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray70", shape = 21,
aes(x, y)) +
annotate("text", alpha = 0.5,
x = c(1725, 1750),
y = c(1960, 1940),
label = c("Lookup area", "sector of\nstarting location")) +
geom_point(size = 1) +
scale_color_discrete(guide = FALSE) +
coord_equal() -> my_plot
library(gganimate)
animate(
my_plot +
gganimate::view_zoom_manual(pan_zoom = -1, ease = "quadratic-in-out",
xmin = c(0, 1700),
xmax = c(3000, 1800),
ymin = c(0, 1880),
ymax = c(3000, 1980)),
duration = 3, fps = 20, width = 300)
Example data --- you were using a lon/lat example, but based on your code, I am assuming that you are using planar data.
library(raster)
r <- raster(nrows=100, ncols=100, xmn=0, xmx=100, ymn=0, ymx=100, crs="+proj=utm +zone=1 +datum=WGS84")
values(r) <- 1:ncell(r) # for display only
xygrid <- as.data.frame(r, xy = TRUE)[,1:2]
locs <- c(8025, 1550, 5075)
dn <- 2.5 # min dist
dx <- 5.5 # max dist
The simplest approach would be to use pointDistance
p <- xyFromCell(r, locs)
d <- pointDistance(xygrid, p, lonlat=FALSE)
u <- unique(which(d>dn & d<dx) %% nrow(d))
pts <- xygrid[u,]
plot(r)
points(pts)
But you will probably run out of memory with that, and it is inefficient to compute all distance. Instead, you may intersect the points with a buffer around the points of interest
b1 <- buffer(SpatialPoints(p, proj4string=crs(r)), dx)
b2 <- buffer(SpatialPoints(p, proj4string=crs(r)), dn)
b <- erase(b1, b2)
x <- intersect(SpatialPoints(xygrid, proj4string=crs(r)), b)
plot(r)
points(x, cex=.5)
points(xyFromCell(r, locs), col="red", pch="x")
With terra it goes like this -- and works well for large datasets in version 1.1-11 that should be on CRAN this week
library(terra)
rr <- rast(r)
pp <- xyFromCell(rr, locs)
bb1 <- buffer(vect(pp), dx)
bb2 <- buffer(vect(pp), dn)
bb <- erase(bb1, bb2)
xx <- intersect(vect(as.matrix(xygrid)), bb)
You can do similar things with sf.
Given that you have so many data points, you might want to start with removing all points that are clearly not of interest
xySel <- lapply(locs, function(i) {
xy <- xygrid[i,]
s <- xygrid[,1] > xy[,1]-dx & xygrid[,1] < xy[,1]+dx & xygrid[,2] > xy[,2]-dx & xygrid[,2] < xy[,2]+dx
xygrid[s,]
})
xySel = do.call(rbind, xySel)
dim(xySel)
# [1] 363 2
dim(xygrid)
#[1] 10000 2
And now you could run pointDistance as above on all data (or else inside the lapply function)
You say that you need to use points, and not a raster. I have seen that idea many times, and 9 out of 10 times that is wrong. Maybe it is true in your case. For others who stumble upon this question, here are are two raster based approaches.
With the raster package you could use extract( ... ,cellnumbers=TRUE) or ajacent. With adjacent, you would first make a weights matrix using one of the buffers made above
buf <- disaggregate(b)[2,]
rb <- crop(r, buf)
w <- as.matrix(rasterize(buf, rb, background=NA) )
w[6,6]=0
And then use the weight matrix like this
a <- adjacent(r, locs, w, pairs=FALSE)
pts <- xyFromCell(r, a)
plot(r)
points(pts)
With terra you could use the cells method
d <- cells(rr, bb)
xy <- xyFromCell(rr, d[,2])
plot(rr)
points(xy, cex=.5)
lines(bb, col="red", lwd=2)
I am trying to figure our the proportion of an area that has a slope of 0, +/- 5 degrees. Another way of saying it is anything above 5 degrees and below 5 degrees are bad. I am trying to find the actual number, and a graphic.
To achieve this I turned to R and using the Raster package.
Let's use a generic country, in this case, the Philippines
{list.of.packages <- c("sp","raster","rasterVis","maptools","rgeos")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)}
library(sp) # classes for spatial data
library(raster) # grids, rasters
library(rasterVis) # raster visualisation
library(maptools)
library(rgeos)
Now let's get the altitude information and plot the slopes.
elevation <- getData("alt", country = "PHL")
x <- terrain(elevation, opt = c("slope", "aspect"), unit = "degrees")
plot(x$slope)
Not very helpful due to the scale, so let's simply look at the Island of Palawan
e <- drawExtent(show=TRUE) #to crop out Palawan (it's the long skinny island that is roughly midway on the left and is oriented between 2 and 8 O'clock)
gewataSub <- crop(x,e)
plot(gewataSub, 1)## Now visualize the new cropped object
A little bit better to visualize. I get a sense of the magnitude of the slopes and that with a 5 degree restriction, I am mostly confined to the coast. But I need a little bit more for analysis.
I would like Results to be something to be in two parts:
1. " 35 % (made up) of the selected area has a slope exceeding +/- 5 degrees" or " 65 % of the selected area is within +/- 5 degrees". (with the code to get it)
2. A picture where everything within +/- 5 degrees is one color, call it good or green, and everything else is in another color, call it bad or red.
Thanks
There are no negative slopes, so I assume you want those that are less than 5 degrees
library(raster)
elevation <- getData('alt', country='CHE')
x <- terrain(elevation, opt='slope', unit='degrees')
z <- x <= 5
Now you can count cells with freq
f <- freq(z)
If you have a planar coordinate reference system (that is, with units in meters or similar) you can do
f <- cbind(f, area=f[,2] * prod(res(z)))
to get areas. But for lon/lat data, you would need to correct for different sized cells and do
a <- area(z)
zonal(a, z, fun=sum)
And there are different ways to plot, but the most basic one
plot(z)
You can use reclassify from the raster package to achieve that. The function assigns each cell value that lies within a defined interval a certain value. For example, you can assign cell values within interval (0,5] to value 0 and cell values within the interval (5, maxSlope] to value 1.
library(raster)
library(rasterVis)
elevation <- getData("alt", country = "PHL")
x <- terrain(elevation, opt = c("slope", "aspect"), unit = "degrees")
plot(x$slope)
e <- drawExtent(show = TRUE)
gewataSub <- crop(x, e)
plot(gewataSub$slope, 1)
m <- c(0, 5, 0, 5, maxValue(gewataSub$slope), 1)
rclmat <- matrix(m, ncol = 3, byrow = TRUE)
rc <- reclassify(gewataSub$slope, rclmat)
levelplot(
rc,
margin = F,
col.regions = c("wheat", "gray"),
colorkey = list(at = c(0, 1, 2), labels = list(at = c(0.5, 1.5), labels = c("<= 5", "> 5")))
)
After the reclassification you can calculate the percentages:
length(rc[rc == 0]) / (length(rc[rc == 0]) + length(rc[rc == 1])) # <= 5 degrees
[1] 0.6628788
length(rc[rc == 1]) / (length(rc[rc == 0]) + length(rc[rc == 1])) # > 5 degrees
[1] 0.3371212