I want to perform what I would call a union on 2 sf objects.
I have the following code:
east.west.sf <- st_sfc(st_polygon(list(cbind(c(1,2,2,1,1),c(0,0,2,2,0)))),
st_polygon(list(cbind(c(0,1,1,0,0),c(0,0,2,2,0)))))
east.west.df <- data.frame(var1=c("east", "west"), var2=c(1,2))
east.west <- st_sf(east.west.df, geom=east.west.sf)
north.south.sf <- st_sfc(st_polygon(list(cbind(c(0,2,2,0,0),c(1,1,2,2,1)))),
st_polygon(list(cbind(c(0,2,2,0,0),c(0,0,1,1,0)))))
north.south.df <- data.frame(var3=c("north", "south"), var4=c(FALSE, TRUE))
north.south <- st_sf(north.south.df, geom=north.south.sf)
map.overlay <- st_union(east.west, north.south)
st_area(map.overlay)
I was expecting all four areas to be 1 unit. Why are they 3 units?
An analogous union() operation with sp objects gives me what I expect.
How can I get four features of 1 unit each with my two initial sf objects?
Related
I am calculating the dissimilarity index of several groups compared to the total population with the function "seg" from the identically named package.
The data consists of about 450 rows, each a different district, and around 20 columns (groups that may be segregated). The values are the number of people from respective group living in respective district. Here are the first few rows of my csv file:
Region,Germany,EU15 without Germany,Poland,Former Yugoslavia and successor countries,Former Soviet Union and successor countries,Turkey,Arabic states,West Afrika,Central Afrika,East Afrika,North America,Central America and the Carribean,South America,East and Central Asia,South and Southeast Asia - excluding Vietnam,Australia and Oceania,EU,Vietnam,Non EU Europe,Total Population
1011101,1370,372,108,35,345,91,256,18,6,3,73,36,68,272,98,3,1979,19,437,3445
1011102,117,21,6,0,0,0,6,0,0,0,7,0,6,0,7,0,156,0,3,188
1011103,2180,482,181,102,385,326,358,48,12,12,73,24,75,175,129,12,3152,34,795,5159
Since the seg function only works with two columns as input, my current code to create a table with the index for all groups looks like this:
DI_table <- as.data.frame(0)
DI_table[1,1] <- print (seg(data =dfplrcountrygroups2019[, c( "Germany", "Total.Population")]))
DI_table[1,2] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[3], "Total.Population")]))
DI_table[1,3] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[4], "Total.Population")]))
DI_table[1,4] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[5], "Total.Population")]))
# and so on...
colnames(DI_table)<- (colnames(dfplrcountrygroups2019[2:20]))
Works well, but a hassle to recode every time I change something with my data and I would like to use this method for other datasets too.
I thought I might try something like below but the seg function did not consider it a selection of two columns.
for (i in colnames(dfplrcountrygroups2019)) {
di_matrix [i] <- seg(data =dfplrcountrygroups2019[, c( "i", "Total.Population")])
}
Error in [.data.frame(dfplrcountrygroups2019, , c("i",
"Total.Population")) : undefined columns selected
I also thought of the apply function but not sure how to make it work so it repeats itself while just changing the column where "Germany" is in the example. How do I make the selection of columns change for each time I repeat the seg function?
my_function <- seg(data =dfplrcountrygroups2019[, c("Germany", "Total.Population")])
apply(X = dfplrcountrygroups2019,
FUN = my_function,
MARGIN = 2
)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'my_function' of mode 'function' was not found
The seg package's functions such as dissim (seg::seg is being deprecated in its favor) have a specific expected data format. From the docs:
data - a numeric matrix or data frame with two columns that represent mutually exclusive population groups (e.g., Asians and non-Asians). If more than two columns are given, only the first two will be used for computing the index.
To get a data frame of the d values seg::dissim returns, where each column is a region's dissimilarity index, you can iterate over the columns, making a temporary data frame and calculating the index. Because the data you're starting with isn't made up of mutually-exclusive categories, you'll have to subtract each population from the total population column to get a not-X counterpart for each group X.
A base R option with sapply will return a named list, which you can then convert into a data frame.
di_table <- sapply(names(dat)[2:20], function(col) {
tmp_df <- dat[col]
tmp_df$other <- dat$Total.Population - dat[col]
seg::dissim(data = tmp_df)$d
}, simplify = FALSE)
as.data.frame(di_table)
#> Germany EU15.without.Germany Poland
#> 1 0.03127565 0.03989693 0.02770549
#> Former.Yugoslavia.and.successor.countries
#> 1 0.160239
#> Former.Soviet.Union.and.successor.countries Turkey Arabic.states West.Afrika
#> 1 0.08808277 0.2047 0.02266828 0.1415519
#> Central.Afrika East.Afrika North.America Central.America.and.the.Carribean
#> 1 0.08004711 0.213581 0.1116014 0.2095969
#> South.America East.and.Central.Asia
#> 1 0.08486598 0.2282734
#> South.and.Southeast.Asia...excluding.Vietnam Australia.and.Oceania EU
#> 1 0.0364721 0.213581 0.04394527
#> Vietnam Non.EU.Europe
#> 1 0.05505789 0.06624686
A couple tidyverse options: you can use purrr functions to do something like above in one step.
dat[2:20] %>%
purrr::map(~data.frame(value = ., other = dat$Total.Population - .)) %>%
purrr::map_dfc(~seg::dissim(data = .)$d)
# same output
Or with reshaping the data and splitting by county. This takes more steps, but might fit a larger workflow better.
library(dplyr)
dat %>%
tidyr::pivot_longer(c(-Region, -Total.Population)) %>%
mutate(other = Total.Population - value) %>%
split(.$name) %>%
purrr::map_dfc(~seg::dissim(data = .[c("value", "other")])$d)
# same output
I have a shp file. I want to get the names of neighboring counties in all regions according to the latitude and longitude in the file. I found that some regions obviously have neighboring counties, but I didn’t get the neighboring counties when I ran the code. I don't know what was wrong.
library(tidyverse)
library(plyr)
library(sf)
library(readxl)
> county <-st_read('D:/county.shp',stringsAsFactors = FALSE)
> neighbor_counties <- function(subcounty){
name <- st_touches(subcounty, county)
county[unlist(name), ]$NAME
}
> output <- vector("list", nrow(county))
> names(output) <- county$NAME
> for (i in seq_len(nrow(county))) {
output[[i]] <- suppressWarnings(neighbor_counties(county[i,]))
}
> output
> head(output)
> neighbor <- output %>%
ldply(data.frame) %>%
set_names("ori_county", "neighbor_county")
Your example is not exactly reproducible, but we are lucky to have the nc.shp shapefile that ships with {sf} available.
So consider this code; it is built on sf::st_touches() function, with the county shapefile passed as argument twice (once for the touching counties, and once for the counties being touched). Sparse = TRUE makes it return a list of indexes of neighboring counties.
To find names of neighbors of a particular county you need to know the index of the county of interest, and then subset the list of neighbors accordingly. You will get indices of the neighboring counties.
As for the second part of your question (expressed in comments) = how to get from a list of indices to a data frame of neighbors - I suggest creating a function returning a data frame, and then applying it via purrr::map_dfr() to the vector of indices as starting points; consider the code provided and amend as necessary. It should give you a start...
library(sf)
shape <- st_read(system.file("shape/nc.shp", package="sf")) # included with sf package
# a list of neighbors
neighbors <- st_touches(shape, # first
shape, # second
sparse = T)
# neighbors of County Mecklenburg (as in Charlotte of Mecklenburg-Strelitz)
# index of Mecklenburg cnty
idx_strelitz <- which(shape$NAME == 'Mecklenburg')
# index of neighbors of Mecklenburg cnty
nbr_mecklenburg <- neighbors[idx_strelitz][[1]]
# names of neighbours of cnty Meckl.
shape$NAME[nbr_mecklenburg]
# [1] "Iredell" "Lincoln" "Cabarrus" "Gaston" "Union"
# a visual check
plot(st_geometry(shape))
plot(shape[idx_strelitz, ], col = "blue", add = T)
plot(shape[nbr_mecklenburg,], col = "red", add = T)
# second question: get pairs of names as a data frame
# a function returning data frame of neighbors of a given cnty
nbr_pairs <- function(idx) {
data.frame(ori_county = rep(shape$NAME[idx], length(neighbors[[idx]])),
neighbor_county = shape$NAME[neighbors[[idx]]])
}
# check - cnty Mecklemburg
nbr_pairs(idx_strelitz)
# ori_county neighbor_county
# 1 Mecklenburg Iredell
# 2 Mecklenburg Lincoln
# 3 Mecklenburg Cabarrus
# 4 Mecklenburg Gaston
# 5 Mecklenburg Union
# apply to list of indices
pairs_of_names <- purrr::map_dfr(seq_along(neighbors),
nbr_pairs)
I have 2 sets of pairwise alignments, where query genome 1 (q1) is aligned to the reference genome and query genome 2 (q2) is aligned to the same reference genome. Therefore, I have both alignments with a coordinate system in the reference genome. The alignments are in the form of GRanges objects.
I would like to project the breakpoints of q2 onto q1, by aligning the breakpoints of q1 in the center, and look for any clustering of q2 breakpoints around the q1 breakpoints, all in the reference genome coordinate system.
Therefore, I make a GRanges object of q1 with its breakpoints in the center. For example, if there is a breakpoint in q1 relative to the reference genome at scaffold 1, bp 833, then taking a window on 500 either side of this, the q1 GRanges object will have an element:
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] S1 333-1333 *
-------
seqinfo: 576 sequences from an unspecified genome; no seqlengths
I then construct a GRanges object of the breakpoints on q2, but all seqlengths are of length 1. I intersect this with the q1 GRanges object, so that q2 only obtains points that can be projected onto q1.
The CoverageHeatmap function requires:
windows:
A set of GRanges of equal length
track:
A GRanges or RleList object specifying coverage
When I call the CoverageHeatmap function, I always get this error and warning message:
Error: subscript contains out-of-bounds ranges
In addition: Warning message:
In e1 == Rle(e2) :
longer object length is not a multiple of shorter object length
Called from: S4Vectors:::.subscript_error("subscript contains out-of-bounds ",
"ranges")
I've tried a bunch of things to try and make this work and still get the same error and warning message. This is my code (including when I've tried the function with q2 as a GRanges object and an RleList)
## BP Pairwise comparison, using 3rd genome as co-ordinate reference
# q1 is used as the centre point reference, with q2 bps projected on to it.
# gr_ref_q1 is the pw alignment between the reference and query genome 1
# gr_ref_q2 is the pw alignment between the reference and query genome 2
# We construct two GRanges objects to feed into CoverageHeatMaps
library(schoolmath)
library(heatmaps)
library(IRanges)
bp_3gen_v2 <- function(gr_ref_q1, gr_ref_q2, win){
# Failsafes (check ref genome is the same, etc)
if(!(is.even(win))){stop("win should be an even number")}
## Construct g1_rco (1st GRanges object)
# IRanges object
q1_starts1 <- start(ranges(gr_ref_q1)) - (win*0.5)
q1_starts2 <- end(ranges(gr_ref_q1)) - (win*0.5)
q1_starts <- c(q1_starts1, q1_starts2)
q1_ends1 <- start(ranges(gr_ref_q1)) + (win*0.5)
q1_ends2 <- end(ranges(gr_ref_q1)) + (win*0.5)
q1_ends <- c(q1_ends1, q1_ends2)
q1_ir_ob <- IRanges(start = q1_starts, end = q1_ends)
# GR object
g1_vec_seq <- as.vector(seqnames(gr_ref_q1))
gr1_seqnames <- c(g1_vec_seq, g1_vec_seq)
g1_rco <- GRanges(seqnames = gr1_seqnames, ranges = q1_ir_ob,
seqinfo = seqinfo(gr_ref_q1))
# Remove negative ranges from GR object
g1_rco <- g1_rco[!(start(ranges(g1_rco)) < 0)]
## Construct g2_rco (2nd GRanges object)
# IRanges object
q2_starts <- start(ranges(gr_ref_q2))
q2_ends <- end(ranges(gr_ref_q2))
q2_bps <- c(q2_starts, q2_ends)
q2_ir_ob <- IRanges(start = q2_bps, end = q2_bps)
# GR object
g2_vec_seq <- as.vector(seqnames(gr_ref_q2))
gr2_seqnames <- c(g2_vec_seq, g2_vec_seq)
g2_rco <- GRanges(seqnames = gr2_seqnames, ranges = q2_ir_ob,
seqinfo = seqinfo(gr_ref_q2))
# Try removing anywhere in g2_rco that is not present in g1_rco
# find intersection of seqnames
g_inter <- intersect(g1_vec_seq, g2_vec_seq)
# apply to g2_rco to remove out of bound scaffols
g2_rco <- g2_rco[seqnames(g2_rco) == g_inter]
# now to remove out of bound ranges (GRanges object)
g2_red <- intersect(g1_rco, g2_rco)
# And try as RleList object
g2_red_rle <- coverage(g2_red)
# Heatmap
heat_map <- CoverageHeatmap(windows = g1_rco, track = g2_red_rle)
To avoid these problems and to achieve what you need, the simplest solution is to have the same seqlevels and seqlenghts for both GRanges. If you know this for your reference then provide it, if not try this:
First example datasets:
library(heatmaps)
gr1 = GRanges(seqnames=c(1,2,3),
IRanges(start=c(1,101,1001),end=c(500,600,1500)))
gr2 = GRanges(seqnames=c(2,2,3,3),
IRanges(start=c(1,301,1,1201),end=c(2500,4800,3500,9700)))
Then we make a combined range to get the levels and lengths:
combined= range(c(gr1,gr2))
seqlevels(gr1) = as.character(seqnames(combined))
seqlevels(gr2) = as.character(seqnames(combined))
seqlengths(gr1) = end(combined)
seqlengths(gr2) = end(combined)
Then the heatmap can be easily obtained by:
CoverageHeatmap(gr1,coverage(gr2))
Or if you only want to look at gr1 windows that have some values in gr2, then do:
CoverageHeatmap(gr1[countOverlaps(gr1,gr2)>0],coverage(gr2))
I have a list of co-ordinates of certain bus stops in this format
Bus_Stop_ID lat long
A -34.04199 18.61747
B -33.92312 18.44649
I then have a list of certain shops
Shop_ID lat long
1 -34.039350 18.617964
2 -33.927820 18.410520
I would like to check whether the shops fall within a 500 metre radius from the bus stop. Ultimately, the final dataset would look something like this where the Bus_Stop column indicates T/F and Bus_Stop_ID shows the relevant BUS ID(s) for that shop if Bus_Stop == T -
Shop_ID lat long Bus_Stop Bus_ID
1 -34.039350 18.617964 TRUE A
2 -33.927820 18.410520 FALSE #NA
Does anyone have an idea about how I can go about this using R? I've seen the package geosphere but have struggled to understand it given my relative inexperience in the spatial domain. Any ideas or packages you could recommend? Thank you
Updated to more scalable solution:
The previous answer (still included below) is not suited for large data sets. The reason is that we need to compute the distance for each pair of shops and bus. Therefore both the memory and computation scale as O(N*M) for N shops and M buses. A more scalable solution uses a data structure such as a KD-Tree to perform nearest neighbor search for each shop. The advantage here is that the computational complexity becomes O(M*logM) for building the KD-Tree for the bus stops and O(N*logM) for searching the nearest neighbor for each shop.
To do this, we can use nn2 from the RANN package. The complication here is that nn2 deals only with Euclidean distances and does not know anything about lat/long. Therefore, we need to convert the lat/long coordinates to some map projection (i.e. UTM) in order to use it correctly (i.e., in order to compute the Euclidean distance between shops and bus stops correctly).
Note: The following borrows heavily from Josh O'Brien's solutions for determining the UTM zone from a longitude and for converting lat/long to UTM, so he should take a bow.
## First define a function from Josh OBrien's answer to convert
## a longitude to its UTM zone
long2UTM <- function(long) {
(floor((long + 180)/6) %% 60) + 1
}
## Assuming that all points are within a zone (within 6 degrees in longitude),
## we use the first shop's longitude to get the zone.
z <- long2UTM(shops[1,"long"])
library(sp)
library(rgdal)
## convert the bus lat/long coordinates to UTM for the computed zone
## using the other Josh O'Brien linked answer
bus2 <- bus
coordinates(bus2) <- c("long", "lat")
proj4string(bus2) <- CRS("+proj=longlat +datum=WGS84")
bus.xy <- spTransform(bus2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
## convert the shops lat/long coordinates to UTM for the computed zone
shops2 <- shops
coordinates(shops2) <- c("long", "lat")
proj4string(shops2) <- CRS("+proj=longlat +datum=WGS84")
shops.xy <- spTransform(shops2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
library(RANN)
## find the nearest neighbor in bus.xy#coords for each shops.xy#coords
res <- nn2(bus.xy#coords, shops.xy#coords, 1)
## res$nn.dist is a vector of the distance to the nearest bus.xy#coords for each shops.xy#coords
## res$nn.idx is a vector of indices to bus.xy of the nearest bus.xy#coords for each shops.xy#coords
shops$Bus_Stop <- res$nn.dists <= 500
shops$Bus_ID <- ifelse(res$nn.dists <= 500, bus[res$nn.idx,"Bus_Stop_ID"], NA)
Although more complicated, this approach is much better suited for realistic problems where you may have large numbers of shops and bus stops. Using the same supplied data:
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
You can do this using the package geosphere. Here, I'm assuming that your first data frame is named bus, and your second data frame is named shops:
library(geosphere)
g <- expand.grid(1:nrow(shops), 1:nrow(bus))
d <- matrix(distGeo(shops[g[,1],c("long","lat")], bus[g[,2],c("long","lat")]),
nrow=nrow(shops))
shops$Bus_Stop <- apply(d, 1, function(x) any(x <= 500))
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
Notes:
We first use expand.grid to enumerate all pair combinations of shops and bus stops. These are ordered by shops first.
We then compute the distance matrix d using geosphere::distGeo. Note here that the input expects (lon, lat) coordinates. distGeo returns distances in meters. The resulting d matrix is now(shops) by now(bus) so that each row gives the distance from a shop to each bus stop.
We then see if there is a bus stop within 500 meters of each shop by applying the function any(x <= 500) for each row x in d using apply with MARGIN=1.
Similarly, we can extract the column of d (corresponding to the row in bus) for the first shop that is within 500 meters using which instead of any in our applied function. Then use this result to select the Bus_Stop_ID from bus.
By the way, we don't have to apply the condition x <= 500 twice. The following will also work:
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
shops$Bus_Stop <- !is.na(shops$Bus_ID)
and is more efficient.
Data:
bus <- structure(list(Bus_Stop_ID = structure(1:2, .Label = c("A", "B"
), class = "factor"), lat = c(-34.04199, -33.92312), long = c(18.61747,
18.44649)), .Names = c("Bus_Stop_ID", "lat", "long"), class = "data.frame", row.names = c(NA,
-2L))
shops <- structure(list(Shop_ID = 1:2, lat = c(-34.03935, -33.92782),
long = c(18.617964, 18.41052), Bus_ID = structure(c(1L, NA
), .Label = c("A", "B"), class = "factor"), Bus_Stop = c(TRUE,
FALSE)), .Names = c("Shop_ID", "lat", "long", "Bus_ID", "Bus_Stop"
), row.names = c(NA, -2L), class = "data.frame")
My first approach would be to just use Euclidean distance and check whether the resulting value is greater or equal 0.
You could then use an IF clause and check T/F conditions.
I hope this helps.
PS: In my imagination, a distance of 500m would be a rather flat representation of the Earth's surface, so I don't think it's needed to use some geoid packages.
I want to dissolve all polygons but one from a shape file. Is there a way to do this?
Here is a reproducible example:
library(rgeos)
library(UScensus2000tract)
# load data
data("oregon.tract")
# plot map
plot(oregon.tract)
# Dissolve all polygons
d <- gUnaryUnion(oregon.tract, id = oregon.tract#data$state)
plot(d)
In this example, is it possible to dissolve the polygons by keep the tract number 9501 ?
I assume this is what you are looking for. This is slightly different if you want to merge contiguous members of the tract together, but all you would have to do is remove the first element (the entire state) from the polygon, and then run gUnaryUnion on the remainder, and then re-add the contiguous tract members to a copy of the gUnaryUnion-ized state.
oregon = oregon.tract
names(attributes(oregon.tract))
#[1] "bbox" "proj4string" "polygons" "plotOrder" "data"
#[6] "class"
selected_tract_indices = which(oregon.tract#data$tract == 9501)
oregon <- gUnaryUnion(oregon.tract, id = oregon.tract#data$state)
d = oregon
npolygons = 1
for (selected_tract_index in selected_tract_indices){
d#polygons[[npolygons+1]] = oregon.tract#polygons[[selected_tract_index]]
npolygons = npolygons + 1
d#plotOrder=c(d#plotOrder,as.integer(npolygons))
}
plot(d)
The output of this operation is a SpatialPolygon. In case you want to convert it back to a SpatialPolygonDataDrame, here is a simple way to do it:
# Extract polygon ID's
( did <- sapply(slot(d, "polygons"), function(x) slot(x, "ID")) )
# Create dataframe with correct rownames
( d.df <- data.frame( ID=1:length(d), row.names = did) )
# Try coersion again and check class
d <- SpatialPolygonsDataFrame(d, d.df)
class(d)