I have a data frame that have information about crimes (variable x), and latitude and longitude of where that crime happened. I have a shape file with the districts from são paulo city. That is df:
latitude longitude n_homdol
1 -23.6 -46.6 1
2 -23.6 -46.6 1
3 -23.6 -46.6 1
4 -23.6 -46.6 1
5 -23.6 -46.6 1
6 -23.6 -46.6 1
And a shape file for the districts of são paulo,sp.dist.sf :
geometry NOME_DIST
1 POLYGON ((352436.9 7394174,... JOSE BONIFACIO
2 POLYGON ((320696.6 7383620,... JD SAO LUIS
3 POLYGON ((349461.3 7397765,... ARTUR ALVIM
4 POLYGON ((320731.1 7400615,... JAGUARA
5 POLYGON ((338651 7392203, 3... VILA PRUDENTE
6 POLYGON ((320606.2 7394439,... JAGUARE
With the help of #Humpelstielzchen, i join both data doing:
sf_df = st_as_sf(df, coords = c("longitude", "latitude"), crs = 4326)
shape_df<-st_join(sp.dist.sf, sf_df, join=st_contains)
My final goal is to implement a local moran i statistic, and i'm trying to do this with:
sp_viz <- poly2nb(shape_df, row.names = shape_df$NOME_DIST)
xy <- st_coordinates(shape_df)
ww <- nb2listw(sp_viz, style ='W', zero.policy = TRUE)
shape_df[is.na(shape_df)] <- 0
locMoran <- localmoran(shape_df$n_homdol, ww)
sids.shade <- auto.shading(c(locMoran[,1],-locMoran[,1]),
cols=brewer.pal(5,"PRGn"))
choropleth(shape_df, locMoran[,1], shading=sids.shade)
choro.legend(-46.5, -20, sids.shade,fmt="%6.2f")
title("Criminalidade (Local Moran's I)",cex.main=2)
But when i run the code, it takes hours to compute:
sp_viz <- poly2nb(shape_df, row.names = shape_df$NOME_DIST)
I have 15,000 observations, for 93 districts. I tried to run the above code with only 100 observations, and it was fast and everything went right. But with the 15,000 obs i did not see the result, because de computation goes on forever. What may be happening? I am doing something wrong? Is there a better way to do this Local moran I test?
As I can't just comment, here is some questions one might ask:
- how long do you mean by fast? some of my scripts run in seconds and I call it slow.
- are all your observation identically structured? maybe the poly2nb() function is infinitely looping on an item which has an uncommon structure. You can use the unique() function to ensure this point.
- Did you try to cut your dataset into pieces and to run each piece separately? this would help to see 1/ whether one of your parts has something to be corrected and 2/ whether R is loading all data at the same time, overloading the memory of your computer. Beware, this happen really often with huge datasets in R (and by huge, I mean data tables of > 50 Mo wheight).
Glad to have tried to help you, do not hesitate to question my answer !
Related
I have two problems I'm trying to solve, the first issue is the main one. Hopefully I've explained the second one decently.
1) My initial issue is trying to create spatial polygon dataframe from a tibble. For example, I have a tibble that outlines U.S. states, from the urbnmapr library and I want to be able to plot spatial polygons for all 50 states. (Note: I already have made a map from these data in ggplot but I specifically want spatial polygons to plot and animate in leaflet):
> states <- urbnmapr::states
> states
# A tibble: 83,933 x 10
long lat order hole piece group state_fips state_abbv state_name fips
<dbl> <dbl> <int> <lgl> <fct> <fct> <chr> <chr> <chr> <chr>
1 -88.5 31.9 1 FALSE 1 01.1 01 AL Alabama 01
2 -88.5 31.9 2 FALSE 1 01.1 01 AL Alabama 01
3 -88.5 31.9 3 FALSE 1 01.1 01 AL Alabama 01
...
2) Once I do this, I will want to join additional data from a separate tibble to the spatial polygons by the state name. What would be the best way to do that if I different data for each year? i.e. for the 50 states I have three years of data, so would I create 150 different polygons for the states across years or have 50 state polygons but have all the information in each to be able to make 3 different plots of all states for the different years?
I can propose you the following (unchecked because I don't have access to the urbnmapr package with my R version).
Problem 1
If you specifically want polygons, I think the best would be to join a dataframe to an object that comes from a shapefile.
If you still want to do it on your own, you need to do two things:
Convert your tibble into a spatial object with a point geometry
Aggregate points by state
sf package can do both. For the first step (the easy one), use sf_as_sf function.
library(sf)
states
states_spat <- states %>% st_as_sf(., coords = c("lon","lat"))
For the second step, you will need to aggregate geometries. I can propose you something that will give you a MULTIPOINT geometry, not polygons. To convert into polygons, you could find this thread to help
states_spat <- states_spat %>% group_by(state_name) %>%
dplyr::summarise(x = n())
Problem 2
That's a standard join based on a common attributes between your data and a spatial object (e.g. a state code). merge or *_join functions from dplyr work with sf data as they would do with tibbles. You have elements there
By the way, I think it is better for you to do that than creating your own polygons from a series of points.
I am wanting to get correlation values between two variables for each county.
I have subset my data as shown below and get the appropriate value for the individual Adams county, but am now wanting to do the other counties:
CorrData<-read.csv("H://Correlation
Datasets/CorrelationData_Master_Regression.csv")
CorrData2<-subset(CorrData, CountyName=="Adams")
dzCases<-(cor.test(CorrData2$NumVisit, CorrData2$dzdx,
method="kendall"))
dzCases
I am wanting to do a For Loop or something similar that will make the process more efficient, and so that I don't have write 20 different variable correlations for each of the 93 counties.
When I run the following in R, it doesn't give an error, but it doesn't give me the response I was hoping for either. Rather than the Spearman's Correlation for each county, it seems to be ignoring the loop portion and just giving me the correlation between the two variables for ALL counties.
CorrData<-read.csv("H:\\CorrelationData_Master_Regression.csv")
for (i in CorrData$CountyName)
{
dzCasesYears<-cor.test(CorrData$NumVisit, CorrData$dzdx,
method="spearman")
}
A very small sample of my data looks similar to this:
CountyName Year NumVisits dzdx
Adams 2010 4.545454545 1.19
Adams 2011 20.83333333 0.20
Elmore 2010 26.92307692 0.24
Elmore 2011 0 0.61
Brown 2010 0 -1.16
Brown 2011 17.14285714 -1.28
Clark 2010 25 -1.02
Clark 2011 0 1.13
Cass 2010 17.85714286 0.50
Cass 2011 27.55102041 0.11
I have tried to find a similar example online, but am not having luck!
Thank you in advance for all your help!
You are looping but not using your iterator 'i' in your code. If this makes sense with respect with what you want to do (and judging from your condition). Based on comments, you might want to make sure you are using numerics. Also, i noticed that you are not iterating into your output cor.test vector. I'm not sure a loop is the most efficient way to do it, but it will be just fine and since your started with a loop, You should have something of the kind:
dzCasesYears = list() #Prep a list to store your corr.test results
counter = 0 # To store your corr.test into list through iterating
for (i in unique(CorrData$CountyName))
{
counter = counter + 1
# Creating new variables makes the code clearer
x = as.numeric(CorrData[CorrData$CountyName == i,]$NumVisit)
y = as.numeric(CorrData[CorrData$CountyName == i,]$dzdx)
dzCasesYears[[counter]] <-cor.test(x,y,method="spearman")
}
And it's always good to put a unique there when you are iterating.
data.table makes operations like this very simple.
library('data.table')
CorrData <- as.data.table(read.csv("H:\\CorrelationData_Master_Regression.csv"))
CorrData[, cor(dzdx, NumVisits), CountyName]
With the sample data, it's all negative ones because there's two points per county and so the correlation is perfect. The full dataset should be more interesting!
CountyName V1
1: Adams -1
2: Elmore -1
3: Brown -1
4: Clark -1
5: Cass -1
Edit to include p values from cor.test as OP asked in the comment
This is also quite simple!
CorrData[, .(cor=cor(dzdx, NumVisits),
p=cor.test(dzdx, NumVisits)$p.value),
CountyName]
...But it won't work with your sample data as two points per county is not enough for cor.test to get a p value. Perhaps you could take #smci's advice and dput a larger subset of the data to make your question truly reproducible
I have a dataframe containing list of all accidents and their locations (about 10,000 location coordinates) that took place in a city in a month.
Accident ID Name latitudes longitudes Intensity time
1 citycentre -25.5567 +54.00087 minor morning
2
3
4
I need to find how many of those accidents took place between a given set of longitudes and latitudes.(e.g between latitudes 25'S and 27'S, lontitudes 54 E and 55 E)
Is there a way to do such a thing on R
I am new to R so any help will be much appreciated.I have to do the same process with data from a large number of months and between different pairs of coordinates.So running a loop and using a counter variable will be very time consuming.
Is there a functional available that will tell me the number of accidents if I make my study area into a polygon shape file
I'm sure there are many ways to do it. Here's just one using non-equi joins:
library(data.table)
points <- fread("Accident_ID Name latitudes longitudes Intensity time
1 citycentre -28.5567 +54.50087 minor morning
2 citycentre -28.5567 +54.50087 minor morning
3 citycentre 0 0 minor morning
4 citycentre 100 100 minor morning")
extents <- data.table(
extent_id=1:3,
x1=c(-30, -20, 1000),
x2=c(-27, 20, 1100),
y1=c(54, -54, 100),
y2=c(55, 55, 100)
)
points[extents, on=.(latitudes>=x1, latitudes<=x2, longitudes>=y1, longitudes<=y2)][
,.(N=sum(!is.na(Accident_ID))), by=extent_id]
# extent_id N
# 1: 1 2
# 2: 2 1
# 3: 3 0
I've been researching this for a while now but haven't come across any solution that fit my needs or that I can transform sufficiently to work in my case:
I have a large car sharing data set for multiple cities in which I have the charging demand per location (e.g. row = carID, 55.63405, 12.58818, charging demand). I now would like to split the area over the city (example above is Copenhagen) up into a hexagonal grid and tag every parking location with an ID (e.g. row = carID, 55.63405, 12.58818, charging demand, cell ABC) so I know which hexagonal cell it belongs to.
So my question is twofold:
(1) how can I create such a honeycomb grid with a side length of 124 meters (about 40000 sqm which is equivalent to 200x200 meters but nicer in hexagonal) in this area:
my_area <- structure(list(longitude = c(12.09980, 12.09980, 12.67843, 12.67843),
latitude = c(55.55886, 55.78540, 55.55886, 55.78540)),
.Names = c("longitude", "latitude"),
class = "data.frame", row.names = c(NA, -4L))
(2) How can I then associate all of my points on the map with a certain grid cell?
I'm really lost at this point, I tried to use tons of packages like rgdal, hexbin, sp, raster, rgeos, rasterVis, dggridR, ... but none of them got me to where I want to go. Help is much appreciated!
Parking data example:
id latitude longitude timestamp charging_demand
1: WBY1Z210X0V307780 55.68387 12.60167 2016-07-30 12:35:07 22
2: WBY1Z210X0V307780 55.63405 12.58818 2016-07-30 16:35:07 27
3: WBY1Z210X0V307780 55.68401 12.49015 2016-08-02 16:00:08 44
4: WBY1Z210X0V307780 55.68694 12.49146 2016-08-03 13:40:07 1
5: WBY1Z210X0V307780 55.68564 12.48824 2016-08-03 14:00:07 66
6: WBY1Z210X0V307780 55.66065 12.60569 2016-08-04 16:19:15 74
I think you can indeed use the hexbin package. Call the function like this:
h <- hexbin(data_x, data_y, nbins, range_x, range_y, IDs = TRUE)
The result has a column cID which tells you the cell in which the observation falls. You can use this to e.g. calculate the average charging demand per cell:
tapply(charging_demand, h#cID, FUN = function(z) sum(z)/length(z))
Additionally you can use hcell2xy to get coordinates you can use for plotting with ggplot. For an example you can look at this answer.
I would like simply delete some polygons from a SpatialPolygonsDataFrame object based on corresponding attribute values in the #data data frame so that I can plot a simplified/subsetted shapefile. So far I haven't found a way to do this.
For example, let's say I want to delete all polygons from this world shapefile that have an area of less than 30000. How would I go about doing this?
Or, similarly, how can I delete Antartica?
require(maptools)
getinfo.shape("TM_WORLD_BORDERS_SIMPL-0.3.shp")
# Shapefile type: Polygon, (5), # of Shapes: 246
world.map <- readShapeSpatial("TM_WORLD_BORDERS_SIMPL-0.3.shp")
class(world.map)
# [1] "SpatialPolygonsDataFrame"
# attr(,"package")
# [1] "sp"
head(world.map#data)
# FIPS ISO2 ISO3 UN NAME AREA POP2005 REGION SUBREGION LON LAT
# 0 AC AG ATG 28 Antigua and Barbuda 44 83039 19 29 -61.783 17.078
# 1 AG DZ DZA 12 Algeria 238174 32854159 2 15 2.632 28.163
# 2 AJ AZ AZE 31 Azerbaijan 8260 8352021 142 145 47.395 40.430
# 3 AL AL ALB 8 Albania 2740 3153731 150 39 20.068 41.143
# 4 AM AM ARM 51 Armenia 2820 3017661 142 145 44.563 40.534
# 5 AO AO AGO 24 Angola 124670 16095214 2 17 17.544 -12.296
If I do something like this, the plot does not reflect any changes.
world.map#data = world.map#data[world.map#data$AREA > 30000,]
plot(world.map)
same result if I do this:
world.map#data = world.map#data[world.map#data$NAME != "Antarctica",]
plot(world.map)
Any help is appreciated!
looks like you're overwriting the data, but not removing the polygons. If you want to cut down the dataset including both data and polygons, try e.g.
world.map <- world.map[world.map$AREA > 30000,]
plot(world.map)
[[Edit 19 April, 2016]]
That solution used to work, but #Bonnie reports otherwise for a newer R version (though perhaps the data has changed too?):
world.map <- world.map[world.map#data$AREA > 30000, ]
Upvote #Bonnie's answer if that helped.
When I tried to do this in R 3.2.1, tim riffe's technique above did not work for me, although modifying it slightly fixed the problem. I found that I had to specifically reference the data slot as well before specifying the attribute to subset on, as below:
world.map <- world.map[world.map#data$AREA > 30000, ]
plot(world.map)
Adding this as an alternative answer in case others come across the same issue.
Just to mention that subset also makes the work avoiding to write the data's name in the condition.
world.map <- subset(world.map, AREA > 30000)
plot(world.map)
I used the above technique to make a map of just Australia:
australia.map < - world.map[world.map$NAME == "Australia",]
plot(australia.map)
The comma after "Australia" is important, as it turns out.
One flaw with this method is that it appears to retain all of the attribute columns and rows for all of the other countries, and just populates them with zeros. I found that if I wrote out a .shp file, then read it back in using readOGR (rgdal package), it automatically removes the null geographic data. Then I could write another shape file with only the data I want.
writeOGR(australia.map,".","australia",driver="ESRI Shapefile")
australia.map < - readOGR(".","australia")
writeOGR(australia.map,".","australia_small",driver="ESRI Shapefile")
On my system, at least, it's the "read" function that removes the null data, so I have to write the file after reading it back once (and if I try to re-use the filename, I get an error). I'm sure there's a simpler way, but this seems to work good enough for my purposes anyway.
As a second pointer: this does not work for shapefiles with "holes" in the shapes, because it is subsetting by index.