I have two data frames. One is a Spatial Polygon and the other is a Spatial Points dataframe. Unfortunately I can't reproduce the entire example here but the Spatial Polygon looks like this:
head(electorate)
ELECT_DIV STATE NUMCCDS ACTUAL PROJECTED POPULATION OVER_18 AREA_SQKM SORTNAME
Adelaide SA 318 0 0 0 0 76.0074 Adelaide
Aston VIC 191 0 0 0 0 99.0122 Aston
Ballarat VIC 274 0 0 0 0 4651.5400 Ballarat
Banks NSW 229 0 0 0 0 49.3189 Banks
Barker SA 343 0 0 0 0 63885.7100 Barker
Barton NSW 234 0 0 0 0 44.1112 Barton
As you can see it's the spatial polygon for the Australian electorate. The second data frame is a Spatial points dataframe with longitude and latitude for polling places. It looks like this -
head(ppData)
State PollingPlaceID PollingPlaceNm Latitude Longitude
1 ACT 8829 Barton -35.3151 149.135
2 ACT 11877 Bonython -35.4318 149.083
3 ACT 11452 Calwell -35.4406 149.116
4 ACT 8794 Canberra Hospital -35.3453 149.099
5 ACT 8761 Chapman -35.3564 149.042
6 ACT 8763 Chisholm -35.4189 149.123
My goal is to try and match each polling place (PollingPlaceID) to the appropriate electoral division (ELECT_DIV). There will be many polling places within each division. It's no problem to plot them over each other. It seems only natural that R will also let me add a new vector to my polling place data frame (ppData) which assigns each polling place the electorate (ELECT_DIV) it falls within.
I know I can extract the coordinates for each ELECT_DIV from electorate with coordinates(electorate) but I'm not sure that actually helps. Any advice?
You need over from sp and you can use it like this:
require( sp )
ID <- over( SpatialPoints( ppData ) , electorate )
ppData#data <- cbind( ppData#data , ID )
This returns a data.frame where each row relates to the first argument (each of your polling points) and is the data from the polygon that the point fell in. You can just cbind them afterwards and you now have the polygon data that relates to each point.
Related
I have a shapefile of the Philippines that has all the correct labels of each provinces. After removing some of the provinces I won't be using, aggregating the data into a single data frame, and then attaching my covariates to the shapefile I run into trouble. Using tmap to create some maps, the provinces are mislabeled and therefore, different data is applied to different provinces I am doing a spatial-temporal analysis with this data, so it's important the provinces are in the correct locations.
I have tried retrojecting some of the shapefile, but it doesn't seem to work.
#reading in shapefile
shp <- readOGR(".","province.csv")
#removing provinces not in data from shapefile
myshp82=shp
shp#data$prov=as.character(shp#data$prov)
ind=shp#data$prov%in% mydata$prov
shp.subset=shp[ind,]
#attaching covariates to shapefile for plotting, myagg is my data frame.
#The shape files are divided in four different time periods.
myagg_time1=myagg[myagg$period==1,]
myagg_time2=myagg[myagg$period==2,]
myagg_time3=myagg[myagg$period==3,]
myagg_time4=myagg[myagg$period==4,]
myshptime1=myshptime2=myshptime3=myshptime4=shp
myshptime1#data=merge(myshptime1#data, myagg_time1, by='prov',all.x=TRUE)
myshptime2#data=merge(myshptime2#data, myagg_time2, by='prov',all.x=TRUE)
myshptime3#data=merge(myshptime3#data, myagg_time3, by='prov',all.x=TRUE)
myshptime4#data=merge(myshptime4#data, myagg_time4, by='prov',all.x=TRUE)
#desc maps. Here's the code I've been using for one of the maps.
Per1= tm_shape(myshptime1)+
tm_polygons(c('total_incomeMed','IRA_depMean','pov'), title=c('Total Income', 'IRA', 'Poverty (%)'))+
tm_facets(sync = TRUE, ncol=3)
#sample data from my data sheet "myagg". First column is provinces.
period counts total_income_MED IRA_depMean
Agusan del Norte.1 1 2 119.33052 0.8939136
Agusan del Norte.2 2 0 280.96928 0.8939136
Agusan del Norte.3 3 1 368.30082 0.8939136
Agusan del Norte.4 4 0 368.30082 0.8950379
Aklan.5 1 0 129.63132 0.8716863
Aklan.6 2 3 282.95535 0.8716863
Aklan.7 3 3 460.29969 0.8716863
Aklan.8 4 0 460.29969 0.8437920
Albay.9 1 0 280.12221 0.8696165
Albay.10 2 3 453.05098 0.8696165
Albay.11 3 1 720.40732 0.8696165
Albay.12 4 0 720.40732 0.8254676
Essentially the above tmap code creates three maps for this time period side-by-side for each of the different covariates ('total_incomeMed','IRA_depMean','pov'). This is happening, but the provinces are mislabeled and the data is tied to the name of the province. I just need the provinces properly labeled!
Sorry if this doesn't make sense. Happy to clarify more if needed.
I have a dataset called dolls.csv that I imported using
dolls <- read.csv("dolls.csv")
This is a snippet of the data
Name Review Year Strong Skinny Weak Fat Normal
Bell 3.5 1990 1 1 0 0 0
Jan 7.2 1997 0 0 1 0 1
Tweet 7.6 1987 1 1 0 0 0
Sall 9.5 2005 0 0 0 1 0
I am trying to run some preliminary analysis of this data. The Name is the name of the doll, the review is a rating 1-10, year is year made, and all values after that are binary where they are 1 if they possess a characteristic or 0 if they don't.
I ran
summary(dolls)
and get the header, means, mins and max's of values.
I am trying to possibly see what the correlations are between characteristics and year or review rating to see if there is some correlation (for example to see if certain dolls have really high ratings yet have unfavorable traits ), not sure how to construct charts or what functions to use in this case? I was considering some ANOVA tail testing for outliers and means of different values but not sure how to compare values like this (In python i'd run a if-then statement but i dont know how to in R).
This is for a personal study I wanted to conduct and improve my R skills.
Thank you!
I have a data frame that have information about crimes (variable x), and latitude and longitude of where that crime happened. I have a shape file with the districts from são paulo city. That is df:
latitude longitude n_homdol
1 -23.6 -46.6 1
2 -23.6 -46.6 1
3 -23.6 -46.6 1
4 -23.6 -46.6 1
5 -23.6 -46.6 1
6 -23.6 -46.6 1
And a shape file for the districts of são paulo,sp.dist.sf :
geometry NOME_DIST
1 POLYGON ((352436.9 7394174,... JOSE BONIFACIO
2 POLYGON ((320696.6 7383620,... JD SAO LUIS
3 POLYGON ((349461.3 7397765,... ARTUR ALVIM
4 POLYGON ((320731.1 7400615,... JAGUARA
5 POLYGON ((338651 7392203, 3... VILA PRUDENTE
6 POLYGON ((320606.2 7394439,... JAGUARE
With the help of #Humpelstielzchen, i join both data doing:
sf_df = st_as_sf(df, coords = c("longitude", "latitude"), crs = 4326)
shape_df<-st_join(sp.dist.sf, sf_df, join=st_contains)
My final goal is to implement a local moran i statistic, and i'm trying to do this with:
sp_viz <- poly2nb(shape_df, row.names = shape_df$NOME_DIST)
xy <- st_coordinates(shape_df)
ww <- nb2listw(sp_viz, style ='W', zero.policy = TRUE)
shape_df[is.na(shape_df)] <- 0
locMoran <- localmoran(shape_df$n_homdol, ww)
sids.shade <- auto.shading(c(locMoran[,1],-locMoran[,1]),
cols=brewer.pal(5,"PRGn"))
choropleth(shape_df, locMoran[,1], shading=sids.shade)
choro.legend(-46.5, -20, sids.shade,fmt="%6.2f")
title("Criminalidade (Local Moran's I)",cex.main=2)
But when i run the code, it takes hours to compute:
sp_viz <- poly2nb(shape_df, row.names = shape_df$NOME_DIST)
I have 15,000 observations, for 93 districts. I tried to run the above code with only 100 observations, and it was fast and everything went right. But with the 15,000 obs i did not see the result, because de computation goes on forever. What may be happening? I am doing something wrong? Is there a better way to do this Local moran I test?
As I can't just comment, here is some questions one might ask:
- how long do you mean by fast? some of my scripts run in seconds and I call it slow.
- are all your observation identically structured? maybe the poly2nb() function is infinitely looping on an item which has an uncommon structure. You can use the unique() function to ensure this point.
- Did you try to cut your dataset into pieces and to run each piece separately? this would help to see 1/ whether one of your parts has something to be corrected and 2/ whether R is loading all data at the same time, overloading the memory of your computer. Beware, this happen really often with huge datasets in R (and by huge, I mean data tables of > 50 Mo wheight).
Glad to have tried to help you, do not hesitate to question my answer !
I am working on spatial datasets using R.
Data Description
My master dataset is in SpatialPointsDataFrame format and has surface temperature data (column names - "ruralLSTday", "ruralLSTnight") for every month. Data snippet is shown below:
Master Data - (in SpatialPointsDataFrame format)
TOWN_ID ruralLSTday ruralLSTnight year month
2920006.11 2920006 303.6800 289.6400 2001 0
2920019.11 2920019 302.6071 289.0357 2001 0
2920015.11 2920015 303.4167 290.2083 2001 0
3214002.11 3214002 274.9762 293.5325 2001 0
3214003.11 3214003 216.0267 293.8704 2001 0
3207010.11 3207010 232.6923 295.5429 2001 0
Coordinates:
longitude latitude
2802003.11 78.10401 18.66295
2802001.11 77.89019 18.66485
2803003.11 79.14883 18.42483
2809002.11 79.55173 18.00016
2820004.11 78.86179 14.47118
I want to add columns in the above data about rainfall and air temperature - This data is present in SpatialGridDataFrame in the table "secondary_data" for every month. Snippet of "secondary_data" is shown below:
Secondary Data - (in SpatialGridDataFrame format)
month meant.69_73 rainfall.69_73
1 1 25.40968 0.6283871
2 2 26.19570 0.4580542
3 3 27.48942 1.0800000
4 4 28.21407 4.9440000
5 5 27.98987 9.3780645
Coordinates:
longitude latitude
[1,] 76.5 8.5
[2,] 76.5 8.5
[3,] 76.5 8.5
[4,] 76.5 8.5
[5,] 76.5 8.5
Question
How do I add the columns from secondary data to my master data by matching over latitude longitude and month? Currently the latitude/longitude information in the two table above will not match exactly as master data is a set of points and secondary data is grid.
Is there a way to find the square of the grid on the "Secondary Data" that the lat/long of my master data falls into, and interpolate?
If your SpatialPointsDataFrame object is called x, and your SpatialGridDataFrame is called y, then
x <- cbind(x, over(x, y))
will add the attributes (grid cell values) of y matching to the locations of x, to the attributes of x. Match is done by point-in-grid cell.
Interpolation is a different question; a simple way would be inverse distance with the four nearest neighbours, e.g. by
library(gstat)
x = idw(meant.69_73~1, y, x, nmax = 4)
whether you want one, or the other really depends on what your grid cells mean: do they refer to (i) the point value at the grid cell center, (ii) a value that is constant throughout the grid cell, or (iii) an average value over the whole grid cell. First case: interpolate, second: use over, third: use area-to-point interpolation (not explained here).
R package raster will offer similar functionality, but use different names.
Can anyone tell me how to constrain the output and selected variables of a neural network such that the influence of a charateristic is positive using the function nnet in R. I Have a database (real estate) with numerical (surface, price) and categorial values (parking Y/N, areacode, ectera). The output of the model is the price. The thing is that the model currently estimates that in a few areacodes the homes with a parking spot are less worth than the homes without a parking spot. I would like to constrain the output (Price) so that in each areacode, the influence of a parking spot on the price is positive. Ofcourse a really small house with parking spot can still be cheaper than a big house without a parking spot.
example data (of 80.000 observations):
Price Surface Parking Y Areacode 1 Areacode 2 Areacode 3
100000 100 0 1 0 0
110000 99 1 0 1 0
200000 110 0 0 0 1
150000 130 0 0 1 0
190000 130 1 0 0 1
(thanks for putting the table in a decent format)
I had this modelled in R using nnet.
model = nnet(Price~ . , data=data6, MaxNWts=2500, size=12, skip=TRUE, linout=TRUE, decay=0.025, na.action=na.omit)
I used nnet because I hope to find different values for parking spots per area code. If there is a beter way for this please let us know.
Im using RStudio Version 0.98.976 on windows XP (yes i know;)
Thanks in advance for your replies