I have presence points of a certain species all over the United States. I completed a spatial join between the US and said points. However, I am unsure of how to normalize the data. There is a "percent of total," but I am unsure if this is the appropriate option. Or is it as simple as just normalizing by the counts themselves?
It depends on what comparison you're trying to make with the normalized data.
If you want to look at the occurrence of that species by state, you could do a spatial join on a US States layer, then calculate a new field where the value is the species count for each state divided by the total area of the state. That would give you the normalized 'count per square mile' (or whatever unit you want).
Related
I have blocks of census data (shapefile with the column of interest being pop20) and polygons of areas of interest (shapefile with the column of interest being site). I am trying to get a sum of the population within each of the areas of interest (see example of one area of interest and the census blocks below). I don't know how to join the population estimates (column: pop20) to the areas of interest and account for polygons that are only partially within the areas of interest.
Hence I am interested in the following:
what is the population within each census block within each area of interest, accounting for some blocks only being partial inside (so if 1/2 the block is within the area of interest, assume the population is 1/2 of the value in pop20).
Then what is the sum of all the blocks within the area of interest weighing the blocks that are only partially within the area of interest from part 1.
I have essentially imported by shapefiles using the sf package but then I don't know what to do (do I use st_intersection or st_join or something else)?
pop<-st_read("...\\pop_census2020_prj.shp")
buff<-st_read("...\\trap_mcpbuff_prj.shp")
Thank you for your help.
I have a made up dataset of polling stations in Wales and I've attached a date column to it. We can imagine this date is the date this polling station was visited to check the facilities (for example).
What I'd like to do is work out :
I would like to work out whether geographic points are within a certain distance
This I've managed by self_joining and using st_buffer and st_within to calculate within 1000 m and then calculated the number of neighbours.
and also the interval between the sample dates
this I'm having a bit of a problem with
What I'd like to do, I think, is
for each polling station
calculate the number of neighbours (so far so easy)
for each neighbour determine the interval between the sampling dates
return a spatial object (for plotting in tmaps probably)
Here's some test code that I've got that generates the sf dataset, calculates the number of neighbours and returns that.
It's really the date interval that's stumping me. It's not so much the calculation of the date interval but it's the way to generate these clusters of polling stations with date intervals.
Is it better to generate the (in this case) 108 polling station clusters?
What I'm trying to do in my larger dataset is calculate clusters of points over time.
I have ~2000 records with a date. I'd like to say :
for each of these 2000 records calculate the number of neighbours within a distance and within a timeframe.
I think it's probably better to
calculate each cluster of neighbouring points and visualise
then
remove neighbours from the cluster that are outside of the time frame and visualise that
Although, on typing this, I wonder if excluding points that didn't fall within a timeframe first and then calculating neighbours would be more efficient?
polls<-st_as_sf(read.csv(url("https://www.caerphilly.gov.uk/CaerphillyDocs/FOI/Datasets_polling_stations_csv.aspx")),
coords = c("Easting","Northing"),crs = 27700)%>%
mutate(date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/31'), by="day"), 147))
test_stack<-polls%>%st_join(polls%>%st_buffer(dist=1000),join=st_within)%>%
filter(Ballot.Box.Polling.Station.x!=Ballot.Box.Polling.Station.y)%>%
add_count(Ballot.Box.Polling.Station.x)%>%
rename(number_of_neighbours = n)%>%
mutate(interval_date = date.x-date.y)%>%
subset(select = -c(6:8,10,11,13:18))## removing this comment will summarise the data so that only number of neighbours is returned %>%
distinct(Ballot.Box.Polling.Station.x,number_of_neighbours,date.x)%>%
filter(number_of_neighbours >=2)
I think it might be as simple as
tm_shape(test_stack)+tm_dots(col = "number_of_neighbours", clustering =T, size = 0.5)
I'm not sure how clustering works in leaflet, but that works quite nicely on this test data.
I am using the sf_package to work around spatial data in r. At this stage, I want to make a spatial join so that the tax lots of my area of study inherit the attributes of the floodplain on which they are located. For example, taxlots may be located in a floodplain categorized as X, VE, A, A0, or V (these are codes that relate to the intensity of the flood in each area).
To do this, I tested the sf function st_join, which will by default rely on st_intersects to determine the spatial join for each entity of my tax lots.
However, I am trying to figure out the criteria used by the function when a tax lot intersects with two different floodplain areas (as in the photo below, in which several lots intersect both with an A floodplain and an AE floodplain). Does it take the value of the area that covers the largest area of the lot? or is it a matter of which area is located upper in the dataframe?
Note that I am not interested in partitioning the intersecting lots so that I divide them according to their areas intersecting one and other floodplain zones.
Photo of tax lots intesecting with more than one floodplain category
By default, st_join(x, y, join = st_intersects) duplicates all features in x,
that intersect with more than one features from y.
If you set the argument largest = TRUE, st_join() returns the x features augmented with the fields of y that have the largest overlap with each of the features of x.
See https://r-spatial.github.io/sf/reference/st_join.html and https://github.com/r-spatial/sf/issues/578 for more details.
I cannot warp my mind arround reading the plots generated by coplot().
For example from the help(coplot)
## Tonga Trench Earthquakes
coplot(lat ~ long | depth, data = quakes)
What do the gray bars above represent? Why are there 2 rows or lat/long boxes?
How do I read this graph?
I can shed some more light on the second chart's interpretation. The gray bars for both mag and depth represent intervals of the their respective variables. Andy gave a nice description of how they are created above.
When you are reading them keep in mind that they are meant to show you the range of the observations for the respective conditioning variable (mag or depth) represented in each column or row. Therefore, in Andy's example the largest mag bar is just showing that the topmost row contains observations for earthquakes of approx. 4.6 to 7. It makes sense that this bar is the largest, since as Andy mentioned, they are created to have roughly similar numbers of observations and stronger earthquakes are not as common as weaker ones. The same logic holds true for depth where a larger range of depths was required to get a roughly proportional number of observations.
Regarding reading the chart, you would read the columns as representing the three depth groups (left to right) and the rows as representing the four mag groups (bottom to top). Thus, as you read up the chart you're progressively slicing the data into groups of observations with increasing magnitudes. So, for example, the bottom row represents earthquakes with magnitudes of 4 to 4.5 with each column representing a different range of depths. Similarly, you read the columns as holding depth constant while allowing you to see various ranges of magnitudes.
Putting it all together, as mentioned by Andy, we can see that as we read up the rows (progressing up in magnitude) the distribution of earthquakes remains relatively unchanged. However, when reading across the columns (progressing up in depth) we see that the distribution does slightly change. Specifically, the grouping of quakes on the right, between longitudes 180 and 185, grows tighter and more clustered towards the top of the cell.
This is a method for visualizing interactions in your dataset. More specifically, it lets you see how some set of variables are conditional on some other set of variables.
In the example given, you're asking to visualize how lat and long vary with depth. Because you didn't specify number, and the formula indicates you're interested in only one conditional variable, the function assumes you want number=6 depth cuts (passed to co.intervals, which tries to make the number of data points approximately equal within each interval) and is simply maximizing the data-to-ink ratio by stacking individual plot frames; the value of depth increases to the right, starting with the lowest row and moving up (hence the top-right frame represents the largest depth interval). You can set rows or columns to change this behavior, e.g.:
coplot(lat ~ long | depth, data = quakes, columns=6)
but I think the power of this tool becomes more apparent when you inspect two or more conditioning variables. For example:
coplot(lat ~ long | depth * mag, data = quakes, number=c(3,4))
gives a rich view of how earthquakes vary in space, and demonstrates that there is some interaction with depth (the pattern changes from left to right), and little-to-no interaction with magnitude (the pattern does not change from top to bottom).
Finally, I would highly recommend reading Cleveland's Visualizing Data -- a classic text.
I am quite new to the area of spatial statistics, but I'm very interested. For learning and demo purposes, I've created three datsets.
Dataset - Persons: This describes individuals at a certain location with a few variables. Please note, that the persons are located in the provided cities. A short explanation:
POINT_X: X-coordinate of city.
POINT_Y: Y-coordinate of city.
city: The name of the city, where they live.
ill: "1" states that they are ill. For learning purposes, all persons are ill.
job: If they have a job or not. "1" means: they have one, "0" means they haven't got one.
disnw: The distance to the nearest waterpoint.
wID: not relevant.
Dataset - City: This describes a number of cities including some variables. A short explanation of these:
city: The name of the city.
population: The population of the city.
POINT_X: X-coordinate of city.
POINT_Y: Y-coordinate of city.
ill: Number of ill persons in the city.
notill: Number of healthy persons in the city.
disnw: The distance (in km) to the nearest waterfeature.
wID: not relevant
rate_ill: The rate of ill persons in the city.
rate_notill: The rate of healthy persons in the city.
Dataset - Waterfeatures: . Please note that the viallages are on the same location as persons. This is a collection of spatial points, which describes waterfeatures.
POINT_X: X-coordinate of a waterfeature.
POINT_Y: Y-coordinate of a waterfeature.
geographic overview about the setting (red are persons, blue are waterfeatures, yellow are cities)
Now I want to check the hypothesis that cities, which are nearer to waterfeatures (so where the variable disnw is lower), have a higher number of ill persons. So is there a correlation between the number of ill persons/rate of ill persons and the proximity to water features. I know, that the datasets are possibly not representative or suitable for my hyptothesis, but for now this fact shouldn't matter.
I've already looked at some functions and packages, but I'm very unsure about a suitable method. Methods, which might be useful (at least from my point of view): semivariogram, variogram, Ripley's K function, G-Function, correlation coefficient.
To give you a better overview, I've created example datasets. You can find these here:
persons = read.csv("http://pastebin.com/raw.php?i=3aMGi9Ax", header = TRUE, stringsAsFactors=FALSE)
city = read.csv("http://pastebin.com/raw.php?i=Lk3KXLQT", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("http://pastebin.com/raw.php?i=hQRvMZwE", header = TRUE, stringsAsFactors=FALSE)
It would be awesome to get some input from your side. Maybe you have a tip, how to perform this kind of analysis.
Thanks in advance!