How to eliminate no-neighbour data when doing spatial clustering - r

Hello wonderful people!
I'm trying to run a cluster analysis on some data that I've mapped on a choropleth map. It's the % participation rates per constituency.
I'm trying to run a Moran_result test, but am unable to get the data in list format. I keep getting the error: "Error in nb2listw(nb) : Empty neighbour sets found".
I assume this is because some constituencies (like the Isle of White) have no neighbours. I can't find online what constituencies are islands or have no neighbours, and wondered if there is a speedier way to by-pass solve this issue in R, rather than googling all 573 England and Wales constituencies.
Can you help?
Ideas: I thought that maybe I could create "fake" polygons to surround all constituencies with no value so at the very least they could be listed. Or maybe there is a way of searching which have no neighbours and then removing them? Both of these I'm unsure how to do.
My goal: I wan't to get a few spatial clusters where the participation rates are similar and then extract that data so I can compare it to a regression model I have. If you know of another way to do this, other than above, please let me know.
I've tried: new_dataframe <- filter(election_merged_sf, !is.na(nb), but this doesn't actually remove any objects. I assume this is because it is testing whether there are numeric neighbours, when it needs to be done spatially.

Related

Is there an R function/package for determining WWF biomes from latlong coordinates?

Very new here, hi, postgraduate student who is tearing their hair out.
I have inherited a dataset of secondary data collected from research papers on species populations and their genetic diversity and have been adding more appropriate data to this sheet in preparation to perform some analyses. Part of the analysis will include subsetting the data by biome type to create comparisons between the biomes, and therefore I've been cleaning up and trying at add this information to the data I've added. I have latlong coordinates for each population (in degrees decimals) and it appears that the person working on this before me was able to use these to determine the biome for each point, specifically following the Olson et al. (2001)/WWF 14 biome categorisation, but at this point I'll take anything.
However I have no idea how this was achieved and truly can't find anything to help. After googling just about every combination of "r package biomes WWF latitude longitude species assign derive convert" that you can think of, the only packages that I have located are non functioning in my version of RStudio (e.g. biomeara, ggbiome), leaving me with no idea if they'd even work, and all other pages that I have dragged up seem to already have biome data included with their dataset. Other research papers I have found describe assigning biomes based on latlong coords and give 0 steps on how to actually achieve this. Is it possible in R? Am I losing my mind? Does anyone know of a way to do this, whether in R or not, and that preferably doesn't take forever as I have over 8000 populations to assess? Many thanks!

Why is PCA analysis in R using order as a variable?

I am doing PCA analysis in R. I am not by any means a programmer so please have some patience me if I'm too vague or use incorrect terminology :)
So, for context, I am doing PCA of a giant dataset of US counties, with a ton of demographic data!
pcatest <- prcomp(countydata, center = TRUE, scale = TRUE)
Beforehand, this prcomp function was not accepting my countydata dataframe, saying it was "not numeric," so I needed to unlist it, use the as.numeric function, create a matrix and turn it back into a dataframe.
Anyways, after doing this, I noticed that the PCA analysis was definitely a bit weird. For most counties in the US, PC1 was around -0.9, but in nearly every county in Iowa, as well as some in Illinois and Indiana, values ranged from 20-40. Counties in Alabama, Alaska, and Arizona also had significantly lower than average values, despite being highly demographically different. I meticulously checked my data, nothing seemed off about the information that would lead to this PCA failure? I checked to see if numerical order or row number was accidentally made a variable analyzed by PCA, and it didn't seem like it!
Now, I do not know what to do. Maybe this error has something to do with what I had to do in order to use the prcomp function, maybe not. Has anyone else had this issue? If so, I would really like help. Thank you! :)

Pre-defining clusters in r

I have a pretty big data table (about 100.000 observations) that I'd like to use for clustering. Since some of the data is categorical, I've tried using "gower distance" and then hclust() with the "ward" method.
The data itself is very heterogeneous, which is why I'd like to sort of "pre-cluster" the data and then do the actual cluster analysis. Have any of you done this before and can point me in the right direction? I'm at a loss at the moment :(
With the mentioned methods, I don't really get useful clusters.
Thanks guys, I really appreciate every tip I can get.
Edit: I think that I didn't really explain my problem right, so here's another attempt: let's say, that I have a dataset containing brands of cars and some of their features. Before clustering them by features I would like to precluster them by brand. So all BMW e.g. are in the same cluster and so on.. and only after that I would like to cluster by features, so I should get a cluster with fast cars etc.
does anybody know, how to do this in R?
this does not describe my dataset, but maybe the question I'm having is clearer now.
You should start with a sample first.
Once you get good results on the sample, try to reproduce it on a different sample. Once the results are stable, you can either try to scale the algorithm to the entire data set (maybe try doubling first), or you can train a classifier and predict the clusters of the remaining data. With most clustering algorithms, a 1 nearest neighbor classifier will be very good.

Adding spatial clustering data in Map by R

I have the results for spatial clustering, in this results I have the id for some cities in USA. I would like to show this clustering results on a nice map. Is this feasible in R?
Yes, this is feasible.
You need to map the city ids to geographical data, then visualize it.
With the extensive drawing capabilities of R, this is not very hard; there are several R packages that will do the heavy lifting, and tutorials to guide you. Just pick whatever package you prefer.
We cannot give you a complete source, of course, because we don't know what kind of ids you have. For example many people use zip codes, others use FIPS ids, etc.

Point pattern similarity and comparison

I recently started to work with a huge dataset, provided by medical emergency
service. I have cca 25.000 spatial points of incidents.
I am searching books and internet for quite some time and am getting more and more confused about what to do and how to do it.
The points are, of course, very clustered. I calculated K, L and G function
for it and they confirm serious clustering.
I also have population point dataset - one point for every citizen, that is similarly clustered as incidents dataset (incidents happen to people, so there is a strong link between these two datasets).
I want to compare these two datasets to figure out, if they are similarly
distributed. I want to know, if there are places, where there are more
incidents, compared to population. In other words, I want to use population dataset to explain intensity and then figure out if the incident dataset corresponds to that intensity. The assumption is, that incidents should appear randomly regarding to population.
I want to get a plot of the region with information where there are more or less incidents than expected if the incidents were randomly happening to people.
How would you do it with R?
Should I use Kest or Kinhom to calculate K function?
I read the description, but still don't understand what is a basic difference
between them.
I tried using Kcross, but as I figured out, one of two datasets used
should be CSR - completely spatial random.
I also found Kcross.inhom, should I use that one for my data?
How can I get a plot (image) of incident deviations regarding population?
I hope I asked clearly.
Thank you for your time to read my question and
even more thanks if you can answer any of my questions.
Best regards!
Jernej
I do not have time to answer all your questions in full, but here are some pointers.
DISCLAIMER: I am a coauthor of the spatstat package and the book Spatial Point Patterns: Methodology and Applications with R so I have a preference for using these (and I genuinely believe these are the best tools for your problem).
Conceptual issue: How big is your study region and does it make sense to treat the points as distributed everywhere in the region or are they confined to be on the road network?
For now I will assume we can assume they are distributed anywhere.
A simple approach would be to estimate the population density using density.ppp and then fit a Poisson model to the incidents with the population density as the intensity using ppm. This would probably be a reasonable null model and if that fits the data well you can basically say that incidents happen "completely at random in space when controlling for the uneven population density". More info density.ppp and ppm are in chapters 6 and 9 of 1, respectively, and of course in the spatstat help files.
If you use summary statistics like the K/L/G/F/J-functions you should always use the inhom versions to take the population density into account. This is covered in chapter 7 of 1.
Also it could probably be interesting to see the relative risk (relrisk) if you combine all your points in to a marked point pattern with two types (background and incidents). See chapter 14 of 1.
Unfortunately, only chapters 3, 7 and 9 of 1 are availble as free to download sample chapters, but I hope you have access to it at your library or have the option of buying it.

Resources