Choropleth Plotting polygons with ggplot2 R on a map - r

I realise this has been asked about 100 times prior, but none of the answers I've read so far on SO seem to fit my problem.
I have data. I have the lat and lon values. I've read around about something called sp and made a bunch of shape objects in a dataframe. I have matched this dataframe with the variable I am interested in mapping.
I cannot for the life of me figure out how the hell to get ggplot2 to draw polygons. Sometimes it wants explicit x,y values (which are a PART of the shape anyway, so seems redundant), or some other shape files externally which I don't actually have. Short of colouring it in with highlighters, I'm at a loss.
if I take an individual sps object (built with the following function after importing, cleaning, and wrangling a shitload of data)
createShape = function(sub){
#This funciton takes the list of lat/lng values and returns a SHAPE which should be plottable on ggmap/ggplot
tempData = as.data.frame(do.call(rbind, as.list(VICshapes[which(VICshapes$Suburb==sub),] %>% select(coords))[[1]][[1]]))
names(tempData) = c('lat', 'lng')
p = Polygon(tempData)
ps = Polygons(list(p),1)
sps = SpatialPolygons(list(ps))
return(sps)
}
These shapes are then stored in the same dataframe as my data - which only this afternoon for some reason, I can't even look at, as trying to look at it yields the following error.
head(plotdata)
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : first argument must be atomic
I realise I'm really annoyed at this now, but I've about 70% of a grade riding on this, and my university has nobody capable of assisting.
I have pasted the first few rows of data here - https://pastebin.com/vFqy5m5U - apparently you can't print data with an s4 object - the shape file that I"m trying to plot.
Anyway. I'm trying to plot each of those shapes onto a map. Polygons want an x,y value. I don't have ANY OTHER SHAPE FILES. I created them based on a giant list of lat and long values, and the code chunk above. I'm genuinely at a loss here and don't know what question to even ask. I have the variable of interest based on locality, and the shape for each locality. What am I missing?
edit: I've pasted the summary data (BEFORE making them into shapes) here. It's a massive list of lat/lng values for EACH tile/area, so it's pretty big...

Answered on gis.stackexchange.com (link not provided).

Related

R Adding polygon name data to point data

I have two sets of data: Point level data and polygon data. I am aiming to add the name of the polygon into which a pint is located as an extra column on the point level data.
I have found and used the code below utilising the sf library.
new_point_data <- point_data %>% mutate(
intersection = as.integer(st_intersects(geometry, polygon_data))
, area = if_else(is.na(intersection), polygon_data$Name[intersection])
This works in 90% of cases, however when a point intersects a polygon the code does not bring any data back. Which I'm assuming is because it will return two (or more) values and cannot determine which to use, how can I update this to select any value e.g. the first?

Occurrence of a certain word within a text stored in columns using R

I'm facing quiet a lot of challenges currently by doing text analysis with R.
Therefore I have in a table the columns Date, Text and Likes
I want to count how often a certain word occurs within the texts of a column (max 1 per column) and how often not.
I want to plot the results by displaying the result like in this picture
but I would like dots for "occurrence" and "not occurrence" of the searched word with different colors as dots and aggregate it monthly on y-axis and likes on x-axis
It would be great if you could help me with this challenge
As update I have here the sample data available https://drive.google.com/file/d/1IWqDoRFBTL8er8VmvisHDeB5uM3BGgJe/view?usp=sharing
It looks like there are several moving parts here so let me outline the tasks I think you are looking for assistance with:
Determine if a word appears in text, row by row.
Plot this information.
Display the information by category, i.e. word found or not found.
Provide some sort of smoothed fit over the data.
You can accomplish the first task by using your choice of pattern matching function. grepl for example will search with the pattern as its first argument. You may want to look into other parameters such as case sensitivity to ensure they match your needs. You'll want to store this result into another column, assuming you use ggplot. Then, you can pass the data to ggplot and use the col argument to have it separate out categories for you.
It doesn't appear that your data is readily available from your question. In the future, it generally helps if you can share some sample data. I have made my own sample which should be similar to what you describe. See the example code below.
library(tidyverse)
library(ggplot2)
set.seed(5)
data <- data.frame(Date = seq.Date(from = as.Date("2021-01-01"),
to = as.Date("2021-03-01"),
by = "day"),
fruit = sample(c("banana", "orange", "apple")),
likes = runif(60, 100, 1000))
data$good_fruit <- ifelse(grepl("orange", data$fruit), "orange", "not orange")
data %>%
ggplot() +
geom_point(aes(Date, likes, col = good_fruit)) +
geom_smooth(aes(Date, likes))
Since I threw together literally random data, there is not much a pattern here, but I think this illustrates the general idea of what you wanted to show? If you wanted a more specific kind of aggregation, I would recommend performing that manipulation before passing to ggplot, but for a rough fit this should work.
Sample Image

How to use corrplot with is.corr=FALSE

I previously made a beautiful functional and perfect actual corrolation plot with corrplot (my plot). Now I have to get the underlying data in the same look. So my goal is to have triangular similarity matrixes in the same colours as my corrolation plot. Imagine it like the conditional formatting in excel.
My Data: my Data from excel
Link to CSV Data file
it is loaded in as a csv and it can read the csv perfectly
My Code:corrplot(Phylogeny, is.corr=FALSE,method="number", cl.lim=c(0,1))
The error it throws me: Error in if (any(corr < cl.lim[1]) || any(corr > cl.lim[2])) { : Missing value, where TRUE/FALSE is required
i made sure all colums are numeric
i made sure to fill the missing bits with NA's (because that was a problem somwhere before)
i made sure all my values are between 0 and 1 like i want the limit to be (in between it told me that my values are not within the limit, when i tried around with some stuff)
the error does not change when i change the limit
the error does not change when i take the is.corr=FALSE out (default=TRUE)
i played around with corrplot.mixed and its still not working
have been referencing information from Corrplot Intro
I have looked into the condformat function but i am not really sure if it can do a filling of each cell with one colour according to the overall gradient like i used for my corrolation plot.
What am I missing here that it does not want to give me my table back with pretty colours?
I had the same error, but I was able to fix it by converting my data.frame to a matrix. I ended up with corrplot(as.matrix(df), is.corr = FALSE).
If I am understanding correctly, your posted data are already a correlation matrix - although not a fully symmetrical one of the sort that would be produced with the call cor on raw data.
In that case, the problem is just that you have variable names (Species) as a column in your data. Change this column to row names, drop the variable names, and call corrplot as user9536160 suggests:
# read in your data
phyl <- as.data.frame(read_csv("Phylogeny.csv"))
# name rows and drop variable names in the df itself
row.names(phyl) <- phyl$Species
phyl <- phyl %>%
select(-Species)
# call corrplot
corrplot(as.matrix(phyl), is.corr = FALSE)
The result:

R spplot labels in wrong places

I am working with the US census gazetteer data file (zcta5) which is publicly available. The version I am using has files named tl_2015_us_zcta510.shp, dbf... Plotting the file works fine.
The issue I am having seems to happen when I subset the SpatialDataPolygonsDataFrame with a larger number of polygons. But when I use a small subset the labels work fine.
The labels I need identify assigned groupings of postal codes an individual 5-digit polygon area belongs to. For example - for Ashtabula, OH postal codes I need all the postal codes to have a label in the middle of it that reads "503". I have labels for all the other Ohio postal code groupings - called "PostalGroupNumber" and in table form the data all checks out to be correct.
So I load libraries and read the full spatial data frame into memory:
library(sp)
library(maps)
library(mapdata)
library(maptools)
library(foreign)
#Load in the entire census gazatteer data file
zcta5=readShapeSpatial("~/R/PostalCodes/USA/US Postal Codes/ZCTA5/tl_2015_us_zcta510.shp")
Next: create vector of Ashtabula, OH postal codes:
ashtab.zips <- c("44003","44004","44005","44010","44030","44032","44041","44047","44048","44068","44076","44082","44084","44085","44088","44093","44099")
Next - subset zcta5 Spatial Data Frame to include only these postal codes:
ashtab <- zcta5[which(zcta5#data$GEOID10 %in% ashtab.zips),]
Next - add labels to new ashtab spatial data frame and plot:
ashtab#data <- cbind(ashtab#data, "PostalGroupNumber"="503")
l1 = list("sp.text", coordinates(ashtab), as.character(ashtab#data$PostalGroupNumber),col="black", cex=0.7,font=2)
spplot(ashtab,zcol="GEOID10", sp.layout=list(l1)
,main=list(label="PostalGroupNumber 503 Postal Areas",cex=2,font=1)
)
Which works and gives the following and correct plot of the postal areas of northeast Ohio with correct labels in them:
Pretty good - BUT - the scale on the right looks like it retained a huge number of GEOID10 levels where I expected only the subset of the 17 in the ashtab.zips vector. Side Question (extra credit ;-)- why are those levels still there?
Now on to the main problem. Ohio postal codes all start with a 43... or a 44... - I have a csv file for just the 5-digit codes that are in Ohio, each with their assigned PostalGroupNumber which I read into a data frame, clean up and use to subset the main data frame like I did above:
oh <- read.csv("~/R/PostalCodes/OhioPostalGroupings/OH-PGAs-PostalCodes Only.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = c("character", "character", "character"))
oh$ZIP_CODE <- trimws(oh$ZIP_CODE)
ohzcta5 <- zcta5[which(zcta5#data$GEOID10 %in% oh$ZIP_CODE),]
l1 = list("sp.text", coordinates(ohzcta5), as.character(ohzcta5#data$GEOID10),col="black", cex=0.7,font=2)
spplot(ohzcta5,zcol="GEOID10", sp.layout=list(l1)
,main=list(label="Ohio Postal Code - PostalGroupNumbers",cex=2,font=1)
)
This time - just plot with labels of the GEOID10 value to see if it plots correctly and it does - hard to read here but zooming in shows correct postal codes in each polygon (this is not a great image but shape of OH is right and labels are correct...):
Now I need to add the PostalGroupNumber labels to the spatial data frame, and make a factor to color all the groups of postal codes together as the same color per group. So Ashtabula should all be one color and all have "503" labels in them - but they do not:
ohzcta5#data <- merge(ohzcta5#data, oh, by.x="GEOID10", by.y="ZIP_CODE", all.x=TRUE)
ohzcta5#data <- cbind(ohzcta5#data, "TAcolor"=as.factor(ohzcta5#data$PostalGroupNumber))
l1 = list("sp.text", coordinates(ohzcta5), as.character(ohzcta5#data$PostalGroupNumber),col="black", cex=0.7,font=2)
spplot(ohzcta5,zcol="GEOID10", sp.layout=list(l1)
,main=list(label="Ohio Postal Code - PostalGroupNumber",cex=2,font=1)
)
Which now looks like this:
A closer look at Ashtabula (northeast corner) now looks like this - What happened to the labels?:
The labels are all wrong - and yet when examining the ohzcta5#data the PostalGroupNumber is on the correct GEOID10 records.
Help!!!! Losing my mind.
Answers to two issues:
1) the issue of too many levels retained inthe spatial frame appearing on the spplot scale is resolved by using the base package "droplevels" for each of the factors in the spatial data frame.
2) Don't use "merge" because it re-orders the data so it no longer aligns to the correct polygon. Instead use "match" as shown in this post https://stackoverflow.com/a/3652472/4017087 (Thanks Ramnath!)

Creating SpatialLinesDataFrame from SpatialLines object and basic df

Using leaflet, I'm trying to plot some lines and set their color based on a 'speed' variable. My data start at an encoded polyline level (i.e. a series of lat/long points, encoded as an alphanumeric string) with a single speed value for each EPL.
I'm able to decode the polylines to get lat/long series of (thanks to Max, here) and I'm able to create segments from those series of points and format them as a SpatialLines object (thanks to Kyle Walker, here).
My problem: I can plot the lines properly using leaflet, but I can't join the SpatialLines object to the base data to create a SpatialLinesDataFrame, and so I can't code the line color based on the speed var. I suspect the issue is that the IDs I'm assigning SL segments aren't matching to those present in the base df.
The objects I've tried to join, with SpatialLinesDataFrame():
"sl_object", a SpatialLines object with ~140 observations, one for each segment; I'm using Kyle's code, linked above, with one key change - instead of creating an arbitrary iterative ID value for each segment, I'm pulling the associated ID from my base data. (Or at least I'm trying to.) So, I've replaced:
id <- paste0("line", as.character(p))
with
lguy <- data.frame(paths[[p]][1])
id <- unique(lguy[,1])
"speed_object", a df with ~140 observations of a single speed var and row.names set to the same id var that I thought I created in the SL object above. (The number of observations will never exceed but may be smaller than the number of segments in the SL object.)
My joining code:
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object)
And the result:
row.names of data and Lines IDs do not match
Thanks, all. I'm posting this in part because I've seen some similar questions - including some referring specifically to changing the ID output of Kyle's great tool - and haven't been able to find a good answer.
EDIT: Including data samples.
From sl_obj, a single segment:
print(sl_obj)
Slot "ID":
[1] "4763655"
[[151]]
An object of class "Lines"
Slot "Lines":
[[1]]
An object of class "Line"
Slot "coords":
lon lat
1955 -74.05228 40.60397
1956 -74.05021 40.60465
1957 -74.04182 40.60737
1958 -74.03997 40.60795
1959 -74.03919 40.60821
And the corresponding record from speed_obj:
row.names speed
... ...
4763657 44.74
4763655 34.8 # this one matches the ID above
4616250 57.79
... ...
To get rid of this error message, either make the row.names of data and Lines IDs match by preparing sl_object and/or speed_object, or, in case you are certain that they should be matched in the order they appear, use
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object, match.ID = FALSE)
This is documented in ?SpatialLinesDataFrame.
All right, I figured it out. The error wasn't liking the fact that my speed_obj wasn't the same length as my sl_obj, as mentioned here. ("data =
object of class data.frame; the number of rows in data should equal the number of Lines elements in sl)
Resolution: used a quick loop to pull out all of the unique lines IDs, then performed a left join against that list of uniques to create an exhaustive speed_obj (with NAs, which seem to be OK).
ids <- data.frame()
for (i in (1:length(sl_obj))) {
id <- data.frame(sl_obj#lines[[i]]#ID)
ids <- rbind(ids, id)
}
colnames(ids)[1] <- "linkId"
speed_full <- join(ids, speed_obj)
speed_full_short <- data.frame(speed_obj[,c(-1)])
row.names(speed_full_short) <- speed_full$linkId
splndf <- SpatialLinesDataFrame(sl_obj, data = speed_full_short, match.ID = T)
Works fine now!
I may have deciphered the issue.
When I am pulling in my spatial lines data and I check the class it reads as
"Spatial Lines Data Frame" even though I know it's a simple linear shapefile, I'm using readOGR to bring the data in and I believe this is where the conversion is occurring. With that in mind the speed assignment is relatively easy.
sl_object$speed <- speed_object[ match( sl_object$ID , row.names( speed_object ) ) , "speed" ]
This should do the trick, as I'm willing to bet your class(sl_object) is "Spatial Lines Data Frame".
EDIT: I had received the same error as OP, driving me to check class()
I am under the impression that the error that was populated for you is because you were trying to coerce a data frame into a data frame and R wasn't a fan of that.

Resources