rworldmap package - Warning if the number of quantiles was reduced - r

I am using this R code:
library(rworldmap)
Data <- read.table("D:/Bla/Maps/Test.txt", header = TRUE, sep = "\t")
sPDF <- joinCountryData2Map(Data, joinCode = "ISO3",nameJoinColumn = "ISO3CountryCode")
mapCountryData(sPDF, nameColumnToPlot = "Data")
This produces a map but I get:
You asked for 7 quantiles, only 1 could be created in quantiles classification
I googled and it pointed me to this code
Not sure whether it is relevant.
This is the data I have used:
ISO3CountryCode Data
JPN 7
AUS 6
IND 6
CHN 5
GBR 5
CHE 4
IRN 4
DEU 3
EGY 3
ESP 3
LBY 3
TUN 3
USA 3
ARG 2
AUT 2
BRA 2
EST 2
GRC 2
ITA 2
TUR 2
URY 2
CHL 1
ETH 1
FRA 1
JOR 1
KEN 1
KOR 1
LTU 1
MEX 1
NLD 1
NZL 1
PER 1
POL 1
SAU 1
SRB 1
SVK 1
SVN 1
TZA 1
ZAF 1

It looks like by default mapCountryData() tries to fit data to quantiles for binning. You'll need to help it along a little by tweaking the catMethod parameter.
I'm not sure what your values 1 through 7 mean. If they are categories (and you want them all explicitly displayed in the legend), try:
mapCountryData(sPDF, nameColumnToPlot = "Data", catMethod="categorical")
If you want to treat all values equally on a continuous scale, try:
mapCountryData(sPDF, nameColumnToPlot = "Data", catMethod="fixedWidth")
If neither of these does do what you want, you might try altering numCats and/or catMethod see ?mapCountryData for the possible values and their meaning.

Related

How do you match a numeric value to a categorical value in another data set

I have two data sets. One with a numeric value assigned to individual categorical variables (country name) and a second with survey responses including a person's nationality. How do I assign the numeric value to a new column in the survey dataset with matching nationality/country name?
Here is the head of data set 1 (my.data1):
EN HCI
1 South Korea 0.845
2 UK 0.781
3 USA 0.762
Here is the head of data set 2 (my.data2):
Nationality OIS IR
1 South Korea 2 2
2 South Korea 3 3
3 USA 3 4
4 UK 3 3
I would like to make it look like this:
Nationality OIS IR HCI
1 South Korea 2 2 0.845
2 South Korea 3 3 0.845
3 USA 3 4 0.762
4 UK 3 3 0.781
I have tried this but unsuccessfully:
my.data2$HCI <- NA
for (i in i:nrow(my.data2)) {
my.data2$HCI[i] <- my.data1$HCI[my.data1$EN == my.data2$Nationality[i]]
}
We can use a left_join
library(dplyr)
left_join(my.data2, my.data1, by = c("Nationality" = "EN"))
Or with merge from base R
merge(my.data2, my.data1, by.x = c("Nationality", by.y = "EN", all.x = TRUE)

How to Merge Shapefile and Dataset?

I want to create a spatial map showing drug mortality rates by US county, but I'm having trouble merging the drug mortality dataset, crude_rate, with the shapefile, usa_county_df. Can anyone help out?
I've created a key variable, "County", in both sets to merge on but I don't know how to format them to make the data mergeable. How can I make the County variables correspond? Thank you!
head(crude_rate, 5)
Notes County County.Code Deaths Population Crude.Rate
1 Autauga County, AL 1001 74 975679 7.6
2 Baldwin County, AL 1003 440 3316841 13.3
3 Barbour County, AL 1005 16 524875 Unreliable
4 Bibb County, AL 1007 50 420148 11.9
5 Blount County, AL 1009 148 1055789 14.0
head(usa_county_df, 5)
long lat order hole piece id group County
1 -97.01952 42.00410 1 FALSE 1 0 0.1 1
2 -97.01952 42.00493 2 FALSE 1 0 0.1 2
3 -97.01953 42.00750 3 FALSE 1 0 0.1 3
4 -97.01953 42.00975 4 FALSE 1 0 0.1 4
5 -97.01953 42.00978 5 FALSE 1 0 0.1 5
crude_rate$County <- as.factor(crude_rate$County)
usa_county_df$County <- as.factor(usa_county_df$County)
merge(usa_county_df, crude_rate, "County")
[1] County long lat order hole
[6] piece id group Notes County.Code
[11] Deaths Population Crude.Rate
<0 rows> (or 0-length row.names)`
My take on this. First, you cannot expect a full answer with code because you did not provide a link to you data. Next time, please provide a full description of the problem with the data.
I just used the data you provided here to illustrate.
require(tidyverse)
# Load the data
crude_rate = read.csv("county_crude.csv", header = TRUE)
usa_county = read.csv("usa_county.csv", header = TRUE)
# Create the variable "county_join" within the county_crude to "left_join" on with the usa_county data. Note that you have to have the same type of data variable between the two tables and the same values as well
crude_rate = crude_rate %>%
mutate(county_join = c(1:5))
# Join the dataframes using a left join on the county_join and County variables
df_all = usa_county %>%
left_join(crude_rate, by = c("County"="county_join")) %>%
distinct(order,hole,piece,id,group, .keep_all = TRUE)
Data link: county_crude
Data link: usa_county
Blockquote

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

r - Join data frame coordinates by shapefile regions aka Join Attributes by Location

I have a large data set, loaded in R as a data.frame. It contains observations associated with coordinate points (lat/lon).
I also have a shape file of North America.
In the empty column (NA filled) in my data frame, labelled BCR, I want to insert the region name which each coordinate falls into according to the shapefile.
I know how to do this is QGIS using the Vector > Data Management Tools > Join Attributes by Location
The shapefile can be downloaded by clicking HERE.
My data, right now, looks like this (a sample):
LATITUDE LONGITUDE Year EFF n St PJ day BCR
50.406752 -104.613 2009 1 0 SK 90 2 NA
50.40678 -104.61256 2009 2 0 SK 120 3 NA
50.40678 -104.61256 2009 2 1 SK 136 2 NA
50.40678 -104.61256 2009 3 2 SK 149 4 NA
43.0026385 -79.2900467 2009 2 0 ON 112 3 NA
43.0026385 -79.2900467 2009 2 1 ON 122 3 NA
But I want it to look like this:
LATITUDE LONGITUDE Year EFF n St PJ day BCR
50.406752 -104.613 2009 1 0 SK 90 2 Prairie Potholes
50.40678 -104.61256 2009 2 0 SK 120 3 Prairie Potholes
50.40678 -104.61256 2009 2 1 SK 136 2 Prairie Potholes
50.40678 -104.61256 2009 3 2 SK 149 4 Prairie Potholes
43.0026385 -79.2900467 2009 2 0 ON 112 3 Lower Great Lakes/St.Lawrence Plain
43.0026385 -79.2900467 2009 2 1 ON 122 3 Lower Great Lakes/St.Lawrence Plain
Notice the BCR column is now filled with the appropriate BCR region name.
My code so far is just importing and formatting the data and shapefile:
library(rgdal)
library(proj4)
library(sp)
library(raster)
# PFW data, full 2.5m observations
df = read.csv("MyData.csv")
# Clearning out empty coordinate data
pfw = df[(df$LATITUDE != 0) & (df$LONGITUDE != 0) & (!is.na(df$LATITUDE)) & (!is.na(df$LATITUDE)),]
# Creating a new column to be filled with associated Bird Conservation Regions
pfw["BCR"] = NA
# Making a duplicate data frame to conserve data
toSPDF = pfw
# Ensuring spatial formatting
#coordinates(toSPDF) = ~LATITUDE + LONGITUDE
SPDF <- SpatialPointsDataFrame(toSPDF[,c("LONGITUDE", "LATITUDE"),],
toSPDF,
proj4string = CRS("+init=epsg:4326"))
# BCR shape file, no state borders
shp = shapefile("C:/Users/User1/Desktop/BCR/BCR_Terrestrial_master_International.shx")
spPoly = spTransform(shp, CRS("+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
# Check
isTRUE(proj4string(spPoly) == proj4string(SPDF))
# Trying to join attributes by location
#try1 = point.in.polygon(spPoly, SPDF) # Sounds good doesn't work
#a.data <- over(SPDF, spPoly[,"BCRNAME"]) # Error: cannot allocate vector of size 204.7 Mb
I think you want to do a spatial query with points and polygons. That is to assign polygon attributes to the corresponding points. You can do that like this:
Example data
library(terra)
f <- system.file("ex/lux.shp", package="terra")
polygons <- vect(f)
points <- spatSample(v, 10)
Solution
e <- extract(polygons, points)
e
# id.y ID_1 NAME_1 ID_2 NAME_2 AREA POP
#1 1 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#2 2 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#3 3 2 Grevenmacher 6 Echternach 188 18899
#4 4 1 Diekirch 2 Diekirch 218 32543
#5 5 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#6 6 1 Diekirch 4 Vianden 76 5163
#7 7 3 Luxembourg 11 Mersch 233 32112
#8 8 2 Grevenmacher 7 Remich 129 22366
#9 9 1 Diekirch 3 Redange 259 18664
#10 10 3 Luxembourg 9 Esch-sur-Alzette 251 176820
With the older spatial packages you can use raster::extract or sp::over.
Example data:
library(raster)
pols <- shapefile(system.file("external/lux.shp", package="raster"))
set.seed(20180121)
pts <- data.frame(coordinates(spsample(pols, 5, 'random')), name=letters[1:5])
plot(pols); points(pts)
Solution:
e <- extract(pols, pts[, c('x', 'y')])
pts$BCR <- e$NAME_2
pts
# x y name BCR
#1 6.009390 49.98333 a Wiltz
#2 5.766407 49.85188 b Redange
#3 6.268405 49.62585 c Luxembourg
#4 6.123015 49.56486 d Luxembourg
#5 5.911638 49.53957 e Esch-sur-Alzette

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.
We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2
I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1

Resources