R data from sf package is missing data from small island states - r

I am creating a map with R that should include all SADC economies. This map should be coloured in a later step according to an additional data set that I want to merge with the map. At the moment I have been using the sf package to map the SADC economies.
These include the following 16 Member States: Angola, Botswana, Comoros, Democratic Republic of Congo, Eswatini, Lesotho, Madagascar, Malawi, Mauritius, Mozambique, Namibia, Seychelles, South Africa, United Republic Tanzania, Zambia and Zimbabwe.
While selecting the countries for my map, I could not find data for the three island states: Comoros, Mauritius & Seychelles
Is there any opportunity to **manually add the geom (MULTIPOLYGON) data **and if so, where do I find the information?
Alternatively: is there an alternative package, which includes all SADC country coordinates with which I could plot the map?
I have not found the missing data in the iso_a2 column (containing all iso2 codes) in the name_long column (containing all names), or when filtering for all countries on the continent Africa in the continent column
Here is my sample code
# install packages
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(sf) # for geographic data # classes and functions for vector data
# show African countries
Africa <- world %>%
filter(continent == "Africa")
View(Africa) # find all SADC economies with the right name
# problem: missing: Comoros, Mauritius & Seychelles
# create SADC vector according to country names in dataset
SADCvector2 <- SADCvector <- c("Angola","Botswana", "Democratic Republic of the Congo", "eSwatini", "Lesotho", "Madagascar", "Malawi",
"Mozambique","Namibia", "Seychelles", "South Africa", "Tanzania","Zambia", "Zimbabwe")
# select SADC countries
SADC1 <- world %>%
filter(name_long %in% SADCvector2) %>%
#select only variables of interest
select(name_long, geom)
plot(SADC1)

Related

I need to find a way to count the number of repetition of value x where another column value is y

I am using the twitter sentiment analysis for airline flights dataset and It has a column called negative result and another column called airline name. I need to know how to count the repetitions of the value "Bad Flight" in the column negative result Where the airline name is "Virgin America" and repeat this step for "Late Flight" and "Virgin America" and then compare between values and choose the bigger number and use it in plotting.
for example :
Negative Result Airline Name
Bad Flight Virgin America
Bad Flight Virgin America
Bad Flight Virgin America
Late Flight Virgin America
Late Flight Virgin America
Bad Flight United
Damaged Luggage United
Bad Flight United
Late Flight United
Late Flight United
Bad Flight Virgin America
Bad Flight Virgin America
Late Flight Virgin America
expected output will be 5 for bad flight and 3 for late flight so after comparing, bad flight will be the value to be plotted.
If your dataframe is called df you can just do table(df).
Using dplyr:
library(dplyr)
df %>%
filter(`Airline Name` == "Virgin America") %>%
group_by(`Negative Result`) %>%
summarize(n = n())

search and replace multiple patterns in R

I'm attempting to use grepl in nested ifelse statements to simplify a column of data containing researchers' institutional affiliations with the country they belong to, i.e. '1234 University Way, University, Washington, United States' would become 'United States'. The column contains universities in over 100 countries. At first I tried nested ifelse statements with grepl:
H$FAF1 <- ifelse(grepl("Hungary", H$AF1), "Hungary",
ifelse(grepl("United States", H$AF1), "United States", ...
etc., but I realized the limit is 50 for nested ifelse statements. Does anyone know another way to do this? I tried writing a function but am unfortunately not that adept at R yet.
An alternative for the regex-approach by csgroen, where you have to write down countries manually, you could try the countrycode-package, where they are already included, which might save you some time... Try:
countrycode::countrycode(sourcevar = "1234 University Way, University, Washington, United States",
origin = "country.name",
destination = "country.name")
Maybe using str_extract? I've made a little example.
min_ex <- c("1234 University Way, University, Washington, United States",
c("354 A Road, University B, City A, Romania"),
c("447 B Street, National C University, City B, China"))
library(stringr)
str_extract(min_ex, regex("United States|Romania|China"))

Extract characters from a string by a succession of colons

I am trying to pull some information out of a variable in a data frame. I am using R 3.3.3.
The information formatted as follows:
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
I would like to break down each section into a separate variable like so:
w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."
z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
I am having some difficulty trying to extract this information. SO questions such as this and this have been very helpful. From these, I gathered that some form of stringr/ gsub can be used to pull this information but I can't figure out how to specify the ranges within a gsub statement.
I have been able to work out the how to pull the first portion:
>test4 <- gsub("(.*{1})(:.*)","\\1", t)
which gives
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
My overall question is:
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
It would be nice if I did not have to clean up the "DOMINICAN REPUBLIC" part from the end of the string.
In summary:
1. How you extract characters from a string by a succession of colons? (1st to 2nd colon, 2nd to 3rd etc)
2. Is there a way to keep the words infront of the colon as well?
Any information or guidance would be greatly appreciated.
You can use strsplit with an appropriate regex:
strsplit(t, "\\.\\s(?=[\\w\\s]+:)", perl=TRUE)
or
stringr::str_split(t, "\\.\\s(?=[\\w\\s]+:)")
Notes:
\\.\\s matches a literal dot and a space.
(?=[\\w\\s]+:) is a positive lookahead that matches either a word character or space one or more times following a colon.
\\.\\s(?=[\\w\\s]+:) thus matches a dot and a space only if it is immediately followed by either a word character or a space one or more times and a colon. This would be the end of each paragraph.
Since I am using the regex within strsplit, I am splitting by whatever is matched by the regex. This results in splitting by the end of each paragraph.
perl=TRUE is needed to enable lookaheads/behinds.
Result:
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
How about the following in base R?
# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";
# Get position of regexp matches
matches <- data.frame(
idx = unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t)),
len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t))), nchar(t))
);
# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
trimws(substr(t, x[1], sum(x) - 1));
})
lst;
#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
Note: Regexp-matching countries is a bit awkward because your example contains all caps multi-word countries (DOMINCAN REPUBLIC), all caps single-word countries (e.g. GUAM), and "first-letter-caps" countries (China).

Categorizing a data frame column in R using grep on a list of a list

I have a column of a data frame that I want to categorize.
> df$orgName
[1] "Hank Rubber" "United Steel of Chicago"
[3] "Muddy Lakes Solar" "West cable"
I want to categorize the column using the categories list below that contains a list of subcategories.
metallurgy <- c('steel', 'iron', 'mining', 'aluminum', 'metal', 'copper' ,'geolog')
energy <- c('petroleum', 'coal', 'oil', 'power', 'petrol', 'solar', 'nuclear')
plastics <- c('plastic', 'rubber')
wiring <- c('wire', 'cable')
categories = list(metallurgy, energy, plastics, wiring)
So far I've been able to use a series of nested ifelse statements to categorize the column as shown below, but the number of categories and subcategories keeps increasing.
df$commSector <-
ifelse(grepl(paste(metallurgy,collapse="|"),df$orgName,ignore.case=TRUE), 'metallurgy',
ifelse(grepl(paste(energy,collapse="|"),df$orgName,ignore.case=TRUE), 'energy',
ifelse(grepl(paste(plastics,collapse="|"),df$orgName,ignore.case=TRUE), 'plastics',
ifelse(grepl(paste(wiring,collapse="|"),df$orgName,ignore.case=TRUE), 'wiring',''))))
I've thought about using a set of nested lapply statements, but I'm not too sure how to execute it.
Lastly does anyone know of any R Libraries that may have functions to do this.
Thanks a lot for everyone's time.
Cheers.
One option would be to get the vectors as a named list using mget, then paste the elements together (as showed by OP), use grep to find the index of elements in 'orgName' that matches (or use value = TRUE) extract those elements, stack it create a data.frame.
res <- setNames(stack(lapply(mget(c("metallurgy", "energy", "plastics", "wiring")),
function(x) df$orgName[grep(paste(x, collapse="|"),
tolower(df$orgName))])), c("orgName", "commSector"))
res
# orgName commSector
#1 United Steel of Chicago metallurgy
#2 Muddy Lakes Solar energy
#3 Hank Rubber plastics
#4 West cable wiring
If we have other columns in 'df', do a merge
merge(df, res, by = "orgName")
# orgName commSector
#1 Hank Rubber plastics
#2 Muddy Lakes Solar energy
#3 United Steel of Chicago metallurgy
#4 West cable wiring
data
df <- data.frame(orgName = c("Hank Rubber", "United Steel of Chicago",
"Muddy Lakes Solar", "West cable"), stringsAsFactors=FALSE)

Creating a function in R

I'm trying to create code that looks at two CSV files: one is a world list of all bird species and their ranges, and the other is a file of all the birds in the Himalayas. I need to check each species in the CSV file with the matching species on the IOC world list one and see if the bird is actually in range (meaning it would say either "India" or "himalayas" or "s e Asia" under the Range column). I want to create a function that can input both data sets, find where names match, check if range contains those words and returns where it does NOT, so I can check those birds specifically. Here is what I have so far (I'm using RStudio):
myfunc <- function() {
if ((bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")) {
print(eval(bird_data$Common.Name[bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")]))
}
}
save("myfunc", file = "myfunc.Rdata")
source("myfunc.Rdata")
I think I'm messed up in not having inputs. So I'm trying a new approach with:
compare = function(data1, data2) {
....
}
But for the above, I don't know how to let the function recognize the appropriate subsets of data (like I can't say data1$Scientific.name).
It's difficult to answer this question without a minimal reproducible example - without any knowledge of the two dataframes you are comparing it is hard to formulate a solution - see the link in the comment by alistaire above for how to provide this.
I suggest you change your question title to make it more informative - "Creating a function in R" suggests you want to know the syntax required for a function in R - I would recommend "Subsetting a dataframe with Grep and then filtering results in R" - which is what I think you are actually trying to do.
Assuming you obtained your IOC world list data from the International Ornithological Committee website I am unsure whether the approach you describe in your function would work as the data in the column Breeding Range-Subregion(s) is very messy, For example:
w Himalayas to s Siberia and w Mongolia
Himalayas to c China
e Afghanistan to nw India and w Nepal
e Afghanistan to w Tibetan plateau and n India
Africa south of the Sahara, s and se Asia
None of these values is identical to "India" or "himalayas" or "SE Asia" and none will be returned by your function which looks for an exact match. You would need to use grep to find the substring present within your data.
Lets create a toy data set.
bird_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"Foghorn Leghorn",
"The Road Runner",
"Angry Birds"))
ioc_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"The Road Runner",
"Angry Birds"),
subrange=c(
"Australia, New Zealand",
"w Himalayas to s Siberia and w Mongolia",
"Himalayas to c China",
"e Afghanistan to nw India and w Nepal",
"e Afghanistan to w Tibetan plateau and n India",
"Africa south of the Sahara, s and se Asia",
"Amazonia to n Argentina",
"n Eurasia",
"n North America"))
I would break what you are attempting to do into two steps.
Step 1
Use grep to subset the ioc_data dataframe based upon whether your search terms are found in the subrange column:
searchTerms <- c("India", "himalayas", "SE Asia")
#Then we use grep to return the indexes of matching rows:
matchingIndexes <- grep(paste(searchTerms, collapse="|"),
ioc_data$subrange,
ignore.case=TRUE) #Important so search such as "SE Asia" will match "se asia"
#We can then use our matching indexes to subset our ioc_data dataframe producing
#a subset of data corresponding to our range of interest:
ioc_data_subset <- ioc_data[matchingIndexes,]
Step 2
If I understand your question correctly you now want to extract the rows from bird_data that ARE NOT present in the ioc_data_subset (i.e. Which rows in bird_data are for birds that ARE NOT recorded as inhabiting the subrange "India", "SE Asia", and "Himalayas" in the IOC Data.
I would use Hadley Wickham's dplyr package for this - a good cheat sheet can be found here. After installing dplyr:
library(dplyr)
#Create a merged dataframe containing all the data in one place.
merged_data <- dplyr::left_join(bird_data,
ioc_data,
by = "Scientific.name")
#Use an anti_join to select any rows in merged_data that are NOT present in
#ioc_data_subset
results <- dplyr::anti_join(merged_data,
ioc_data_subset,
by = "Scientific.name")
The left_join is required first because otherwise we would not have the subrange column in our final database. Note that any species in bird_data not in IOC_data will return NA in the subrange column to indicate no data found.
results
Scientific.name subrange
1 Angry Birds n North America
2 The Road Runner n Eurasia
3 Foghorn Leghorn <NA>
4 Tweety Pie Amazonia to n Argentina
5 Chicken Little Australia, New Zealand

Resources