R: changing values across multiple tables - r

I've been searching for a solution here and trying multiple methods to achieve what I want, but to no avail! I would really appreciate some help.
I have several tables with data on different countries. I need to merge these tables by country, but the same country is often referred to differently in each table, so I need to standardize them first.
Example table1:
birth_country mean_age
China 37
Germany 42
Example table2:
birth_country proportion_male
Federal Republic of Germany 54
China, People's Republic of 43
So I want to do something like this (which works when I do it as follows for a single table):
table1$birth_country[table1$birth_country == "China"] <- "China, People\'s Republic of"
table1$birth_country[table1$birth_country == "Federal Republic of Germany"] <- "Germany"
But no matter what I try, I can't seem to apply this sort of process to ALL of my tables. I've tried lapply and a for loop, in at least ten variations of the following...:
standardizeCountryNames<-function(x){
x[x == "China"] <- "China, People\'s Republic of"
x[x == "Federal Republic of Germany"] <- "Germany"
}
tables<-list(table1, table2, table3)
lapply(tables, function(i) {standardizeCountryNames(i$birth_country)})
and
for (k in 1:length(tables)){
tables[[k]]$birth_country[tables[[k]]$birth_country == "China"] <- "China, People\'s Republic of" }
I've tried referring to the birth_country variable in different ways, such as using with(table) and attach(table).
Any help would be greatly appreciated! (:

You were almost there:
table1 <- read.table(
text = "birth_country mean_age
China 37
Germany 42",
header = TRUE, stringsAsFactors = FALSE)
table2 <- read.table(
text = 'birth_country proportion_male
"Federal Republic of Germany" 54
"China, People\'s Republic of" 43',
header = TRUE, stringsAsFactors = FALSE)
standardizeCountryNames<-function(x){
x$birth_country[x$birth_country == "China"] <- "China, People\'s Republic of"
x$birth_country[x$birth_country == "Federal Republic of Germany"] <- "Germany"
x
}
tables<-list(table1, table2)
lapply(tables, function(i) {standardizeCountryNames(i)})
# [[1]]
# birth_country mean_age
# 1 China, People's Republic of 37
# 2 Germany 42
#
# [[2]]
# birth_country proportion_male
# 1 Germany 54
# 2 China, People's Republic of 43

Related

Renaming names in a dataset with commas and fullstops

I am trying to rename some countries in my data but I seem to be having an issue with one. I wish to rename these following countries but I get an error when trying to rename 'Egypt, Arab Rep.' to `Egypt'
data
Country_Name value
Egypt, Arab Rep. 2192
Syrian Arab Republic 4998
Turkiye 8230
code used to rename
data = data %>% rename(Egypt = "Egypt, Arab Rep.") %>% rename(Syria = "Syrian Arab Republic") %>% rename(Turkey = "Turkiye")
error message received
`Error in `stop_subscript()`:
! Can't rename columns that don't exist.
✖ Column_Name `Egypt, Arab Rep.` doesn't exist.`
A dplyr way of doing this would be using the case_when function
library(dplyr)
df <- data.frame(Country = c("Egypt, Arab Rep.", "Syrian Arab Republic", "Turkiye"),
value = c(2192, 4998, 8230))
df <- df %>%
dplyr::mutate(Country = dplyr::case_when(
Country == "Egypt, Arab Rep." ~ "Egypt",
Country == "Syrian Arab Republic" ~ "Syria",
TRUE ~ Country
))
RESULT:
Country value
1 Egypt 2192
2 Syria 4998
3 Turkiye 8230

How to group a column with character values in a new column in r

I have a data set with countries column, I want to create a new column and classify the countries into the following categories (first world, second world, third world) countries.
I'm relatively new to R and I'm finding it difficult to find a proper function that deals with characters!
My dataset contains the countries like this, and I have three vectors with a list of countries as shown below:
nt_final_table$`Country name`
#[1] "Finland" "Denmark" "Switzerland"
#[4] "Iceland" "Netherlands" "Norway"
#[7] "Sweden" "Luxembourg" "New Zealand"
#[10] "Austria" "Australia" "Israel"
first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea",
"Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
Second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
Third_world_countries <- ("Somalia","Niger","South Sudan")
I would want a new column that contains the following values :
First World, Second World, Third World based on the Country name column
Any help would be appreciated!
Thanks!
Here are 2 ways you could do this.
Using dplyr package
You could use case_when from the dplyr package to do this.
library(dplyr)
country_name <-c("Finland", "Denmark", "Switzerland","Iceland", "Netherlands", "Norway", "Sweden", "Luxembourg", "New Zealand",
"Austria", "Australia", "Israel")
nt_final_table <- data.frame(country_name)
first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea", "Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
third_world_countries <- c("Somalia","Niger","South Sudan")
nt_final_table_categorized <- nt_final_table %>% mutate(category = case_when(country_name %in% first_world_countries ~ "First",
country_name %in% second_world_countries ~ "Second",
country_name %in% third_world_countries ~ "Third",
TRUE ~"Not listed"))
nt_final_table_categorized
Sample output
country_name category
1 Finland Not listed
2 Denmark First
3 Switzerland First
4 Iceland First
5 Netherlands First
6 Norway First
7 Sweden First
8 Luxembourg First
9 New Zealand First
10 Austria First
11 Australia First
12 Israel First
Using base R
In base R we could create a data frame that lists the countries and their category then use merge to perform a left-join on the 2 dataframes.
country_name <-c("Finland", "Denmark", "Switzerland","Iceland", "Netherlands", "Norway", "Sweden", "Luxembourg", "New Zealand",
"Austria", "Australia", "Israel")
nt_final_table <- data.frame(country_name)
first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea", "Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
third_world_countries <- c("Somalia","Niger","South Sudan")
country_name <- c(first_world_countries,second_world_countries,third_world_countries)
categories <- c(rep("First", length(first_world_countries)),
rep("Second",length(second_world_countries)),
rep("Third",length(third_world_countries)))
all_countries_categorised <- data.frame(country_name, categories)
nt_final_table_categorized <-merge(nt_final_table, all_countries_categorised, by ="country_name", all.x=TRUE)
nt_final_table_categorized
Sample output
country_name categories
1 Australia First
2 Austria First
3 Denmark First
4 Finland <NA>
5 Iceland First
6 Israel First
7 Luxembourg First
8 Netherlands First
9 New Zealand First
10 Norway First
11 Sweden First
12 Switzerland First

R - Convert a Character Vector into a Data Frame

This seems like it should be a fairly simple problem, but I can't seem to find a straightforward solution.
I have a character list that looks like this:
my_info <- c("Fruits",
"North America",
"Apples",
"Michigan",
"Europe",
"Pomegranates",
"Greece",
"Oranges",
"Italy",
"Vegetables",
"North America",
"Potatoes",
"Idaho",
"Avocados",
"California",
"Europe",
"Artichokes",
"Italy",
"Meats",
"North America",
"Beef",
"Illinois")
I want to parse this character vector into a data frame that looks like this:
screenshot of R console
The food types and the region lists will always remain the same, but the foods and their locations are subject to change.
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
I was thinking I needed to use something like str_split, but use the food_types and regions as some sort of a delimiter? But I am not sure how to proceed. The character vector does have an order to it.
Thank you.
One solution can be to first convert your my_info vector in a matrix using ncol = 4. This will split your vector in a matrix/data frame.
Now, you can apply the rule of for food_type and region and swap any food_type or region which is present in other columns.
Note: I request OP to check data once, it seems every 4 elements are not able to make a complete row with description provided by OP.
df <- as.data.frame(matrix(my_info, ncol = 4, byrow = TRUE))
names(df) <- c("Foodtype", "Region", "Food", "Location")
food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")
t(apply(df,1,function(x){
for(i in seq_along(x)){
#One can think of writing a swap function here.
if(x[i] %in% region ){
temp = x[i]
x[i] = x[2]
x[2] = temp
}
#Swap any food_type wrongly placed in other column
if(x[i] %in% food_type ){
temp = x[i]
x[i] = x[1]
x[1] = temp
}
}
x
}))
# Foodtype Region Food Location
# [1,] "Fruits" "North America" "Apples" "Michigan"
# [2,] "Pomegranates" "Europe" "Greece" "Oranges"
# [3,] "Vegetables" "North America" "Italy" "Potatoes"
# [4,] "Idaho" "Europe" "California" "Avocados"
# [5,] "Meats" "North America" "Artichokes" "Italy"
# [6,] "Fruits" "North America" "Beef" "Illinois"
#
I have a long solution, but should work as long as food and location are always in the same order.
First creating a few data.frames with dplyr.
library(dplyr)
info <- data_frame(my_info = my_info)
region <- data_frame(region_id = region, region = region)
food_type <- data_frame(food_type_id = food_type, food_type)
Next creating a data.frame that joins all of these together and fill missing values with tidyr and removing the rows we do not need. Then The most important trick is the last one, creating a cols column based on the assumption that the order is always the same!
library(tidyr)
df <- info %>%
left_join(food_type, by = c("my_info" = "food_type_id")) %>%
left_join(region, by = c("my_info" = "region_id")) %>%
fill(food_type) %>%
group_by(food_type) %>%
fill(region) %>%
filter(!is.na(region) & !(my_info == region)) %>%
ungroup %>%
mutate(cols = rep(c("food", "location"), group_size(.)/2 ))
This returns:
# A tibble: 14 x 4
my_info food_type region cols
<chr> <chr> <chr> <chr>
1 Apples Fruits North America food
2 Michigan Fruits North America location
3 Pomegranates Fruits Europe food
4 Greece Fruits Europe location
5 Oranges Fruits Europe food
6 Italy Fruits Europe location
7 Beef Meats North America food
8 Illinois Meats North America location
9 Potatoes Vegetables North America food
10 Idaho Vegetables North America location
11 Avocados Vegetables North America food
12 California Vegetables North America location
13 Artichokes Vegetables Europe food
14 Italy Vegetables Europe location
Next use tidyr to spread the cols into food and location columns.
df <- df %>%
group_by(food_type, region, cols) %>%
mutate(ind = row_number()) %>%
spread(cols, my_info) %>%
select(-ind)
# A tibble: 7 x 4
# Groups: food_type, region [5]
food_type region food location
<chr> <chr> <chr> <chr>
1 Fruits Europe Pomegranates Greece
2 Fruits Europe Oranges Italy
3 Fruits North America Apples Michigan
4 Meats North America Beef Illinois
5 Vegetables Europe Artichokes Italy
6 Vegetables North America Potatoes Idaho
7 Vegetables North America Avocados California
This can all be done in one go, just remove the intermediate step of creating a data.frame.
Here are three alternatives. All of them use na.locf0 from zoo and the cn vector only shown in the first.
1) Let cn be a vector the same length as my_info which identifies which column number of the output that element of my_info belongs to. Let cdef be an output column definition vector of 1:4 with the output column names as its names. Then for each output column create a vector the same length as my_info whose rows corresponding to that column and NAs for the other elements. Then use na.locf0 to fill in the NA values and take the elements corresponding to column 4.
library(zoo)
cn <- (my_info %in% food_type) + 2 * (my_info %in% region)
cn[cn == 0] <- 3:4
cdef <- c(food_type = 1, region = 2, food = 3, location = 4)
m <- sapply(cdef, function(i) na.locf0(ifelse(cn == i, my_info, NA))[cn == 4])
giving:
> m
food_type region food location
[1,] "Fruits" "North America" "Apples" "Michigan"
[2,] "Fruits" "Europe" "Pomegranates" "Greece"
[3,] "Fruits" "Europe" "Oranges" "Italy"
[4,] "Vegetables" "North America" "Potatoes" "Idaho"
[5,] "Vegetables" "North America" "Avocados" "California"
[6,] "Vegetables" "Europe" "Artichokes" "Italy"
[7,] "Meats" "North America" "Beef" "Illinois"
We have created character matrix output since the output is entirely character but if you want a data frame anyways then use:
as.data.frame(mm, stringsAsFactors = FALSE)
2) Alternately, we can create m from cn by putting my_info[i] into position (i, cn[i]) of an n x 4 matrix mm of NAs, using na.locf to fill in the NAs and taking those rows corresponding to column 4.
n <- length(my_info)
m2 <- na.locf(replace(matrix(NA, n, 4), cbind(1:n, cn), my_info))[cn == 4, ]
colnames(m2) <- c("food_type", "region", "food", "location")
identical(m2, m) # test
## [1] TRUE
3) A third alternative for creating m from cn is to construct the matrix column by column like this:
m3 <- cbind( food_type = na.locf0(ifelse(cn == 1, my_info, NA))[cn == 3],
region = na.locf0(ifelse(cn == 2, my_info, NA))[cn == 3],
food = my_info[cn == 3],
location = my_info[cn == 4])
identical(m, m3) # test
## [1] TRUE

Get continent name from country name in R

I have a data frame with one column representing country names. My goal is to add one more column which gives the continent information. Please check the following use case:
my.df <- data.frame(country = c("Afghanistan","Algeria"))
Is there a package that I can use to append a column of data containing the continent names without having the original data?
You can use the countrycode package for this task.
library(countrycode)
df <- data.frame(country = c("Afghanistan",
"Algeria",
"USA",
"France",
"New Zealand",
"Fantasyland"))
df$continent <- countrycode(sourcevar = df[, "country"],
origin = "country.name",
destination = "continent")
#warning
#In countrycode(sourcevar = df[, "country"], origin = "country.name", :
# Some values were not matched unambiguously: Fantasyland
Result
df
# country continent
#1 Afghanistan Asia
#2 Algeria Africa
#3 USA Americas
#4 France Europe
#5 New Zealand Oceania
#6 Fantasyland <NA>
Expanding on Markus' answer, countrycode draws on codelists 'continent' declaration.
?codelist
Definition of continent:
continent: Continent as defined in the World Bank Development Indicators
The question asked for continents but sometimes continents don't provide enough groups for you to delineate the data. For example, continents groups North and South America into Americas.
What you might want is region:
region: Regions as defined in the World Bank Development Indicators
It is unclear how the World Bank groups regions but the below code shows how this destination is more granular.
library(countrycode)
egnations <- c("Afghanistan","Algeria","USA","France","New Zealand","Fantasyland")
countrycode(sourcevar = egnations, origin = "country.name",destination = "region")
Output:
[1] "Southern Asia"
[2] "Northern Africa"
[3] "Northern America"
[4] "Western Europe"
[5] "Australia and New Zealand"
[6] NA
You can try
my.df <- data.frame(country = c("Afghanistan","Algeria"),
continent= as.factor(c("Asia","Africa")))
merge(my.df, raster::ccodes()[,c("NAME", "CONTINENT")], by.x="country", by.y="NAME", all.x=T)
# country continent CONTINENT
# 1 Afghanistan Asia Asia
# 2 Algeria Africa Africa
Some country values might need an adjustment; I dunno since you did not provide all values.

Searching for multiple text patterns in R

This question is related to: Searching a data.frame in R
I want to search for multiple patterns , e.g. 'america' and 'united', in
all fields
in a given field
How can this be done? The case needs to be ignored.
Data:
ddf
id country area
1 1 United States of America North America
2 2 United Kingdom Europe
3 3 United Arab Emirates Arab
4 4 Saudi Arabia Arab
5 5 Brazil South America
ddf = structure(list(id = 1:5, country = c("United States of America",
"United Kingdom", "United Arab Emirates", "Saudi Arabia", "Brazil"
), area = c("North America", "Europe", "Arab", "Arab", "South America"
)), .Names = c("id", "country", "area"), class = "data.frame", row.names = c(NA,
-5L))
EDIT: To clarify, I have to search with AND and not OR. In this example, only 'United States of America' (row number 1) should come. If I search for 'brazil' and 'america', row number 5 should come (i.e. different search strings can be in different columns).
This actually fails for the "brazil" & "america" case but it was a useful test-bed for diagnosisng the logical problems;
hasAm <- sapply( ddf, grepl, patt="america", ignore.case=TRUE)
ddf[ rowSums(hasAm) > 0 , ]
#----------
id country area
1 1 United States of America North America
5 5 Brazil South America
#---------
hasUn <- sapply( ddf, grepl, patt="united", ignore.case=TRUE)
#---------
ddf[ rowSums( hasAm & hasUn) > 0 , ]
#-----------
id country area
1 1 United States of America North America
This edited version generalizes that strategy although it requires entering the selection criteria as a formula. I needed to first collapse each matrix so that summing across the cbind()-ed values didn't pick up multiple hits on a single term. So I have two rowSums, the outer one being done on m-column matrices where m is the number of terms in the formula, and the inner one being done on n-column matrices where n is the number of columns in the data-argument:
dfsel <- function(form, data) {
vars = all.vars(form)
selmatx <- lapply( vars, function(v)
sapply (data, grepl, patt=v, ignore.case=TRUE))
data[ rowSums( do.call(cbind,
lapply(selmatx,
function(L) {rowSums(L) > 0}) ) ) == length(vars)
, ] }
Demonstration:
> res <- dfsel( ~ united + america , ddf)
> res
id country area
1 1 United States of America North America
> res <- dfsel( ~ brazil + america , ddf)
> res
id country area
5 5 Brazil South America
Dumb way of solving it. Interested in other answers.
pattern<-c('America','United')
ddf1<-NULL
for (i in 1:length(pattern)){
new<-ddf[grep(paste0(pattern[i]),ddf$country),]
ddf1<-rbind(ddf1,new)
}
Going on the logic that no country in the world has "America" before "United" in its name, you could do
> f <- lapply(ddf, grep, pattern = "(united)(.*)(america)", ignore.case = TRUE)
> ddf[unique(unlist(f)), ]
# id country area
# 1 1 United States of America North America

Resources