How to apply multiple if statements in R? - r

I have a data frame (df) that lists the countries associated with every site
Site Country
Site1 USA
Site2 Vietnam
Site3 Spain
Site4 Germany
Site5 China
I want to attach a column, where for each country I associate its corresponding continent. I wrote a simple if loop to do this:
df$Continent <- NA
if(df$Country == "USA" |df$Country == "Canada" |df$Country == "Mexico")
{df$Continent <- "North America"}
if(df$Country == "Spain" |df$Country == "France" |df$Country == "Germany")
{df$Continent <- "Europe"}
## .. etc
summary(df)
However, each time I run it the df, I find that it assigns North America to all the countries. I understand that this may sound trivial, but does it make a difference if I use if statments everywhere and not else or if else? Any suggestions for correcting this?

Build a lookup table and merge() it with the data.
For example:
lookup <- data.frame(Country = c("USA", "Canada", "Mexico",
"Spain", "France", "Germany",
"Vietnam", "China"),
Continent = rep(c("North America", "Europe", "Asia"),
times = c(3,3,2)))
Using your snippet of data as data frame df, we can add Continent via merge() (a join in database terminology):
> merge(df, lookup, sort = FALSE, all.x = TRUE)
Country Site Continent
1 USA Site1 North America
2 Vietnam Site2 Asia
3 Spain Site3 Europe
4 Germany Site4 Europe
5 China Site5 Asia

If you're working with a factor you can also do some nonsense with levels, or levels<- to be exact:
`levels<-`(dat$Country, list(
`North America` = c("USA","Canada","Mexico"),
`Europe` = c("Spain","France","Germany"),
`Asia` = c("Vietnam","China")
))
#[1] North America Asia Europe Europe Asia
#Levels: North America Europe Asia

I like ifelse() for things like this. You could use it with the %in% operator like this:
df$Continent <- ifelse(df$Country %in% c("USA", "Canada", "Mexico"),
"North America", df$Continent)
df$Continent <- ifelse(df$Country %in% c("Spain", "France", "Germany"),
"Europe", df$Continent)
df
Site Country Continent
1 Site1 USA North America
2 Site2 Vietnam <NA>
3 Site3 Spain Europe
4 Site4 Germany Europe
5 Site5 China <NA>

Related

Trying to find values within excel cell based on given pairs in R df

I am using this excel sheet that I have currently read into R: https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx
dput(head(remittance, 5))
The output is:
structure(list(`Remittance-receiving country (across) - Remittance-sending country (down)` = c("Australia",
"Brazil", "Canada"), Brazil = c("27.868809286999106", "0", "31.284184411144214"
), Canada = c("46.827693406219382", "1.5806325278762619", "0"
), `Czech Republic` = c("104.79905129342241", "3.0488843262423089",
"176.79676736179096"), Finland = c("26.823089572300752", "1.3451674211686246",
"37.781150857376964"), France = c("424.37048861305249", "123.9763417712491",
"1296.7352242506483"), Germany = c("556.4140279523856", "66.518143815367239",
"809.9621650533453"), Hungary = c("200.08597014449356", "11.953328254521287",
"436.0811601171776"), Indonesia = c("172.0021287331823", "1.3701340430259537",
"33.545925908780198"), Italy = c("733.51652291459231", "116.74264895322995",
"1072.1119887588022"), `Korea, Rep.` = c("259.97044386689589",
"20.467939414361016", "326.94157937864327"), Netherlands = c("133.48932759488602",
"4.7378343766684532", "181.28828076733771"), Philippines = c("1002.3593555086774",
"1.5863355979877207", "2369.5223195675494"), Poland = c("109.73486651698796",
"5.8313637459523129", "341.10408952685464"), `Russian Federation` = c("19.082541158574934",
"1.0136604494838692", "58.760989426089431"), `Saudi Arabia` = c("13.578431465294949",
"0.32506772760873404", "15.511213677040857"), Sweden = c("91.887827513176489",
"5.1132733094740352", "65.860232580192786"), Thailand = c("383.08245004577498",
"2.7410805494977684", "79.370683058792849"), `United Kingdom` = c("1084.0742194994727",
"4.2050614573174592", "568.62605950140266"), `United States` = c("188.06242727403128",
"49.814372612310521", "661.98049661387927"), WORLD = c("5578.0296723604206",
"422.37127035334271", "8563.264510816849")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
I currently have a dataframe of two columns "Source" and "Destination" where each row is a pair of countries which I created by doing:
countries = c("Australia","Brazil", "Canada", "Czech Republic", "Germany", "Finland", "United Kingdom", "Italy", "Poland", "Russian Federation", "Sweden", "United States", "Philippines", "France", "Netherlands", "Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs = t(combn(countries, 2))
I would like to use each pair to extract its corresponding value from the excel sheet above. (In the Excel sheet "Source" is the first column of countries-down and "Destination is the first row countries-across)
For example a sample of the df that I have looks as follows (it currently contains 190 pairs):
pairs = data.frame(Source = c("Australia", "Australia", "Australia"), Destination = c("Brazil", "Canada", "Czech Republic"))
Where the first pair in my df is (Australia, Brazil) which corresponds to a value of 27.868809286999106 from the excel sheet that I reproduced above. Is there a built-in R function that would match the pairs from my df to extract its corresponding value? Thanks
Perhaps what you need is dplyr::pivot_longer?
library(dplyr)
colnames(remittance)[1] <- 'source'
remittance %>% pivot_longer(-source, names_to = 'destination')
#----
# A tibble: 60 x 3
source destination value
<chr> <chr> <chr>
1 Australia Brazil 27.868809286999106
2 Australia Canada 46.827693406219382
3 Australia Czech Republic 104.79905129342241
4 Australia Finland 26.823089572300752
Note remittance is the dataframe in the OP dput.
Probably you are interested in keeping the flexibility of your nice combn approach.
To loop over your pairs data frame (it's actually a matrix though) you may use apply with MARGIN=1 for row-wise. In the FUN= argument we create data frames of one row each with source corresponding to column 1 of pairs and destination to column 2. The distance (or whatever this value is) we get by subsetting at the corresponding rows and columns of remittance (for brevity I shortend to rem).
Since we will get a list of single-line data frames, we want to rbind, and because we have multiple objects we need do.call.
res <- do.call(rbind,
apply(pairs, MARGIN=1, FUN=function(x)
data.frame(source=x[1], destination=x[2],
dist=as.integer(rem[rem[, 1] == x[1], rem[1, ] == x[2]])))
)
Since the .xlsx has zeros where actually should be NAs we should declare them as such in the result.
res[res == 0] <- NA
Result
head(res, 25)
# source destination dist
# 1 Australia Brazil 721
# 2 Australia Canada 24721
# 3 Australia Czech Republic 1074
# 4 Australia Germany 13938
# 5 Australia Finland 1121
# 6 Australia United Kingdom 135000
# 7 Australia Italy 19350
# 8 Australia Poland 974
# 9 Australia Russian Federation 543
# 10 Australia Sweden 3988
# 11 Australia United States 93179
# 12 Australia Philippines 4118
# 13 Australia France 8475
# 14 Australia Netherlands 10697
# 15 Australia Hungary 997
# 16 Australia Saudi Arabia NA
# 17 Australia Thailand 11298
# 18 Australia Korea, Rep. 5381
# 19 Australia Indonesia 11094
# 20 Brazil Canada 26647
# 21 Brazil Czech Republic 742
# 22 Brazil Germany 44000
# 23 Brazil Finland 1378
# 24 Brazil United Kingdom 55772
# 25 Brazil Italy 104779
Data:
u <- "https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx"
rem <- openxlsx::read.xlsx(u)
countries <- c("Australia", "Brazil", "Canada", "Czech Republic", "Germany",
"Finland", "United Kingdom", "Italy", "Poland", "Russian Federation",
"Sweden", "United States", "Philippines", "France", "Netherlands",
"Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs <- t(combn(countries, 2))

Changing spelling for multiple words at a time in R/replacing many words at once

I have a dataset (survey) and a column of birth_country, where people have written their country of birth. An example of it:
1 america
2 usa
3 american
4 us of a
5 united states
6 england
7 english
8 great britain
9 uk
10 united kingdom
how I would like it to look:
1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk
I have tried using str_replace to manually insert the different spellings, to replace them with 'america' but when I look at my dataset, nothing has changed
e.g.
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")
thank you in advance
Come up with some patterns that only match for each country and basically loop over what you are already doing (you can change the replacement below with your favorite function)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
america = 'amer|us|states',
uk = 'eng|brit|king|uk',
'another country' = 'ano|an co',
chaz = 'chaz|chop'
)
f <- function(x, list) {
for (ii in seq_along(list)) {
x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
}
x
}
## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk" "uk" "uk" "uk" "uk"
within(survey, {
clean <- f(birth_country, l)
})
# birth_country clean
# 1 america america
# 2 usa america
# 3 american america
# 4 us of a america
# 5 united states america
# 6 england uk
# 7 english uk
# 8 great britain uk
# 9 uk uk
# 10 united kingdom uk
Note that 1) if you don't give a pattern that matches, nothing will change, but 2) if you give a pattern that matches both countries (e.g., "united"), the first in the list will be used (unless the replacement itself is also matched)
Looks like the problem is in how you specified your regular expression. Try this (updated based on #Gabriella 's comment, and another tidyverse approach, similar to #MarBIo ):
library(tidyverse)
survey <- survey %>%
mutate(birth_country = if_else(
str_detect(birth_country,
"(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
"america", #Change it to "america"
birth_country #Otherwise, keep as is.
) #end of if_else
) #end of mutate
Other people are suggesting you come up with a more complex regular expression, which you can certainly do as well. Consecutive "or" (i.e. "|") statements in your regular expression works though.
In case you allow tidyverse`s mutate you can do:
library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>%
mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#> birth_country
#> 1 america
#> 2 america
#> 3 america
#> 4 america
#> 5 america
#> 6 UK
#> 7 UK
#> 8 UK
#> 9 UK
#> 10 UK

Replace a value in a data frame based on a conditional statement

I have a question very similar to this question
country continent
<chr> <chr>
1 Taiwan Asia
2 New Zealand Oceania
3 Bulgaria Europe
4 Bahamas Americas
5 Serbia Europe
6 Tajikistan Asia
7 Southern Sub-Saharan Africa NA
8 Cameroon Africa
9 Indonesia Asia
10 Democratic Republic of Congo Africa
How do I use a function/write a loop so that when the country is "Bahamas" that it converts the continent so that it now says South America?
The page that I linked was the closest answer I could find but it differed from my question because I am trying to manipulate one column based on the values in a different column.
I tried using ifelse() but that did not work:
gm %>%
ifelse(country == "Bahamas", continent == "S America", continent)
Any insight would be greatly appreciated!
You need to mutate:
library(dplyr)
gm %>%
mutate(continent = ifelse(country == "Bahamas", "S America", continent))
This works:
gm[,'continent'][gm[,'country'] == "Bahamas"] <- "South America"
You might get a warning message like this if "South America" is not already in the dataframe:
Warning message:
In `[<-.factor`(`*tmp*`, gm[, "country"] == "Bahamas", value = c(2L, :
invalid factor level, NA generated
This means you need to add the level first, you are trying to issue a level which doesn't exist:
levels(gm$continent) <- c(levels(gm$continent), "South America")
gm[,'continent'][gm[,'country'] == "Bahamas"] <- "South America"
(run time on this approach [5M entries in a dataframe, 10 repeated measures] was 4x faster than the dplyr method)

Find groups that contain all elements, but do not overlap [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I've been given a set of country groups and I'm trying to get a set of mutually exclusive regions so that I can compare them. The problem is that my data contains several groups, many of which overlap. How can I get a set of groups which contain all countries, but do not overlap with each other?
For example, assume that this is the list of countries in the world:
World <- c("Angola", "France", "Germany", "Australia", "New Zealand")
Assume that this is my set of groups:
df <- data.frame(group = c("Africa", "Western Europe", "Europe", "Europe", "Oceania", "Oceania", "Commonwealth Countries"),
element = c("Angola", "France", "Germany", "France", "Australia", "New Zealand", "Australia"))
group element
1 Africa Angola
2 Western Europe France
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
7 Commonwealth Countries Australia
How could I remove overlapping groups (in this case Western Europe) to get a set of groups that contains all countries like the following:
df_solved <- data.frame(group = c("Africa", "Europe", "Europe", "Oceania", "Oceania"),
element = c("Angola", "France", "Germany", "Australia", "New Zealand"))
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
One possible rule could be to minimize the number of groups, e.g. to associate an element with that group which includes the most elements.
library(data.table)
setDT(df)[, n.elements := .N, by = group][
order(-n.elements), .(group = group[1L]), by = element]
element group
1: Germany Europe
2: France Europe
3: Australia Oceania
4: New Zealand Oceania
5: Angola Africa
Explanation
setDT(df)[, n.elements := .N, by = group][]
returns
group element n.elements
1: Africa Angola 1
2: Western Europe France 1
3: Europe Germany 2
4: Europe France 2
5: Oceania Australia 2
6: Oceania New Zealand 2
7: Commonwealth Countries Australia 1
Now, the rows are ordered by decreasing number of elements and for each country the first, i.e., the "largest", group is picked. This should return a group for each country as requested.
In case of ties, i.e., one group contains equally many elements, you can add additional citeria when ordering, e.g., length of the group name, or just alphabetical order.
1) If you want to simply eliminate duplicate elements then use !duplicated(...) as shown. No packages are used.
subset(df, !duplicated(element))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
2) set partitioning If each group must be wholly in or wholly out and each element may only appear once then this is a set partitioning problem:
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, "=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
3) set covering Of course there may be no exact set partition so we could consider the set covering problem (same code exceept "=" is replaced by ">=" in the lp line.
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, ">=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
and we could optionally then apply (1) to remove any duplicates in the cover.
4) Non-dominated groups Another approach is to remove any group whose elements form a strict subset of the elements of some other group. For example, every element in Western Europe is in Europe and Europe has more elements than Western Europe so the elements of Western Europe are a strict subset of the elements of Europe and we remove Western Europe. Using const.mat from above:
# returns TRUE if jth column of const.mat is dominated by some other column
is_dom_fun <- function(j) any(apply(const.mat[, j] <= const.mat[, -j], 2, all) &
sum(const.mat[, j]) < colSums(const.mat[, -j]))
is_dom <- sapply(seq_len(ncol(const.mat)), is_dom_fun)
subset(df, group %in% colnames(const.mat)[!is_dom])
giving:
group element
1 Africa Angola
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
If there are any duplicates left we can use (1) to remove them.
library(dplyr)
df %>% distinct(element, .keep_all=TRUE)
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
Shoutout to Axeman for beating me with this answer.
Update
Your question is ill-defined. Why is 'Europe' preferred over 'Western Europe'? Put another way, each country is assigned several groups. You want to reduce it to one group per country. How do you decide which group?
Here's one way, we always prefer the biggest:
groups <- df %>% count(group)
df %>% inner_join(groups, by='group') %>%
arrange(desc(n)) %>% distinct(elemenet, .keep_all=TRUE)
group element n
1 Europe France 2
2 Europe Germany 2
3 Oceania Australia 2
4 Oceania New Zealand 2
5 Africa Angola 1
Here is one option with data.table
library(data.table)
setDT(df)[, head(.SD, 1), element]
Or with unique
unique(setDT(df), by = 'element')
# group element
#1: Africa Angola
#2: Europe France
#3: Europe Germany
#4: Oceania Australia
#5: Oceania New Zealand
Packages are used and it is data.table
A completely different approach would be to ignore the given groups but to look up just the country names in the catalogue of UN regions which are available in the countrycodes or ISOcodes packages.
The countrycodes package seems to offer the simpler interface and it also warns about country names which can not be found in its database:
# given country names - note the deliberately misspelled last entry
World <- c("Angola", "France", "Germany", "Australia", "New Zealand", "New Sealand")
# regions
countrycode::countrycode(World, "country.name.en", "region")
[1] "Middle Africa" "Western Europe" "Western Europe" "Australia and New Zealand"
[5] "Australia and New Zealand" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "region") :
Some values were not matched unambiguously: New Sealand
# continents
countrycode::countrycode(World, "country.name.en", "continent")
[1] "Africa" "Europe" "Europe" "Oceania" "Oceania" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "continent") :
Some values were not matched unambiguously: New Sealand

Aggregate factors in Variable in R

I have this data.frame with a variable V21 in which many countries are recorded, I want to make it smaller by just specifying the continent rather then all those countries. For example 'Cuba', 'Peru', 'Argentina' rather than being separate levels of V21, I want them to become level 'South America'. Here's the code I tried to use:
recode(WaveOne.test$V21, "levels("Cuba","Colombia","Costa Rica","Argentina","Chile","Ecuador","Peru","Venezuela")= 'South America'")
levels(V21)
Can you suggest what is wrong with my code or maybe a different method?
I am a complete newbie in R and its syntax.
Thank you!
========UPDATE=========
SA_countries <- c("Cuba", "Mexico", "Argentina","Jamaica", "Haiti","West Indies", "Chile", "Ecuador", "Venezuela", "Other South America", "El Salvador", "Guatemala", "Nicaragua", "Dominican Republic", "Panama", "Costa Rica", "Peru")
Asia_countries <- c("Philippines", "Vietnam", "Laos", "Cambodia", "Hmong", "Other Asia", "China", "Hong Kong", "Taiwan", "Japan", "Korea", "India", "Pakistan")
Europe_Canada <- c("Europe/Canada")
MiddleEast_Africa <- c("Middle East/Africa")
continents <- list(`South America`= SA_countries, `Asia` = Asia_countries, `Europe_Canada` = Europe_Canada, `Middle East & Africa` = MiddleEast_Africa)
levels(WaveOne.test$V21) <- c(levels(WaveOne.test$V21), names(continents))
for(i in seq_along(continents)) WaveOne.test$V21[WaveOne.test$V21 %in% continents[[i]]] <- names(continents)[i]
levels(WaveOne.test$V21)
My output however is:
levels(WaveOne.test$V21)
1 "Cuba" "Mexico" "Nicaragua" "Colombia" "Dominican Republic" "El Salvador" "Guatemala"
[8] "Honduras" "Costa Rica" "Panama" "Argentina" "Chile" "Ecuador" "Peru"
[15] "Venezuela" "Other South America" "Haiti" "Jamaica" "West Indies" "Philippines" "Vietnam"
[22] "Laos" "Cambodia" "Hmong" "Other Asia" "China" "Hong Kong" "Taiwan"
[29] "Japan" "Korea" "India" "Pakistan" "Middle East/Africa" "Europe/Canada" "South America"
[36] "Asia" "Europe_Canada" "Middle East & Africa"
You can create a list with all of your countries and continents then reassign the values accordingly:
continents <- list(`South America`=SA_countries,
`North America` = NA_countries,
Europe=Euro_countries)
levels(df$V21) <- c(levels(df$V21), names(continents)) #necessary to add new levels
for(i in seq_along(continents)) {
df$V21[df$V21 %in% continents[[i]]] <- names(continents)[i]}
Reproducible Example
set.seed(123)
SA_countries <- c("Cuba","Colombia","Costa Rica","Argentina","Chile","Ecuador","Peru","Venezuela")
NA_countries <- c("Mexico", "USA", "Canada")
Euro_countries <- c("Germany", "France")
df <- data.frame(V21=sample(c(NA_countries,SA_countries, Europe),20,T))
df
# V21
# 1 Cuba
# 2 Venezuela
# 3 Costa Rica
# 4 Germany
# 5 France
# 6 Mexico
# 7 Argentina
# 8 Germany
# 9 Chile
# 10 Costa Rica
# 11 France
# 12 Costa Rica
# 13 Ecuador
# 14 Chile
# 15 USA
# 16 Germany
# 17 Cuba
# 18 Mexico
# 19 Colombia
# 20 France
continents <- list(`South America`=SA_countries, `North America` = NA_countries, Europe=Euro_countries)
levels(df$V21) <- c(levels(df$V21), names(continents))
for(i in seq_along(continents)) df$V21[df$V21 %in% continents[[i]]] <- names(continents)[i]
df
# V21
# 1 South America
# 2 South America
# 3 South America
# 4 Europe
# 5 Europe
# 6 North America
# 7 South America
# 8 Europe
# 9 South America
# 10 South America
# 11 Europe
# 12 South America
# 13 South America
# 14 South America
# 15 North America
# 16 Europe
# 17 South America
# 18 North America
# 19 South America
# 20 Europe

Resources