Convert list in data frame adjusting compound names - r

I have the following hypothetical list
test <- list(a = c("United", "States", "of", "America", "2021", "North", "America"),
b = c("Canada", "2021", "North", "America"),
c = c("Morocco", "2021", "Africa"),
d = c("South", "Africa", "2021", "Africa"),
e = c("Faroe", "Islands", "2021", "Europe"),
f = c("Spain", "2021", "Europe"))
I would like produce the following tibble:
country
year
continent
United States of America
2021
North America
Canada
2021
North America
Morocco
2021
Africa
South Africa
2021
Africa
Faroe Islands
2021
Europe
Spain
2021
Europe
I tried to use the ldply() function of the plyr package. However, my list elements have unequal lengths, because of compound names.
How could I join this data in a tibble with the variables: country, year and continent, for example?

An option is to use rleid to create a grouping based on the occurence of digits in the list, then paste the list elements and rbind them
library(data.table)
out <- type.convert(do.call(rbind.data.frame, lapply(test, function(x)
tapply(x, rleid(grepl('\\d+', x)), paste, collapse=' '))), as.is = TRUE)
colnames(out) <- c('country', 'year', 'continent')
row.names(out) <- NULL
-output
out
# country year continent
#1 United States of America 2021 North America
#2 Canada 2021 North America
#3 Morocco 2021 Africa
#4 South Africa 2021 Africa
#5 Faroe Islands 2021 Europe
#6 Spain 2021 Europe
Or use a similar option with rle from base R
out <- type.convert(do.call(rbind.data.frame,
lapply(test, function(x) tapply(x, with(rle(grepl('\\d+', x)),
rep(seq_along(values), lengths)), FUN = paste, collapse=' '))),
as.is = TRUE)
colnames(out) <- c('country', 'year', 'continent')

Slightly less general/efficient than the other solutions here but maybe more transparent?
cfun <- function(x) {
## find position of numeric value
numpos <- grep("^[0-9]+$", x)
## combine elements appropriately
list(country=paste(x[1:(numpos-1)], collapse=" "),
year=x[numpos],
continent=paste(x[(numpos+1):length(x)], collapse=" "))
}
purrr::map_dfr(test,cfun)

In base R you could do:
a <- do.call(rbind, lapply(test, function(x) paste(sub("(\\d+)",",\\1,", x), collapse = " ")))
read.csv(text=a, col.names = c("Country","Year","Continent"), h=FALSE)
Country Year Continent
1 United States of America 2021 North America
2 Canada 2021 North America
3 Morocco 2021 Africa
4 South Africa 2021 Africa
5 Faroe Islands 2021 Europe
6 Spain 2021 Europe

Here is another base R option
type.convert(setNames(
data.frame(
do.call(
rbind,
lapply(
test,
function(v) {
tapply(v,
cumsum(c(1, diff(grepl("\\d+", v)) != 0)), paste0,
collapse = " "
)
}
)
)
), c("Country", "Year", "Continent")
),
as.is = TRUE
)
which gives
Country Year Continent
a United States of America 2021 North America
b Canada 2021 North America
c Morocco 2021 Africa
d South Africa 2021 Africa
e Faroe Islands 2021 Europe
f Spain 2021 Europe

Related

How to re-order the columns after splitting it?

I have a data frame that contains the list of countries and it has been split using the csplit function.
The code is as follows:-
df <- data.frame(country = c("India, South Africa", "United Kingdom, United States, India",
"England, Australia, South Africa, Germany, United States"))
splitstackshape::cSplit(df, "country", sep = ", ")
# country_1 country_2 country_3 country_4 country_5
#1: India South Africa <NA> <NA> <NA>
#2: United Kingdom United States India <NA> <NA>
#3: England Australia South Africa Germany United States
I wish to rearrange the columns in a such a manner that country_1 column should contain either United States or <NA>. Similarly for country_2 and country_3, it should be India or <NA> and United Kingdom or <NA> respectively. From column_4 on wards, it can follow the order as it is in the row.
Expected output is as follows,
#Expected Output
# country_1 country_2 country_3 country_4 country_5 country_6 country_7
#1 <NA> India <NA> South Africa <NA> <NA> <NA>
#2 United States India United Kingdom <NA> <NA> <NA> <NA>
#3 United States <NA> <NA> England Australia South Africa Germany
A very ugly solution using apply :
df1 <- splitstackshape::cSplit(df, "country", sep = ", ")
n <- length(unique(na.omit(unlist(df1))))
as.data.frame(t(apply(df1, 1, function(x) {
x1 <- rep(NA, n)
if(any(x == 'United States', na.rm = TRUE)) x1[1] <- 'United States'
if(any(x == 'India', na.rm = TRUE)) x1[2] <- 'India'
if(any(x == 'United Kingdom', na.rm = TRUE)) x1[3] <- 'United Kingdom'
temp <- setdiff(x, x1)
if(length(temp)) x1[4:(4 + length(temp) - 1)] <- temp
x1
})))
# V1 V2 V3 V4 V5 V6 V7
#1 <NA> India <NA> South Africa <NA> <NA> <NA>
#2 United States India United Kingdom <NA> <NA> <NA> <NA>
#3 United States <NA> <NA> England Australia South Africa Germany

Replace a value in a data frame based on a conditional statement

I have a question very similar to this question
country continent
<chr> <chr>
1 Taiwan Asia
2 New Zealand Oceania
3 Bulgaria Europe
4 Bahamas Americas
5 Serbia Europe
6 Tajikistan Asia
7 Southern Sub-Saharan Africa NA
8 Cameroon Africa
9 Indonesia Asia
10 Democratic Republic of Congo Africa
How do I use a function/write a loop so that when the country is "Bahamas" that it converts the continent so that it now says South America?
The page that I linked was the closest answer I could find but it differed from my question because I am trying to manipulate one column based on the values in a different column.
I tried using ifelse() but that did not work:
gm %>%
ifelse(country == "Bahamas", continent == "S America", continent)
Any insight would be greatly appreciated!
You need to mutate:
library(dplyr)
gm %>%
mutate(continent = ifelse(country == "Bahamas", "S America", continent))
This works:
gm[,'continent'][gm[,'country'] == "Bahamas"] <- "South America"
You might get a warning message like this if "South America" is not already in the dataframe:
Warning message:
In `[<-.factor`(`*tmp*`, gm[, "country"] == "Bahamas", value = c(2L, :
invalid factor level, NA generated
This means you need to add the level first, you are trying to issue a level which doesn't exist:
levels(gm$continent) <- c(levels(gm$continent), "South America")
gm[,'continent'][gm[,'country'] == "Bahamas"] <- "South America"
(run time on this approach [5M entries in a dataframe, 10 repeated measures] was 4x faster than the dplyr method)

R column mapping

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore
Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore
A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)
If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

Find groups that contain all elements, but do not overlap [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I've been given a set of country groups and I'm trying to get a set of mutually exclusive regions so that I can compare them. The problem is that my data contains several groups, many of which overlap. How can I get a set of groups which contain all countries, but do not overlap with each other?
For example, assume that this is the list of countries in the world:
World <- c("Angola", "France", "Germany", "Australia", "New Zealand")
Assume that this is my set of groups:
df <- data.frame(group = c("Africa", "Western Europe", "Europe", "Europe", "Oceania", "Oceania", "Commonwealth Countries"),
element = c("Angola", "France", "Germany", "France", "Australia", "New Zealand", "Australia"))
group element
1 Africa Angola
2 Western Europe France
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
7 Commonwealth Countries Australia
How could I remove overlapping groups (in this case Western Europe) to get a set of groups that contains all countries like the following:
df_solved <- data.frame(group = c("Africa", "Europe", "Europe", "Oceania", "Oceania"),
element = c("Angola", "France", "Germany", "Australia", "New Zealand"))
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
One possible rule could be to minimize the number of groups, e.g. to associate an element with that group which includes the most elements.
library(data.table)
setDT(df)[, n.elements := .N, by = group][
order(-n.elements), .(group = group[1L]), by = element]
element group
1: Germany Europe
2: France Europe
3: Australia Oceania
4: New Zealand Oceania
5: Angola Africa
Explanation
setDT(df)[, n.elements := .N, by = group][]
returns
group element n.elements
1: Africa Angola 1
2: Western Europe France 1
3: Europe Germany 2
4: Europe France 2
5: Oceania Australia 2
6: Oceania New Zealand 2
7: Commonwealth Countries Australia 1
Now, the rows are ordered by decreasing number of elements and for each country the first, i.e., the "largest", group is picked. This should return a group for each country as requested.
In case of ties, i.e., one group contains equally many elements, you can add additional citeria when ordering, e.g., length of the group name, or just alphabetical order.
1) If you want to simply eliminate duplicate elements then use !duplicated(...) as shown. No packages are used.
subset(df, !duplicated(element))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
2) set partitioning If each group must be wholly in or wholly out and each element may only appear once then this is a set partitioning problem:
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, "=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
3) set covering Of course there may be no exact set partition so we could consider the set covering problem (same code exceept "=" is replaced by ">=" in the lp line.
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, ">=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
and we could optionally then apply (1) to remove any duplicates in the cover.
4) Non-dominated groups Another approach is to remove any group whose elements form a strict subset of the elements of some other group. For example, every element in Western Europe is in Europe and Europe has more elements than Western Europe so the elements of Western Europe are a strict subset of the elements of Europe and we remove Western Europe. Using const.mat from above:
# returns TRUE if jth column of const.mat is dominated by some other column
is_dom_fun <- function(j) any(apply(const.mat[, j] <= const.mat[, -j], 2, all) &
sum(const.mat[, j]) < colSums(const.mat[, -j]))
is_dom <- sapply(seq_len(ncol(const.mat)), is_dom_fun)
subset(df, group %in% colnames(const.mat)[!is_dom])
giving:
group element
1 Africa Angola
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
If there are any duplicates left we can use (1) to remove them.
library(dplyr)
df %>% distinct(element, .keep_all=TRUE)
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
Shoutout to Axeman for beating me with this answer.
Update
Your question is ill-defined. Why is 'Europe' preferred over 'Western Europe'? Put another way, each country is assigned several groups. You want to reduce it to one group per country. How do you decide which group?
Here's one way, we always prefer the biggest:
groups <- df %>% count(group)
df %>% inner_join(groups, by='group') %>%
arrange(desc(n)) %>% distinct(elemenet, .keep_all=TRUE)
group element n
1 Europe France 2
2 Europe Germany 2
3 Oceania Australia 2
4 Oceania New Zealand 2
5 Africa Angola 1
Here is one option with data.table
library(data.table)
setDT(df)[, head(.SD, 1), element]
Or with unique
unique(setDT(df), by = 'element')
# group element
#1: Africa Angola
#2: Europe France
#3: Europe Germany
#4: Oceania Australia
#5: Oceania New Zealand
Packages are used and it is data.table
A completely different approach would be to ignore the given groups but to look up just the country names in the catalogue of UN regions which are available in the countrycodes or ISOcodes packages.
The countrycodes package seems to offer the simpler interface and it also warns about country names which can not be found in its database:
# given country names - note the deliberately misspelled last entry
World <- c("Angola", "France", "Germany", "Australia", "New Zealand", "New Sealand")
# regions
countrycode::countrycode(World, "country.name.en", "region")
[1] "Middle Africa" "Western Europe" "Western Europe" "Australia and New Zealand"
[5] "Australia and New Zealand" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "region") :
Some values were not matched unambiguously: New Sealand
# continents
countrycode::countrycode(World, "country.name.en", "continent")
[1] "Africa" "Europe" "Europe" "Oceania" "Oceania" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "continent") :
Some values were not matched unambiguously: New Sealand

How to apply multiple if statements in R?

I have a data frame (df) that lists the countries associated with every site
Site Country
Site1 USA
Site2 Vietnam
Site3 Spain
Site4 Germany
Site5 China
I want to attach a column, where for each country I associate its corresponding continent. I wrote a simple if loop to do this:
df$Continent <- NA
if(df$Country == "USA" |df$Country == "Canada" |df$Country == "Mexico")
{df$Continent <- "North America"}
if(df$Country == "Spain" |df$Country == "France" |df$Country == "Germany")
{df$Continent <- "Europe"}
## .. etc
summary(df)
However, each time I run it the df, I find that it assigns North America to all the countries. I understand that this may sound trivial, but does it make a difference if I use if statments everywhere and not else or if else? Any suggestions for correcting this?
Build a lookup table and merge() it with the data.
For example:
lookup <- data.frame(Country = c("USA", "Canada", "Mexico",
"Spain", "France", "Germany",
"Vietnam", "China"),
Continent = rep(c("North America", "Europe", "Asia"),
times = c(3,3,2)))
Using your snippet of data as data frame df, we can add Continent via merge() (a join in database terminology):
> merge(df, lookup, sort = FALSE, all.x = TRUE)
Country Site Continent
1 USA Site1 North America
2 Vietnam Site2 Asia
3 Spain Site3 Europe
4 Germany Site4 Europe
5 China Site5 Asia
If you're working with a factor you can also do some nonsense with levels, or levels<- to be exact:
`levels<-`(dat$Country, list(
`North America` = c("USA","Canada","Mexico"),
`Europe` = c("Spain","France","Germany"),
`Asia` = c("Vietnam","China")
))
#[1] North America Asia Europe Europe Asia
#Levels: North America Europe Asia
I like ifelse() for things like this. You could use it with the %in% operator like this:
df$Continent <- ifelse(df$Country %in% c("USA", "Canada", "Mexico"),
"North America", df$Continent)
df$Continent <- ifelse(df$Country %in% c("Spain", "France", "Germany"),
"Europe", df$Continent)
df
Site Country Continent
1 Site1 USA North America
2 Site2 Vietnam <NA>
3 Site3 Spain Europe
4 Site4 Germany Europe
5 Site5 China <NA>

Resources