how to use pipes in condition matching in r? - r

I am getting an error while trying to filter data in 1st df based on countries available in 2nd df by using pipe operator in conditional statement.
Referencing countries df
Overall_top5
########### output ###########
continent country gdpPercap
Africa Botswana 8090
Africa Equatorial Guinea 20500
Africa Gabon 19600
Africa Libya 12100
Africa Mauritius 10900
Americas Canada 51600
Americas Chile 15100
Americas Trinidad and Tobago 17100
main df
gap_longer
########### output #############
country year gdpPercap continent
Australia 2019 57100 Oceania
Botswana 2019 8090 Africa
Canada 2019 51600 Americas
Chile 2019 15100 Americas
Denmark 2019 65100 Europe
Error: When I try below code it gives me errors:
gap_longer %>%
filter(year == 2019,
country %in% Overall_top5 %>% select(country) )
Error: Problem with `filter()` input `..1`. x no applicable method for 'select_' applied to an object of class "logical" i Input `..1` is `country %in% Overall_top5 %>% select(country)`. Run `rlang::last_error()` to see where the error occurred.
How can I run this using pipes ? I am able to run this using base R but don't know how to fix it using pipes .
gap_longer %>%
filter(year == 2019,
country %in% Overall_top5$country )
Raw data
Overall_top5 <- structure(list(continent = c("Africa", "Africa", "Africa", "Africa", "Africa", "Americas", "Americas", "Americas"), country = c("Botswana", "Equatorial Guinea", "Gabon", "Libya", "Mauritius", "Canada", "Chile", "Trinidad and Tobago"), gdpPercap = c(8090L, 20500L, 19600L, 12100L, 10900L, 51600L, 15100L, 17100L)), row.names = c(NA, -8L), class = "data.frame")
gap_longer <- structure(list(country = c("Australia", "Botswana", "Canada", "Chile", "Denmark"), year = c(2019L, 2019L, 2019L, 2019L, 2019L), gdpPercap = c(57100L, 8090L, 51600L, 15100L, 65100L), continent = c("Oceania", "Africa", "Americas", "Americas", "Europe")), class = "data.frame", row.names = c(NA, -5L))

First, you want to use pull rather than select as select will return a data frame rather than a vector (but that doesn't solve your problem).
Your problem comes from precedence. In your example, %in% is evaluated first, then %>%. To fix this, use parentheses.
gap_longer %>%
filter(
year == 2019,
country %in% (Overall_top5 %>% pull(country))
)
#> # A tibble: 3 x 4
#> country year gdpPercap continent
#> <chr> <dbl> <dbl> <chr>
#> 1 Botswana 2019 8090 Africa
#> 2 Canada 2019 51600 Americas
#> 3 Chile 2019 15100 Americas

Parenthesis problem. Try:
gap_longer %>%
filter(year == 2019, country %in% Overall_top5$country) %>%
select(country)
or, if you want a vector of country names, not a data frame:
gap_longer %>%
filter(year == 2019, country %in% Overall_top5$country) %>%
pull(country)

Related

retain only rows and columns that match with a string vector

I have a large DF with certain columns that have a vector of character values as below. The number of columns varies from dataset to dataset as well as the number of character vectors it holds also varies.
ID Country1 Country2 Country3
1 1 Argentina, Japan,USA,Poland, Argentina,USA Pakistan
2 2 Colombia, Mexico,Uruguay,Dutch Mexico,Uruguay Afganisthan
3 3 Argentina, Japan,USA,NA Japan Khazagistan
4 4 Colombia, Mexico,Uruguay,Dutch Colombia, Dutch North Korea
5 5 India, China China Iran
Would like to match them one-to-one with another string vector as below
vals_to_find <-c("Argentina","USA","Mexico")
If, a column/row matches to anyone of the strings passed would like to retain that column and row. Remove duplicates, and finally remove those values that do not match.
the desired output is as follows
ID Countries.found
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
data
dput(df)
structure(list(ID = 1:5, Country1 = c("Argentina, Japan,USA,Poland,",
"Colombia, Mexico,Uruguay,Dutch", "Argentina, Japan,USA,NA",
"Colombia, Mexico,Uruguay,Dutch", "India, China"), Country2 = c("Argentina,USA",
"Mexico,Uruguay", "Japan", "Colombia, Dutch", "China"), Country3 = c("Pakistan",
"Afganisthan", "Khazagistan", "North Korea", "Iran")), class = "data.frame", row.names = c(NA,
-5L))
dput(df_out)
structure(list(ID = 1:4, Countries.found = c("Argentina, USA",
"Mexico", "Argentina, USA", "Mexico")), class = "data.frame", row.names = c(NA,
-4L))
Instead of a each column as a vector, if the file is read as one value per column. Then, was able do it as below
dput(df_out)
structure(list(ID = 1:5, X1 = c("Argentina", "Colombia", "Argentina",
"Colombia", "India"), X2 = c("Japan", "Mexico", "Japan", "Mexico",
"China"), X3 = c("USA", "Uruguay", "USA", "Uruguay", NA), X4 = c("Poland",
"Dutch", NA, "Dutch", NA), X5 = c("Argentina", "Mexico", "Japan",
"Colombia", "China"), X6 = c("USA", "Uruguay", NA, "Dutch", NA
), X7 = c("Pakistan", "Afganisthan", "Khazagistan", "North Korea",
"Iran")), class = "data.frame", row.names = c(NA, -5L))
df_out %>%
dplyr::select(
where(~ !all(is.na(.x)))
) %>%
dplyr::select(c(1, where(~ any(.x %in% vals_to_find)))) %>%
dplyr::mutate(dplyr::across(
tidyselect::starts_with("X"),
~ vals_to_find[match(., vals_to_find)]
)) %>%
tidyr::unite("countries_found", tidyselect::starts_with("X"),
sep = " | ", remove = TRUE, na.rm = TRUE
)
Output
ID countries_found
1 1 Argentina | USA | Argentina | USA
2 2 Mexico | Mexico
3 3 Argentina | USA
4 4 Mexico
unite the "Country" columns, then create a long vector by separating the values into rows, get all distinct values per ID, filter only those who are in vals_to_find, and summarise each countries.found toString.
library(tidyr)
library(dplyr)
df %>%
unite("Country", starts_with("Country"), sep = ",") %>%
separate_rows(Country) %>%
distinct(ID, Country) %>%
filter(Country %in% vals_to_find) %>%
group_by(ID) %>%
summarise(Countries.found = toString(Country))
output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
We may use
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(starts_with("Country"),
~ str_extract_all(.x, str_c(vals_to_find, collapse = "|")))) %>%
pivot_longer(cols = -ID, names_to = NULL,
values_to = 'Countries.found') %>%
unnest(Countries.found) %>%
distinct %>%
group_by(ID) %>%
summarise(Countries.found = toString(Countries.found))
-output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico

Is there an R function to add a common 'word' to all row under a particular column

I have this dataset below
Country Sales
France 12000
Germany 2400
Italy 1000
Belgium 500
Please can you help with a code to add a common 'word' to the Country. I have tried all my best. Here is the intended output I want. thanks
Country Sales
France - Europe 12000
Germany - Europe 2400
Italy - Europe 1000
Belgium - Europe 500
Thanks, as you help me
country_data <-
data.frame(
country = c(
"France",
"Germany",
"Italy",
"Belgium"
),
sales = c(
12000,
2400,
1000,
500
)
)
country_data_2 <-
country_data |>
dplyr::mutate(
continent = dplyr::case_when(
country %in% c("France", "Germany", "Italy", "Belgium") ~ "Europe",
country %in% c("Egypt", "South Africa", "Morroco") ~ "Africa",
country %in% c("Canada", "Mexico", "United States") ~ "North America"
# ...
)
) |>
dplyr::transmute(
country = paste(
country,
continent,
sep = " - "
),
sales = sales
)
country_data_2
#> country sales
#> 1 France - Europe 12000
#> 2 Germany - Europe 2400
#> 3 Italy - Europe 1000
#> 4 Belgium - Europe 500
Created on 2022-11-08 with reprex v2.0.2

R: Average years in time series per group

Dear Community,
I am working with R and looking for trends in time series data of bilateral exports over a duration of 20 years. As the data is fluctuating a lot between the years (and in addition is not 100% reliable), I would prefer to use four-years-average data (instead of looking at every single year separately) in order to analyze how the main export partners have changed over time.
I have the following dataset, called GrossExp3, covering the bilateral exports (in 1000 USD) of 15 reporter countries for all years between (1998 – 2019) to all available partner countries.
It covers the following four variables:
Year, ReporterName (= exporter) , PartnerName (= export destination), 'TradeValue in 1000 USD' ( = export value to the destination)
The PartnerName column also includes an entry, called “All”, which is the total sum of all exports for each year by reporter.
Here is the summary of my data
> summary(GrossExp3)
Year ReporterName PartnerName TradeValue in 1000 USD
Min. :1998 Length:35961 Length:35961 Min. : 0
1st Qu.:2004 Class :character Class :character 1st Qu.: 39
Median :2009 Mode :character Mode :character Median : 597
Mean :2009 Mean : 134370
3rd Qu.:2014 3rd Qu.: 10090
Max. :2018 Max. :47471515
My goal is to return a table which shows the percentage of total trade for each exporter to the export destination in percentage of total exports for that period. Instead of every single year, I want to have the average data for the following periods: 2000-2003, 2004-2007, 2008-2011, 2012-2015, 2016-2019.
What I tried
My current code (created with support of this amazing community is the following: (At the current moment, it shows the data for each year separately, but I need the average data in the headline)
# install packages
library(data.table)
library(dplyr)
library(tidyr)
library(stringr)
library(plyr)
library(visdat)
# set working directory
setwd("C:/R/R_09.2020/Other Indicators/Bilateral Trade Shift of Partners")
# load data
# create a file path SITC 3
path1 <- file.path("SITC Rev 3_Data from 1998.csv")
# load cvs data table, call "SITC3"
SITC3 <- fread(path1, drop = c(1,9,11,13))
# prepare data (SITC3) for analysis
# Filter for GROSS EXPORTS SITC3 (Gross exports = Exports that include intermediate products)
GrossExp3 <- SITC3 %>%
filter(TradeFlowName == "Gross Exp.", PartnerISO3 != "All", Year != 2019) %>% # filter for gross exports, remove "All", remove 2019
select(Year, ReporterName, PartnerName, `TradeValue in 1000 USD`) %>%
arrange(ReporterName, desc(Year))
# compare with old subset
summary(GrossExp3)
summary(SITC3)
# calculate percentage of total
GrossExp3Main <- GrossExp3 %>%
group_by(Year, ReporterName) %>%
add_tally(wt = `TradeValue in 1000 USD`, name = "TotalValue") %>%
mutate(Percentage = 100 * (`TradeValue in 1000 USD` / TotalValue)) %>%
arrange(ReporterName, desc(Year), desc(Percentage))
head(GrossExp3Main, n = 20)
# print tables in separate sheets to get an overview about hierarchy of export partners and development over time
SpreadExpMain <- GrossExp3Main %>%
select(Year, ReporterName, PartnerName, Percentage) %>%
spread(key = Year, value = Percentage) %>%
arrange(ReporterName, desc(`2018`))
View(SpreadExpMain) # shows whole table
Here is the head of my data
> head(GrossExp3Main, n = 20)
# A tibble: 20 x 6
# Groups: Year, ReporterName [7]
Year ReporterName PartnerName `TradeValue in 100~ TotalValue Percentage
<int> <chr> <chr> <dbl> <dbl> <dbl>
1 2018 Angola China 24517058. 42096736. 58.2
2 2018 Angola India 3768940. 42096736. 8.95
3 2017 Angola China 19487067. 34904881. 55.8
4 2017 Angola India 2890061. 34904881. 8.28
5 2016 Angola China 13923092. 28057500. 49.6
6 2016 Angola India 1948845. 28057500. 6.95
7 2016 Angola United States 1525650. 28057500. 5.44
8 2015 Angola China 14320566. 33924937. 42.2
9 2015 Angola India 2676340. 33924937. 7.89
10 2015 Angola Spain 2245976. 33924937. 6.62
11 2014 Angola China 27527111. 58672369. 46.9
12 2014 Angola India 4507416. 58672369. 7.68
13 2014 Angola Spain 3726455. 58672369. 6.35
14 2013 Angola China 31947235. 67712527. 47.2
15 2013 Angola India 6764233. 67712527. 9.99
16 2013 Angola United States 5018391. 67712527. 7.41
17 2013 Angola Other Asia, ~ 4007020. 67712527. 5.92
18 2012 Angola China 33710030. 70863076. 47.6
19 2012 Angola India 6932061. 70863076. 9.78
20 2012 Angola United States 6594526. 70863076. 9.31
I am not sure if the results I get up to this point are right?
In addition, I have the following questions:
Do you have any recommendation on how to print nice looking tables with R?
How can I better round the percentage data to one number behind the comma?
As I have been stuck with these issues over the week, I would be very grateful for any recommendations on how to solve the issue!!
Wishing you a nice weekend and all the best,
Melike
** EDIT**
here is some sample data
dput(head(GrossExp3Main, n = 20))
structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L), ReporterName = c("Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola"), PartnerName = c("China",
"India", "United States", "Spain", "South Africa", "Portugal",
"United Arab Emirates", "France", "Thailand", "Canada", "Indonesia",
"Singapore", "Italy", "Israel", "United Kingdom", "Unspecified",
"Namibia", "Uruguay", "Congo, Rep.", "Japan"), `TradeValue in 1000 USD` = c(24517058.342,
3768940.47, 1470132.736, 1250554.873, 1161852.097, 1074137.369,
884725.078, 734551.345, 649626.328, 647164.297, 575477.283, 513982.584,
468914.918, 452453.482, 425616.975, 423008.886, 327921.516, 320586.229,
299119.102, 264671.779), TotalValue = c(42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31, 42096736.31, 42096736.31,
42096736.31, 42096736.31, 42096736.31), Percentage = c(58.2398078593471,
8.9530467213552, 3.49227247731025, 2.97066942147468, 2.75995765667944,
2.55159298119945, 2.10164767046284, 1.74491281127062, 1.54317504144777,
1.53732653342598, 1.3670353890672, 1.22095589599877, 1.11389850877492,
1.07479467925527, 1.01104506502775, 1.00484959899258, 0.778971352043039,
0.761546516668669, 0.710551762961598, 0.62872279943737)), row.names = c(NA,
-20L), groups = structure(list(Year = 2018L, ReporterName = "Angola",
.rows = structure(list(1:20), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1L, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
>
To do what you want need an additional variable to group the year together. I used cut to do that.
library(dplyr)
# Define the cut breaks and labels for each group
# The cut define by the starting of each group and when using cut function
# I would use param right = FALSE to have the desire cut that I want here.
year_group_break <- c(2000, 2004, 2008, 2012, 2016, 2020)
year_group_labels <- c("2000-2003", "2004-2007", "2008-2011", "2012-2015", "2016-2019")
data %>%
# create the year group variable
mutate(year_group = cut(Year, breaks = year_group_break,
labels = year_group_labels,
include.lowest = TRUE, right = FALSE)) %>%
# calculte the total value for each Reporter + Partner in each year group
group_by(year_group, ReporterName, PartnerName) %>%
summarize(`TradeValue in 1000 USD` = sum(`TradeValue in 1000 USD`),
.groups = "drop") %>%
# calculate the percentage value for Partner of each Reporter/Year group
group_by(year_group, ReporterName) %>%
mutate(Percentage = `TradeValue in 1000 USD` / sum(`TradeValue in 1000 USD`)) %>%
ungroup()
Sample output
year_group ReporterName PartnerName `TradeValue in 1000 USD` Percentage
<fct> <chr> <chr> <dbl> <dbl>
1 2016-2019 Angola Canada 647164. 0.0161
2 2016-2019 Angola China 24517058. 0.609
3 2016-2019 Angola Congo, Rep. 299119. 0.00744
4 2016-2019 Angola France 734551. 0.0183
5 2016-2019 Angola India 3768940. 0.0937
6 2016-2019 Angola Indonesia 575477. 0.0143
7 2016-2019 Angola Israel 452453. 0.0112
8 2016-2019 Angola Italy 468915. 0.0117
9 2016-2019 Angola Japan 264672. 0.00658
10 2016-2019 Angola Namibia 327922. 0.00815
11 2016-2019 Angola Portugal 1074137. 0.0267
12 2016-2019 Angola Singapore 513983. 0.0128
13 2016-2019 Angola South Africa 1161852. 0.0289
14 2016-2019 Angola Spain 1250555. 0.0311
15 2016-2019 Angola Thailand 649626. 0.0161
16 2016-2019 Angola United Arab Emirates 884725. 0.0220
17 2016-2019 Angola United Kingdom 425617. 0.0106
18 2016-2019 Angola United States 1470133. 0.0365
19 2016-2019 Angola Unspecified 423009. 0.0105
20 2016-2019 Angola Uruguay 320586. 0.00797

Hot encoding for a set of columns in R

I am trying to do hot encoding for a subset of df columns in R,
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction by converting string columns to binary columns for each string in that column.
Supose we are having a df that looks like this:
mes work_location birth_place
01/01/2000 China Chile
01/02/2000 Mexico Japan
01/03/2000 China Chile
01/04/2000 China Argentina
01/05/2000 USA Poland
01/06/2000 Mexico Poland
01/07/2000 USA Finland
01/08/2000 USA Finland
01/09/2000 Japan Norway
01/10/2000 Japan Kenia
01/11/2000 Japan Mali
01/12/2000 India Mali
Here's the code to hot encode :
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## selecting columns ##
columnas <- c("work_location", "birth_place")
## applying loop to repeat columna_dummy function for each df column ##
for(i in 1:length(columnas)){
new_dataset <- columna_dummy(df, i)
}
Console output:
Error: Problem with `mutate()` input `mes`.
x objeto '1' no encontrado
i Input `mes` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Column mes it's a date class column, however it is not included into columns atomic vector
and it still raises the above error,
Expected output should look somewhat like this for each string in selected string df column:
(I could not add every single column, but work_location_China it's an example of
how columns should look)
mes work_location birth_place work_location_China
01/01/2000 China Chile 1
01/02/2000 Mexico Japan 0
01/03/2000 China Chile 1
01/04/2000 China Argentina 1
01/05/2000 USA Poland 0
01/06/2000 Mexico Poland 0
01/07/2000 USA Finland 0
01/08/2000 USA Finland 0
01/09/2000 Japan Norway 0
01/10/2000 Japan Kenia 0
01/11/2000 Japan Mali 0
01/12/2000 India Mali 0
Is there any other way to apply this loop?
As we are passing strings, an option is to select the column (select can take both quoted/unquoted), create a column of 1s ('valor') and a row number column ('rn'), then do the reshaping from 'long' to 'wide' (pivot_wider)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
columna_dummy <- function(df, columna) {
df %>%
select(columna) %>%
mutate(valor = 1, rn = row_number()) %>%
pivot_wider(names_from = all_of(columna),
values_from = valor, values_fill = 0) %>%
select(-rn)
}
-testing
For more than one column, an option is to loop over the column names of interest with map, apply the function and bind them with _dfc and bind with the original dataset (bind_cols)
out <- imap_dfc(setNames(c("work_location", "birth_place"),
c("work_location", "birth_place")) , ~ {
nm1 <- as.character(.y)
columna_dummy(df = df, columna = .x) %>%
rename_all(~ str_c(nm1, ., sep="_"))
}) %>%
bind_cols(df, .)
-output
head(out, 2)
# mes work_location birth_place work_location_China work_location_Mexico work_location_USA work_location_Japan
#1 01/01/2000 China Chile 1 0 0 0
#2 01/02/2000 Mexico Japan 0 1 0 0
# work_location_India birth_place_Chile birth_place_Japan birth_place_Argentina birth_place_Poland birth_place_Finland
#1 0 1 0 0 0 0
#2 0 0 1 0 0 0
# birth_place_Norway birth_place_Kenia birth_place_Mali
#1 0 0 0
#2 0 0 0
data
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
By using purrr library I solved the issue:
## data ##
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## vector of columns ##
columnas <- c("work_location", "birth_place")
## hot_encoded_dataset ##
library(purrr)
hot_encoded_dataset <- purrr :: map(columnas , columna_dummy, df = df) %>%
reduce(inner_join)

r data.table how to do a lookup from a different data.table

The code below does what I want for a simple table. The mapping that takes place in the statement with on works perfectly. But I also have the situation with multiple countries that need to be assigned potentially to multiple regions and the result stored in the regions column is more challenging
library(data.table)
testDT <- data.table(country = c("Algeria", "Egypt", "United States", "Brazil"))
testDTcomplicated <- data.table(country = c("Algeria, Ghana, Sri Lanka", "Egypt", "United States, Argentina", "Brazil"))
regionLookup <- data.table(countrylookup = c("Algeria", "Argentina", "Egypt", "United States", "Brazil", "Ghana", "Sri Lanka"), regionVal = c("Africa", "South America", "Africa", "North America", "South America", "Africa", "Asia"))
testDT[regionLookup, region := regionVal, on = c(country = "countrylookup")]
> testDT
country region
1: Algeria Africa
2: Egypt Africa
3: United States North America
4: Brazil South America
I'd like to have testDTcomplicated look like the following
> testDT
country region
1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
2: Egypt Africa
3: United States, Argentina, Brazil North America, South America, South America
4: Brazil South America
You could split the data on comma and get each country in a separate row, join the data with regionLookup and collapse them again in one value in a comma-separated string.
library(data.table)
testDTcomplicated[, row := seq_len(.N)]
new <- splitstackshape::cSplit(testDTcomplicated, 'country', ',',
direction = 'long')[regionLookup, region := regionVal,
on = c(country = "countrylookup")]
new <- new[, lapply(.SD, toString), row][,row:=NULL]
new
# country region
#1: Algeria, Ghana, Sri Lanka Africa, Africa, Asia
#2: Egypt Africa
#3: United States, Argentina North America, South America
#4: Brazil South America
Same logic in dplyr can be implemented as :
library(dplyr)
testDTcomplicated %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(country, sep = ", ") %>%
left_join(regionLookup, by = c("country" = "countrylookup")) %>%
group_by(row) %>%
summarise(across(.fns = toString))

Resources