Trying to find values within excel cell based on given pairs in R df - r

I am using this excel sheet that I have currently read into R: https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx
dput(head(remittance, 5))
The output is:
structure(list(`Remittance-receiving country (across) - Remittance-sending country (down)` = c("Australia",
"Brazil", "Canada"), Brazil = c("27.868809286999106", "0", "31.284184411144214"
), Canada = c("46.827693406219382", "1.5806325278762619", "0"
), `Czech Republic` = c("104.79905129342241", "3.0488843262423089",
"176.79676736179096"), Finland = c("26.823089572300752", "1.3451674211686246",
"37.781150857376964"), France = c("424.37048861305249", "123.9763417712491",
"1296.7352242506483"), Germany = c("556.4140279523856", "66.518143815367239",
"809.9621650533453"), Hungary = c("200.08597014449356", "11.953328254521287",
"436.0811601171776"), Indonesia = c("172.0021287331823", "1.3701340430259537",
"33.545925908780198"), Italy = c("733.51652291459231", "116.74264895322995",
"1072.1119887588022"), `Korea, Rep.` = c("259.97044386689589",
"20.467939414361016", "326.94157937864327"), Netherlands = c("133.48932759488602",
"4.7378343766684532", "181.28828076733771"), Philippines = c("1002.3593555086774",
"1.5863355979877207", "2369.5223195675494"), Poland = c("109.73486651698796",
"5.8313637459523129", "341.10408952685464"), `Russian Federation` = c("19.082541158574934",
"1.0136604494838692", "58.760989426089431"), `Saudi Arabia` = c("13.578431465294949",
"0.32506772760873404", "15.511213677040857"), Sweden = c("91.887827513176489",
"5.1132733094740352", "65.860232580192786"), Thailand = c("383.08245004577498",
"2.7410805494977684", "79.370683058792849"), `United Kingdom` = c("1084.0742194994727",
"4.2050614573174592", "568.62605950140266"), `United States` = c("188.06242727403128",
"49.814372612310521", "661.98049661387927"), WORLD = c("5578.0296723604206",
"422.37127035334271", "8563.264510816849")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
I currently have a dataframe of two columns "Source" and "Destination" where each row is a pair of countries which I created by doing:
countries = c("Australia","Brazil", "Canada", "Czech Republic", "Germany", "Finland", "United Kingdom", "Italy", "Poland", "Russian Federation", "Sweden", "United States", "Philippines", "France", "Netherlands", "Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs = t(combn(countries, 2))
I would like to use each pair to extract its corresponding value from the excel sheet above. (In the Excel sheet "Source" is the first column of countries-down and "Destination is the first row countries-across)
For example a sample of the df that I have looks as follows (it currently contains 190 pairs):
pairs = data.frame(Source = c("Australia", "Australia", "Australia"), Destination = c("Brazil", "Canada", "Czech Republic"))
Where the first pair in my df is (Australia, Brazil) which corresponds to a value of 27.868809286999106 from the excel sheet that I reproduced above. Is there a built-in R function that would match the pairs from my df to extract its corresponding value? Thanks

Perhaps what you need is dplyr::pivot_longer?
library(dplyr)
colnames(remittance)[1] <- 'source'
remittance %>% pivot_longer(-source, names_to = 'destination')
#----
# A tibble: 60 x 3
source destination value
<chr> <chr> <chr>
1 Australia Brazil 27.868809286999106
2 Australia Canada 46.827693406219382
3 Australia Czech Republic 104.79905129342241
4 Australia Finland 26.823089572300752
Note remittance is the dataframe in the OP dput.

Probably you are interested in keeping the flexibility of your nice combn approach.
To loop over your pairs data frame (it's actually a matrix though) you may use apply with MARGIN=1 for row-wise. In the FUN= argument we create data frames of one row each with source corresponding to column 1 of pairs and destination to column 2. The distance (or whatever this value is) we get by subsetting at the corresponding rows and columns of remittance (for brevity I shortend to rem).
Since we will get a list of single-line data frames, we want to rbind, and because we have multiple objects we need do.call.
res <- do.call(rbind,
apply(pairs, MARGIN=1, FUN=function(x)
data.frame(source=x[1], destination=x[2],
dist=as.integer(rem[rem[, 1] == x[1], rem[1, ] == x[2]])))
)
Since the .xlsx has zeros where actually should be NAs we should declare them as such in the result.
res[res == 0] <- NA
Result
head(res, 25)
# source destination dist
# 1 Australia Brazil 721
# 2 Australia Canada 24721
# 3 Australia Czech Republic 1074
# 4 Australia Germany 13938
# 5 Australia Finland 1121
# 6 Australia United Kingdom 135000
# 7 Australia Italy 19350
# 8 Australia Poland 974
# 9 Australia Russian Federation 543
# 10 Australia Sweden 3988
# 11 Australia United States 93179
# 12 Australia Philippines 4118
# 13 Australia France 8475
# 14 Australia Netherlands 10697
# 15 Australia Hungary 997
# 16 Australia Saudi Arabia NA
# 17 Australia Thailand 11298
# 18 Australia Korea, Rep. 5381
# 19 Australia Indonesia 11094
# 20 Brazil Canada 26647
# 21 Brazil Czech Republic 742
# 22 Brazil Germany 44000
# 23 Brazil Finland 1378
# 24 Brazil United Kingdom 55772
# 25 Brazil Italy 104779
Data:
u <- "https://www.knomad.org/sites/default/files/2018-04/bilateralmigrationmatrix20170_Apr2018.xlsx"
rem <- openxlsx::read.xlsx(u)
countries <- c("Australia", "Brazil", "Canada", "Czech Republic", "Germany",
"Finland", "United Kingdom", "Italy", "Poland", "Russian Federation",
"Sweden", "United States", "Philippines", "France", "Netherlands",
"Hungary", "Saudi Arabia", "Thailand", "Korea, Rep.", "Indonesia")
pairs <- t(combn(countries, 2))

Related

Web-scraping table with merged row entries in R

I'm trying to scrape data-tables from a website
https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/
The issue is that the first column entry is merged across multiple rows while the second column has discrete entries:
The data table which is being scraped llos something like:
Here different entries in the second column has been merges in a single entry using \n
Now I want to shift the merged data to different rows and need some help with the same.
The code for webscraping is
library(rvest
#Spotify's list of new artists to look out for
upcoming_artists <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists)
upcoming_artists <- html_table(upcoming_artists)
The erroneous data frame looks something like:
list(structure(list(X1 = c("United States", "United Kingdom",
"Brazil", "Mexico", "Argentina", "Colombia", "Panama", "Spain",
"Australia", "France", "UAE & Lebanon", "South Africa", "Philippines",
"Indonesia", "Taiwan", "Austria", "Germany", "Netherlands", "Japan\n*RADAR locally titled Early Noise",
"India"), X2 = c("Alaina Castillo", "Young T + Bugsey", "Agnes Nunes",
"Silvana Estrada", "Romeo El Santo", "Ela Minus", "Boza", "DORA \nAleesha\nMaría José Llergo\nGuitarricadelafuente\nParanoid 1966",
"merci, mercy", "Lous and the Yakuza \nYuzmv\nPhilippine\nHervé",
"Hollaphonic x Xriss", "Elaine", "SB19\nAugust Wahh", "Mahen\nMonica Karina",
"張若凡RuoFan", "AVEC \nMy Ugly Clementine", "badmómzjay",
"RIMON \nJeangu Macrooy", "Fujii Kaze\nVaundy\nRina Sawayama",
"Mali\nWhen Chai Met Toast\nTaba Chake")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame")))
Use separate_rows from package tidyr to separate a given column into rows.
I have changed the scrapping a bit.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
})
#Spotify's list of new artists to look out for
upcoming_artists_link <- "https://newsroom.spotify.com/2020-03-09/36-new-artists-around-the-world-that-are-on-spotifys-radar/"
upcoming_artists <- read_html(upcoming_artists_link)
upcoming_artists %>%
html_elements("tbody") %>%
html_table() %>%
`[[`(1) %>%
tidyr::separate_rows(X2, sep = "\n")
#> # A tibble: 35 × 2
#> X1 X2
#> <chr> <chr>
#> 1 United States Alaina Castillo
#> 2 United Kingdom Young T + Bugsey
#> 3 Brazil Agnes Nunes
#> 4 Mexico Silvana Estrada
#> 5 Argentina Romeo El Santo
#> 6 Colombia Ela Minus
#> 7 Panama Boza
#> 8 Spain DORA 
#> 9 Spain Aleesha
#> 10 Spain María José Llergo
#> # … with 25 more rows
Created on 2022-10-03 with reprex v2.0.2

Changing spelling for multiple words at a time in R/replacing many words at once

I have a dataset (survey) and a column of birth_country, where people have written their country of birth. An example of it:
1 america
2 usa
3 american
4 us of a
5 united states
6 england
7 english
8 great britain
9 uk
10 united kingdom
how I would like it to look:
1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk
I have tried using str_replace to manually insert the different spellings, to replace them with 'america' but when I look at my dataset, nothing has changed
e.g.
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")
thank you in advance
Come up with some patterns that only match for each country and basically loop over what you are already doing (you can change the replacement below with your favorite function)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
america = 'amer|us|states',
uk = 'eng|brit|king|uk',
'another country' = 'ano|an co',
chaz = 'chaz|chop'
)
f <- function(x, list) {
for (ii in seq_along(list)) {
x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
}
x
}
## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk" "uk" "uk" "uk" "uk"
within(survey, {
clean <- f(birth_country, l)
})
# birth_country clean
# 1 america america
# 2 usa america
# 3 american america
# 4 us of a america
# 5 united states america
# 6 england uk
# 7 english uk
# 8 great britain uk
# 9 uk uk
# 10 united kingdom uk
Note that 1) if you don't give a pattern that matches, nothing will change, but 2) if you give a pattern that matches both countries (e.g., "united"), the first in the list will be used (unless the replacement itself is also matched)
Looks like the problem is in how you specified your regular expression. Try this (updated based on #Gabriella 's comment, and another tidyverse approach, similar to #MarBIo ):
library(tidyverse)
survey <- survey %>%
mutate(birth_country = if_else(
str_detect(birth_country,
"(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
"america", #Change it to "america"
birth_country #Otherwise, keep as is.
) #end of if_else
) #end of mutate
Other people are suggesting you come up with a more complex regular expression, which you can certainly do as well. Consecutive "or" (i.e. "|") statements in your regular expression works though.
In case you allow tidyverse`s mutate you can do:
library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>%
mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#> birth_country
#> 1 america
#> 2 america
#> 3 america
#> 4 america
#> 5 america
#> 6 UK
#> 7 UK
#> 8 UK
#> 9 UK
#> 10 UK

Return a vector of values associated with a vector of character strings?

I have a data-frame of survey respondents from various countries. I would like to create a new vector with the average wage of in country, next to the respondent.
I have a data set of countries and wages - sample below:
countries <- c("Australia", "Austria", "Belgium", "Canada", "Chile", "Czech")
wages <- c(61620, 48306, 49419, 50033, 18645, 15374)
data_set <- data.frame(countries, wages)
countries wages
1 Australia 61620
2 Austria 48306
3 Belgium 49419
4 Canada 50033
5 Chile 18645
6 Czech 15374
In my data frame there is a vector of the nationalities of the respondants:
c("Martha", "Shelagh","Ronald", "Stefan", "Dimitri", "Jack", "Johan", "Arnold", "Gilles")
c("Canada", "Australia", "Canada", "Czech", "Czech", "Australia", "Czech", "Austraia", "Belgium")
I would like to create a new vector, which returns the appropriate average wage for each country.
It should return something like:
names country av_wage
1 Martha Canada 50033
2 Shelagh Australia 61620
3 Ronald Canada 50033
4 Stefan Czech 15374
5 Dimitri Czech 15374
6 Jack Australia 61620
7 Johan Czech 15374
8 Arnold Austria 48306
9 Gilles Belgium 49419
Thankyou for your help :)
Start with the first aggregated data data_set.
countries <- c("Australia", "Austria", "Belgium", "Canada", "Chile", "Czech")
wages <- c(61620, 48306, 49419, 50033, 18645, 15374)
data_set <- data.frame(countries, wages)
Then, name the two vectors in such a way that the vector of countries shares the name with the corresponding vector of data_set.
names <- c("Martha", "Shelagh","Ronald", "Stefan", "Dimitri", "Jack", "Johan", "Arnold", "Gilles")
countries <- c("Canada", "Australia", "Canada", "Czech", "Czech", "Australia", "Czech", "Austraia", "Belgium")
new <- data.frame(names, countries)
Now just merge the two dataframes.
merge(data_set, new)
# countries wages names
#1 Australia 61620 Shelagh
#2 Australia 61620 Jack
#3 Belgium 49419 Gilles
#4 Canada 50033 Ronald
#5 Canada 50033 Martha
#6 Czech 15374 Johan
#7 Czech 15374 Stefan
#8 Czech 15374 Dimitri
To reorder the columns,
mrg <- merge(data_set, new)[c(3, 1, 2)]
mrg
# names countries wages
#1 Shelagh Australia 61620
#2 Jack Australia 61620
#3 Gilles Belgium 49419
#4 Ronald Canada 50033
#5 Martha Canada 50033
#6 Johan Czech 15374
#7 Stefan Czech 15374
#8 Dimitri Czech 15374

How to apply multiple if statements in R?

I have a data frame (df) that lists the countries associated with every site
Site Country
Site1 USA
Site2 Vietnam
Site3 Spain
Site4 Germany
Site5 China
I want to attach a column, where for each country I associate its corresponding continent. I wrote a simple if loop to do this:
df$Continent <- NA
if(df$Country == "USA" |df$Country == "Canada" |df$Country == "Mexico")
{df$Continent <- "North America"}
if(df$Country == "Spain" |df$Country == "France" |df$Country == "Germany")
{df$Continent <- "Europe"}
## .. etc
summary(df)
However, each time I run it the df, I find that it assigns North America to all the countries. I understand that this may sound trivial, but does it make a difference if I use if statments everywhere and not else or if else? Any suggestions for correcting this?
Build a lookup table and merge() it with the data.
For example:
lookup <- data.frame(Country = c("USA", "Canada", "Mexico",
"Spain", "France", "Germany",
"Vietnam", "China"),
Continent = rep(c("North America", "Europe", "Asia"),
times = c(3,3,2)))
Using your snippet of data as data frame df, we can add Continent via merge() (a join in database terminology):
> merge(df, lookup, sort = FALSE, all.x = TRUE)
Country Site Continent
1 USA Site1 North America
2 Vietnam Site2 Asia
3 Spain Site3 Europe
4 Germany Site4 Europe
5 China Site5 Asia
If you're working with a factor you can also do some nonsense with levels, or levels<- to be exact:
`levels<-`(dat$Country, list(
`North America` = c("USA","Canada","Mexico"),
`Europe` = c("Spain","France","Germany"),
`Asia` = c("Vietnam","China")
))
#[1] North America Asia Europe Europe Asia
#Levels: North America Europe Asia
I like ifelse() for things like this. You could use it with the %in% operator like this:
df$Continent <- ifelse(df$Country %in% c("USA", "Canada", "Mexico"),
"North America", df$Continent)
df$Continent <- ifelse(df$Country %in% c("Spain", "France", "Germany"),
"Europe", df$Continent)
df
Site Country Continent
1 Site1 USA North America
2 Site2 Vietnam <NA>
3 Site3 Spain Europe
4 Site4 Germany Europe
5 Site5 China <NA>

Aggregate factors in Variable in R

I have this data.frame with a variable V21 in which many countries are recorded, I want to make it smaller by just specifying the continent rather then all those countries. For example 'Cuba', 'Peru', 'Argentina' rather than being separate levels of V21, I want them to become level 'South America'. Here's the code I tried to use:
recode(WaveOne.test$V21, "levels("Cuba","Colombia","Costa Rica","Argentina","Chile","Ecuador","Peru","Venezuela")= 'South America'")
levels(V21)
Can you suggest what is wrong with my code or maybe a different method?
I am a complete newbie in R and its syntax.
Thank you!
========UPDATE=========
SA_countries <- c("Cuba", "Mexico", "Argentina","Jamaica", "Haiti","West Indies", "Chile", "Ecuador", "Venezuela", "Other South America", "El Salvador", "Guatemala", "Nicaragua", "Dominican Republic", "Panama", "Costa Rica", "Peru")
Asia_countries <- c("Philippines", "Vietnam", "Laos", "Cambodia", "Hmong", "Other Asia", "China", "Hong Kong", "Taiwan", "Japan", "Korea", "India", "Pakistan")
Europe_Canada <- c("Europe/Canada")
MiddleEast_Africa <- c("Middle East/Africa")
continents <- list(`South America`= SA_countries, `Asia` = Asia_countries, `Europe_Canada` = Europe_Canada, `Middle East & Africa` = MiddleEast_Africa)
levels(WaveOne.test$V21) <- c(levels(WaveOne.test$V21), names(continents))
for(i in seq_along(continents)) WaveOne.test$V21[WaveOne.test$V21 %in% continents[[i]]] <- names(continents)[i]
levels(WaveOne.test$V21)
My output however is:
levels(WaveOne.test$V21)
1 "Cuba" "Mexico" "Nicaragua" "Colombia" "Dominican Republic" "El Salvador" "Guatemala"
[8] "Honduras" "Costa Rica" "Panama" "Argentina" "Chile" "Ecuador" "Peru"
[15] "Venezuela" "Other South America" "Haiti" "Jamaica" "West Indies" "Philippines" "Vietnam"
[22] "Laos" "Cambodia" "Hmong" "Other Asia" "China" "Hong Kong" "Taiwan"
[29] "Japan" "Korea" "India" "Pakistan" "Middle East/Africa" "Europe/Canada" "South America"
[36] "Asia" "Europe_Canada" "Middle East & Africa"
You can create a list with all of your countries and continents then reassign the values accordingly:
continents <- list(`South America`=SA_countries,
`North America` = NA_countries,
Europe=Euro_countries)
levels(df$V21) <- c(levels(df$V21), names(continents)) #necessary to add new levels
for(i in seq_along(continents)) {
df$V21[df$V21 %in% continents[[i]]] <- names(continents)[i]}
Reproducible Example
set.seed(123)
SA_countries <- c("Cuba","Colombia","Costa Rica","Argentina","Chile","Ecuador","Peru","Venezuela")
NA_countries <- c("Mexico", "USA", "Canada")
Euro_countries <- c("Germany", "France")
df <- data.frame(V21=sample(c(NA_countries,SA_countries, Europe),20,T))
df
# V21
# 1 Cuba
# 2 Venezuela
# 3 Costa Rica
# 4 Germany
# 5 France
# 6 Mexico
# 7 Argentina
# 8 Germany
# 9 Chile
# 10 Costa Rica
# 11 France
# 12 Costa Rica
# 13 Ecuador
# 14 Chile
# 15 USA
# 16 Germany
# 17 Cuba
# 18 Mexico
# 19 Colombia
# 20 France
continents <- list(`South America`=SA_countries, `North America` = NA_countries, Europe=Euro_countries)
levels(df$V21) <- c(levels(df$V21), names(continents))
for(i in seq_along(continents)) df$V21[df$V21 %in% continents[[i]]] <- names(continents)[i]
df
# V21
# 1 South America
# 2 South America
# 3 South America
# 4 Europe
# 5 Europe
# 6 North America
# 7 South America
# 8 Europe
# 9 South America
# 10 South America
# 11 Europe
# 12 South America
# 13 South America
# 14 South America
# 15 North America
# 16 Europe
# 17 South America
# 18 North America
# 19 South America
# 20 Europe

Resources