Join 2 dataframes together if two columns match - r

I have 2 dataframes:
CountryPoints
From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1
and another dataframe with neighbouring/bordering countries:
From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy
I would like to add another column in CountryPoints called neighbour (Y/N) depending if the key value pair is found in the neighbour/bordering countries dataframe. Is this somehow possible - so it is a kind of a join but the result should be a boolean column.
The result should be:
From.country To.Country points Neighbour
Belgium Finland 4 Y
Belgium Germany 5 Y
Malta Italy 12 Y
Malta UK 1 N
In the question below it shows how you can merge but it doesn't show how you can add that extra boolean column

Two alternative approaches:
1) with base R:
idx <- match(df1$From.country, df2$From.country, nomatch = 0) &
match(df1$To.Country, df2$To.Country, nomatch = 0)
df1$Neighbour <- c('N','Y')[1 + idx]
2) with data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[, Neighbour := 'N'][df2, on = .(From.country, To.Country), Neighbour := 'Y'][]
which both give (data.table-output shown):
From.country To.Country points Neighbour
1: Belgium Finland 4 Y
2: Belgium Germany 5 Y
3: Malta Italy 12 Y
4: Malta UK 1 N

Borrowing the idea from this post:
df1$Neighbour <- duplicated(rbind(df2[, 1:2], df1[, 1:2]))[ -seq_len(nrow(df2)) ]
df1
# From.country To.Country points Neighbour
# 1 Belgium Finland 4 TRUE
# 2 Belgium Germany 5 TRUE
# 3 Malta Italy 12 TRUE
# 4 Malta UK 1 FALSE

What about something like this?
sortpaste <- function(x) paste0(sort(x), collapse = "_");
df1$Neighbour <- apply(df1[, 1:2], 1, sortpaste) %in% apply(df2[, 1:2], 1, sortpaste)
# From.country To.Country points Neighbour
#1 Belgium Finland 4 TRUE
#2 Belgium Germany 5 TRUE
#3 Malta Italy 12 TRUE
#4 Malta UK 1 FALSE
Sample data
df1 <- read.table(text =
"From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1", header = T)
df2 <- read.table(text =
"From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy", header = T)

Related

r - How to track Changes in Rows of dataframe with characters?

Additional to my last question, I am now looking for a way to track changes within a data frame of characters.
Suppose I have the following dataframe df:
df=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c("Germany","Germany","Germany","Italy","Austria"),"2015"=c("Germany","Germany","Germany","Italy","Austria"),"2016"=c("Italy","Germany","Germany","Italy","Germany"), "2017"=c("Italy","Germany","Germany","Italy","Germany"), "2018"=c("Italy","Austria","Germany","Italy","Germany") )
Now, I want to find out, for which ID the data has changed in which year. So for example, in 2016 ID 123100 has changed from Germany to Italy. I would like to add new columns for change (1 = change, 0 or NA = no change), year of change, old expression and new expression. The fact, that the real dataset consists of thousands of different expressions instead of the three countries is a challenge for me. I need a solution without the need to determine the different expressions before.
In the end it should look like this:
df_final=data.frame(ID=c(123100,123200,123300,123400,123500),"2014"=c("Germany","Germany","Germany","Italy","Austria"),"2015"=c("Germany","Germany","Germany","Italy","Austria"),"2016"=c("Italy","Germany","Germany","Italy","Germany"), "2017"=c("Italy","Germany","Germany","Italy","Germany"), "2018"=c("Italy","Austria","Germany","Italy","Germany"), "change"=c(1,1,0,0,1),
"year"=c(2016, 2018, 0, 0, 2016), "before"=c("Germany","Germany",0,0,"Austria"), "after"=c("Italy", "Austria", 0, 0, "Germany"))
I couldn't find any satisfying solution on here, so I hope you can help me.
Try this
df |> rowwise() |> mutate(change = case_when(all(c_across(X2015:X2018) == X2014) ~ 0 , TRUE ~ 1) ,
year = colnames(df)[-1][which(c_across(X2014) != c_across(X2014:X2018))[1]] ) |>
ungroup() |> mutate(before = ifelse(change == 1 , X2014 ,NA) ,
after = ifelse(change == 1 , X2018 ,NA))
output
# A tibble: 5 × 10
ID X2014 X2015 X2016 X2017 X2018 change year before after
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 123100 Germany Germany Italy Italy Italy 1 X2016 Germany Italy
2 123200 Germany Germany Germany Germany Austria 1 X2018 Germany Austria
3 123300 Germany Germany Germany Germany Germany 0 NA NA NA
4 123400 Italy Italy Italy Italy Italy 0 NA NA NA
5 123500 Austria Austria Germany Germany Germany 1 X2016 Austria Germany
>
Not elegant, but you can use rle to count the lengths and values in a vector. I'd used plyr::ldply to run rle for each row.
library(plyr)
output <- ldply(seq_len(nrow(df)), function(x){
columns <- c("X2014", "X2015", "X2016", "X2017", "X2018")
rle_output <- rle(df[x, columns])
if(length(rle_output$lengths) == 1) return(data.frame(change=0))
else{
change = 1
year = columns[rle_output$lengths[2]]
before = unlist(rle_output$values[1])
after = unlist(rle_output$values[2])
return(data.frame(change, year, before, after))
}})
cbind(df, output)
ID X2014 X2015 X2016 X2017 X2018 change year before after
1 123100 Germany Germany Italy Italy Italy 1 X2016 Germany Italy
2 123200 Germany Germany Germany Germany Austria 1 X2014 Germany Germany
3 123300 Germany Germany Germany Germany Germany 0 <NA> <NA> <NA>
4 123400 Italy Italy Italy Italy Italy 0 <NA> <NA> <NA>
5 123500 Austria Austria Germany Germany Germany 1 X2016 Austria Germany

Using apply function to calculate the mean of a column

After splitting a data frame into multiple data frames by country,I wanted to be able to calculate the mean of the column centralization in each country's data frame that i split. I used tapply which worked and I tried to use sapply() but the weird thing is that all mean values of the country follows the mean value of the first country. I cannot figure out why and I am asked to use sapply as an exercise so I would like to know how i can improve on my code. Any pointer would be appreciated. (it might be a dumb mistake)
INPUT/my code:
strikes.df = read.csv("http://www.stat.cmu.edu/~pfreeman/strikes.csv")
strikes.by.country=split(strikes.df,strikes.df$country)
my.fun=function(x=strikes.by.country){
l=length(strikes.by.country)
for (i in 1:l){
return(strikes.by.country[[i]]$centralization %>% mean)
}
}
sapply(strikes.by.country, my.fun)
#using tapply()
tapply(strikes.df[,"centralization",],INDEX=strikes.df[,"country",],FUN=mean)
OUTPUT
0.374644 0.374644 0.374644 0.374644 0.374644
Finland France Germany Ireland Italy
0.374644 0.374644 0.374644 0.374644 0.374644
Japan Netherlands New.Zealand Norway Sweden
0.374644 0.374644 0.374644 0.374644 0.374644
Switzerland UK USA
0.374644 0.374644 0.374644
Australia Austria Belgium Canada Denmark
0.374644022 0.997670495 0.749485177 0.002244134 0.499958552
Finland France Germany Ireland Italy
0.750374065 0.002729909 0.249968231 0.499711882 0.250699502
Japan Netherlands New.Zealand Norway Sweden
0.124675342 0.749602699 0.375940378 0.875341821 0.875253817
Switzerland UK USA
0.499990005 0.375946785 0.002390639
i am given instruction to use sapply after using split; thats why the only thing that occured to me is using for loops.
Better use sapply on the unique country names. Actually there's no need to split anything.
sapply(unique(strikes.df$country), function(x)
mean(strikes.df[strikes.df$country == x, "centralization"]))
# Australia Austria Belgium Canada Denmark Finland France
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909
# Germany Ireland Italy Japan Netherlands New.Zealand Norway
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821
# Sweden Switzerland UK USA
# 0.875253817 0.499990005 0.375946785 0.002390639
But if you depend on using split as well, you may do:
sapply(split(strikes.df$centralization, strikes.df$country), mean)
# Australia Austria Belgium Canada Denmark Finland France
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909
# Germany Ireland Italy Japan Netherlands New.Zealand Norway
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821
# Sweden Switzerland UK USA
# 0.875253817 0.499990005 0.375946785 0.002390639
Or write it in two lines:
s <- split(strikes.df$centralization, strikes.df$country)
sapply(s, mean)
Edit
If splitting the whole data frame is required, do
s <- split(strikes.df, strikes.df$country)
sapply(s, function(x) mean(x[, "centralization"]))
or
foo <- function(x) mean(x[, "centralization"])
sapply(s, foo)
Using the gapminder::gapminder dataset as example data this can be achieved like so:
The example code computes mean life expectancy (lifeExp) by continent.
# sapply: simplifies. returns a vector
sapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> Africa Americas Asia Europe Oceania
#> 48.86533 64.65874 60.06490 71.90369 74.32621
# lapply: returns a list
lapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> $Africa
#> [1] 48.86533
#>
#> $Americas
#> [1] 64.65874
#>
#> $Asia
#> [1] 60.0649
#>
#> $Europe
#> [1] 71.90369
#>
#> $Oceania
#> [1] 74.32621

How to use for loop to extract rows with similar elements across 2 dataframes?

I have 2 dataframes, one is a Free Trade Agreement dataset that contains many columns, the columns c1 to c91 denote different countries part of a particular Free Trade Agreement, as shown below:
FTA data
FTA data e.g.
No Base_treaty entry_type c1 c2 c3
1 1 treaty Afghanistan India NA
2 2 treaty Algeria Egypt Ghana
3 3 treaty Algeria Angola Benin
4 4 treaty Egypt Jordan Morocco
5 5 treaty Albania Bulgaria NA
6 6 treaty Albania Croatia NA
The other data frame contains trade data between two particular countries, i and j. Trade Data
inventor_ctry_i authority_ctry_j
1 Albania Bulgaria
2 Albania Croatia
3 Algeria Angola
4 Algeria Belgium
5 Algeria France
6 Andorra Turkey
7 Andorra United States
8 Anguilla Germany
9 Anguilla Switzerland
10 Anguilla United States
Desired output:
No Base_treaty entry_type matched ctry1 matched ctry2
3 3 treaty Algeria Angola
5 5 treaty Albania Bulgaria
6 6 treaty Albania Croatia
I want to be able to find countries i and j in trade data that show up in the same row somewhere in between c1 to c91 of the FTA data. If both are present in a particular row, extract the 2 countries from the row in FTA, keeping no, base treaty and entry type column intact.
What I have done so far:
FTA_final: FTA Data, unique_pairs: Trade Data
specialnames <- setdiff(names(FTA_final), c("number", "base_treaty",
"entry_type")) **#getting rid of irrelevant columns**
table <- data.frame()` **#create empty dataframe**
for(i in nrow(FTA_final)){`
for(j in seq_along(specialnames)){`
for(p in nrow(unique_pairs)){`
if (FTA_final[i,j] %in% unique_pairs[p,])`
{table <- rbind(table,FTA_final[i,c(1:3, j)])}`
` }`
`}`
`}` **#for loop**
Nothing happens when I run these codes, not sure why. Any help would be greatly appreciated.
One way to do this would be to row-wise paste the value of Trade_data to get combinations of countries that trade together. We can then create a combination of countries in FTA_data and check if any of the combination matches all_countries.
cols <- paste0('c', 1:3)
all_countries <- do.call(paste, Trade_data)
data <- apply(FTA_data[cols], 1, function(x) {
x <- na.omit(x)
if(length(x) <= 1) return(NULL)
temp <- combn(x, 2)
inds <- combn(x, 2, paste, collapse = " ") %in% all_countries
if(any(inds)) temp[, inds]
})
new_data <- FTA_data[!sapply(data, is.null), ]
new_data[cols] <- NULL
final_data <- cbind(new_data, do.call(rbind, data))
final_data
# No Base_treaty entry_type 1 2
#3 3 3 treaty Algeria Angola
#5 5 5 treaty Albania Bulgaria
#6 6 6 treaty Albania Croatia
Here is another way :
library(dplyr)
library(tidyr)
output<- FTA_data[rowSums(sapply(all_countries, function(x)
apply(FTA_data[cols], 1, function(y)
grepl(x, paste(y, collapse = " "))))) > 0, ]
output %>%
pivot_longer(cols = starts_with('c'),
values_drop_na = TRUE) %>%
filter(value %in% Trade_data$inventor_ctry_i |
value %in% Trade_data$authority_ctry_j) %>%
group_by(No, Base_treaty, entry_type) %>%
mutate(name = paste0('c', row_number())) %>%
pivot_wider()
Thank you to #Ronak Shah for your suggestions
As suggested by #Ronak Shah, I was able to get the relevant rows that had countries i and j in them:
cols <- paste0('c', 1:3)
all_countries <- do.call(paste, Trade_data)
output<- FTA_data[rowSums(sapply(all_countries, function(x)
apply(FTA_data[cols], 1, function(y)
grepl(x, paste(y, collapse = " "))))) > 0, ]
Afterwhich, I did this:
do.call(rbind, combn(grep("^c\\d+$", names(output)), 2, function(x)
cbind(output[1:3], setNames(output[x], paste0("c", 1:2))), simplify=F))
This helps me get all possible combinations across "c" columns, while retaining columns 1:3, i.e. No, Base Entry and entry_type.
After this, a simple left join with trade data will gave me the desired i and j pairs and the output:
No Base_treaty entry_type matched ctry1 matched ctry2
3 3 treaty Algeria Angola
5 5 treaty Albania Bulgaria
6 6 treaty Albania Croatia

From monadic to dyadic data in R

For the sake of simplicity, let's say I have a dataset at the country-year level, that lists organizations that received aid from a government, how much money was that, and the type of project. The data frame has "space" for 10 organizations each year, but not every government subsidizes so many organizations each year, so there are a lot a blank spaces. Moreover, they do not follow any order: one organization can be in the first spot one year, and the next year be coded in the second spot. The data looks like this:
> State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2 Org3 Aid3 Proj3 Org4 Aid4 Proj4 ...
Italy 2000 A 1000 Arts B 500 Arts C 300 Social
Italy 2001 B 700 Social A 1000 Envir
Italy 2002 A 1000 Arts C 300 Envir
UK 2000
UK 2001 Z 2000 Social
UK 2002 Z 2000 Social
...
I'm trying to transform this into dyadic data, which would look like this:
> State Org Year Aid Proj
Italy A 2000 1000 Arts
Italy A 2001 1000 Envir
Italy A 2002 1000 Arts
Italy B 2000 500 Arts
Italy B 2001 700 Social
Italy C 2000 300 Social
Italy C 2002 300 Envir
UK Z 2001 2000 Social
...
I'm using R, and the best way I could find was building a pre-defined possible set of dyads —using something like expand.grid(unique(State), unique(Org))— and then looping through the data, finding the corresponding column and filling the data frame. But I don't thing this is the most effective method, so I was wondering whether there would be a better way. I thought about dplyror reshape but can't find a solution.
I know this is a recurring question, but couldn't really find an answer. The most similar question is this one, but it's not exactly the same.
Thanks a lot in advance.
Since you did not use dput, I will try and make some data that resemble yours:
dat = data.frame(State = rep(c("Italy", "UK"), 3),
Year = rep(c(2014, 2015, 2016), 2),
Org1 = letters[1:6],
Aid1 = sample(800:1000, 6),
Proj1 = rep(c("A", "B"), 3),
Org2 = letters[7:12],
Aid2 = sample(600:700, 6),
Proj2 = rep(c("C", "D"), 3),
stringsAsFactors = FALSE)
dat
# State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2
# 1 Italy 2014 a 910 A g 658 C
# 2 UK 2015 b 926 B h 681 D
# 3 Italy 2016 c 834 A i 625 C
# 4 UK 2014 d 858 B j 620 D
# 5 Italy 2015 e 831 A k 650 C
# 6 UK 2016 f 821 B l 687 D
Next I gather the data and then use extract to make 2 new columns and then spread it all again:
library(tidyr)
library(dplyr)
dat %>%
gather(key, value, -c(State, Year)) %>%
extract(key, into = c("key", "num"), "([A-Za-z]+)([0-9]+)") %>%
spread(key, value) %>%
select(-num)
# State Year Aid Org Proj
# 1 Italy 2014 910 a A
# 2 Italy 2014 658 g C
# 3 Italy 2015 831 e A
# 4 Italy 2015 650 k C
# 5 Italy 2016 834 c A
# 6 Italy 2016 625 i C
# 7 UK 2014 858 d B
# 8 UK 2014 620 j D
# 9 UK 2015 926 b B
# 10 UK 2015 681 h D
# 11 UK 2016 821 f B
# 12 UK 2016 687 l D
Is this the desired output?

R equal sampling takes too long

I want to sample rows from different years given some constraints.
Say my dataset looks like this:
library(data.table)
dataset = data.table(ID=sample(1:21), Vintage=c(1989:1998, 1989:1998, 1992), Region.Focus=c("Europe", "US", "Asia"))
> dataset
ID Vintage Region.Focus
1: 7 1989 Europe
2: 10 1990 US
3: 20 1991 Asia
4: 18 1992 Europe
5: 4 1993 US
6: 17 1994 Asia
7: 13 1995 Europe
8: 9 1996 US
9: 12 1997 Asia
10: 3 1998 Europe
11: 11 1989 US
12: 14 1990 Asia
13: 8 1991 Europe
14: 16 1992 US
15: 19 1993 Asia
16: 1 1994 Europe
17: 5 1995 US
18: 15 1996 Asia
19: 6 1997 Europe
20: 21 1998 US
21: 2 1992 Asia
ID Vintage Region.Focus
I want to 1,000 draws of sample size 2 and 4 (separate from each other) spread along two years. E.g. for 1,000 draws of sample size 2, it could be the first and the second row. I also have a constraint that the sample must consist of rows with the same region focus. My solution is the code below, but it is way too slow.
for(i in c(2,4)) {
simulate <- function(i) {
repeat{
start <- dataset[sample(nrow(dataset), 1, replace=TRUE),]
t <- start$Vintage:(start$Vintage + 1)
matches <- which(dataset$Vintage %in% t & dataset$Region.Focus == start$Region.Focus) #constraints
DT <- dataset[matches,]
DT <- as.data.table(DT)
x <- DT[,.SD[sample(.N,min(.N,i/length(t)))],by = Vintage]
if(nrow(x) ==i) {
x <- as.data.frame(x)
x <- x %>% mutate(EqualWeight = 1 / i) %>% mutate(RandomWeight = prop.table(runif(i)))
x <- ungroup(x)
return(x)
} else {
x <- 0
}
}
}
#now replicate the expression 1000 times
r <- replicate(1000, simulate(i), simplify=FALSE)
r <- rbindlist(r, idcol="draw")
f <- as.data.frame(r)
write.csv(p, file=paste("Performance.fof.5", i, "csv", sep="."))
fof <- paste("fof.5", i, sep = ".")
assign(fof, f)
}
This code is very slow. My initial intuition is that my approach would need a lot of funds and keeps looping due to the constraint. I have 5,800 rows.
Is there a way other than the repeat function that results in a lot of looping? Perhaps there is another way of expressing the line DT[,.SD[sample(.N,min(.N,i/length(t)))],by = Vintage] to get rid off the repeat expression? Thank you in advance for any input!

Resources