How to use for loop to extract rows with similar elements across 2 dataframes? - r

I have 2 dataframes, one is a Free Trade Agreement dataset that contains many columns, the columns c1 to c91 denote different countries part of a particular Free Trade Agreement, as shown below:
FTA data
FTA data e.g.
No Base_treaty entry_type c1 c2 c3
1 1 treaty Afghanistan India NA
2 2 treaty Algeria Egypt Ghana
3 3 treaty Algeria Angola Benin
4 4 treaty Egypt Jordan Morocco
5 5 treaty Albania Bulgaria NA
6 6 treaty Albania Croatia NA
The other data frame contains trade data between two particular countries, i and j. Trade Data
inventor_ctry_i authority_ctry_j
1 Albania Bulgaria
2 Albania Croatia
3 Algeria Angola
4 Algeria Belgium
5 Algeria France
6 Andorra Turkey
7 Andorra United States
8 Anguilla Germany
9 Anguilla Switzerland
10 Anguilla United States
Desired output:
No Base_treaty entry_type matched ctry1 matched ctry2
3 3 treaty Algeria Angola
5 5 treaty Albania Bulgaria
6 6 treaty Albania Croatia
I want to be able to find countries i and j in trade data that show up in the same row somewhere in between c1 to c91 of the FTA data. If both are present in a particular row, extract the 2 countries from the row in FTA, keeping no, base treaty and entry type column intact.
What I have done so far:
FTA_final: FTA Data, unique_pairs: Trade Data
specialnames <- setdiff(names(FTA_final), c("number", "base_treaty",
"entry_type")) **#getting rid of irrelevant columns**
table <- data.frame()` **#create empty dataframe**
for(i in nrow(FTA_final)){`
for(j in seq_along(specialnames)){`
for(p in nrow(unique_pairs)){`
if (FTA_final[i,j] %in% unique_pairs[p,])`
{table <- rbind(table,FTA_final[i,c(1:3, j)])}`
` }`
`}`
`}` **#for loop**
Nothing happens when I run these codes, not sure why. Any help would be greatly appreciated.

One way to do this would be to row-wise paste the value of Trade_data to get combinations of countries that trade together. We can then create a combination of countries in FTA_data and check if any of the combination matches all_countries.
cols <- paste0('c', 1:3)
all_countries <- do.call(paste, Trade_data)
data <- apply(FTA_data[cols], 1, function(x) {
x <- na.omit(x)
if(length(x) <= 1) return(NULL)
temp <- combn(x, 2)
inds <- combn(x, 2, paste, collapse = " ") %in% all_countries
if(any(inds)) temp[, inds]
})
new_data <- FTA_data[!sapply(data, is.null), ]
new_data[cols] <- NULL
final_data <- cbind(new_data, do.call(rbind, data))
final_data
# No Base_treaty entry_type 1 2
#3 3 3 treaty Algeria Angola
#5 5 5 treaty Albania Bulgaria
#6 6 6 treaty Albania Croatia
Here is another way :
library(dplyr)
library(tidyr)
output<- FTA_data[rowSums(sapply(all_countries, function(x)
apply(FTA_data[cols], 1, function(y)
grepl(x, paste(y, collapse = " "))))) > 0, ]
output %>%
pivot_longer(cols = starts_with('c'),
values_drop_na = TRUE) %>%
filter(value %in% Trade_data$inventor_ctry_i |
value %in% Trade_data$authority_ctry_j) %>%
group_by(No, Base_treaty, entry_type) %>%
mutate(name = paste0('c', row_number())) %>%
pivot_wider()

Thank you to #Ronak Shah for your suggestions
As suggested by #Ronak Shah, I was able to get the relevant rows that had countries i and j in them:
cols <- paste0('c', 1:3)
all_countries <- do.call(paste, Trade_data)
output<- FTA_data[rowSums(sapply(all_countries, function(x)
apply(FTA_data[cols], 1, function(y)
grepl(x, paste(y, collapse = " "))))) > 0, ]
Afterwhich, I did this:
do.call(rbind, combn(grep("^c\\d+$", names(output)), 2, function(x)
cbind(output[1:3], setNames(output[x], paste0("c", 1:2))), simplify=F))
This helps me get all possible combinations across "c" columns, while retaining columns 1:3, i.e. No, Base Entry and entry_type.
After this, a simple left join with trade data will gave me the desired i and j pairs and the output:
No Base_treaty entry_type matched ctry1 matched ctry2
3 3 treaty Algeria Angola
5 5 treaty Albania Bulgaria
6 6 treaty Albania Croatia

Related

How to subtract each Country's value by year

I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.

Lagging a variable by adding up the previous 5 years?

I am working with data that look like this:
Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000
I want to create a lagged variable by adding up the previous five years in the data
So that the observation for 2000 looks like this:
Country Year Aid Lagged5
Angola 2000 416420000 1953200000
Which was derived by adding the Aid observations from 1995 to 1999 together:
416420000 + 459310000 + 354660000 + 335270000 + 387540000 = 1953200000
Also, I will need to group by country as well.
Thank You!
You could do:
library(dplyr)
df %>%
group_by(Country) %>%
mutate(Lagged5 = sapply(Year, function(x) sum(Aid[between(Year, x - 5, x - 1)])))
Output:
# A tibble: 6 x 4
# Groups: Country [1]
Country Year Aid Lagged5
<chr> <int> <int> <int>
1 Angola 1995 416420000 0
2 Angola 1996 459310000 416420000
3 Angola 1997 354660000 875730000
4 Angola 1998 335270000 1230390000
5 Angola 1999 387540000 1565660000
6 Angola 2000 302210000 1953200000
Using the input DF shown reproducibly in the Note at the end define a roll function which sums the prior 5 rows and use ave to run it for each Country. The width argument list(-seq(5)) to rollapplyr means use offsets -1, -2, -3, -4, -5 in summing, i.e. the values in the prior 5 rows.
The question did not discuss what to do with the initial rows in each country so we put in NA values but if you want partial sums add the partial = TRUE argument to rollapplyr. You can also change the fill=NA to some other value if you wish so it is quite flexible.
library(zoo)
roll <- function(x) rollapplyr(x, list(-seq(5)), sum, fill = NA)
transform(DF, Lag5 = ave(Aid, Country, FUN = roll))
Note
The input was assumed to be the following. We added a second country.
Lines <- "Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE,
colClasses = c("character", "integer", "numeric"))
DF <- rbind(DF, transform(DF, Country = "Belize"))

Join 2 dataframes together if two columns match

I have 2 dataframes:
CountryPoints
From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1
and another dataframe with neighbouring/bordering countries:
From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy
I would like to add another column in CountryPoints called neighbour (Y/N) depending if the key value pair is found in the neighbour/bordering countries dataframe. Is this somehow possible - so it is a kind of a join but the result should be a boolean column.
The result should be:
From.country To.Country points Neighbour
Belgium Finland 4 Y
Belgium Germany 5 Y
Malta Italy 12 Y
Malta UK 1 N
In the question below it shows how you can merge but it doesn't show how you can add that extra boolean column
Two alternative approaches:
1) with base R:
idx <- match(df1$From.country, df2$From.country, nomatch = 0) &
match(df1$To.Country, df2$To.Country, nomatch = 0)
df1$Neighbour <- c('N','Y')[1 + idx]
2) with data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[, Neighbour := 'N'][df2, on = .(From.country, To.Country), Neighbour := 'Y'][]
which both give (data.table-output shown):
From.country To.Country points Neighbour
1: Belgium Finland 4 Y
2: Belgium Germany 5 Y
3: Malta Italy 12 Y
4: Malta UK 1 N
Borrowing the idea from this post:
df1$Neighbour <- duplicated(rbind(df2[, 1:2], df1[, 1:2]))[ -seq_len(nrow(df2)) ]
df1
# From.country To.Country points Neighbour
# 1 Belgium Finland 4 TRUE
# 2 Belgium Germany 5 TRUE
# 3 Malta Italy 12 TRUE
# 4 Malta UK 1 FALSE
What about something like this?
sortpaste <- function(x) paste0(sort(x), collapse = "_");
df1$Neighbour <- apply(df1[, 1:2], 1, sortpaste) %in% apply(df2[, 1:2], 1, sortpaste)
# From.country To.Country points Neighbour
#1 Belgium Finland 4 TRUE
#2 Belgium Germany 5 TRUE
#3 Malta Italy 12 TRUE
#4 Malta UK 1 FALSE
Sample data
df1 <- read.table(text =
"From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1", header = T)
df2 <- read.table(text =
"From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy", header = T)

Pass a string argument to a function as dataframe column name in dplyr

I am trying to pass a string variable to a function, to be used as the column name after some data alteration.
Here is the function:
cleandata <- function(df,name){
df <- df %>%
gather(key = 'Year',value = name,X1960:X2015)
df <- df %>%
select(-c(X,Indicator.Name,Indicator.Code))
df$Year <- substr(df$Year,start = 2,stop = 5)
df$Year <- as.factor(df$Year)
return(df)
}
I want to pass a string variable to 'name', and have it as the column name.
The current output of the function is:
> cleandata(lifeexp,'LifeExp')
Source: local data frame [13,888 x 4]
Country.Name Country.Code Year name
(fctr) (fctr) (fctr) (dbl)
1 Aruba ABW 1960 65.56937
2 Andorra AND 1960 NA
3 Afghanistan AFG 1960 32.32851
4 Angola AGO 1960 32.98483
5 Albania ALB 1960 62.25437
6 Arab World ARB 1960 46.84706
7 United Arab Emirates ARE 1960 52.24322
8 Argentina ARG 1960 65.21554
9 Armenia ARM 1960 65.86346
10 American Samoa ASM 1960 NA
.. ... ... ... ...
>
The last column should be 'LifeExp', not name. What am I missing?
Thanks in advance,
Rahul
You want to use gather_ here. See vignette('nse') for an explanation why.
year_cols <- names(df)[grepl('^X\\d{4}$', names(df))]
df %>% gather_('Year', name, year_cols)
The issue is gather takes an unquoted name for its key and value columns, so you can't pass in a variable name. It's just going to interpret what ever variable name you put in there as the the unquoted name you want for the value column. This is consistent with the principle that the tidyr functions without underscores are meant for interactive use and those with underscores should be used when your effort is more programmatic.

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources