Lagging a variable by adding up the previous 5 years? - r

I am working with data that look like this:
Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000
I want to create a lagged variable by adding up the previous five years in the data
So that the observation for 2000 looks like this:
Country Year Aid Lagged5
Angola 2000 416420000 1953200000
Which was derived by adding the Aid observations from 1995 to 1999 together:
416420000 + 459310000 + 354660000 + 335270000 + 387540000 = 1953200000
Also, I will need to group by country as well.
Thank You!

You could do:
library(dplyr)
df %>%
group_by(Country) %>%
mutate(Lagged5 = sapply(Year, function(x) sum(Aid[between(Year, x - 5, x - 1)])))
Output:
# A tibble: 6 x 4
# Groups: Country [1]
Country Year Aid Lagged5
<chr> <int> <int> <int>
1 Angola 1995 416420000 0
2 Angola 1996 459310000 416420000
3 Angola 1997 354660000 875730000
4 Angola 1998 335270000 1230390000
5 Angola 1999 387540000 1565660000
6 Angola 2000 302210000 1953200000

Using the input DF shown reproducibly in the Note at the end define a roll function which sums the prior 5 rows and use ave to run it for each Country. The width argument list(-seq(5)) to rollapplyr means use offsets -1, -2, -3, -4, -5 in summing, i.e. the values in the prior 5 rows.
The question did not discuss what to do with the initial rows in each country so we put in NA values but if you want partial sums add the partial = TRUE argument to rollapplyr. You can also change the fill=NA to some other value if you wish so it is quite flexible.
library(zoo)
roll <- function(x) rollapplyr(x, list(-seq(5)), sum, fill = NA)
transform(DF, Lag5 = ave(Aid, Country, FUN = roll))
Note
The input was assumed to be the following. We added a second country.
Lines <- "Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE,
colClasses = c("character", "integer", "numeric"))
DF <- rbind(DF, transform(DF, Country = "Belize"))

Related

ggplot doesn't arrange my graph as expected [duplicate]

This question already has answers here:
ggplot2: sorting a plot
(5 answers)
How to force specific order of the variables on the X axis?
(1 answer)
Closed last month.
Good morning,
I'm trying to use ggplot with a data frame but I faced an issue. My ggplot doesn't take consideration about the function arrange on my data frame.
Here is my code :
data()
pop <- population[population$year == 1995, ]
pop <- pop[1:10, ]
pop %>%
ggplot(aes(x = country, y = population)) +
geom_point()
pop <- pop %>%
arrange(population)
pop %>%
ggplot(aes(x = country, y = population)) +
geom_point()
I would like that my graph would be arranged according to the population, so at the first place, the country with the lowest population, at the second place, the country with the second lowest population and so on. But ggplot doesn't match my graph as expected.
I have this data frame :
country year population
<chr> <int> <int>
1 Anguilla 1995 9807
2 American Samoa 1995 52874
3 Andorra 1995 63854
4 Antigua and Barbuda 1995 68349
5 Armenia 1995 3223173
6 Albania 1995 3357858
7 Angola 1995 12104952
8 Afghanistan 1995 17586073
9 Algeria 1995 29315463
10 Argentina 1995 34833168
But my graph is ordered by alphabetical order :
Do you have any idea to make it by population number?

How to use for loop to extract rows with similar elements across 2 dataframes?

I have 2 dataframes, one is a Free Trade Agreement dataset that contains many columns, the columns c1 to c91 denote different countries part of a particular Free Trade Agreement, as shown below:
FTA data
FTA data e.g.
No Base_treaty entry_type c1 c2 c3
1 1 treaty Afghanistan India NA
2 2 treaty Algeria Egypt Ghana
3 3 treaty Algeria Angola Benin
4 4 treaty Egypt Jordan Morocco
5 5 treaty Albania Bulgaria NA
6 6 treaty Albania Croatia NA
The other data frame contains trade data between two particular countries, i and j. Trade Data
inventor_ctry_i authority_ctry_j
1 Albania Bulgaria
2 Albania Croatia
3 Algeria Angola
4 Algeria Belgium
5 Algeria France
6 Andorra Turkey
7 Andorra United States
8 Anguilla Germany
9 Anguilla Switzerland
10 Anguilla United States
Desired output:
No Base_treaty entry_type matched ctry1 matched ctry2
3 3 treaty Algeria Angola
5 5 treaty Albania Bulgaria
6 6 treaty Albania Croatia
I want to be able to find countries i and j in trade data that show up in the same row somewhere in between c1 to c91 of the FTA data. If both are present in a particular row, extract the 2 countries from the row in FTA, keeping no, base treaty and entry type column intact.
What I have done so far:
FTA_final: FTA Data, unique_pairs: Trade Data
specialnames <- setdiff(names(FTA_final), c("number", "base_treaty",
"entry_type")) **#getting rid of irrelevant columns**
table <- data.frame()` **#create empty dataframe**
for(i in nrow(FTA_final)){`
for(j in seq_along(specialnames)){`
for(p in nrow(unique_pairs)){`
if (FTA_final[i,j] %in% unique_pairs[p,])`
{table <- rbind(table,FTA_final[i,c(1:3, j)])}`
` }`
`}`
`}` **#for loop**
Nothing happens when I run these codes, not sure why. Any help would be greatly appreciated.
One way to do this would be to row-wise paste the value of Trade_data to get combinations of countries that trade together. We can then create a combination of countries in FTA_data and check if any of the combination matches all_countries.
cols <- paste0('c', 1:3)
all_countries <- do.call(paste, Trade_data)
data <- apply(FTA_data[cols], 1, function(x) {
x <- na.omit(x)
if(length(x) <= 1) return(NULL)
temp <- combn(x, 2)
inds <- combn(x, 2, paste, collapse = " ") %in% all_countries
if(any(inds)) temp[, inds]
})
new_data <- FTA_data[!sapply(data, is.null), ]
new_data[cols] <- NULL
final_data <- cbind(new_data, do.call(rbind, data))
final_data
# No Base_treaty entry_type 1 2
#3 3 3 treaty Algeria Angola
#5 5 5 treaty Albania Bulgaria
#6 6 6 treaty Albania Croatia
Here is another way :
library(dplyr)
library(tidyr)
output<- FTA_data[rowSums(sapply(all_countries, function(x)
apply(FTA_data[cols], 1, function(y)
grepl(x, paste(y, collapse = " "))))) > 0, ]
output %>%
pivot_longer(cols = starts_with('c'),
values_drop_na = TRUE) %>%
filter(value %in% Trade_data$inventor_ctry_i |
value %in% Trade_data$authority_ctry_j) %>%
group_by(No, Base_treaty, entry_type) %>%
mutate(name = paste0('c', row_number())) %>%
pivot_wider()
Thank you to #Ronak Shah for your suggestions
As suggested by #Ronak Shah, I was able to get the relevant rows that had countries i and j in them:
cols <- paste0('c', 1:3)
all_countries <- do.call(paste, Trade_data)
output<- FTA_data[rowSums(sapply(all_countries, function(x)
apply(FTA_data[cols], 1, function(y)
grepl(x, paste(y, collapse = " "))))) > 0, ]
Afterwhich, I did this:
do.call(rbind, combn(grep("^c\\d+$", names(output)), 2, function(x)
cbind(output[1:3], setNames(output[x], paste0("c", 1:2))), simplify=F))
This helps me get all possible combinations across "c" columns, while retaining columns 1:3, i.e. No, Base Entry and entry_type.
After this, a simple left join with trade data will gave me the desired i and j pairs and the output:
No Base_treaty entry_type matched ctry1 matched ctry2
3 3 treaty Algeria Angola
5 5 treaty Albania Bulgaria
6 6 treaty Albania Croatia

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

ddply and adding columns

I have a data frame with columns year|country|growth_rate. I wanted to to find country with highest growth rate in every year, which I did with:
ddply(data, .(year), summarise, highest=max(growth_rate))
and I've got data frame with 2 columns; year and highest
I would like to add third column here, which would show that country that had that max growth_rate, but I can't figure out how to do this.
R> data = data.frame(year = rep(1990:1993, 2), growth_rate = runif(8), country = rep(c("US", "FR"), each = 4))
R> data
year growth_rate country
1 1990 0.82785327 US
2 1991 0.86724498 US
3 1992 0.84813164 US
4 1993 0.35884355 US
5 1990 0.92792399 FR
6 1991 0.08659153 FR
7 1992 0.26732516 FR
8 1993 0.37819132 FR
R> ddply(data, .(year), summarize, highest = max(growth_rate), country = country[which.max(growth_rate)])
year highest country
1 1990 0.9279240 FR
2 1991 0.8672450 US
3 1992 0.8481316 US
4 1993 0.3781913 FR

How to get column mean for specific rows only?

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:
period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991
This is the structure of my data:
country year score
Algeria 1980 -1.1201501
Algeria 1981 -1.0526943
Algeria 1982 -1.0561565
Algeria 1983 -1.1274560
Algeria 1984 -1.1353926
Algeria 1985 -1.1734330
Algeria 1986 -1.1327666
Algeria 1987 -1.1263586
Algeria 1988 -0.8529455
Algeria 1989 -0.2930265
Algeria 1990 -0.1564207
Algeria 1991 -0.1526328
Algeria 1992 -0.9757842
Algeria 1993 -0.9714060
Algeria 1994 -1.1422258
Algeria 1995 -0.3675797
...
The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.
This is how it should look like:
country year score mean
Algeria 1980 -1.1201501 -1.089
Algeria 1981 -1.0526943 -1.089
Algeria 1982 -1.0561565 -1.089
Algeria 1983 -1.1274560 -1.089
Algeria 1984 -1.1353926 -0.839
Algeria 1985 -1.1734330 -0.839
Algeria 1986 -1.1327666 -0.839
Algeria 1987 -1.1263586 -0.839
Algeria 1988 -0.8529455 -0.839
Algeria 1989 -0.2930265 -0.839
Algeria 1990 -0.1564207 -0.839
...
Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...
Many many thanks for your help!
datfrm$mean <-
with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )
The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:
mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks #DWin]. For completeness the data.table equivalent (scales for large data) is :
require(data.table)
DT = as.data.table(DF) # or just start with a data.table in the first place
DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]
or findInterval is likely faster as DWin used :
DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
If the rows are ordered by year, I think the easiest way to accomplish this would be:
m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.
If the rows are not ordered by year, I would use tapply like this.
list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)
Here, tapply takes three parameters:
First, the data you want to do stuff with (in this case, datafram$score).
Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.
Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.
So, list.of.means should be a list of the 3 values you are looking for.

Resources