Create time event based dummy variable in R - leads & lags - r

I am currently searching for a method to create a set of dummy variables indicating a time event in a panel. Explicitly I am trying to make dummy variables indicating the event 20 years prior the event and 20 years after the event, e.g. the effect of a war on trade in 20 years. I want to code this dummy for each parnter in the dyads. How is it possible, to elegantly programm these event dummies ? I would appreciate your help :)
iso_o iso_d year mid_o mid_d
ABW AFG 1980 0 1
ABW AFG 1981 0 1
ABW AFG 1982 0 1
ABW AFG 1983 0 2
ABW AFG 1984 0 1
ABW AFG 1985 0 1
ABW AFG 1986 0 1
ABW AFG 1987 0 1
ABW AFG 1988 0 0
ABW AFG 1989 0 1
So and this is where I want to go to:
iso_o iso_d year mid_o mid_d mid_o_t-20 mid_o_t-19 mid o_t-18 .... mid_d_t-20
ABW AFG 1980 0 1 0 0 0
ABW AFG 1981 0 1 0 0 0
ABW AFG 1982 0 1 0 0 0
ABW AFG 1983 0 2 0 0 0
ABW AFG 1984 0 1 0 0 0
ABW AFG 1985 0 1 0 0 0

I'm assuming here da.f (short for data.frame with no collision with known functions) follows approximately your structure as you did not include it in the question.
library(zoo)
#da.f is randomly generated in this example
da.f = data.frame(mid_o = sample(seq(0,4), 50, replace = TRUE), mid_d = sample(seq(0,4), 50, replace = TRUE))
#our result consists of 20 lags backward and forward in time
res = lag(as.zoo(da.f), -20:20, na.pad = TRUE)
On May 10th 2018 it was pointed to me by #thistleknot (thanks!) that dplyr masks stats's own lag generic. Therefore make sure you don't have dplyr attached, or instead run stats::lag explicitly, otherwise my code won't run.
I think I found the culprit: github.com/tidyverse/dplyr/issues/1586
answer: This is a natural consequence of having lots of R packages.
Just be explicit and use stats::lag or dplyr::lag

Hello There and thank you for your help!
I found the solution to the problem: I had to convert the data.frame to a data.table in the first place. Seconly I found a way to create multiple columns in data.table combining the commands sprintif and shift. Therby I could create 20 lags and 20 leads within only 4 lines of code.
df[, sprintf("mid_o_lag_%0d", 1:20) := shift(mid_o, c(1:20), type = 'lag')]
df[, sprintf("mid_d_lag_%0d", 1:20) := shift(mid_d, c(1:20), type = 'lag')]
df[, sprintf("mid_o_lead_%0d", 1:20) := shift(mid_o, c(1:20), type = 'lead')]
df[, sprintf("mid_d_lead_%0d", 1:20) := shift(mid_d, c(1:20), type = 'lead')]

Related

How to assign unique country ID number in panel data frame in R

In my dataset, I want to create unique country id numbers. Any help?
library(dplyr)
library(pwt10)
dataframe looks like
country isocode year currency gdp inflation ...
Aruba ABW 1950 N/A N/A N/A
Aruba ABW 1950 N/A N/A N/A
Aruba ABW 1950 N/A N/A N/A
Aruba ABW 1950 N/A N/A N/A
...
Argentina ARG 1950 Peso 130 60 ...
I want to create another column of country ID variable (id_num), whose values are written in descending order (1,2,3,....) so that it looks like the following:
country isocode year currency gdp inflation ID
Aruba ABW 1950 N/A N/A N/A 1
Aruba ABW 1950 N/A N/A N/A 1
Aruba ABW 1950 N/A N/A N/A 1
Aruba ABW 1950 N/A N/A N/A 1
...
Argentina ARG 1950 Peso 130 60 ... 5
`
I was wondering how to create the unique country ID column. Any help?
If I understood your task correctly you are looking to build a second (first is the isocode) group identification by sequencial numbering of groups. One way to achive this is the cur_group_id() function from dplyr. Here is a toy example you should be able to adapt to your data.frame:
library(dplyr)
# dummy data
df <- data.frame(col1 = c("a", "a", "b", "b", "b", "c") ,
col2 = 1:6)
df %>%
# arrange the data in growing order for column you want to build sequential group ID from/for
dplyr::arrange(col1) %>%
# build the groupings
dplyr::group_by(col1) %>%
# add new column : sequenctial group id
dplyr::mutate(ID = dplyr::cur_group_id()) %>%
# always ungroup to prevent unwanted behaviour down stream
dplyr::ungroup()
# A tibble: 6 x 3
col1 col2 ID
<chr> <int> <int>
1 a 1 1
2 a 2 1
3 b 3 2
4 b 4 2
5 b 5 2
6 c 6 3
Do you mean that "ARM" should return 1, "AUS" should return 2 and so on.
Maybe you can try this answer with match.
library(dplyr)
result <- pwt10.0 %>%
filter(isocode %in% comparison_states) %>%
distinct(isocode) %>%
mutate(id_num = match(comparison_states, isocode))
result
# isocode id_num
#ARM-1950 ARM 1
#AUS-1950 AUS 2
#CAN-1950 CAN 3
#CHN-1950 CHN 4
#GBR-1950 GBR 5
#ITA-1950 ITA 6
#JPN-1950 JPN 7
#LUX-1950 LUX 8
#NOR-1950 NOR 9
#NZL-1950 NZL 10
#SGP-1950 SGP 11
#SWE-1950 SWE 12
#THA-1950 THA 13
#TWN-1950 TWN 14
#USA-1950 USA 15

Revaluing many observations with a for loop in R

I have a data set where I am looking at longitudinal data for countries.
master.set <- data.frame(
Country = c(rep("Afghanistan", 3), rep("Albania", 3)),
Country.ID = c(rep("Afghanistan", 3), rep("Albania", 3)),
Year = c(2015, 2016, 2017, 2015, 2016, 2017),
Happiness.Score = c(3.575, 3.360, 3.794, 4.959, 4.655, 4.644),
GDP.PPP = c(1766.593, 1757.023, 1758.466, 10971.044, 11356.717, 11803.282),
GINI = NA,
Status = 2,
stringsAsFactors = F
)
> head(master.set)
Country Country.ID Year Happiness.Score GDP.PPP GINI Status
1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
4 Albania Albania 2015 4.959 10971.044 NA 2
5 Albania Albania 2016 4.655 11356.717 NA 2
6 Albania Albania 2017 4.644 11803.282 NA 2
I created that Country.ID variable with the intent of turning them into numerical values 1:159.
I am hoping to avoid doing something like this to replace the value at each individual observation:
master.set$Country.ID <- master.set$Country.ID[master.set$Country.ID == "Afghanistan"] <- 1
As I implied, there are 159 countries listed in the data set. Because it' longitudinal, there are 460 observations.
Is there any way to use a for loop to save me a lot of time? Here is what I attempted. I made a couple of lists and attempted to use an ifelse command to tell R to label each country the next number.
Here is what I have:
#List of country names
N.Countries <- length(unique(master.set$Country))
Country <- unique(master.set$Country)
Country.ID <- unique(master.set$Country.ID)
CountryList <- unique(master.set$Country)
#For Loop to make Country ID numerically match Country
for (i in 1:460){
for (j in N.Countries){
master.set[[Country.ID[i]]] <- ifelse(master.set[[Country[i]]] == CountryList[j], j, master.set$Country)
}
}
I received this error:
Error in `[[<-.data.frame`(`*tmp*`, Country.ID[i], value = logical(0)) :
replacement has 0 rows, data has 460
Does anyone know how I can accomplish this task? Or will I be stuck using the ifelse command 159 times?
Thanks!
Maybe something like
master.set$Country.ID <- as.numeric(as.factor(master.set$Country.ID))
Or alternatively, using dplyr
library(tidyverse)
master.set <- master.set %>% mutate(Country.ID = as.numeric(as.factor(Country.ID)))
Or this, which creates a new variable Country.ID2based on a key-value pair between Country.ID and a 1:length(unique(Country)).
library(tidyverse)
master.set <- left_join(master.set,
data.frame( Country = unique(master.set$Country),
Country.ID2 = 1:length(unique(master.set$Country))))
master.set
#> Country Country.ID Year Happiness.Score GDP.PPP GINI Status
#> 1 Afghanistan Afghanistan 2015 3.575 1766.593 NA 2
#> 2 Afghanistan Afghanistan 2016 3.360 1757.023 NA 2
#> 3 Afghanistan Afghanistan 2017 3.794 1758.466 NA 2
#> 4 Albania Albania 2015 4.959 10971.044 NA 2
#> 5 Albania Albania 2016 4.655 11356.717 NA 2
#> 6 Albania Albania 2017 4.644 11803.282 NA 2
#> Country.ID2
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 2
#> 6 2
library(dplyr)
df<-data.frame("Country"=c("Afghanistan","Afghanistan","Afghanistan","Albania","Albania","Albania"),
"Year"=c(2015,2016,2017,2015,2016,2017),
"Happiness.Score"=c(3.575,3.360,3.794,4.959,4.655,4.644),
"GDP.PPP"=c(1766.593,1757.023,1758.466,10971.044,11356.717,11803.282),
"GINI"=NA,
"Status"=rep(2,6))
df1<-df %>% arrange(Country) %>% mutate(Country_id = group_indices_(., .dots="Country"))
View(df1)

Construct a variable that conditionally takes a certain value until another condition is met

I have a panel dataset with data on conflicts for which I want to identify the post-conflict years.
So I constructed a variable myself, which codes a transition from conflict to peace with "3". Whenever the values for a new country begin, I coded that same variable with NA. S
What I want to do now is to create a new binary variable which identifies post-conflict years with a 1 and conflict years and never conflict with 0. For that I would have to assign every year, following a 3 in the transition variable with a 1 until there is an NA in the same column. As follows:
Country Year transition post-conflict
Afghanistan 1994 0 0
Afghanistan 1995 0 0
Afghanistan 1996 3 1
Afghanistan 1997 2 1
Afghanistan 1998 2 1
Albania 1994 NA 0
Albania 1994 2 0
How could I go about this?
You probably shouldn't use NA like that. It prevents functions like which, sum, and cumsum from working as you may want them to. You likely don't need to mark the first row of a new country anyway, since most R functions you would use for your analysis can group by Country without needing a special marker showing where each group starts.
Below I change NA to something different, and make transition a factor. Then you can use cumsum to create your new column.
library(data.table)
setDT(df) # assuming your data is called df
# fix transition column
df[is.na(transition), transition := 90]
df[, transition := as.factor(transition)]
# create post_conflict column
df[, post_conflict := cumsum(transition == 3), by = Country]
# Country Year transition post_conflict
# 1: Afghanistan 1994 0 0
# 2: Afghanistan 1995 0 0
# 3: Afghanistan 1996 3 1
# 4: Afghanistan 1997 2 1
# 5: Afghanistan 1998 2 1
# 6: Albania 1994 90 0
# 7: Albania 1994 2 0

Cumulative sums for the previous row

I'm trying to get cumulative sums for the previous row/year. Running cumsum(data$fonds) gives me the running totals of adjacent sells, which doesn't work for what I want to do. I would like to have my data look like the following:
year fond cumsum
1 1950 0 0
2 1951 1 0
3 1952 3 1
4 1953 0 4
5 1954 0 4
Any help would be appreciated.
data$cumsum <- c(0, cumsum(data$fonds)[-nrow(data)])
With data.table, we can use the shift function. By default, it gives type="lag"
library(data.table)
setDT(df1)[, Cumsum := cumsum(shift(fond, fill= 0))]

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Resources