How to create unbalanced panel data based on start date and end date - panel-data

My data looks something like this:
leader startday enddate
P 28/12/2000 15/12/2004
C 11/11/1966 19/10/1969
H 21/10/1993 1/07/1994
And I would like to obtain the following data:
leader year
P 2000
P 2001
P 2002
P 2003
P 2004
C 1966
C 1967
C 1968
C 1969
H 1993
H 1994

As you didn't specified language to use. I'll write some pseudo-code to help you with it.
get_year(string date){
return int(date.substr(date.lastIndexOf('/')+1));
}
foreach (table as row) {
for ( int year = get_year(row.startday); year <= get_year(row.enddate); year++ ) {
print( row.leader + " " + year );
}
}

Related

Nested for loop to create a column, using two data sets in R

I want to create a new variable(named "treatment") in a dataset using two different datasets. My original datasets are two big datasets with other variables however, for simplicity, let's say I have the following datasets:
#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8),
birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
lower_cutoff = c(1928, 1975, 1907, 1934),
upper_cutoff = c(1948, 1995, 1927, 1948),
cohort = c(1938, 1985, 1917, 1942))
Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information. According to dataset b I want to create a treatment column in the dataset a. Treatment is 1 if the birth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff. And anything else should be NA.
After creating an empty treatment column, I used the following code below:
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] &
a$birth_year[i] >= b$cohort[j] &
a$birth_year[i]<= b$upper_cutoff[j], "1",
ifelse(a$ison[i] == b$ison[j] &
a$birth_year[i] < b$cohort[j] &
a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
}
}
As well as:
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
a[i, "treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] >= b[j,"cohort"] &
a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] < b[j,"cohort"]&
a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
}
}
Both codes run, but they only return NAs. The following is the result I want to get:
treatment <- c(NA, 0, 1, NA, 0, 1, NA, 0)
Any ideas about what is wrong? Or any other suggestions? Thanks in advance!
I believe you are switching your upper and lower cutoffs. Try this approach with dplyr:
library(dplyr)
left_join(a,b) %>%
mutate(treatment = case_when(
(birth_year>=cohort & birth_year<=lower_cutoff)~1,
(birth_year<cohort & birth_year>=upper_cutoff)~0
))
Output:
country_code birth_year upper_cutoff lower_cutoff cohort treatment
1 2 1920 1928 1948 1938 NA
2 2 1930 1928 1948 1938 0
3 2 1940 1928 1948 1938 1
4 10 1970 1975 1995 1985 NA
5 10 1970 1907 1927 1917 NA
6 10 1980 1975 1995 1985 0
7 10 1980 1907 1927 1917 NA
8 10 1990 1975 1995 1985 1
9 10 1990 1907 1927 1917 NA
10 10 2000 1975 1995 1985 NA
11 10 2000 1907 1927 1917 NA
12 8 1910 NA NA NA NA
Try this for loop
for(i in 1:nrow(a)){
x <- which(a$country_code[i] == b$country_code)
a$treatment[i] <- NA
for(j in x){
if(a$birth_year[i] %in% b$cohort[j]:b$upper_cutoff[j]){
a$treatment[i] <- 1
}
if(a$birth_year[i] %in% b$lower_cutoff[j]:b$cohort[j]){
a$treatment[i] <- 0
}
}
}
Output
country_code birth_year treatment
1 2 1920 NA
2 2 1930 0
3 2 1940 1
4 10 1970 NA
5 10 1980 0
6 10 1990 1
7 10 2000 NA
8 8 1910 NA
I found the mistake in my code. Apparently, I should have used a break to avoid overwriting the variable I'm creating. But, I'm still open to other answers.
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
if(!is.na(a$treatment[i])){break} #to make it stop if I already assign a value
a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] &
a$birth_year[i] >= b$cohort[j] &
a$birth_year[i]<= b$upper_cutoff[j], "1",
ifelse(a$ison[i] == b$ison[j] &
a$birth_year[i] < b$cohort[j] &
a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
}
}

How can i sum values of 1 column based on the categories of another column, multiple times, in R?

I guess my question its a little strange, let me try to explain it. I need to solve a simple equation for a longitudinal database (29 consecutive years) about food availability and international commerce: (importations-exportations)/(production+importations-exportations)*100[equation for food dependence coeficient, by FAO]. The big problem is that my database has the food products and its values of interest (production, importation and exportation) dissagregated, so i need to find a way to apply that equation to a sum of the values of interest for every year, so i can get the coeficient i need for every year.
My data frame looks like this:
element product year value (metric tons)
Production Wheat 1990 16
Importation Wheat 1990 2
Exportation Wheat 1990 1
Production Apples 1990 80
Importation Apples 1990 0
Exportation Apples 1990 72
Production Wheat 1991 12
Importation Wheat 1991 20
Exportation Wheat 1991 0
I guess the solution its pretty simple, but im not good enough in R to solve this problem by myself. Every help is very welcome.
Thanks!
This is a picture of my R session
require(data.table)
# dummy table. Use setDT(df) if yours isn't a data table already
df <- data.table(element = (rep(c('p', 'i', 'e'), 3))
, product = (rep(c('w', 'a', 'w'), each=3))
, year = rep(c(1990, 1991), c(6,3))
, value = c(16,2,1,80,0,72,12,20,0)
); df
element product year value
1: p w 1990 16
2: i w 1990 2
3: e w 1990 1
4: p a 1990 80
5: i a 1990 0
6: e a 1990 72
7: p w 1991 12
8: i w 1991 20
9: e w 1991 0
# long to wide
df_1 <- dcast(df
, product + year ~ element
, value.var = 'value'
); df_1
# apply calculation
df_1[, food_depend_coef := (i-e) / (p+i-e)*100][]
product year e i p food_depend_coef
1: a 1990 72 0 80 -900.000000
2: w 1990 1 2 16 5.882353
3: w 1991 0 20 12 62.500000

r merge data with different year

I would like to merge two data using different years.
My data are like the below with more than 1,000 firms with 20 years span.
And I want to merge data to examine firm A's ratio at t's impact on firm A's count at t+1.
Data A
firm year ratio
A 1990 0.2
A 1991 0.3
...
B 1990 0.1
Data B
firm tyear count
A 1990 2
A 1991 6
...
B 1990 4
Expected Output
firm year ratio count
A 1990 0.2 6
Any suggestion for code to merge data?
Thank you
This should get you started on the dataset, just make sure you do the right lag/lead transformation on the table.
library(data.table)
dt.a.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.b.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.merged <- merge( x = dt.a.years
, y = dt.b.years[, .(Year, lag.Year = shift(Year, n = 1, fill = NA))]
, by.x = "Year"
, by.y = "lag.Year")
>dt.merged
Year Year.y
1: 1990 1991
2: 1991 1992
3: 1992 1993
4: 1993 1994
5: 1994 1995
6: 1995 1996
7: 1996 1997
8: 1997 1998
9: 1998 1999
How about like this:
A$tyear = A$year+1
AB = merge(A,B,by=c('firm','tyear'),all=F)

R equal sampling takes too long

I want to sample rows from different years given some constraints.
Say my dataset looks like this:
library(data.table)
dataset = data.table(ID=sample(1:21), Vintage=c(1989:1998, 1989:1998, 1992), Region.Focus=c("Europe", "US", "Asia"))
> dataset
ID Vintage Region.Focus
1: 7 1989 Europe
2: 10 1990 US
3: 20 1991 Asia
4: 18 1992 Europe
5: 4 1993 US
6: 17 1994 Asia
7: 13 1995 Europe
8: 9 1996 US
9: 12 1997 Asia
10: 3 1998 Europe
11: 11 1989 US
12: 14 1990 Asia
13: 8 1991 Europe
14: 16 1992 US
15: 19 1993 Asia
16: 1 1994 Europe
17: 5 1995 US
18: 15 1996 Asia
19: 6 1997 Europe
20: 21 1998 US
21: 2 1992 Asia
ID Vintage Region.Focus
I want to 1,000 draws of sample size 2 and 4 (separate from each other) spread along two years. E.g. for 1,000 draws of sample size 2, it could be the first and the second row. I also have a constraint that the sample must consist of rows with the same region focus. My solution is the code below, but it is way too slow.
for(i in c(2,4)) {
simulate <- function(i) {
repeat{
start <- dataset[sample(nrow(dataset), 1, replace=TRUE),]
t <- start$Vintage:(start$Vintage + 1)
matches <- which(dataset$Vintage %in% t & dataset$Region.Focus == start$Region.Focus) #constraints
DT <- dataset[matches,]
DT <- as.data.table(DT)
x <- DT[,.SD[sample(.N,min(.N,i/length(t)))],by = Vintage]
if(nrow(x) ==i) {
x <- as.data.frame(x)
x <- x %>% mutate(EqualWeight = 1 / i) %>% mutate(RandomWeight = prop.table(runif(i)))
x <- ungroup(x)
return(x)
} else {
x <- 0
}
}
}
#now replicate the expression 1000 times
r <- replicate(1000, simulate(i), simplify=FALSE)
r <- rbindlist(r, idcol="draw")
f <- as.data.frame(r)
write.csv(p, file=paste("Performance.fof.5", i, "csv", sep="."))
fof <- paste("fof.5", i, sep = ".")
assign(fof, f)
}
This code is very slow. My initial intuition is that my approach would need a lot of funds and keeps looping due to the constraint. I have 5,800 rows.
Is there a way other than the repeat function that results in a lot of looping? Perhaps there is another way of expressing the line DT[,.SD[sample(.N,min(.N,i/length(t)))],by = Vintage] to get rid off the repeat expression? Thank you in advance for any input!

R: Where a value in two data frames is the same, apply a set of condition on one to determine its classification

I have two sets of data. I wish to apply a classification (low, mid.lo, mid.up, high) one the first set (income by year) based on conditions contained in the other (year, and three breakpoints). Below are samples from those data sets - the real sets are much larger and are not of the same length.
income
Country Year GNI.caput
Argentina 2000 7470
Argentina 2001 7000
Argentina 2002 4050
Argentina 2003 3670
Argentina 2004 3810
Denmark 2000 32660
Denmark 2001 31440
Denmark 2002 30870
Denmark 2003 34850
Denmark 2004 42760
Kenya 2000 420
Kenya 2001 400
Kenya 2002 390
Kenya 2003 410
Kenya 2004 460
Philippines 2000 1230
Philippines 2001 1230
Philippines 2002 1190
Philippines 2003 1270
Philippines 2004 1400
breaks
Year Break.1 Break.2 Break.3
2004 825 3225 10065
2003 765 3035 9385
2002 735 2935 9075
2001 745 2975 9205
2000 755 2995 9265
I have tried the following sets of loops, but neither completes, generating several errors each.
Attempt 1
for(i in seq_along(gni.data)){
while(gni.data$Year == break.pts$Year) {
if(gni.data$GNI.caput <= break.pts$Break.1) {
gni.data$Indicator <- "Low"
} else if(gni.data$GNI.caput <= break.pts$Break.2) {
gni.data$Indicator <- "Mid.Low"
} else if(gni.data$GNI.caput <= break.pts$Break.3) {
gni.data$Indicator <- "Mid.Up"
} else if(gni.data$GNI.caput > break.pts$Break.3) {
gni.data$Indicator <- "High"
} else gni.data$Indicator <- "NA"
}
}
Warning messages:
1: In gni.data$Year == break.pts$Year :
longer object length is not a multiple of shorter object length
2: In while (gni.data$Year == break.pts$Year) { :
the condition has length > 1 and only the first element will be used
...
Attempt 2
for(i in seq_along(gni.data)){
while(gni.data$Year == break.pts$Year) {
ifelse(gni.data$GNI.caput <= break.pts$Break.1, gni.data$Indicator <- "Low",
ifelse(gni.data$GNI.caput <= break.pts$Break.2, gni.data$Indicator <- "Mid.Lo",
ifelse(gni.data$GNI.caput <= break.pts$Break.3, gni.data$Indicator <- "Mid.Up",
ifelse(gni.data$GNI.caput > break.pts$Break.3, gni.data$Indicator <- "High",
gni.data$Indicator <- "NA"))))
}
}
Warning messages same as for attempt 1.
Where am I going wrong? Thanks!
You could do this by temporarily merging the two data frames in a call to with() and then using nested ifelse() calls to make the new variable, like this:
# Toy data to test
df <- data.frame(country=rep(c("A", "B", "C"), each=3), year=rep(seq(2000,2002), 3), gdp = rnorm(9, 5000, 1000), stringsAsFactors=FALSE)
cuts <- data.frame(year = seq(2000,2002), break.1=c(4000,4500,4000), break.2=c(5000,5500,5000), break.3=c(6000,6500,6000))
# Create new variable using merge of two data sets
df$class <- with(merge(df, cuts, all.x=TRUE),
ifelse(gdp < break.1, "lo", ifelse(gdp >= break.1 & gdp < break.2, "mid.lo",
ifelse(gdp >= break.2 & gdp < break.3, "mid.hi", ifelse(gdp >= break.3, "hi", NA)))))
# Result
> newdf
year country gdp break.1 break.2 break.3 class
1 2000 A 5510.243 4000 5000 6000 mid.hi
2 2000 C 6404.494 4000 5000 6000 hi
3 2000 B 6125.383 4000 5000 6000 hi
4 2001 A 4899.577 4500 5500 6500 mid.lo
5 2001 B 4678.249 4500 5500 6500 mid.lo
6 2001 C 6026.577 4500 5500 6500 mid.hi
7 2002 B 6350.749 4000 5000 6000 hi
8 2002 A 7225.358 4000 5000 6000 hi
9 2002 C 5469.354 4000 5000 6000 mid.hi
You could also use dplyr and its piping operator to merge, recode, sort, and cut the superfluous columns all in one go:
library(dplyr)
df <- left_join(df, cuts) %>%
mutate(class = ifelse(gdp < break.1, "lo", ifelse(gdp >= break.1 & gdp < break.2, "mid.lo",
ifelse(gdp >= break.2 & gdp < break.3, "mid.hi", ifelse(gdp >= break.3, "hi", NA))))) %>%
arrange(country, year) %>%
select(-break.1, -break.2, -break.3)
# Result
>df
country year gdp class
1 A 2000 5510.243 mid.hi
2 A 2001 4899.577 mid.lo
3 A 2002 7225.358 hi
4 B 2000 6125.383 hi
5 B 2001 4678.249 mid.lo
6 B 2002 6350.749 hi
7 C 2000 6404.494 hi
8 C 2001 6026.577 mid.hi
9 C 2002 5469.354 mid.hi

Resources