Convert Panel Data to Long in R - r

My current data is for missiles between 1920 and 2018. The goal is to measure a nation’s ability to deploy missiles of different kinds for each year from 1920 to 2018. The problems that arise are that the data has multiple observations per nation and often per year. This creates issues because for instance if a nation adopted a missile in 1970 that is Air to Air and imported then developed one in 1980 that is Air to Air and Air to Ground and produced domestically, that change needs to be reflected. The goal is to have a unique row/observation for each year for every nation. Also it should be noted that it is assumed if the nation can produced Air to air for instance in 1970 they can do so until 2018.
Current:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2016 2 United States 1 1
Desired:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2015 670 Saudi Arabia 0 1
2016 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2018 670 Saudi Arabia 1 1
2016 2 United States 0 1
2017 2 United States 0 1
2018 2 United States 0 1
Note: There are many entries and so I would like it to generate from 1920 to 2018 for every country even if they will have straight zeroes. That is not necessary but it would be a great bit!

You can do this via several steps:
Create the combination of all years and countries (a CROSS JOIN in SQL)
LEFT JOIN these combinations with the available data
Use a function like zoo::na.locf() to replace NA values by the last known ones per country.
The first step is common:
df <- read.table(text = 'YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 "Saudi Arabia" 0 1
2017 670 "Saudi Arabia" 1 1
2016 2 "United States" 1 1', header = TRUE, stringsAsFactors = FALSE)
combinations <- merge(data.frame(YearAcquired = seq(1920, 2018, 1)),
unique(df[,2:3]), by = NULL)
For steps 2 and 3 here a solution using dplyr
library(dplyr)
library(zoo)
df <- left_join(combinations, df) %>%
group_by(CountryCode) %>%
mutate(Domestic = na.locf(Domestic, na.rm = FALSE),
AirtoAir = na.locf(AirtoAir, na.rm = FALSE))
And one solution using data.table:
library(data.table)
library(zoo)
setDT(df)
setDT(combinations)
df <- df[combinations, on = c("YearAcquired", "CountryCode", "CountryName")]
df <- df[, na.locf(.SD, na.rm = FALSE), by = "CountryCode"]

You could create a new dataframe using the country names and codes available and perform a left join with your existing data. This would give you 1920 to 2018 for each country and code, leaving NA's in where you don't have data available but you could easily replace them given how you want your data structured.
# df is your initial dataframe
countries <- df$CountryName
codes <- df
new_df <- data.frame(YearAcquired = seq(1920, 2018, 1),
CountryName = df$CountryName
CountryCode = df$CountryCode)
new_df <- left_join(new_df, df)

Using tidyverse (dplyr and tidyr)...
If you only need to fill in internal years per country...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
group_by(countrycode) %>%
complete(YearAcquired = full_seq(YearAcquired, 1), countrycode, CountryName) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir)
#> # A tibble: 5 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2016 2 United States 1 1
#> 2 2014 670 Saudi Arabia 0 1
#> 3 2015 670 Saudi Arabia 0 1
#> 4 2016 670 Saudi Arabia 0 1
#> 5 2017 670 Saudi Arabia 1 1
If you want to expand each country to all years found in the dataset...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
complete(YearAcquired = full_seq(YearAcquired, 1),
nesting(countrycode, CountryName)) %>%
group_by(countrycode) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir) %>%
mutate_at(vars(Domestic, AirtoAir), funs(if_else(is.na(.), 0L, .)))
#> # A tibble: 8 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2014 2 United States 0 0
#> 2 2015 2 United States 0 0
#> 3 2016 2 United States 1 1
#> 4 2017 2 United States 1 1
#> 5 2014 670 Saudi Arabia 0 1
#> 6 2015 670 Saudi Arabia 0 1
#> 7 2016 670 Saudi Arabia 0 1
#> 8 2017 670 Saudi Arabia 1 1

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

Combining rows and generating category counts

I want to be able to first combine rows with a similar attribute into one(for example, one row for each City/Year), and then find the specific counts for types of categories for each of those rows.
For example, with this as the original data:
City Year Type of Death
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide
I want to be able to produce something like this:
City Year n_Total n_Homicides n_Suicides
NYC 1995 1 1 0
NYC 1996 2 1 1
LA 1995 3 1 2
I've tried something like the below, but it only gives me the n_Total and doesn't take into account the splits for n_Homicides and n_Suicides:
library(dplyr)
total_deaths <- data %>%
group_by(city, year)%>%
summarize(n_Total= n())
You may do this
library(tidyverse, warn.conflicts = F)
df <- read.table(header = T, text = 'City Year TypeofDeath
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide')
df %>%
pivot_wider(names_from = TypeofDeath, values_fn = length, values_from = TypeofDeath, values_fill = 0, names_prefix = 'n_') %>%
mutate(n_total = rowSums(select(cur_data(), starts_with('n_'))))
#> # A tibble: 3 x 5
#> City Year n_Homicide n_Suicide n_total
#> <chr> <int> <int> <int> <dbl>
#> 1 NYC 1995 1 0 1
#> 2 NYC 1996 1 1 2
#> 3 LA 1995 1 2 3
Created on 2021-07-05 by the reprex package (v2.0.0)
If you don't have too many types of death, then something simple (albeit a little "manual") like this might have some appeal.
library(dplyr, warn.conflicts = FALSE)
df <- read.table(header = TRUE, text = 'City Year TypeofDeath
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide')
df %>%
group_by(City, Year) %>%
summarize(n_Total = n(),
n_Suicide = sum(TypeofDeath == "Suicide"),
n_Homicide = sum(TypeofDeath == "Homicide"))
#> `summarise()` has grouped output by 'City'. You can override using the `.groups` argument.
#> # A tibble: 3 x 5
#> # Groups: City [2]
#> City Year n_Total n_Suicide n_Homicide
#> <chr> <int> <int> <int> <int>
#> 1 LA 1995 3 2 1
#> 2 NYC 1995 1 0 1
#> 3 NYC 1996 2 1 1
Created on 2021-07-05 by the reprex package (v2.0.0)
You can first dummify your factor variable using the fastDummies package, than summarise(). This is a more general and versatile approach that can be used seamlessly with any number of unique types of death.
If you only have two types of death and will settle for a simpler (though more "manual") approach, you can use the other suggestions with summarise(x=..., y=..., total=n()
library(dplyr)
library(fastDummies)
df%>%fastDummies::dummy_cols('TypeofDeath', remove_selected_columns = TRUE)%>%
group_by(City, Year)%>%
summarise(across(contains('Type'), sum),
total_deaths=n())
# A tibble: 3 x 5
# Groups: City [2]
City Year TypeofDeath_Homicide TypeofDeath_Suicide total_deaths
<chr> <int> <int> <int> <int>
1 LA 1995 1 2 3
2 NYC 1995 1 0 1
3 NYC 1996 1 1 2

dplyr group operations adding na

Here are my data :
places <- c("London", "London", "London", "Paris", "Paris", "Rennes")
years <- c(2019, 2019, 2020, 2019, 2019, 2020)
dataset <- data.frame(years, places)
The result:
years places
1 2019 London
2 2019 London
3 2020 London
4 2019 Paris
5 2019 Paris
6 2020 Rennes
I am counting by place and years
dataset2 <- dataset %>%
count(places, years)
places years n
1 London 2019 2
2 London 2020 1
3 Paris 2019 2
4 Rennes 2020 1
I want my table to show the two years for each city even if there are no values.
places years n
1 London 2019 2
2 London 2020 1
3 Paris 2019 2
4 Paris 2020 NA # or better 0
5 Rennes 2019 NA # or better 0
6 Rennes 2020 1
You could use complete from tidyr to fill in missing sequence :
library(dplyr)
library(tidyr)
dataset %>% count(places, years) %>% complete(places, years, fill = list(n = 0))
If you convert years to factor you can specify .drop = FALSE.
dataset %>% mutate(years = factor(years)) %>% count(places, years, .drop = FALSE)
# places years n
# <fct> <fct> <int>
#1 London 2019 2
#2 London 2020 1
#3 Paris 2019 2
#4 Paris 2020 0
#5 Rennes 2019 0
#6 Rennes 2020 1
We can use CJ from data.table
library(data.table)
setDT(dataset)[, .N, .(years, places)][CJ(years, places, unique = TRUE), on = .(years, places)]

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.
We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2
I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1

R: Find top, mid and bottom values to create a category column in dplyr

I would like to create a 'Category' column in the below dataset based on the sales and year.
set.seed(30)
df <- data.frame(
Year = rep(2010:2015, each = 6),
Country = rep(c('India', 'China', 'Japan', 'USA', 'Germany', 'Russia'), 6),
Sales = round(runif(18, 100, 900))
)
head(df)
Year Country Sales
1 2010 India 661
2 2010 China 888
3 2010 Japan 285
4 2010 USA 272
5 2010 Germany 332
6 2010 Russia 660
Categories are:
Top 2 countries with highest sales in each year: Category - 1
Bottom 2 countries with lowest sales in each year: Category - 3
Remaining countries by year: Category - 2
Expected dataset might look like:
Year Country Sales Category
1 2010 India 661 1
2 2010 China 888 1
3 2010 Japan 285 3
4 2010 USA 272 3
5 2010 Germany 332 2
6 2010 Russia 660 2
You don't need much here; just group_by year, arrange from greatest to least sales, and then add a new column with mutate that fills with 2:
df %>% group_by(Year) %>%
arrange(desc(Sales)) %>%
mutate(Category = c(1, 1, rep(2, n()-4), 3, 3))
# Source: local data frame [36 x 4]
# Groups: Year [6]
#
# Year Country Sales Category
# (int) (fctr) (dbl) (dbl)
# 1 2010 China 491 1
# 2 2010 USA 436 1
# 3 2010 Japan 391 2
# 4 2010 Germany 341 2
# 5 2010 Russia 218 3
# 6 2010 India 179 3
# 7 2011 Japan 873 1
# 8 2011 India 819 1
# 9 2011 Russia 418 2
# 10 2011 China 279 2
# .. ... ... ... ...
It will fail with fewer than four countries, but that doesn't sound like an issue from the question.
We can use cut to create a 'Category' column after grouping by "Year".
library(dplyr)
df %>%
group_by(Year) %>%
mutate(Category = as.numeric(cut(-Sales, breaks=c(-Inf,
quantile(-Sales, prob = c(0, .5, 1))))))
Or using data.table
library(data.table)
setDT(df)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
This should also work when there are fewer elements than 4 in each "Year". i.e. if we remove the first five observations in 2010.
df1 <- df[-(1:5),]
setDT(df1)[order(-Sales), Category := if(.N > 4) rep(1:3,
c(2, .N - 4, 2)) else rep(seq(.N), each = ceiling(.N/3)) ,by = Year]
head(df1)
# Year Country Sales Category
#1: 2010 Russia 218 1
#2: 2011 India 819 1
#3: 2011 China 279 2
#4: 2011 Japan 873 1
#5: 2011 USA 213 3
#6: 2011 Germany 152 3

Resources