Aggregating data according to month - r

I have a panel dataset containing data on civil war, with indices "side_a_id" and "year_month". Each observation is an individual 'event' of armed conflict, and variables include details on the actors involved, a unique ID for each individual event, and then for each event, the number of side_a deaths, side_b deaths, and civilian deaths.
Screenshot of sample dataset
I would like to aggregate the data on each separate variable on deaths ('deaths_a', 'deaths_b' and 'civilian_deaths') according to which year_month they are in. So taking an example from my dataset below: instead of having 3 separate rows for the interaction between Government of Haiti and Military Faction (dyad_id = 14), I would like one row that contains all the deaths of each party for a specific month. I have tried using the the aggregate() function, which seems to work, until I try to re-merge it with my full dataset.
df <- aggregate(cbind(deaths_a, deaths_b, deaths_civilians) ~ side_a_id +
year_month, panel_data, sum)
rebel <- full_join(panel_data, df, by = c("side_a_id", "year_month"))
Can anyone suggest a solution?

Related

I need an alternative to the index(,match,match) function in r

I have two datasets, data_issuer and ESG_data.
data_issuer contains bond pricing observations, the identity of the bond issuer is given with the EquityISIN variable. The same EquityISIN occurs multiple times in the dataset for two reasons 1. for each bond there are multiple pricing observations 2. each issuer can issue multiple bonds
The data_issuer dataset also contains a variable Year. which shows the year of the bond observation.
The first column of the ESG_data dataset, ISIN CODE, contains EquityISIN’s of public listed companies. The rest of the variables are years from 2000 til 2021. For clarity: the column names are 2000, 2001, 2002,.... 2021.
If I want to extract the ESG score of a specific company from the ESG_data dataset I look at the EquityISIN and the year, the intersection of the row and column holds the ESG score I am looking for.
I want to write a code which allows me to add a new variable to the data_issuer dataset called ESGscore. The ESG score has to be extracted from the ESG_data dataset using the EquityISIN value and the Year value.
I tried the following code but it returns an error:
merged_data <- merge(
data_issuer,
ESG_data,
by.x = c("EquityISIN", "Year"),
by.y = c("ISIN CODE", "2000":"2021")
)
# Error in merge.data.frame(data_issuer, ESG_data, by.x = c("EquityISIN", :
# 'by.x' and 'by.y' specify different numbers of columns
Also tried with the help of chat GPT but did not help unfortunately.

How can I create a code/loop to automate the creation of a variable?

I am writing my thesis and I am struggling with some data preparation.
I have a dataset with prices, distance, and many other variables for several us airline routes. I need to identify the threat of entry on each route for a specific carrier (southwest) and to do that I need to create, for each row of the dataset, a dummy that assumes the value of 1 if southwest is flying from the takeoff airport of the row at that point in time.
How I thought of approaching this was to have an algorithm that checks the year and the takeoff airport_ID (all variables in the dataset) and then based on that values filter through all the dataset by year =< year row, origin_airport= origin_airport row, carrier = southwest. If the filter produces an output, it means that southwest is by that time already flying from that airport. Hence, if the filtering produces an output, the dummy should assume a value of 1, otherwise 0. This should be automated for each row in the dataset.
Any idea how to put this into Rstudio code? Or is there an easier way to address this issue?
This is the link to the dataset on dropbox:
https://www.dropbox.com/s/n09rp2vcyqfx02r/DB1B_completeDB1B_complete.csv?dl=0
The short answer is to use a self join.
Looking at your data set, I don't see IATA airport codes, but rather 6-digit origin and destination id's (which do not seem to conform to anything in DB1A/DB1B??). Also, it's not clear (to me) what exactly is the granularity of your data, so making some assumptions.
library(data.table)
setwd('<directory with your csv file>')
data <- fread('DB1B_completeDB1B_complete.csv')
wn <- data[carrier=='WN']
data[, flag:=0]
data[wn, flag:=1, on=.(ap_id, year, quarter, date)]
So, this just extracts the WN records and then joins that back to the original table, on ap_id (defines route??), year, quarter, and date. This assumes granularity is at the carrier/route/year/quarter/date level (e.g. one row per).
Before you do that, though, you need to do some serious data cleaning. For instance, while it looks like ORIGIN_AIRPORT_CD and DEST_AIRPORT_CD are parsed out of ap_id, there are about 1200 records where these are NA.
##
# missingness
#
data[, .(col = names(data), na.count=sapply(.SD, \(x) sum(is.na(x))))]
Also, my assumption that there is one row per carrier/route/year/quarter/date does not seem to hold always. This is an especially serious problem wit the WN rows.
##
# duplicates??
#
data[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
wn[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
Finally, in attempting to quantify the impact of WN entry to a market, you probably should at least consider grouping nearby airports. For instance JFK/LGA/EWR are frequently considered "NYC", and SFO/OAK/SJC are frequently considered "Bay Area" (these are just examples). This means, for instance, that if WN started flying from LGA to a destination of interest it might also influence OA prices from JFK and EWR to that same destination.

Merge rows with the same ID but with overlapping variables

I have data in r that has over 6000 observations and 96 variables.
The data relates to groups of individuals and their activities etc. If a group returned the Group ID number was recorded again and a new observation was made. I need to merge the rows by ID so that the # of individuals take the highest number recorded, but the activities etc are a combination of both observations.
The data contains, #of individuals, activities, impacts, time of arrival etc. The issue is that some of the observations were split across 2 lines, so there may be activities which were recorded for the same group in another line. The Group ID for both observations is the same, but one may have the #of individuals recorded and some activity records or impacts, but the second may be incomplete and only have Group ID and then Impacts (which are additional to those in the 1st record). The #of individuals in the group never changes, so I need some way to combine them so that activities are additive, but #visitors takes the value that is highest, time of arrival needs to be the earliest recorded and time of departure needs to be the later of the 2 observations.
Does anyone know how to merge observations based on Group ID but vary the merging protocol based on the variable.
enter image description here
I'm not sure if this actually is what you want, but to combine rows of a data frame based on multiple conditions you can use the dplyr package and its summarise()function. I generated some data to use in R directly, you would have to modify the code according to your needs.
# generate data
ID<-rep(1:20,2)
visitors<-sample(1:50, 40, replace=TRUE)
impact<-sample(rep(c("a", "b", "c", "d", "e"), 8))
arrival<-sample(rep(8:15, 5))
departure <- sample(rep(16:23, 5))
df<-data.frame(ID, visitors, impact, arrival, departure)
df$impact<-as.character(df$impact)
# summarise rows with identical ID
df_summary <- df %>%
group_by(ID) %>%
summarise(visitors = max(visitors), arrival = min(arrival),
departure = max(departure), impact = paste0(impact, collapse =", "))
Hope this helps!

How to aggregate count data into a specific geographic location

I have a dataset called 'model_data', in which the unit of observation is a geographic cell (gid) taken from the UCDP PRIO-GRID data. This is simply a standardised spatial grid structure that allows for finely-grained analysis at a very local level. I am researching the effect of power balance between actors in civil wars on their use of violence against civilians i.e. if actors perform well (operationalised as inflicting a majority of the battle deaths in any one gid) will they target more or less civilians in the same gid. To this end, I have merged my dataset using an inner_join (by gid) with a dataset containing all individual incidents of armed violence (UCDP Georeferenced Events Dataset).
When I merge, the resulting dataset consists of duplicate gid observations for each individual incident of violence from the GED dataset. I need to find a way of aggregating all civilians deaths, all side_a deaths, and all side_b deaths in each specific gid, so that each observation in the dataset is a unique gid with all data on various types of deaths from that gid.
model_data <- inner_join(grid, ged, by = c("year", "gid" = "priogrid_gid", "xcoord" = "longitude", "ycoord" = "latitude"))
As you can see from the first column, there are multiple observations with the same gid. I would like to aggregate all the data from the observations with the same gid into one observation.
I've researched a lot on how the best way to do this, but have been unsuccessful as of yet. From what I gather, the aggregate() function from the "sp" package would be my best bet, but I cannot work out how to use it in the way I need! Thank you for any help that may come my way
How about this?
library(dplyr)
model_data %>%
select(-id) %>%
distinct()
Assuming just using the "gid" without the "id" will get you where you want to go.

R - Merge rows in a dataframe to fill NAs given a number of identifiers

Let's say I have a dataframe that includes 5 years of data showing the number of homicides in the 50 biggest cities of all 50 states in America. Also in the dataframe is the population of that city and the number of guns owned. However, in each row there is only one of population, homicides or guns (see df in example below):
> df1 = data.frame(state=1:50, city=rep(1:50, each=50), year=rep(1:5, each=2500), population=sample(1000:200000,12500), homicides=NA, guns=NA)
> df2 = data.frame(state=1:50, city=rep(1:50, each=50), year=rep(1:5, each=2500), population=NA, homicides=sample(1:200,12500,replace=T), guns=NA)
> df3 = data.frame(state=1:50, city=rep(1:50, each=50), year=rep(1:5, each=2500), population=NA, homicides=NA, guns=round((df1$population/sample(2:20,12500,replace=T))))
> df = rbind(df1, df2, df3)
This resulting dataframe is 25,000 rows longer than it needs to be since each row representing a unique combination of state, city and year could include population, homicide and guns data, rather than just one. In other words, it could look like this:
df.ideal = data.frame(state=1:50, city=rep(1:50, each=50), year=rep(1:5, each=2500), population=sample(1000:200000,12500), homicides=sample(1:200,12500,replace=T), guns=round((df1$population/sample(2:20,12500,replace=T))))
Starting with df, how can I merge the population, guns and homicides data rows to create just one row for each state, city, year combination? Therefore resulting in df.ideal
Sadly the solution has to work for unbalanced dataframes as well - in an ideal world it would be great if a warning was presented when a value replaced anything but an NA.

Resources