I have a dataframe which structure looks like this
I would like to create reshuffle the way data is presented by creating a new data frame where I summarise the data above and it looks like this:
Therefore, for each European country, I will be creating 4 variables which are a sum of the capital expenditure variable based on different conditions. Lets take the first one as an example:
This is the sum of total capital expenditure that is directed to Austria (so Destination country= 'Austria') from EU countries (Source country continent=EU).
Can someone indicate the code to create a new df with this structure and create the variable explained above?
Thanks a lot!
Thanks a lot!
I have a data that mismatches state abbreviation with zipocode.
For example, my data assignes CA with illinois zipcode, 61820.
Hence, I want to match state and zipcode properly.
I was thinking about this approach:
df$state[60001 <= df$zipcode <= 62999] <-"IL"
But obviously, it is wrong approach.
Does anyone know how to to replace values?
I am writing my thesis and I am struggling with some data preparation.
I have a dataset with prices, distance, and many other variables for several us airline routes. I need to identify the threat of entry on each route for a specific carrier (southwest) and to do that I need to create, for each row of the dataset, a dummy that assumes the value of 1 if southwest is flying from the takeoff airport of the row at that point in time.
How I thought of approaching this was to have an algorithm that checks the year and the takeoff airport_ID (all variables in the dataset) and then based on that values filter through all the dataset by year =< year row, origin_airport= origin_airport row, carrier = southwest. If the filter produces an output, it means that southwest is by that time already flying from that airport. Hence, if the filtering produces an output, the dummy should assume a value of 1, otherwise 0. This should be automated for each row in the dataset.
Any idea how to put this into Rstudio code? Or is there an easier way to address this issue?
This is the link to the dataset on dropbox:
https://www.dropbox.com/s/n09rp2vcyqfx02r/DB1B_completeDB1B_complete.csv?dl=0
The short answer is to use a self join.
Looking at your data set, I don't see IATA airport codes, but rather 6-digit origin and destination id's (which do not seem to conform to anything in DB1A/DB1B??). Also, it's not clear (to me) what exactly is the granularity of your data, so making some assumptions.
library(data.table)
setwd('<directory with your csv file>')
data <- fread('DB1B_completeDB1B_complete.csv')
wn <- data[carrier=='WN']
data[, flag:=0]
data[wn, flag:=1, on=.(ap_id, year, quarter, date)]
So, this just extracts the WN records and then joins that back to the original table, on ap_id (defines route??), year, quarter, and date. This assumes granularity is at the carrier/route/year/quarter/date level (e.g. one row per).
Before you do that, though, you need to do some serious data cleaning. For instance, while it looks like ORIGIN_AIRPORT_CD and DEST_AIRPORT_CD are parsed out of ap_id, there are about 1200 records where these are NA.
##
# missingness
#
data[, .(col = names(data), na.count=sapply(.SD, \(x) sum(is.na(x))))]
Also, my assumption that there is one row per carrier/route/year/quarter/date does not seem to hold always. This is an especially serious problem wit the WN rows.
##
# duplicates??
#
data[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
wn[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
Finally, in attempting to quantify the impact of WN entry to a market, you probably should at least consider grouping nearby airports. For instance JFK/LGA/EWR are frequently considered "NYC", and SFO/OAK/SJC are frequently considered "Bay Area" (these are just examples). This means, for instance, that if WN started flying from LGA to a destination of interest it might also influence OA prices from JFK and EWR to that same destination.
I'm working with NAICS data for all the counties in the US, there are 435581 rows of data. Each county (county names are in column A and B) in the US has a series of businesses with associated codes which will be in column C. (Column D is a description of the business) Column E is the number of their employees. Each business has been given an individual row so you can imagine each county has tens of rows associated to it. I was wondering if there was a way to rearrange them in a way that each county has only one row, but multiple columns with business codes as their titles and then the number of employees.
I have added pictures so that you can see what I mean.
Before
What I'm looking for
I have a data frame in R that examines the ELO rating of college football teams over the course of several decades.
Data Layout
Each row is a specific game, and the team listed under the Team.A column is a winning team while the team under Team.B is a losing team. Also, the ELO scores under Elo.A represent the score for Team.A and the ELO scores under Elo.B represent the score for Team.B for those games, respectively.
I want to create a time-series that, for instance, looks at all of the ELO scores in Elo.A and Elo.B for Minnesota. Is there a way in R that can pull the date and scores in both of those columns for that one school?
How about:
df[df$team.A=="Minesota" | df$tema.B=="Minesota", ]
And you can select and specific columns using c(...) in the space after the ','