Trouble with summing specific rows and columns - r

I have a problem and was wondering if there was a code that would let me solve my problem faster than doing it manually.
So for my example, I have 100 different funds with numerous securities in the fund. Within each fund, I have the Name of each type of security in the fund, the Date which shows the given quarter, the State where the security is issued, and the Weighting of each security of the total fund. The Name is not important, just the State from where it was issued is.
I was wondering if there was a way that would allow me to add up the Weighting from each different fund based on the specific State I want for each quarter. So let's say from Fund1, I need the sum of the Weighting just for the state SC and AZ in 16-1Q. The sum would be (.18 + .001). I do not need to include the weighting for KS because I am not interested in that specific state. I would only be interested in the states SC and AZ for every FundId. However, in my real problem I am interested in ~30 states. I would then do the same task for Fund1 for 16-2Q and so on until 17-4Q. My end goal is to find the sum of every portfolio weighting for the states I'm interested in and see how it changes over time. I can do this manually by each fund, but is there a way to automatically sum up the Weighing for each FundId based on the State I want and for each Date (16-1Q, 16-2Q, etc.)?
In the end I would like a table such as:
(.XX) is the sum of portfolio weight
Example of Data

The Example of Data link you sent has a much better data format than the "XX is the sum of portfolio weight" example... only in Excel would you prefer this other kind of format
so using the Example data frame, do this operation
library(dplyr)
example_data <- example_data %>%
group_by(Fund_Id) %>%
summarize(sum = sum(Weighting))

We can use aggregate in base R
aggregate(Weighting ~ Fund_id, example_data, sum)

Related

How to set up a time series for this project (r)?

I am a cross country runner on a high school team, and I am using my limited knowledge of R and linear algebra to create a ranking index for xc teams.
I get my data from milesplit.com, but I am unsure if I am formatting this data properly. So far I created matrices for each race, with odd columns including runner score and even columns including time, where each team has a team_score and team_time column. I want to analyze growth of teams in a time series, but I have two questions about this:
(1): can I combine all of these "race matrices" into a time series? Can I assign all the data in a race matrix a certain date, then make one big time series including all 25 race matrices I made?
(2): Am I closing myself off to insights by not including name and grade for each runner (as I only record time and score)? If so, how can I write a matrix that contains all this information?

How can I create a code/loop to automate the creation of a variable?

I am writing my thesis and I am struggling with some data preparation.
I have a dataset with prices, distance, and many other variables for several us airline routes. I need to identify the threat of entry on each route for a specific carrier (southwest) and to do that I need to create, for each row of the dataset, a dummy that assumes the value of 1 if southwest is flying from the takeoff airport of the row at that point in time.
How I thought of approaching this was to have an algorithm that checks the year and the takeoff airport_ID (all variables in the dataset) and then based on that values filter through all the dataset by year =< year row, origin_airport= origin_airport row, carrier = southwest. If the filter produces an output, it means that southwest is by that time already flying from that airport. Hence, if the filtering produces an output, the dummy should assume a value of 1, otherwise 0. This should be automated for each row in the dataset.
Any idea how to put this into Rstudio code? Or is there an easier way to address this issue?
This is the link to the dataset on dropbox:
https://www.dropbox.com/s/n09rp2vcyqfx02r/DB1B_completeDB1B_complete.csv?dl=0
The short answer is to use a self join.
Looking at your data set, I don't see IATA airport codes, but rather 6-digit origin and destination id's (which do not seem to conform to anything in DB1A/DB1B??). Also, it's not clear (to me) what exactly is the granularity of your data, so making some assumptions.
library(data.table)
setwd('<directory with your csv file>')
data <- fread('DB1B_completeDB1B_complete.csv')
wn <- data[carrier=='WN']
data[, flag:=0]
data[wn, flag:=1, on=.(ap_id, year, quarter, date)]
So, this just extracts the WN records and then joins that back to the original table, on ap_id (defines route??), year, quarter, and date. This assumes granularity is at the carrier/route/year/quarter/date level (e.g. one row per).
Before you do that, though, you need to do some serious data cleaning. For instance, while it looks like ORIGIN_AIRPORT_CD and DEST_AIRPORT_CD are parsed out of ap_id, there are about 1200 records where these are NA.
##
# missingness
#
data[, .(col = names(data), na.count=sapply(.SD, \(x) sum(is.na(x))))]
Also, my assumption that there is one row per carrier/route/year/quarter/date does not seem to hold always. This is an especially serious problem wit the WN rows.
##
# duplicates??
#
data[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
wn[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
Finally, in attempting to quantify the impact of WN entry to a market, you probably should at least consider grouping nearby airports. For instance JFK/LGA/EWR are frequently considered "NYC", and SFO/OAK/SJC are frequently considered "Bay Area" (these are just examples). This means, for instance, that if WN started flying from LGA to a destination of interest it might also influence OA prices from JFK and EWR to that same destination.

Finding the mean of a subset with 2 variables in R

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

How to aggregate count data into a specific geographic location

I have a dataset called 'model_data', in which the unit of observation is a geographic cell (gid) taken from the UCDP PRIO-GRID data. This is simply a standardised spatial grid structure that allows for finely-grained analysis at a very local level. I am researching the effect of power balance between actors in civil wars on their use of violence against civilians i.e. if actors perform well (operationalised as inflicting a majority of the battle deaths in any one gid) will they target more or less civilians in the same gid. To this end, I have merged my dataset using an inner_join (by gid) with a dataset containing all individual incidents of armed violence (UCDP Georeferenced Events Dataset).
When I merge, the resulting dataset consists of duplicate gid observations for each individual incident of violence from the GED dataset. I need to find a way of aggregating all civilians deaths, all side_a deaths, and all side_b deaths in each specific gid, so that each observation in the dataset is a unique gid with all data on various types of deaths from that gid.
model_data <- inner_join(grid, ged, by = c("year", "gid" = "priogrid_gid", "xcoord" = "longitude", "ycoord" = "latitude"))
As you can see from the first column, there are multiple observations with the same gid. I would like to aggregate all the data from the observations with the same gid into one observation.
I've researched a lot on how the best way to do this, but have been unsuccessful as of yet. From what I gather, the aggregate() function from the "sp" package would be my best bet, but I cannot work out how to use it in the way I need! Thank you for any help that may come my way
How about this?
library(dplyr)
model_data %>%
select(-id) %>%
distinct()
Assuming just using the "gid" without the "id" will get you where you want to go.

Summarizing Data across age groups in R

I have data for customer purchases across different products , I calculated the amount_spent by multiplying Item Numbers by the respective Price
I used cut function to segregate people into different age bins, Now how can I find the aggregate amount spent by different age groups i.e the contribution of each age group in terms of dollars spent
Please let me know if you need anymore info
I am really sorry that I can't paste the data here due to remote desktop constraints . I am actually concerned with the result I got after summarize function
library(dplyr)
customer_transaction %>% group_by(age_gr) %>% select(amount_spent) %>% summarise_each(funs(sum))
Though I am not sure if you want the contribution to the whole pie or just the sum in each age group.
If your data is of class data.table you could go with
customer_transaction[,sum(amount_spent),by=age_gr]

Resources