I am working with a large dataset (10 million + cases) where each case represents a sale's monthly transactions of a given product (there are 17 products). As such, each shop is potentially represented across 204 cases (12 months * 17 Product sales; note, not all stores sell all 17 products throughout the year).
I need to restructure the data so that there is one case for each product transaction. This would result in each shop being represented by only 17 cases.
Ideally, I would like the create the mean value of the transactions over the 12 months.
To be more specific, there dataset currently has 5 variables:
Shop Location — A unique 6 digit sequence
Month — 2013_MM (data is only from 2013)
Number of Units sold Total Profit (£)
Product Type - 17 Different product types (this is a String
Variable)
I am working in R. It would be ideal to save this restructured dataset into a data frame.
I'm thinking an if/for loop could work, but I'm unsure how to get this to work.
Any suggestions or ideas are greatly appreciated. If you need further information, please just ask!
Kind regards,
R
There really wasn't much here to work with, but this is what my interpretation leads to... You're looking to summarise your data set, grouped by shop_location and product_type
# install.packages('dplyr')
library(dplyr)
your_data_set <- xxx
your_data_set %>%
group_by(shop_location, product_type) %>%
summarise(profit = sum(total_profit),
count = n(),
avg_profit = profit/count)
Related
I have a question regarding the filtering of a loan dataset for my upcoming thesis.
My dataset consists of loan data which is reported for 5 years on a quarterly basis. The column of interest is the 'Loan Identifier' as well as the 'Cut-Off-Date'. I just want to observe the loans (via Loan Identifier) that exist at the first reporting date (first quarter) for every upcoming quarter (cut-off-date).
For example, if there are the loans with the identifier c("1001","1002","1003") in the first cut-off-date and the second cut-off date, one quarter later, has loans with identifiers ("1002","1003","1004"), R should filter for only the identifiers that existed in the first quarter ("1002","1003"). So that new loans during the analysis are completely ignored.
Is there also the possibility to do that all in one file? Or should I extract the data of each cut-off-date in a new table?
Thanks and best regards!
I am thinking about assigning each loan in the first quarter as a vector. After that, I should split up the loan dataset for each cut-off-date and merge the vector with the new tables via left_join. So that every loan that does not match with the vector is disregarded.
As I have multiple loan pools with 15 pool-cut-off dates, this seems very impractical for me. Maybe there is a smarter and more effective solution.
Guess this is pretty basic but I'm struggling to find a way and find a answer online either. I'm trying to create a dataframe with future dates but those dates should be duplicated per combinations of other 2 variables
so I should have
Dates | Channel | Product
Channel can take 4 values and product 7 values and I need to create dates for future 45 days after my last day in current df. Therefore I have 28 combinations per day and my new df should be 1260 rows (45 * 7 *4)
as the sample below
I know about this function
Dates =seq(max(train$Date), by="day", length.out=45)
However this will create a vector not duplicating dates for each combination. Anyway I can adapt this?
I'm working on some Covid-19 related questions in R Studio.
I have a data frame containing the columns of the date, cases (newly infected people on this date), deaths on this date, country, population, and indicator 14, which is the Number of cases per 100,000 residents over the last 14 days including the current date.
Now I want to create a new indicator, which is looking at the cases per 100,000 over the last 7 days.
The way to calculate it would of course be: 7 days indicator = (sum from k= i-6 to i of cases_k/population) * 100,000
So I wanted to code a function incidence <- function(cases, population) {} performing the formula on the data but I'm struggling:
How can I always address the last 7 days?
I know that I can e.g. compute a sum vom 0 to 5 with the following: i <- 0:5; sum(i^2) but how do I define from k= i-6 to i in this case?
Do I have to use a loop inside the function?
Thank you!
I have a problem and was wondering if there was a code that would let me solve my problem faster than doing it manually.
So for my example, I have 100 different funds with numerous securities in the fund. Within each fund, I have the Name of each type of security in the fund, the Date which shows the given quarter, the State where the security is issued, and the Weighting of each security of the total fund. The Name is not important, just the State from where it was issued is.
I was wondering if there was a way that would allow me to add up the Weighting from each different fund based on the specific State I want for each quarter. So let's say from Fund1, I need the sum of the Weighting just for the state SC and AZ in 16-1Q. The sum would be (.18 + .001). I do not need to include the weighting for KS because I am not interested in that specific state. I would only be interested in the states SC and AZ for every FundId. However, in my real problem I am interested in ~30 states. I would then do the same task for Fund1 for 16-2Q and so on until 17-4Q. My end goal is to find the sum of every portfolio weighting for the states I'm interested in and see how it changes over time. I can do this manually by each fund, but is there a way to automatically sum up the Weighing for each FundId based on the State I want and for each Date (16-1Q, 16-2Q, etc.)?
In the end I would like a table such as:
(.XX) is the sum of portfolio weight
Example of Data
The Example of Data link you sent has a much better data format than the "XX is the sum of portfolio weight" example... only in Excel would you prefer this other kind of format
so using the Example data frame, do this operation
library(dplyr)
example_data <- example_data %>%
group_by(Fund_Id) %>%
summarize(sum = sum(Weighting))
We can use aggregate in base R
aggregate(Weighting ~ Fund_id, example_data, sum)
I am working on a problem for a statistics class that utilizes baseball team data such as attendance, wins/losses, and other stats about baseball teams. The problem statement calls for variables to be created to include winning teams (those with 81 or more wins), losing teams (with less than 81 wins), and attendance figures on three categories, less than 2 million, between 2 and 3 million, and more than 3 million.
The raw data is keyed by team name, with one team per row and then the stats in each column.
I then need to create a table with counts of the number of teams along those dimensions, like:
Winning Season Low Attendance Med. Attendance High Attendance
Yes 3 12 3
No 2 10 2
We can use whatever tool we'd like to complete it and I am attempting to use R and RStudio to create the table in order to gain knowledge about stats and R at the same time. However, I can't figure out how to make it happen or what function(s) to use to create a table with those aggregate numbers.
I have looked at data.table and dplyr and others but I cannot seem to figure out how to get counts sorted by each team. If it was SQL, I would be able to
select count(*) from table where attend < 2000000 and wins < 81
and then programmatically create the table. I can't figure out how to do the same in R.
Thank you for any help.