Guess this is pretty basic but I'm struggling to find a way and find a answer online either. I'm trying to create a dataframe with future dates but those dates should be duplicated per combinations of other 2 variables
so I should have
Dates | Channel | Product
Channel can take 4 values and product 7 values and I need to create dates for future 45 days after my last day in current df. Therefore I have 28 combinations per day and my new df should be 1260 rows (45 * 7 *4)
as the sample below
I know about this function
Dates =seq(max(train$Date), by="day", length.out=45)
However this will create a vector not duplicating dates for each combination. Anyway I can adapt this?
Related
I'm working on some Covid-19 related questions in R Studio.
I have a data frame containing the columns of the date, cases (newly infected people on this date), deaths on this date, country, population, and indicator 14, which is the Number of cases per 100,000 residents over the last 14 days including the current date.
Now I want to create a new indicator, which is looking at the cases per 100,000 over the last 7 days.
The way to calculate it would of course be: 7 days indicator = (sum from k= i-6 to i of cases_k/population) * 100,000
So I wanted to code a function incidence <- function(cases, population) {} performing the formula on the data but I'm struggling:
How can I always address the last 7 days?
I know that I can e.g. compute a sum vom 0 to 5 with the following: i <- 0:5; sum(i^2) but how do I define from k= i-6 to i in this case?
Do I have to use a loop inside the function?
Thank you!
Here's a portion of a dataset that I have of daily closes for certain stocks within a common period of time in .xlsx format:
What I need is an R script that would produce something like this:
So I need a row for each stock everyday for the time period and the corresponding prices for them in the third column like above. Of course, I have more than 100 stocks for a period of 4 years. So that makes more than 100 rows for each day for 4 years. For example, a hundred rows of the day 5.01.2015 and so forth.
I'm still very new to R so help is very much appreciated.
I've just started learning R. As for now, I have prices PRC in a dataframe test together with the date and several other variables.
My goal is to calculate the following within the same dataframe so I can maintain the connection to the date.
1. Overlapping three-day log returns
2. One-day log returns
Through other posts I came up with the following code for the three day lag returns and the one-day lag returns respectively, but I am still unsure on how to incorporate it into my dataframe:
test$logR3 <- diff(log(test$PRC)), lag=3)
This code currently doesn't work due to the difference in number of rows. How do I take this into account? Can I somehow put zeros or NAs in order to fill the missing rows?
Thank you in advance.
maybe something like:
days=c()
for(i in seq(3,nrow(test),3)){ #loop through it in steps of 3
one_day_ago_diff=log(test$PRC[i])-log(test$PRC[i-1]) #difference between today and yesterday
three_days_ago_diff=log(test$PRC[i])-log(test$PRC[i-3]) #difference between today and three days ago
days=c(days,c(three_days_ago_diff,NA,one_day_ago_diff)) # fills empty vector with diff from 3 days ago- followed by NA to skip 2 days ago and then one day ago
}
if(length(days)<nrow(test)){days=c(days, rep(NA,nrow(test)-length(days)))} #check they're the same length
test$lags=days #add column to test
I have a timestamp in one data frame that I am trying to match to the closest timestamp in a second dataframe, for the purpose of extracting data from the second dataframe.
Earlier I found that I can try data.tables rolling join using the nearest option:
library(data.table) # v1.9.6+
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
# [1] 5 7 7 8
However, this results in a list that is 2 rows longer than the data file.
Is this because the time stamps of 2 observations were equally close to 2 time stamps (one earlier, one later)?
Is there an option for when the time of an observation is in the middle of 2 time stamps? Is there a function to choose for one of the 2 time stamps? Or can you find out for which observation there are two possible time stamps?
If I merge the list with the data, the 2 last observations from the list are unused, I figured this changes my data.
Thank You!
I am working with a large dataset (10 million + cases) where each case represents a sale's monthly transactions of a given product (there are 17 products). As such, each shop is potentially represented across 204 cases (12 months * 17 Product sales; note, not all stores sell all 17 products throughout the year).
I need to restructure the data so that there is one case for each product transaction. This would result in each shop being represented by only 17 cases.
Ideally, I would like the create the mean value of the transactions over the 12 months.
To be more specific, there dataset currently has 5 variables:
Shop Location — A unique 6 digit sequence
Month — 2013_MM (data is only from 2013)
Number of Units sold Total Profit (£)
Product Type - 17 Different product types (this is a String
Variable)
I am working in R. It would be ideal to save this restructured dataset into a data frame.
I'm thinking an if/for loop could work, but I'm unsure how to get this to work.
Any suggestions or ideas are greatly appreciated. If you need further information, please just ask!
Kind regards,
R
There really wasn't much here to work with, but this is what my interpretation leads to... You're looking to summarise your data set, grouped by shop_location and product_type
# install.packages('dplyr')
library(dplyr)
your_data_set <- xxx
your_data_set %>%
group_by(shop_location, product_type) %>%
summarise(profit = sum(total_profit),
count = n(),
avg_profit = profit/count)