Iterative partial sum on rows with the same dates in R

Iterative partial sum on rows with the same dates in R - r

I would like to do some computation on several rows in a table.
I created an exemple below:
library(dplyr)
set.seed(123)
year_week <- c(200045:200053, 200145:200152, 200245:200252)
input <- as.vector(sample(1:10,25,TRUE))
partial_sum <- c( 20,12,13,18,12,13,4,15,9,13,10,20,11,9,9,5,13,13,,8,13,11,15,14,7,14)
df <- data.frame(year_week, input, partial_sum)
Given are the columns input and year_week. The later represents dates but the values are numerical in my case with the first 4 digits as years and the last two as the working weeks for that year.
What I need, is to iterate over each week in each year and to sum up the values from the same weeks in the other years and save the results into a column called here partial_sum. The current value is excluded from the sum.
The week 53 in the lap year 2000 will get the same treatment but in this case I have only one lap year therefore its value 3 doesn't change.
Any idea on how to make it?
Thank you

I would expect something like this would work, though as pointed out in comments your example isn't exactly reproducible.
library(dplyr)
df %>%
mutate(week = substr(year_week, 5, 6)) %>%
group_by(week) %>%
mutate(result = sum(input))

Perhaps this helps - grouped by 'week' by taking the substring, get the difference between the sum of 'input' and the 'input'
library(dplyr)
df %>%
group_by(week = substring(year_week, 5)) %>%
mutate(partial_sum2 = sum(input) - input)

Related

How can I compute the difference in consecutive rows for multiple columns?

I have a df consisting of daily returns for various maturities. The first column consists of dates and the next 12 are maturities. I want to create a new df that calculates the difference in consecutive daily rates. Not sure where to start.

With multiple columns, diff can applied
rbind(0, diff(as.matrix(df[-1])))
Or we can use dplyr
library(dplyr)
df %>%
mutate_at(vars(-Date), ~ . - lag(.))
Reproducible example
diff(as.matrix(head(mtcars)))

In the future try to refrain just providing a picture and provide a reprex!
Here is one way to get what you're looking for:
df <-
data.frame(
dates= c("2019-01-01", "2019-01-02", "2019-01-03"),
original_numbers = c(1,2,3)
)
df2 <- df %>%
mutate(
difference = original_numbers - lag(original_numbers)
)

How to count all the similar dates in a separate column

I'm trying to sum all of the similar Date/time rows into one row and a "count" row. Therefore I'll get two columns- one for the Date/Time and one for the count.
I used this argument to round my observations into a 15 minute time period:
dat$by15 <- cut(dat$Date_Time, breaks = "15 min", )
I tried to use this argument, but it's "jumping" to a previous dataset and giving me the wrong observations for some reason:
dat <- aggregate(dat, by = list(dat$by15), length )
Thank you guys !

I'm not sure if I understood the question, but if you are trying to group by date and count observations for each date it's really simple
library(dplyr)
grouped_dates <- dat %>%
group_by(Date_Time) %>%
summarise(Count = n())

how to group by month with summing or counting in R?

I am using the code below to group by month to sum or count. However, the SLARespond column seems like it sums for the whole data set, not for each month.
Any way that I can fix the problem?
also, instead of sum function, can I do count function with SLAIncident$IsSlaRespondByViolated == 1
Appreciate for helps!
SLAIncident <- SLAIncident %>%
mutate(month = format(SLAIncident$CreatedDateLocal, "%m"), year = format(SLAIncident$CreatedDateLocal, "%Y")) %>%
group_by(year, month) %>%
summarise(SLARespond = sum(SLAIncident$IsSlaRespondByViolated))

If you could provide a small bit of the dataset to illustrate your example that would be great. I would first make sure that your months/years are characters or factors so that dplyr can grab them. An ifelse function wrapped in a sum should also fit your criteria for the second part of the question. I am using your code here to convert the dates into month and year but I recommend lubridate
SLAIncident <- SLAIncident %>%
mutate(month = as.character(format(SLAIncident$CreatedDateLocal, "%m")),
year = as.character(format(SLAIncident$CreatedDateLocal, "%Y"))) %>%
group_by(year, month) %>%
summarise(SLARespond = sum(IsSlaRespondByViolated),
sla_1 = sum(ifelse(isSlaRespondByViolated == 1, 1, 0)))
Also as hinted to in the comments, these column names are really long and could use some tidying

Data frame subset according to matching values in R

I have a data.frame with information on racing performance on horses. I have a variable Competition.year that has a "Total" row and then a row for each year the horse competed. I also have a variable Competition.age that describes the age the horses were in each specific year they competed.
I am trying to create a subsetted df based on their best racing times and the age they were when they achieved it. In the "Total" row, the racing time included is their best one. So, I need to figure out how to tell R that, when the race time in Total row is equal to whenever it is they actually achieved that time, include the age they were then in the new data frame. I am super new to R so I have no idea where to even begin doing this, I've tried some stuff I've seen on other questions but I can't get it right. Any help would be much appreciated!
My df looks like this:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
The desired df should have 223 rows (since that is the total amount of horses I have) with columns Name, Competition.year=="Total", Time.record.auto.start, Time.record.volt.start and Competition.age

Firstly, I had to change your sample data to make sure all 5 variables only had 14 observations each. I did this by removing the final NA in the Competition.age variable. I also had to swap around the 94.3 and 98.3 values in the Time.record.volt.start variable so that the values lined up with what was expected in the Total column for the horse with Name equal to Pelson Poika.
Here is the corrected data:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
And here is a simple dplyr solution, which I think does what you want.
library(dplyr)
df1 <-
travdata %>% group_by(Name) %>% filter(Competition.year == "Total") %>% select(Name, Time.record.auto.start, Time.record.volt.start)
df2 <- travdata %>% filter(Competition.year != "Total")
df3 <-
inner_join(
df1,
df2,
by = c(
"Name" = "Name",
"Time.record.auto.start" = "Time.record.auto.start",
"Time.record.volt.start" = "Time.record.volt.start"
)
)
The dataframe df3 should return what you were after.

subset my df provided that each ID has >10 obs a month

I am trying to clean my stocks' df and I need to get rid of the ones that have less than 10 observations per month.
Already checked these 2 threads:
subsetting-based-on-observations-in-a-month
and ddply-for-sum-by-group-in-r
But I'm a noob and I cannot figure it out yet.
In short: Please, help me out eliminating IDs (Stocks) whose observations per month are <10 (for any month if possible). They are Id'd via the permanent number from CRSP (permno).
Here is the df: Lessthan10days.csv
Thank you so much,
Leo

We could create a column 'MonthYr' from the 'date' column after converting it to 'Date' class. Get the number of observations ('n') per group ('permno', 'MonthYr') and use that to remove the IDs ('permno') that have at least one 'n' less than 10.
library(dplyr)
res <- df1 %>%
mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
group_by(permno, MonthYr) %>%
mutate(n=n()) %>%
group_by(permno) %>%
filter(all(n>=10))
all(res$n>=10)
#[1] TRUE
tbl <-table(res$permno, res$MonthYr)
all(tbl[tbl!=0]>=10)
#[1] TRUE
Or using similar approach withdata.table
library(data.table)
setDT(df1)[,N:=.N , list(permno, MonthYr=format(as.Date(date,
format='%m/%d/%Y'), '%Y-%m'))][all(N>=10) , permno][]
data
df1 <- read.csv('Lessthan10days.csv', header=TRUE, stringsAsFactors=FALSE)

I'd just like to add that the next commands work partially:
library(dplyr)
res <- df1 %>%
mutate(MonthYr=format(as.Date(date, format='%m/%d/%Y'), '%Y-%m')) %>%
group_by(permno, MonthYr) %>%
mutate(n=n()) %>%
group_by(permno) %>%
filter(all(n>=10))
all(res$n>=10)
#[1] TRUE
tbl <-table(res$permno, res$MonthYr)
all(tbl[tbl!=0]>=10)
#[1] TRUE
They do not perfectly clean the sample, I believe that some NA values are counted as observations, so they might 'escape' the subsetting/cleaning.
Therefore I did it manually to be sure. A suggestion I can propose would be using just:
>tbl <-table(res$permno, res$MonthYr)
>write.csv(tbl,"tbl.csv")
And then you look into the spreadsheet yourself for cleaning obs<10 (for each year/stock).
On top of that, you can filter the NA values for Price, and erase the 5-10 stocks (ids) that present a couple of months with <10 observations.
Hope this helps a bit. Thanks again for your help!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Iterative partial sum on rows with the same dates in R - r

I would expect something like this would work, though as pointed out in comments your example isn't exactly reproducible. library(dplyr) df %>% mutate(week = substr(year_week, 5, 6)) %>% group_by(week) %>% mutate(result = sum(input))

Perhaps this helps - grouped by 'week' by taking the substring, get the difference between the sum of 'input' and the 'input' library(dplyr) df %>% group_by(week = substring(year_week, 5)) %>% mutate(partial_sum2 = sum(input) - input)

Related

How can I compute the difference in consecutive rows for multiple columns?

How to count all the similar dates in a separate column

how to group by month with summing or counting in R?

Data frame subset according to matching values in R

subset my df provided that each ID has >10 obs a month

Categories

Resources