R - How to sum a column based on date range? [duplicate] - r

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.

As suggested in one of the comments in the link referenced by #SeƱorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Related

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

Dates are not keeping specified format in R data frame

Simply put, I'm grabbing dates for events that meet certain conditions in df1 and putting them in a new data frame (df2). The formatting of dates in df2 should be the same formatting in df1 ("2000-09-12", or %Y-%m-%d). However, the dates in df2 read "11212", "11213", etc.
to generate data:
"Date"<-c("2000-09-08", "2000-09-11","2000-09-12","2000-09-13","2000-09-14","2000-09-15","2000-09-18","2000-09-19","2000-09-20","2000-09-21", "2000-09-22","2000-09-25")
"Event"<-c("A","N","O","O","O","O","N","N","N","N","N","A")
df1<-data.frame(Date,Event)
df1
Date Event
1 2000-09-08 A
2 2000-09-11 N
3 2000-09-12 O
4 2000-09-13 O
5 2000-09-14 O
6 2000-09-15 O
7 2000-09-18 N
8 2000-09-19 N
9 2000-09-20 N
10 2000-09-21 N
11 2000-09-22 N
12 2000-09-25 A
here is the code:
"df2"<-data.frame()
"tmp"<-data.frame(1,2)
i<-c(1:4)
for (x in i)
{
date1<- df1$Date[df1$Event=="O"][x]
date2<- df1$Date[df1$Event=="A" & df1$Date => date1] [1]
as.numeric(difftime(date2, date1))->tmp[1,2]
as.Date(as.character(df1$Date[df1$Event=="O"][x]), "%Y-%m-%d")->tmp[1,1] ##the culprit
rbind(df2, tmp)->df2
}
Loop output looks like this:
X1 X2
1 11212 13
2 11213 12
3 11214 11
4 11215 10
I want it to look like this:
X1 X2
1 "2000-09-12" 13
2 "2000-09-13" 12
3 "2000-09-14" 11
4 "2000-09-14" 10
If I understand correctly, the OP wants to find for each "O" event the difference in days to the next following "A" event.
This can be solved using a rolling join. We extract the "O" events and the "A" events into two separate data.tables and join both on date.
This will avoid all the hassle with the data format and works also if df1 is not already ordered by Date.
library(data.table)
setDT(df1)[Event == "A"][df1[Event == "O"],
on = "Date", roll = -Inf, .(Date, x.Date - i.Date)]
Date V2
1: 2000-09-12 13 days
2: 2000-09-13 12 days
3: 2000-09-14 11 days
4: 2000-09-15 10 days
Note that roll = -Inf rolls backwards (next observation carried backward (NOCB)) because the date of the next "A" event is required.
Data
Date <- as.Date(c("2000-09-08", "2000-09-11","2000-09-12","2000-09-13","2000-09-14","2000-09-15",
"2000-09-18","2000-09-19","2000-09-20","2000-09-21", "2000-09-22","2000-09-25"))
Event <- c("A","N","O","O","O","O","N","N","N","N","N","A")
df1 <- data.frame(Date,Event)

Calculate rolling average of simulated data series with data.table

I am simulating a price time series, where the time horizon basically is that each month consists of 20 working days and 12 months are one year. I now would like to calculate the rolling average of this price, always based on the first day of the month.
I do have a working solution, but would like to know if there's a more elegant or faster one.
dt.oil.price
Period Month Day.Month Oil.Price Oil.Supply Risk.Free.Interest
1: 1 1 1 39.4560000 NA 0.08642857
2: 2 1 2 3.7889460 NA 0.08642857
3: 3 1 3 51.0748751 NA 0.08642857
4: 4 1 4 60.6282853 NA 0.08642857
5: 5 1 5 35.7267224 NA 0.08642857
6: 6 1 6 26.1868977 NA 0.08642857
7: 7 1 7 32.6488136 NA 0.08642857
8: 8 1 8 42.6397549 NA 0.08642857
9: 9 1 9 18.8969991 NA 0.08642857
...
20: 20 1 20 8.8036135 NA 0.08642857
21: 21 2 1 2.5559526 NA 0.08642857
22: 22 2 2 24.3996401 NA 0.08642857
...
40: 40 2 20 41.2988566 NA 0.08642857
41: 41 3 1 20.8012327 NA 0.08642857
42: 42 3 2 70.5297726 NA 0.08642857
Just to give you an idea on the structure of the data. To create the above data structure with 60 periods:
set.seed(1);
dt.oil.price <- as.data.table(cbind( Period = 1:60,
Month = as.integer(rep(1:(60/20), each = 20))[1:60],
Oil.Price=rnorm(3*20,mean = 50, sd = 10)))
dt.oil.price[,"Day.Month" := rank(Period),by="Month"]
With the following code I can then select all first days of a month and calculate the mean of the oil price for these days:
dt.oil.price[ Day.Month == 1, mean(Oil.Price)]
In the next step I use another helper column "Num.Months" to rank the number of months accordingly, by
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)]
and with this I can then select only the last two months for the average calculation, by subsetting this
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)][Num.Months <= 2, Oil.Price]
A code snippet, which allows to calculate the mean without using an explicit helper column for the last three months:
dt.oil.price[Day.Month == 1 & Period <= 60, {Num.Months = rank(-Period); list("Period" = Period, "Month" = Month, "Oil.Price" = Oil.Price, "Num.Months" = Num.Months)}][Num.Months <=12, mean(Oil.Price)]
I hope my steps are all clear and it becomes also clear, what I would like to achieve. It is also possible to calculate the moving average dynamically by defining for example a period and then calculate the moving average for the last 12 months preceding that period. This can be achieved, by sub-setting the data.table only to periods smaller than the defined period and then calculating "Num.Months" for this data.table subset.

How to run a loop in R to find a unique combination of numbers within a range of 7?

I have a dataset which looks something like this:-
Key Days
A 1
A 2
A 3
A 8
A 9
A 36
A 37
B 14
B 15
B 44
B 45
I would like to split the individual keys based on the days in groups of 7. For e.g.:-
Key Days
A 1
A 2
A 3
Key Days
A 8
A 9
Key Days
A 36
A 37
Key Days
B 14
B 15
Key Days
B 44
B 45
I could use ifelse and specify buckets of 1-7, 7-14 etc until 63-70 (max possible value of days). However the issue lies with the days column. There are lots of cases wherein there is an overlap in days - Take days 14-15 as an example which would fall into 2 brackets if split using the ifelse logic (7-14 & 15-21).
The ideal method of splitting this would be to identify a day and add 7 to it and check how many rows of data are actually falling under that category. I think we need to use loops for this. I could do it in excel but i have 20000 rows of data for 2000 keys hence i'm using R. I would need a loop which checks each key value and for each key it further checks the value of days and buckets them in group of 7 by checking the first day value of each range.
We create a grouping variable by applying %/% on the 'Day' column and then split the dataset into a list based on that 'grp'.
grp <- df$Day %/%7
split(df, factor(grp, levels = unique(grp)))
#$`0`
# Key Days
#1 A 1
#2 A 2
#3 A 3
#$`1`
# Key Days
#4 A 8
#5 A 9
#$`5`
# Key Days
#6 A 36
#7 A 37
#$`2`
# Key Days
#8 B 14
#9 B 15
#$`6`
# Key Days
#10 B 44
#11 B 45
Update
If we need to split by 'Key' also
lst <- split(df, list(factor(grp, levels = unique(grp)), df$Key), drop=TRUE)

Turning one row into multiple rows in r [duplicate]

This question already has answers here:
Combine Multiple Columns Into Tidy Data [duplicate]
(3 answers)
Closed 5 years ago.
In R, I have data where each person has multiple session dates, and the scores on some tests, but this is all in one row. I would like to change it so I have multiple rows with the persons info, but only one of the session dates and corresponding test scores, and do this for every person. Also, each person may have completed different number of sessions.
Ex:
ID Name Session1Date Score Score Session2Date Score Score
23 sjfd 20150904 2 3 20150908 5 7
28 addf 20150905 3 4 20150910 6 8
To:
ID Name SessionDate Score Score
23 sjfd 20150904 2 3
23 sjfd 20150908 5 7
28 addf 20150905 3 4
28 addf 20150910 6 8
You can use melt from the devel version of data.table ie. v1.9.5. It can take multiple 'measure' columns as a list. Instructions to install are here
library(data.table)#v1.9.5+
melt(setDT(df1), measure = patterns("Date$", "Score(\\.2)*$", "Score\\.[13]"))
# ID Name variable value1 value2 value3
#1: 23 sjfd 1 20150904 2 3
#2: 28 addf 1 20150905 3 4
#3: 23 sjfd 2 20150908 5 7
#4: 28 addf 2 20150910 6 8
Or using reshape from base R, we can specify the direction as 'long' and varying as a list of column index
res <- reshape(df1, idvar=c('ID', 'Name'), varying=list(c(3,6), c(4,7),
c(5,8)), direction='long')
res
# ID Name time Session1Date Score Score.1
#23.sjfd.1 23 sjfd 1 20150904 2 3
#28.addf.1 28 addf 1 20150905 3 4
#23.sjfd.2 23 sjfd 2 20150908 5 7
#28.addf.2 28 addf 2 20150910 6 8
If needed, the rownames can be changed
row.names(res) <- NULL
Update
If the columns follow a specific order i.e. 3rd grouped with 6th, 4th with 7th, 5th with 8th, we can create a matrix of column index and then split to get the list for the varying argument in reshape.
m1 <- matrix(3:8,ncol=2)
lst <- split(m1, row(m1))
reshape(df1, idvar=c('ID', 'Name'), varying=lst, direction='long')
If your data frame name is data
Use this
data1 <- data[1:5]
data2 <- data[c(1,2,6,7,8)]
newdata <- rbind(data1,data2)
This works for the example you've given. You might have to change column names appropriately in data1 and data2 for a proper rbind

Resources