Suppose I have the next dataframe. How can I create a new "avg" column that is the result of averaging the last 2 dates ("date") for each group.
The idea is to apply this to a dataset with hundreds of thousands of files, so performance is important. The function should contemplate a variable number of months (example 2 or 3 months) and be able to change between simple and medium average.
Thanks in advance.
table1<-data.frame(group=c(1,1,1,1,2,2,2,2),date=c(201903,201902,201901,201812,201903,201902,201901,201812),price=c(10,30,50,20,2,10,9,20))
group date price
1 1 201903 10
2 1 201902 30
3 1 201901 50
4 1 201812 20
5 2 201903 2
6 2 201902 10
7 2 201901 9
8 2 201812 20
result<-data.frame(group=c(1,1,1,1,2,2,2,2),date=c(201903,201902,201901,201812,201903,201902,201901,201812),price=c(10,30,50,20,2,10,9,20), avg = c(20, 40, 35, NA, 6, 9.5, 14.5, NA))
group date price avg
1 1 201903 10 20.0
2 1 201902 30 40.0
3 1 201901 50 35.0
4 1 201812 20 NA
5 2 201903 2 6.0
6 2 201902 10 9.5
7 2 201901 9 14.5
8 2 201812 20 NA
sort the data.frame first so that date is ascending for each group
table1 <- table1[order(table1$group, table1$date), ]
create a moving average function with argument for number of months.
other function options available from: Calculating moving average
mov_avg <- function(y, months = 2){as.numeric(filter(y, rep(1 / months, months), sides = 1))}
Use the classic do.call-lapply-split combo with this mov_avg function
table1$avg_2months <- do.call(c, lapply(split(x=table1$price, f=table1$group), mov_avg, months=2))
table1$avg_3months <- do.call(c, lapply(split(x=table1$price, f=table1$group), mov_avg, months=3))
table1
group date price avg_2months avg_3months
4 1 201812 20 NA NA
3 1 201901 50 35.0 NA
2 1 201902 30 40.0 33.33333
1 1 201903 10 20.0 30.00000
8 2 201812 20 NA NA
7 2 201901 9 14.5 NA
6 2 201902 10 9.5 13.00000
5 2 201903 2 6.0 7.00000
If your date column is sorted, then hers's a way to do it using data.table:
library(data.table)
setDT(table1)[, next_price := dplyr::lead(price), by = group][, total_price := price + next_price][, avg := total_price / 2][, c("total_price", "next_price") := NULL]
table1
group date price avg
1: 1 201903 10 20.0
2: 1 201902 30 40.0
3: 1 201901 50 35.0
4: 1 201812 20 NA
5: 2 201903 2 6.0
6: 2 201902 10 9.5
7: 2 201901 9 14.5
8: 2 201812 20 NA
Related
I have data:
Date
Price
"2021-01-01"
1
"2021-01-02"
NA
"2021-01-03"
NA
"2021-01-04"
NA
"2021-01-05"
NA
"2021-01-06"
6
"2021-01-07"
NA
"2021-01-08"
NA
"2021-01-09"
3
And I would like to replace missing values with means, so that the end result would look like this:
Date
Price
"2021-01-01"
1
"2021-01-02"
2
"2021-01-03"
3
"2021-01-04"
4
"2021-01-05"
5
"2021-01-06"
6
"2021-01-07"
5
"2021-01-08"
4
"2021-01-09"
3
You can use zoo::na.approx:
library(zoo)
na.approx(dat$Price)
# [1] 1 2 3 4 5 6 5 4 3
One way would be to use na_interpolation from imputeTS library:
imputeTS::na_interpolation(c(1, NA, NA, 4))
# 1 2 3 4
imputeTS::na_interpolation(c(6, NA, NA, 3))
# 6 5 4 3
I consider that you have multiple price cols, where you got the price. Then you want to create a new column named Price which is the mean and without NA values.
library(tidyverse)
library(dplyr)
Date <- c("2021-01-01","2021-01-02","2021-01-03","2021-01-04","2021-01-05",
"2021-01-06", "2021-01-07", "2021-01-08","2021-01-09", "2021-01-08","2021-01-09")
your.price.col1 <- c(floor(runif(9,0,100)),NA,NA)
your.price.col2 <- c(floor(runif(9,0,100)),33,44)
df <- data.frame(Date, your.price.col1,your.price.col2)
# slice your price cols, which you want to include in the mean with [2:3] for col1 and col2
df %>%
mutate(Price = rowMeans(df[2:3], na.rm=T))
Date your.price.col1 your.price.col2 Price
1 2021-01-01 96 55 75.5
2 2021-01-02 22 43 32.5
3 2021-01-03 68 62 65.0
4 2021-01-04 18 51 34.5
5 2021-01-05 27 6 16.5
6 2021-01-06 26 30 28.0
7 2021-01-07 32 22 27.0
8 2021-01-08 53 95 74.0
9 2021-01-09 74 78 76.0
10 2021-01-08 NA 33 33.0
11 2021-01-09 NA 44 44.0
The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
)
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
library(magrittr)
library(data.table)
setDT(data)
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
}
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days
I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2
I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days
Suppose I have the following data
set.seed(123)
Company <- c(rep("Company 1",5),rep("Company 2",10))
Dates <- c(seq(as.Date("2014-09-01"), as.Date("2015-01-01"), by="months"),
seq(as.Date("2011-09-01"), as.Date("2012-06-01"), by="months"))
X.1 <- sample(c(0,0,5,10,15,20,25,30,40,50),size=15,replace=TRUE)
X.2 <- sample(c(0,0,5,10,15,20,25,30,40,50),size=15,replace=TRUE)
df <- data.frame(Dates,Company,X.1,X.2)
Dates Company X.1 X.2
1 2014-09-01 Company 1 50 0
2 2014-10-01 Company 1 50 5
3 2014-11-01 Company 1 25 15
4 2014-12-01 Company 1 30 5
5 2015-01-01 Company 1 0 40
6 2011-09-01 Company 2 15 0
7 2011-10-01 Company 2 30 15
8 2011-11-01 Company 2 5 30
9 2011-12-01 Company 2 10 0
10 2012-01-01 Company 2 5 20
11 2012-02-01 Company 2 0 5
12 2012-03-01 Company 2 15 0
13 2012-04-01 Company 2 15 30
14 2012-05-01 Company 2 10 40
15 2012-06-01 Company 2 0 10
What I am trying to do is find monthly growth rates for variables X.1 and X.2
within each company and bind those columns to the right side of the dataframe. The problem here is that the date ranges for each Company are different, which is why I am having trouble with this. Also, since I have 0s in the data Inf and NAs are okay.
Thanks for your help.
#I computed the growth using log: growth=log(X(t)/X(t-1)). If you want to compute using (X(t)-X(t-1))/X(t-1), you can just use that. Also, for the first period, it will be NA.
#Assumption: the data are equally spaced for each company. You get Inf, if your last period value is 0 and -Inf if your current period value is 0 (because we used log). The growth will be 0 (when current period is zero) if you don't use log (see second method)
library(dplyr)
df%>%
group_by(Company)%>%
mutate(gx_1=log(X.1/lag(X.1,1)),gx_2=log(X.1/lag(X.2,1))
)
Source: local data frame [15 x 6]
Groups: Company
Dates Company X.1 X.2 gx_1 gx_2
1 2014-09-01 Company 1 5 40 NA NA
2 2014-10-01 Company 1 30 5 1.7917595 -0.2876821
3 2014-11-01 Company 1 15 0 -0.6931472 1.0986123
4 2014-12-01 Company 1 40 10 0.9808293 Inf
5 2015-01-01 Company 1 50 50 0.2231436 1.6094379
6 2011-09-01 Company 2 0 40 NA NA
7 2011-10-01 Company 2 20 25 Inf -0.6931472
8 2011-11-01 Company 2 40 25 0.6931472 0.4700036
9 2011-12-01 Company 2 20 50 -0.6931472 -0.2231436
10 2012-01-01 Company 2 15 25 -0.2876821 -1.2039728
11 2012-02-01 Company 2 50 30 1.2039728 0.6931472
12 2012-03-01 Company 2 15 20 -1.2039728 -0.6931472
13 2012-04-01 Company 2 25 20 0.5108256 0.2231436
14 2012-05-01 Company 2 20 5 -0.2231436 0.0000000
15 2012-06-01 Company 2 0 0 -Inf -Inf
#without using log , i.e.
df%>%
group_by(Company)%>%
mutate(gx_1=((X.1-lag(X.1,1))/lag(X.1,1)),gx_2=((X.2-lag(X.2,1))/lag(X.2,1)))