I have a data frame in R in the following format:
(The dates are in the wrong format but I can change them fairly easily).
Now, I was wondering how I can perform operations on the data frame between certain dates - for example, say I want to find the average price for the day 5/18/2012, and then I want to find the average price for 5/19/2012, and then similarly for 5/20/2012, how would I go about doing so? Thanks in advance.
EDIT: One idea I did have was to use the identical(x,y) function to compare two dates, however since it is a very large data frame (about 300,000 rows) I'd prefer a more efficient way :)
You can try to group by date and do the average something like that :
library(dplyr);
data %>% group_by(RecordDate) %>% summarise(av = mean(Price));
You can use aggregate.
x <- Sys.time()
y <- seq(from = x, to = x + 5 * 3600*24, by = "day")
xy <- data.frame(date = rep(y, each = 5),
value = rnorm(length(y)))
aggregate(value ~ date, data = xy, FUN = mean)
date value
1 2017-01-28 10:07:29 0.2921081
2 2017-01-29 10:07:29 0.9039815
3 2017-01-30 10:07:29 0.5616696
4 2017-01-31 10:07:29 0.9297463
5 2017-02-01 10:07:29 0.5149972
6 2017-02-02 10:07:29 0.4353255
> aggregate(value ~ date, data = xy, FUN = length)
date value
1 2017-01-28 10:07:29 5
2 2017-01-29 10:07:29 5
3 2017-01-30 10:07:29 5
4 2017-01-31 10:07:29 5
5 2017-02-01 10:07:29 5
6 2017-02-02 10:07:29 5
Related
I'm trying to remove redundant data rows from a gigantic dataset. For the same individual, at the same location, on the same day, I want to keep just one detection for every 10 minutes time range. So for example if individual 1 stays around station 6 for 20 minutes, instead of 200 or so detections I just want to keep 2, one for every 10 minutes he's there. Here is an example dataset:
datetime<-c("2020-12-30 23:03:24","2020-12-30 23:04:25","2020-12-28 23:06:20", "2020-12-26 12:02:10","2020-12-26 12:07:26","2020-12-26 12:10:07", "2018-05-11 05:02:05","2018-05-11 05:03:07", "2018-05-11 05:13:25", "2018-05-11 05:14:27")
dt<-as.POSIXct(datetime, format="%Y-%m-%d %H:%M:%S")
i<-c('ind1','ind1','ind1', 'ind2', 'ind2', 'ind2', 'ind1', 'ind1', 'ind1', 'ind4')
l<-c('station1', 'station1', station1','station2','station2','station3','station1','station1','station2','station6')
stack<-data.frame(dt, i,l)
The expected resulting dataframe:
dt i l
1 2020-12-30 23:03:24 ind1 station1
2 2020-12-30 23:04:25 ind1 station1
3 2020-12-28 23:06:20 ind1 station1
4 2020-12-26 12:02:10 ind2 station2
5 2020-12-26 12:07:26 ind2 station2
6 2020-12-26 12:10:07 ind2 station3
7 2018-05-11 05:02:05 ind1 station1
8 2018-05-11 05:03:07 ind1 station1
9 2018-05-11 05:13:25 ind1 station2
10 2018-05-11 05:14:27 ind4 station6
Here is what I have tried to code so far:
#Separate date and time
stack <- tidyr::separate(stack, dt, c("date", "time"), sep = " ")
#Merge columns location (l), individual (i) and date (date)
data_set_merged <- stack%>%
unite("Merged_sample", c("i", "l", "date"), remove=FALSE)
#Order dataset chronologically
data_set_merged %>% arrange(ymd(data_set_merged$date))
#Count number of minutes between every group of detection
data_set_merged$time<-as.POSIXct(as.character(data_set_merged$time), format="%H:%M:%S")
value <-diff(data_set_merged$time)
#Add NA value at the end, since no difference between last value and nothing
Adding_NA_value <- append(value , "NA")
New_data_frame_with_column<- data_set_merged %>%
dplyr::mutate (Time_intervall_seconds = Adding_NA_value)
#Group_by sample, site, year and day and select the observations with > 10 min (= less than 600 seconds) replicates
Final_data_frame <- New_data_frame_with_column %>%
group_by(Merged_sample)%>%
filter (Time_intervall_seconds>= 600)
This code deletes more than juste the redundant data: if two rows are less than 600 seconds, it deletes both instead of just one, so I'm losing information. I also don't know if it is only calculating the time difference for each group - I don't want to delete detections between individuals, or between locations for the same individual, I really only want it for the same location + individual + date.
I tried ordering it chronologically to get a time interval that makes sense, but then the interval is not calculated by group anymore.
I'm still quite new at R and I don't know where to go from here. Any help would be sososo welcome. Thanks!
This will block the data into 10 minute intervals
library(slider)
block(stack, stack$dt, period = "minute", every = 10)
Odd as it sounds, it looks like the way to do this is block the intervals, then convert it back to a dataframe retaining the blocking in a column. Then you can simply group by (i, l, bucket) and sample one from each. you probably want to set the origin option in the block statement so you know where the cuts land, I just let them fall where they may.
f <- function(x) {t <- block(x, x$dt, period = "minute", every = 10) %>% tibble()}
fr <- f(stack) %>% mutate(bucket=row_number()) %>%
unnest()
fr %>% group_by(i, l, bucket) %>% group_modify( ~ sample_n(.x, 1))
# A tibble: 7 x 4
# Groups: i, l, bucket [7]
i l bucket dt
<chr> <chr> <int> <dttm>
1 ind1 station1 1 2018-05-11 05:03:07
2 ind1 station1 5 2020-12-28 23:06:20
3 ind1 station1 6 2020-12-30 23:04:25
4 ind1 station2 2 2018-05-11 05:13:25
5 ind2 station2 3 2020-12-26 12:07:26
6 ind2 station3 4 2020-12-26 12:10:07
7 ind4 station6 2 2018-05-11 05:14:27
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
I am trying to use dplyr in R to calculate rolling stats (mean, sd, etc) based on a dynamic window based on dates and for specific models. For instance, within groupings of items, I would like to calculate the rolling mean for all data 10 days prior. The dates on the data are not sequential and not complete so I can't use a fixed window.
One way to do this is use rollapply referencing the window width as shown below. However, I'm having trouble calculating the dynamic width. I'd prefer a method that omits the intermediate step of calculating the window and simply calculate based on the date_lookback. Here's a toy example.
I've used for loops to do this, but they are very slow.
library(dplyr)
library(zoo)
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE) %>%
mutate(.,
cut_date = date - date_lookback, #calcs based on sample since this date
dyn_win = c(1,1,2,3,3,1,2,3,4,5), ##!! need to calculate this vector??
roll_mean = rollapply(data, align = "right", width = dyn_win, mean),
roll_sd = rollapply(data, align = "right", width = dyn_win, sd))
These are the roll_mean and roll_sd results I'm looking for:
> df
label date data cut_date dyn_win roll_mean roll_sd
1 a 2017-01-02 790 2016-12-23 1 790.0000 NA
2 a 2017-01-20 493 2017-01-10 1 493.0000 NA
3 a 2017-01-21 718 2017-01-11 2 605.5000 159.0990
4 a 2017-01-30 483 2017-01-20 3 564.6667 132.8847
5 a 2017-01-31 825 2017-01-21 3 675.3333 174.9467
6 b 2017-01-05 186 2016-12-26 1 186.0000 NA
7 b 2017-01-08 599 2016-12-29 2 392.5000 292.0351
8 b 2017-01-09 408 2016-12-30 3 397.6667 206.6938
9 b 2017-01-10 108 2016-12-31 4 325.2500 222.3921
10 b 2017-01-11 666 2017-01-01 5 393.4000 245.5928
Thanks in advance.
You could try explicitly referencing your dataset inside the dplyr call:
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE)
df %>%
group_by(date,label) %>%
mutate(.,
roll_mean = mean(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE),
roll_sd = sd(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE))
I have a df like the following with 30 years until 2015. I want to cut every month into three data like 1-10, 11-20, and 21-31 and average all ten (less then ten) data. Thus, each month has three data. How can I do it?
1993-01-29 28.92189
1993-02-01 29.12760
1993-02-02 29.18927
1993-02-03 29.49786
1993-02-04 29.62128
1993-02-05 29.60068
1993-02-08 29.60068
1993-02-09 29.39498
------
------
2015-08-18 209.92999
2015-08-19 208.28000
2015-08-20 204.01000
2015-08-21 197.63001
2015-08-24 189.55000
2015-08-25 187.23000
2015-08-26 194.67999
2015-08-27 199.16000
2015-08-28 199.24000
tryCatch is for eliminate data start date problem. I will provide more info when i have time.
library(xts)
dates<-seq(as.Date("1993-01-29"),as.Date("2015-08-25"),"days")
sample<-rnorm(length(dates))
tmpxts<-split.xts(xts(x = sample,order.by = dates),f = "months")
mxts<-lapply(tmpxts,function(x) {
tmp<-data.frame(val=tryCatch(c(mean(x[1:10]),mean(x[11:20]),mean(x[21:length(x)])),
error=function(e) matrix(mean(x),1)))
row.names(tmp)<-tryCatch(index(x[c(1,11,21)]),error=function(e) index(x[1]))
tmp
})
do.call(rbind,mxts)
This is a base solution that builds cuts from an increasing sequence the cycles through years, months and your cuts at 1st, 11th and 21th of the month, The default for the base cut functions are to include the breaks as the "right-side" of intervals, but your specification required cuts at 1,11,and 21 (to leave 10, and 20 in the lower intervals) so I used right=TRUE:
tapply(dat$V2, cut.Date(dat$V1,
breaks=as.Date(
apply( expand.grid( c(1,11,21), 1:12, 1993:2015), 1,
function( x) paste(rev(x), collapse="-")) ), right=TRUE), FUN=mean)
1993-01-01 1993-01-11 1993-01-21 1993-02-01 1993-02-11 1993-02-21 1993-03-01
NA NA 29.02475 29.48412 NA NA NA
snipped many empty intervals
And the bottom of results included:
2015-07-21 2015-08-01 2015-08-11 2015-08-21 2015-09-01 2015-09-11 2015-09-21
NA NA 204.96250 193.97200 NA NA NA
2015-10-01 2015-10-11 2015-10-21 2015-11-01 2015-11-11 2015-11-21 2015-12-01
NA NA NA NA NA NA NA
2015-12-11
NA
The code below cuts each month separately into thirds, based on the number of days in each month.
library(dplyr)
library(lubridate)
library(ggplot2)
# Fake data
df = data.frame(date=seq.Date(as.Date("2013-01-01"),
as.Date("2013-03-31"), by="day"))
set.seed(394)
df$value = rnorm(nrow(df), sqrt(1:nrow(df)), 2)
# Cut months into thirds
df = df %>%
# Create a new column to group by Year-Month
mutate(yr_mon = paste0(year(date) , "_", month(date, label=TRUE, abbr=TRUE))) %>%
group_by(yr_mon) %>%
# Cut each month into thirds
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
# Add yr_mon to cutMonth so that we have a unique group label for
# each third of each month
cutMonth = paste0(yr_mon, "\n", cutMonth)) %>%
ungroup() %>%
# Turn cutMonth into a factor with correct date ordering
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
And here is the result:
# Show number of observations in each group
as.data.frame(table(df$cutMonth))
Var1 Freq
1 2013_Jan\n1st third 10
2 2013_Jan\n2nd third 11
3 2013_Jan\n3rd third 10
4 2013_Feb\n1st third 9
5 2013_Feb\n2nd third 10
6 2013_Feb\n3rd third 9
7 2013_Mar\n1st third 10
8 2013_Mar\n2nd third 11
9 2013_Mar\n3rd third 10
# Plot means by group (just to visualize the result of the date grouping operations)
ggplot(df, aes(cutMonth, value)) +
stat_summary(fun.y=mean, geom='point', size=4, colour="red") +
coord_cartesian(ylim=c(-0.2,10.2)) +
theme(axis.text.x = element_text(size=14))
I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1