Create Time series observations,timestamps and filling up the values - r

I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks

Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21

Related

How to subtract a column of date values by sys.Date() using mutate - tidyverse/dplyr? R

I have this dataframe I am working with.
data <- data.frame(id = c(123,124,125,126,127,128,129,130),
date = c("10/7/2021","10/6/2021","9/13/2021","10/18/2021","8/12/2021","9/6/2021","10/29/2021","9/6/2021"))
My goal is create a new column that tells me how many days have passed since that recorded date for each row. I'm trying to use this code but I keep getting NA days in my new column.
data %>%
select(id,date) %>%
mutate("days_since" = as.Date(Sys.Date()) - as.Date(date,format="%Y-%m-%d"))
id date days_since
1 123 10/7/2021 NA days
2 124 10/6/2021 NA days
3 125 9/13/2021 NA days
4 126 10/18/2021 NA days
5 127 8/12/2021 NA days
6 128 9/6/2021 NA days
7 129 10/29/2021 NA days
8 130 9/6/2021 NA days
What am I doing wrong? Thank you for any feedback.
We can use the lubridate package. It makes type conversion and operations with dates much easier.
In your code, the as.Date(date) step was problematic because the format was wrong.
library(dplyr)
library(lubridate)
data %>% mutate("days_since" = Sys.Date() - mdy(date))
id date days_since
1 123 10/7/2021 28
2 124 10/6/2021 29
3 125 9/13/2021 22
4 126 10/18/2021 17
5 127 8/12/2021 23
6 128 9/6/2021 29
7 129 10/29/2021 6
8 130 9/6/2021 29
Thanks, #Karthik S for the simplification
it is also easily done, using base r and a simple "-". This gives back the difference in days:
data <- data.frame(id = c(123,124,125,126,127,128,129,130),
date = c("2021-10-10","2021-10-06","2021-09-13","2021-10-18","2021-08-12","2021-09-06","2021-10-29","2021-09-06"))
data$date <- as.Date(data$date)
data$sys_date <- Sys.Date()
data$sysDate_to_date <- data$sys_date -data$date

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

Finding each time of daily max variable in climate data

I have a large dataset over many years which has several variables, but the one I am interested in is wind speed and dateTime. I want to find the time of the max wind speed for every day in the data set. I have hourly data in Posixct format, with WS as a numeric with occasional NAs. Below is a short data set that should hopefully illustrate my point, however my dateTime wasn't working out to be hourly data, but it provides enough for a sample.
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1798,rep=TRUE)
WD <- sample(0:390,1798,rep=TRUE)
Temp <- sample(0:40,1798,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I have previously tried creating a new column with just a posix date (minus time) to allow for day isolation, however all the things I have tried have only returned a shortened data frame with date and WS (aggregate, splitting, xts). Aggregate was only one that didn't do this, however, it gave me 23:00:00 as a constant time which isn't correct.
I have looked at How to calculate daily means, medians, from weather variables data collected hourly in R?, https://stats.stackexchange.com/questions/7268/how-to-aggregate-by-minute-data-for-a-week-into-hourly-means and others but none have answered this question, or the solutions have not returned an ideal result.
I need to compare the results of this analysis with another data frame, so hence the reason I need the actual time when the max wind speed occurred for each day in the dataset. I have a feeling there is a simple solution, however, this has me frustrated.
A dplyr solution may be:
library(dplyr)
df %>%
mutate(date = as.Date(dateTime)) %>%
left_join(
df %>%
mutate(date = as.Date(dateTime)) %>%
group_by(date) %>%
summarise(max_ws = max(WS, na.rm = TRUE)) %>%
ungroup(),
by = "date"
) %>%
select(-date)
# dateTime WS WD Temp max_ws
# 1 2011-01-01 00:00:00 NA 313 2 15
# 2 2011-01-01 00:24:00 7 376 1 15
# 3 2011-01-01 00:48:00 3 28 28 15
# 4 2011-01-01 01:12:00 15 262 24 15
# 5 2011-01-01 01:36:00 1 149 34 15
# 6 2011-01-01 02:00:00 4 319 33 15
# 7 2011-01-01 02:24:00 15 280 22 15
# 8 2011-01-01 02:48:00 NA 110 23 15
# 9 2011-01-01 03:12:00 12 93 15 15
# 10 2011-01-01 03:36:00 3 5 0 15
Dee asked for: "I want to find the time of the max wind speed for every day in the data set." Other answers have calculated the max(WS) for every day, but not at which hour that occured.
So I propose the following solution with dyplr:
library(dplyr)
set.seed(12345)
dateTime <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
as.POSIXct("2011-01-29 23:00:00", tz = "GMT"),
by = 60*24)
WS <- sample(0:20,1738,rep=TRUE)
WD <- sample(0:390,1738,rep=TRUE)
Temp <- sample(0:40,1738,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
df %>%
group_by(Date = as.Date(dateTime)) %>%
mutate(Hour = hour(dateTime),
Hour_with_max_ws = Hour[which.max(WS)])
I want to highlight out, that if there are several hours with the same maximal windspeed (in the example below: 15), only the first hour with max(WS) will be shown as result, though the windspeed 15 was reached on that date at the hours 0, 3, 4, 21 and 22! So you might need a more specific logic.
For the sake of completeness (and because I like the concise code) here is a "one-liner" using data.table:
library(data.table)
setDT(df)[, max.ws := max(WS, na.rm = TRUE), by = as.IDate(dateTime)][]
dateTime WS WD Temp max.ws
1: 2011-01-01 00:00:00 NA 293 22 15
2: 2011-01-01 00:24:00 15 55 14 15
3: 2011-01-01 00:48:00 NA 186 24 15
4: 2011-01-01 01:12:00 4 300 22 15
5: 2011-01-01 01:36:00 0 120 36 15
---
1734: 2011-01-29 21:12:00 12 249 5 15
1735: 2011-01-29 21:36:00 9 282 21 15
1736: 2011-01-29 22:00:00 12 238 6 15
1737: 2011-01-29 22:24:00 10 127 21 15
1738: 2011-01-29 22:48:00 13 297 0 15

Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

How to cut yearly time-based data into 36 parts with R?

I have a df like the following with 30 years until 2015. I want to cut every month into three data like 1-10, 11-20, and 21-31 and average all ten (less then ten) data. Thus, each month has three data. How can I do it?
1993-01-29 28.92189
1993-02-01 29.12760
1993-02-02 29.18927
1993-02-03 29.49786
1993-02-04 29.62128
1993-02-05 29.60068
1993-02-08 29.60068
1993-02-09 29.39498
------
------
2015-08-18 209.92999
2015-08-19 208.28000
2015-08-20 204.01000
2015-08-21 197.63001
2015-08-24 189.55000
2015-08-25 187.23000
2015-08-26 194.67999
2015-08-27 199.16000
2015-08-28 199.24000
tryCatch is for eliminate data start date problem. I will provide more info when i have time.
library(xts)
dates<-seq(as.Date("1993-01-29"),as.Date("2015-08-25"),"days")
sample<-rnorm(length(dates))
tmpxts<-split.xts(xts(x = sample,order.by = dates),f = "months")
mxts<-lapply(tmpxts,function(x) {
tmp<-data.frame(val=tryCatch(c(mean(x[1:10]),mean(x[11:20]),mean(x[21:length(x)])),
error=function(e) matrix(mean(x),1)))
row.names(tmp)<-tryCatch(index(x[c(1,11,21)]),error=function(e) index(x[1]))
tmp
})
do.call(rbind,mxts)
This is a base solution that builds cuts from an increasing sequence the cycles through years, months and your cuts at 1st, 11th and 21th of the month, The default for the base cut functions are to include the breaks as the "right-side" of intervals, but your specification required cuts at 1,11,and 21 (to leave 10, and 20 in the lower intervals) so I used right=TRUE:
tapply(dat$V2, cut.Date(dat$V1,
breaks=as.Date(
apply( expand.grid( c(1,11,21), 1:12, 1993:2015), 1,
function( x) paste(rev(x), collapse="-")) ), right=TRUE), FUN=mean)
1993-01-01 1993-01-11 1993-01-21 1993-02-01 1993-02-11 1993-02-21 1993-03-01
NA NA 29.02475 29.48412 NA NA NA
snipped many empty intervals
And the bottom of results included:
2015-07-21 2015-08-01 2015-08-11 2015-08-21 2015-09-01 2015-09-11 2015-09-21
NA NA 204.96250 193.97200 NA NA NA
2015-10-01 2015-10-11 2015-10-21 2015-11-01 2015-11-11 2015-11-21 2015-12-01
NA NA NA NA NA NA NA
2015-12-11
NA
The code below cuts each month separately into thirds, based on the number of days in each month.
library(dplyr)
library(lubridate)
library(ggplot2)
# Fake data
df = data.frame(date=seq.Date(as.Date("2013-01-01"),
as.Date("2013-03-31"), by="day"))
set.seed(394)
df$value = rnorm(nrow(df), sqrt(1:nrow(df)), 2)
# Cut months into thirds
df = df %>%
# Create a new column to group by Year-Month
mutate(yr_mon = paste0(year(date) , "_", month(date, label=TRUE, abbr=TRUE))) %>%
group_by(yr_mon) %>%
# Cut each month into thirds
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
# Add yr_mon to cutMonth so that we have a unique group label for
# each third of each month
cutMonth = paste0(yr_mon, "\n", cutMonth)) %>%
ungroup() %>%
# Turn cutMonth into a factor with correct date ordering
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
And here is the result:
# Show number of observations in each group
as.data.frame(table(df$cutMonth))
Var1 Freq
1 2013_Jan\n1st third 10
2 2013_Jan\n2nd third 11
3 2013_Jan\n3rd third 10
4 2013_Feb\n1st third 9
5 2013_Feb\n2nd third 10
6 2013_Feb\n3rd third 9
7 2013_Mar\n1st third 10
8 2013_Mar\n2nd third 11
9 2013_Mar\n3rd third 10
# Plot means by group (just to visualize the result of the date grouping operations)
ggplot(df, aes(cutMonth, value)) +
stat_summary(fun.y=mean, geom='point', size=4, colour="red") +
coord_cartesian(ylim=c(-0.2,10.2)) +
theme(axis.text.x = element_text(size=14))

Resources