I have a df like the following with 30 years until 2015. I want to cut every month into three data like 1-10, 11-20, and 21-31 and average all ten (less then ten) data. Thus, each month has three data. How can I do it?
1993-01-29 28.92189
1993-02-01 29.12760
1993-02-02 29.18927
1993-02-03 29.49786
1993-02-04 29.62128
1993-02-05 29.60068
1993-02-08 29.60068
1993-02-09 29.39498
------
------
2015-08-18 209.92999
2015-08-19 208.28000
2015-08-20 204.01000
2015-08-21 197.63001
2015-08-24 189.55000
2015-08-25 187.23000
2015-08-26 194.67999
2015-08-27 199.16000
2015-08-28 199.24000
tryCatch is for eliminate data start date problem. I will provide more info when i have time.
library(xts)
dates<-seq(as.Date("1993-01-29"),as.Date("2015-08-25"),"days")
sample<-rnorm(length(dates))
tmpxts<-split.xts(xts(x = sample,order.by = dates),f = "months")
mxts<-lapply(tmpxts,function(x) {
tmp<-data.frame(val=tryCatch(c(mean(x[1:10]),mean(x[11:20]),mean(x[21:length(x)])),
error=function(e) matrix(mean(x),1)))
row.names(tmp)<-tryCatch(index(x[c(1,11,21)]),error=function(e) index(x[1]))
tmp
})
do.call(rbind,mxts)
This is a base solution that builds cuts from an increasing sequence the cycles through years, months and your cuts at 1st, 11th and 21th of the month, The default for the base cut functions are to include the breaks as the "right-side" of intervals, but your specification required cuts at 1,11,and 21 (to leave 10, and 20 in the lower intervals) so I used right=TRUE:
tapply(dat$V2, cut.Date(dat$V1,
breaks=as.Date(
apply( expand.grid( c(1,11,21), 1:12, 1993:2015), 1,
function( x) paste(rev(x), collapse="-")) ), right=TRUE), FUN=mean)
1993-01-01 1993-01-11 1993-01-21 1993-02-01 1993-02-11 1993-02-21 1993-03-01
NA NA 29.02475 29.48412 NA NA NA
snipped many empty intervals
And the bottom of results included:
2015-07-21 2015-08-01 2015-08-11 2015-08-21 2015-09-01 2015-09-11 2015-09-21
NA NA 204.96250 193.97200 NA NA NA
2015-10-01 2015-10-11 2015-10-21 2015-11-01 2015-11-11 2015-11-21 2015-12-01
NA NA NA NA NA NA NA
2015-12-11
NA
The code below cuts each month separately into thirds, based on the number of days in each month.
library(dplyr)
library(lubridate)
library(ggplot2)
# Fake data
df = data.frame(date=seq.Date(as.Date("2013-01-01"),
as.Date("2013-03-31"), by="day"))
set.seed(394)
df$value = rnorm(nrow(df), sqrt(1:nrow(df)), 2)
# Cut months into thirds
df = df %>%
# Create a new column to group by Year-Month
mutate(yr_mon = paste0(year(date) , "_", month(date, label=TRUE, abbr=TRUE))) %>%
group_by(yr_mon) %>%
# Cut each month into thirds
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
# Add yr_mon to cutMonth so that we have a unique group label for
# each third of each month
cutMonth = paste0(yr_mon, "\n", cutMonth)) %>%
ungroup() %>%
# Turn cutMonth into a factor with correct date ordering
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
And here is the result:
# Show number of observations in each group
as.data.frame(table(df$cutMonth))
Var1 Freq
1 2013_Jan\n1st third 10
2 2013_Jan\n2nd third 11
3 2013_Jan\n3rd third 10
4 2013_Feb\n1st third 9
5 2013_Feb\n2nd third 10
6 2013_Feb\n3rd third 9
7 2013_Mar\n1st third 10
8 2013_Mar\n2nd third 11
9 2013_Mar\n3rd third 10
# Plot means by group (just to visualize the result of the date grouping operations)
ggplot(df, aes(cutMonth, value)) +
stat_summary(fun.y=mean, geom='point', size=4, colour="red") +
coord_cartesian(ylim=c(-0.2,10.2)) +
theme(axis.text.x = element_text(size=14))
Related
I need to calculate the rolling 14 day average for a large data set. The data set is private, although I can share a small snippet.
The data set comes from an instrument in the field which does not operate every day. For instance, a snippet of the data frame would look like so:
Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000
I have the following plot.
library(zoo)
library(tidyverse)
ggplot(data=df_days, aes(x=Date, y=Value)) +
geom_line(color="black", lwd=0.5) +
geom_point(lwd=0.5) +
geom_line(y=rollmean(df_days$Value, 14, na.pad=TRUE), color="red", lwd=0.8)
I realised that I'm actually taking the 14 point average, i.e the average of 14 data points. Is there a way to take the 14 day average, based upon the dates themselves?
1) Using the input from the question shown reproducibly in the Note at the end we calculate the number of points to use at each date, w, and then use rollapplyr with that.
library(zoo)
within(DF, {
w <- seq_along(Date) - findInterval(Date - 14, Date)
mean14 <- rollapplyr(Value, w, mean)
})
giving the following where mean14 is the mean and w is the number of points used to calculate that mean. This is calculated in such a way that if there were no missing dates then it would give the same result as rollapplyr(DF$Value, 14, mean, partial = TRUE) but if there are missing dates then it uses fewer based on the number of dates in a 14 day window. (Note that using different numbers of points for each mean can affect the variance.)
Date Value mean14 w
1 2022-01-28 196.0000 196.00000 1
2 2022-01-31 104.0000 150.00000 2
3 2022-02-01 0.0000 100.00000 3
4 2022-02-02 98.0000 99.50000 4
5 2022-02-03 0.0000 79.60000 5
6 2022-02-07 139.9231 89.65385 6
7 2022-02-08 114.5000 93.20330 7
8 2022-02-09 121.6429 96.75824 8
9 2022-02-10 96.5000 96.72955 9
10 2022-02-11 151.6364 91.80026 9
11 2022-02-14 85.8750 89.78637 9
12 2022-02-15 98.9000 100.77526 9
13 2022-02-18 209.4000 127.29716 8
14 2022-02-21 172.1818 131.32951 8
15 2022-02-22 0.0000 117.01700 8
16 2022-02-23 0.0000 101.81165 8
17 2022-02-28 264.0000 124.08030 6
18 2022-03-01 131.7500 129.55530 6
19 2022-03-03 119.3333 128.09502 7
20 2022-03-04 88.8000 110.86645 7
21 2022-03-07 152.1667 108.00714 7
22 2022-03-08 24.5000 111.50714 7
2) Another approach is to add the missing dates, fill in Value in those missing dates with NA and then use rollapplyr.
m <- merge(DF, data.frame(Date = seq(min(DF$Date), max(DF$Date), 1)), all = TRUE)
na.omit(transform(m,
mean14 = rollapplyr(Value, 14, mean, na.rm = TRUE, partial = TRUE)))
3) A variation of the above is to use zoo objects. Note that fortify.zoo(zz) can be used to create a data frame from a zoo object.
library(zoo)
z <- read.zoo(DF)
# 1
tt <- time(z)
w <- seq_along(tt) - findInterval(tt - 14, tt)
zz <- rollapplyr(z, w, mean)
# 2
m <- merge(z, zoo(, seq(start(z), end(z), 1)))
zz <- na.omit(rollapply(m, 14, mean, na.rm = TRUE))
Note
Lines <- "Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000"
DF <- read.csv(text = Lines)
DF$Date <- as.Date(DF$Date)
There may be more elegant solutions, but you can fill in the missing dates with NA:
df$Date <- as.Date(df$Date)
library(dplyr)
library(tidyr)
df %>% complete(Date = seq(min(Date),max(Date),1), fill = list(Value = NA))
Output:
# A tibble: 40 × 2
# Date Value
# <date> <dbl>
# 1 2022-01-28 196
# 2 2022-01-29 NA
# 3 2022-01-30 NA
# 4 2022-01-31 104
# 5 2022-02-01 0
# 6 2022-02-02 98
# ...
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
I have a dataset over some departments (dep. number), and in which timeframe a certain survey was made in that department. It looks like this
dep type inDate outDate
1 14 AA 2015-01-16 2015-04-25
2 10 AB 2014-05-01 2017-01-01
3 14 BA 2013-01-04 2015-04-06
4 11 CA 2016-09-10 2017-12-01
5 10 DD 2013-01-01 2013-12-01
...
Also i have a startYear = 2013
and an endYear = 2017
for when the surveys started and ended globally.
I want a plot for each of the departments. These plots should show how many surveys were active in the period between the startDate and endDate. So for department 14, the plot should look like this
Can someone just point me in the right direction, i don't even know where to start?
df = read.table(text = "
dep type inDate outDate
1 14 AA 2015-01-16 2015-04-25
2 10 AB 2014-05-01 2017-01-01
3 14 BA 2013-01-04 2015-04-06
4 11 CA 2016-09-10 2017-12-01
5 10 DD 2013-01-01 2013-12-01
", header=T, stringsAsFactors=F)
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(inDate, outDate), ymd) %>% # update date columns to date format (if needed)
mutate(dep = factor(dep)) %>% # update dep to factor (if it is not)
group_by(dep, id = row_number()) %>% # for every row
nest() %>% # nest data
mutate(dates = map(data, ~seq(.x$inDate, .x$outDate, "1 day"))) %>% # create a sequence of dates
unnest(dates) %>% # add that sequence of dates as column
count(dep, dates) %>% # count live projects each day
complete(dep, dates, fill = list(n = 0L)) %>% # add zeros to days that surveys weren't live
ggplot(aes(dates, n, group=dep, col=dep))+ # plot
geom_line()+ # add line
facet_wrap(~dep) # one plot for each department
You can remove +facet_wrap(~dep) if you want all departments in the same plot.
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).