I have the data.frame with the last 12 months values for 3 observations. There is a Date variable corresponging to the month.m0 (the most recent), and then the values goes backward in time substracting one month each time:
date <- c("2017-01-01", "2016-12-01", "2016-10-01")
month.m0 <- c(1, 2, 3)
month.m1 <- c(4, 5, 6)
month.m2 <- c(7, 8, 9)
month.m3 <- c(10, 11, 12)
month.m4 <- c(13, 14, 15)
month.m5 <- c(16, 17, 18)
month.m6 <- c(19, 20, 21)
month.m7 <- c(22, 23, 24)
month.m8 <- c(25, 26, 27)
month.m9 <- c(28, 29, 30)
month.m10 <- c(31, 32, 33)
month.m11 <- c(34, 35, 36)
df <- data.frame(date, month.m0, month.m1, month.m2, month.m3, month.m4, month.m5, month.m6, month.m7, month.m8, month.m9, month.m10, month.m11)
The input will be:
date month.m0 month.m1 month.m2 month.m3 month.m4 month.m5 month.m6 month.m7 month.m8 month.m9 month.m10 month.m11
1 2017-01-01 1 4 7 10 13 16 19 22 25 28 31 34
2 2016-12-01 2 5 8 11 14 17 20 23 26 29 32 35
3 2016-10-01 3 6 9 12 15 18 21 24 27 30 33 36
The problem here is that I don't know the real month of each observation, because the numeration is ordinal and depends on the date variable.
The initial value (month.m0) correspond for the first row to the month january, becasue the date is january (it doesnt matter the day or the year). For the second row, the date is indicating that the month.m0 corresponds to december, and the third corresponds to october. Then, month.m1 is the ((month(Date) - months(1)) value, the month.m2 corresponds to (month(Date) - months(2)) and so on, going back in time from the initial value
EDITED OUTPUT:
I was trying to assign each value to the real month, so the output would be:
date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 2017-01-01 1 34 31 28 25 22 19 16 13 10 7 4
2 2016-12-01 35 32 29 26 23 20 17 14 11 8 5 2
3 2016-10-01 30 27 24 21 18 15 12 9 6 3 36 33
It's easy to assign the first month for each observation, but then it complicates when going backwards in time.
Assuming that df is the dataframe you provided...
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
gather(month_num,value,-date) %>% # reshape datset
mutate(month_num = as.numeric(gsub("month.m","",month_num)), # keep only the number (as your step)
date = ymd(date), # transform date to date object
month_actual = month(date), # keep the number of the actual month (baseline)
month_now = month_actual + month_num, # create the current month (baseline + step)
month_now_upd = ifelse(month_now > 12, month_now-12, month_now), # update month number (for numbers > 12)
month_now_upd_name = month(month_now_upd, label=T)) %>% # get name of the month
select(date, month_now_upd_name, value) %>% # keep useful columns
spread(month_now_upd_name, value) %>% # reshape again
arrange(desc(date)) # start from recent month
# date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 2017-01-01 1 4 7 10 13 16 19 22 25 28 31 34
# 2 2016-12-01 5 8 11 14 17 20 23 26 29 32 35 2
# 3 2016-10-01 12 15 18 21 24 27 30 33 36 3 6 9
Note that I created various (helpful) variables that you won't need in the end, but they will help you understand the process when you run the chained commands step by step.
You can make the above code shorter by combining some commands within mutate if you want.
Your explanation is not very clear to me, so my output is not exactly yours. But this is how I would do it:
library(dplyr)
library(tidyr)
df %>%
# First create a new variable containing the month as a numeric between 1-12
mutate(month = strftime(date, "%m")) %>%
# Make data tidy so basically there is new column col containing
# month.1, month.2, month.3, ... and a column val containg
# the values
gather(col, val, -date, -month) %>%
# remove "month.m" so the col column has numeric values
mutate_at("col", str_replace, pattern = "month.m", replacement = "") %>%
mutate_at(c("month", "col"), as.numeric) %>%
# Compute the difference between the month column and the col column
mutate(col = abs((col - month + 1) %% 12)) %>%
# Sort the dataframe according to the new col column
arrange(month, col) %>%
# Add month.m to the col column so we redefine the names of the columns
mutate(col = paste0("month.m", col), month = NULL) %>%
# Untidy the data frame
spread(col, val)
Related
I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
I have a data frame with a datetime column. I want to know the number of rows by hour of the day. However, I care only about the rows between 8 AM and 10 PM.
The lubridate package requires us to filter hours of the day using the 24-hour convention.
library(tidyverse)
library(lubridate)
### Fake Data with Date-time ----
x <- seq.POSIXt(as.POSIXct('1999-01-01'), as.POSIXct('1999-02-01'), length.out=1000)
df <- data.frame(myDateTime = x)
### Get all rows between 8 AM and 10 PM (inclusive)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= 8, myHour <= 22) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour) ## number of rows
Is there a way for me to use 10:00 PM rather than the integer 22?
You can use the ymd_hm and hour functions to do 12-hour to 24-hour conversions.
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")), ## hour() ignores year, month, date
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
A more elegant solution.
## custom function to convert 12 hour time to 24 hour time
hourOfDay_12to24 <- function(time12hrFmt){
out <- paste("2000-01-01", time12hrFmt)
out <- hour(ymd_hm(out))
out
}
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hourOfDay_12to24("8:00 AM"),
myHour <= hourOfDay_12to24("10:00 PM")) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
You can also use base R to do this
#Extract the hour
df$hour_day <- as.numeric(format(df$myDateTime, "%H"))
#Subset data between 08:00 AM and 10:00 PM
new_df <- df[df$hour_day >= as.integer(format(as.POSIXct("08:00 AM",
format = "%I:%M %p"), "%H")) & as.integer(format(as.POSIXct("10:00 PM",
format = "%I:%M %p"), "%H")) >= df$hour_day, ]
#Count the frequency
stack(table(new_df$hour_day))
# values ind
#1 42 8
#2 42 9
#3 41 10
#4 42 11
#5 42 12
#6 41 13
#7 42 14
#8 41 15
#9 42 16
#10 42 17
#11 41 18
#12 42 19
#13 42 20
#14 41 21
#15 42 22
This gives the same output as the tidyverse/lubridate approach
library(tidyverse)
library(lubridate)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")),
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>%
count(myHour)
this is my first post so I do apologize if I am not specific enough.
I have a sequence of months and a data frame with approximately 100 rows, each with a unique identifier. Each identifier is associated with a start up date. I am trying to calculate the number of months since start up for each of these unique identifiers at each month in the sequence. I have tried unsuccessfully to write a for loop to accomplish this.
Example Below:
# Build Example Data Frame #
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
# Create Sequence of Months, Format to match Data Frame, Reverse for the For Loop #
base.date <- as.Date(c("2015-11-1"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m")
Months.2 <- rev(Months.1)
# Create For Loop #
require(zoo)
for(i in seq_along(Months.2))
{
for(j in 1:length(summary(as.factor(df_example$ID), maxsum = 100000)))
{
Active.Months <- 12 * as.numeric((as.yearmon(Months.2 - i) - as.yearmon(df_example$StartUp)))
}
}
The idea behind the for loop was that for every record in the Months.2 sequence, there would be a calculation of the number of months to that record (month date) from the Start Up month for each of the unique identifiers. However, this has been kicking back the error:
Error in Months.2 - i : non-numeric argument to binary operator
I am not sure what the solution is, or if I am using the for loop properly for this.
Thanks in advance for any help with solving this problem!
Edit: This is what I am hoping my expected outcome would be (this is just a sample as there are more months in the sequence):
ID Start Up Month 2015-11 2015-12 2015-12 2016-02 2016-03
1 A 2013-10 25 26 27 28 29
2 B 2013-10 25 26 27 28 29
3 C 2014-04 19 20 21 22 23
4 D 2015-06 5 6 7 8 9
5 E 2014-01 22 23 24 25 26
One way to do it is to first use as.yearmon from zoo package to convert the dates. Then simply we iterate over months and subtract from the ones in the df_example,
library(zoo)
df_example$StartUp <- as.Date(as.yearmon(df_example$StartUp))
Months.2 <- as.Date(as.yearmon(Months.2))
df <- as.data.frame(sapply(Months.2, function(i)
round(abs(difftime(df_example$StartUp, i, units = 'days')/30))))
names(df) <- Months.2
cbind(df_example, df)
# ID StartUp 2016-07 2016-06 2016-05 2016-04 2016-03 2016-02 2016-01 2015-12 2015-11
#1 A 2013-10 33 32 31 30 29 28 27 26 25
#2 B 2013-10 33 32 31 30 29 28 27 26 25
#3 C 2014-04 27 26 25 24 23 22 21 20 19
#4 D 2015-06 13 12 11 10 9 8 7 6 5
#5 E 2014-01 30 29 28 27 26 25 24 23 22
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
y_example <- paste(y_example,"-01",sep = "")
# past on the "-01" because I want the later function to work.
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
base.date <- as.Date(c("2015-11-01"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m-%d")
Months.2 <- rev(Months.1)
monnb <- function(d) { lt <- as.POSIXlt(as.Date(d, origin="1900-01-01")); lt$year*12 + lt$mon }
mondf <- function(d1, d2) {monnb(d2) - monnb(d1)}
NumofMonths <- abs(mondf(df_example[,2],Sys.Date()))
n = max(NumofMonths)
# sequence along the number of months and get the month count.
monthcount <- (t(sapply(NumofMonths, function(x) pmax(seq((x-n+1),x, +1), 0) )))
monthcount <- data.frame(monthcount[,-(1:24)])
names(monthcount) <- Months.1
finalDataFrame <- cbind.data.frame(df_example,monthcount)
Here is your final data frame which is the desired output you indicated:
ID StartUp 2015-11-01 2015-12-01 2016-01-01 2016-02-01 2016-03-01 2016-04-01 2016-05-01 2016-06-01 2016-07-01
1 A 2013-10-01 25 26 27 28 29 30 31 32 33
2 B 2013-10-01 25 26 27 28 29 30 31 32 33
3 C 2014-04-01 19 20 21 22 23 24 25 26 27
4 D 2015-06-01 5 6 7 8 9 10 11 12 13
5 E 2014-01-01 22 23 24 25 26 27 28 29 30
The overall idea is that we calculate the number of months and use the sequence function to create a counter of the number of months until we get the current month.
I have a "date" vector, that contains dates in mm/dd/yyyy format:
head(Entered_Date,5)
[1] 1/5/1998 1/5/1998 1/5/1998 1/5/1998 1/5/1998
I am trying to plot a frequency variable against the date, but I want to group the dates that it is by month or year. As it is now, there is a frequency per day, but I want to plot the frequency by month or year. So instead of having a frequency of 1 for 1/5/1998, 1 for 1/7/1998, and 3 for 1/8/1998, I would like to display it as 5 for 1/1998. It is a relatively large data set, with dates from 1998 to present, and I would like to find some automated way to accomplish this.
> dput(head(Entered_Date))
structure(c(260L, 260L, 260L, 260L, 260L, 260L), .Label = c("1/1/1998",
"1/1/1999", "1/1/2001", "1/1/2002", "1/10/2000", "1/10/2001",
"1/10/2002", "1/10/2003", "1/10/2005", "1/10/2006", "1/10/2007",
"1/10/2008", "1/10/2011", "1/10/2012", "1/10/2013", "1/11/1999",
"1/11/2000", "1/11/2001", "1/11/2002", "1/11/2005", "1/11/2006",
"1/11/2008", "1/11/2010", "1/11/2011", "1/11/2012", "1/11/2013",
"1/12/1998", "1/12/1999", "1/12/2001", "1/12/2004", "1/12/2005", ...
The floor_date() function from the lubridate package does this nicely.
data %>%
group_by(month = lubridate::floor_date(date, "month")) %>%
summarize(summary_variable = sum(value))
Thanks to Roman Cheplyaka
https://ro-che.info/articles/2017-02-22-group_by_month_r
See more on how to use the function: https://lubridate.tidyverse.org/reference/round_date.html
Here is an example using dplyr. You simply use the corresponding date format string for month %m or year %Y in the format statement.
set.seed(123)
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
head(df)
date value
1 1998-01-01 2
2 1998-01-02 4
3 1998-01-03 3
4 1998-01-04 5
5 1998-01-05 5
6 1998-01-06 1
library(dplyr)
df %>%
mutate(month = format(date, "%m"), year = format(date, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Source: local data frame [25 x 3]
Groups: month [?]
month year total
(chr) (chr) (int)
1 01 1998 105
2 01 1999 91
3 01 2000 3
4 02 1998 74
5 02 1999 77
6 03 1998 96
7 03 1999 86
8 04 1998 91
9 04 1999 95
10 05 1998 93
.. ... ... ...
Just to add to #cdeterman answer, you can use lubridate along with dplyr to make this even easier:
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
library(dplyr)
library(lubridate)
df %>%
mutate(month = month(date), year = year(date)) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Maybe you just add a column in your data like this:
Year <- format(as.Date(Entered_Date, "%d/%m/%Y"), "%Y")
Dont need dplyr. Look at ?as.POSIXlt
df$date<-as.POSIXlt(df$date)
mon<-df$date$mon
yr<-df$date$year
monyr<-as.factor(paste(mon,yr,sep="/"))
df$date<-monyr
Don't need to use ggplot2 but its nice for this kind of thing.
c <- ggplot(df, aes(factor(date)))
c + geom_bar()
If you want to see the actual numbers
aggregate(. ~ date,data = df,FUN=length )
df2<-aggregate(. ~ date,data = df,FUN=length )
df2
date value
1 0/98 31
2 0/99 31
3 1/98 28
4 1/99 28
5 10/98 30
6 10/99 30
7 11/97 1
8 11/98 31
9 11/99 31
10 2/98 31
11 2/99 31
12 3/98 30
13 3/99 30
14 4/98 31
15 4/99 31
16 5/98 30
17 5/99 30
18 6/98 31
19 6/99 31
20 7/98 31
21 7/99 31
22 8/98 30
23 8/99 30
24 9/98 31
25 9/99 31
There is a super easy way using the cut() function:
list = as.Date(c("1998-5-2", "1993-4-16", "1998-5-10"))
cut(list, breaks = "month")
and you will get this:
[1] 1998-05-01 1993-04-01 1998-05-01
62 Levels: 1993-04-01 1993-05-01 1993-06-01 1993-07-01 1993-08-01 ... 1998-05-01
Another solution is slider::slide_period:
library(slider)
library(dplyr)
monthly_summary <- function(data) summarise(data, date = format(max(date), "%Y-%m"), value = sum(value))
slide_period_dfr(df, df$date, "month", monthly_summary)
date value
1 1998-01 92
2 1998-02 82
3 1998-03 113
4 1998-04 94
5 1998-05 92
6 1998-06 74
7 1998-07 89
8 1998-08 92
9 1998-09 91
10 1998-10 100
...
There is also group_by(month_yr = cut(date, breaks = "1 month") in base R, without needing to use lubridate or other packages.
I have this data frame:
Source: local data frame [446,604 x 2]
date pressure
1 2014_01_01_0:01 991
2 2014_01_01_0:02 991
3 2014_01_01_0:03 991
4 2014_01_01_0:04 991
5 2014_01_01_0:05 991
6 2014_01_01_0:06 991
7 2014_01_01_0:07 991
8 2014_01_01_0:08 991
9 2014_01_01_0:09 991
10 2014_01_01_0:10 991
.. ... ...
I want to separate the date column using separate() from tidyr
library(tidyr)
separate(df, date, into = c("year", "month", "day", "time"), sep="_")
But it does not work. I managed to do it using substr() and mutate():
library(dplyr)
df %>%
mutate(
year = substr(date, 1, 4),
month = substr(date, 6, 7),
day = substr(date, 9, 10),
time = substr(date, 12, 15))
Update:
It does not work because I have malformed rows. I was able to diagnose using my initial substr() method and I found out that I had weird entries in the dataframe:
df %>%
select(date) %>%
mutate(
year = substr(date, 1, 4),
month = substr(date, 6, 7),
day = substr(date, 9, 10),
time = substr(date, 12, 15)) %>%
group_by(year) %>%
summarise(n=n())
And this is what I get:
Source: local data frame [33 x 2]
year n
1 2014 446293
2 4164 9
3 4165 10
4 4166 10
5 4167 10
6 4168 10
7 4169 10
8 4170 10
9 4171 10
10 4172 10
11 4173 10
12 4174 10
13 4175 10
14 4176 10
15 4177 10
16 4178 10
17 4179 10
18 4180 10
19 4181 10
20 4182 10
21 4183 10
22 4184 10
23 4185 10
24 4186 10
25 4187 10
26 4188 10
27 4189 10
28 4190 10
29 4191 10
30 4192 10
31 4193 11
32 4194 10
33 4195 1
Would there be a more efficient way to diagnose the structure of the elements of a column and find the malformed lines before doing separate() ?
The steps would be:
Try to separate() first (no extra)
Notice there are malformed rows (errors in console)
Use separate() with extra = "drop"
Use group_by() and summarise() to explore the data and determine which rows to filter out