This question already has an answer here:
Sort year-month column by year AND month
(1 answer)
Closed 1 year ago.
I have dates in the format mm/yyyy in column 1, and then results in column 2.
month Result
01/2018 96.13636
02/2018 96.40000
3/2018 94.00000
04/2018 97.92857
05/2018 95.75000
11/2017 98.66667
12/2017 97.78947
How can I order by month such that it will start from the first month (11/2017) and end (05/2018).
I have tried a few 'orders', but none seem to be ordering by year and then by month
In tidyverse (w/ lubridate added):
library(tidyverse)
library(lubridate)
dfYrMon <-
df1 %>%
mutate(date = parse_date_time(month, "my"),
year = year(date),
month = month(date)
) %>%
arrange(year, month) %>%
select(date, year, month, result)
With data:
df1 <- tibble(month = c("01/2018", "02/2018", "03/2018", "04/2018", "05/2018", "11/2017", "12/2017"),
result = c(96.13636, 96.4, 94, 97.92857, 95.75, 98.66667, 97.78947))
Will get you this 'dataframe':
# A tibble: 7 x 4
date year month result
<dttm> <dbl> <dbl> <dbl>
1 2017-11-01 2017 11 98.66667
2 2017-12-01 2017 12 97.78947
3 2018-01-01 2018 1 96.13636
4 2018-02-01 2018 2 96.40000
5 2018-03-01 2018 3 94.00000
6 2018-04-01 2018 4 97.92857
7 2018-05-01 2018 5 95.75000
Making your data values atomic (year in its own column, month in its own column) generally improves the ease of manipulation.
Or if you want to use base R date manipulations instead of lubridate's:
library(tidyverse)
dfYrMon_base <-
df1 %>%
mutate(date = as.Date(paste("01/", month, sep = ""), "%d/%m/%Y"),
year = format(as.Date(date, format="%d/%m/%Y"),"%Y"),
month = format(as.Date(date, format="%d/%m/%Y"),"%m")
) %>%
arrange(year, month) %>%
select(date, year, month, result)
dfYrMon_base
Note the datatypes created.
# A tibble: 7 x 4
date year month result
<date> <chr> <chr> <dbl>
1 2017-11-01 2017 11 98.66667
2 2017-12-01 2017 12 97.78947
3 2018-01-01 2018 01 96.13636
4 2018-02-01 2018 02 96.40000
5 2018-03-01 2018 03 94.00000
6 2018-04-01 2018 04 97.92857
7 2018-05-01 2018 05 95.75000
We can convert it to yearmon class and then do the order
library(zoo)
out <- df1[order(as.yearmon(df1$month, "%m/%Y"), df1$Result),]
row.names(out) <- NULL
out
# month Result
#1 11/2017 98.66667
#2 12/2017 97.78947
#3 01/2018 96.13636
#4 02/2018 96.40000
#5 03/2018 94.00000
#6 04/2018 97.92857
#7 05/2018 95.75000
data
df1 <- structure(list(month = c("01/2018", "02/2018", "03/2018", "04/2018",
"05/2018", "11/2017", "12/2017"), Result = c(96.13636, 96.4,
94, 97.92857, 95.75, 98.66667, 97.78947)), .Names = c("month",
"Result"), class = "data.frame",
row.names = c("1", "2", "3",
"4", "5", "6", "7"))
Related
I have data as given in input section (dput below), need to convert to output with all values of two rows in one long column. I tried using transpose but cells were getting trimmed.
I don't want to hardcode since in future I might have data in 3 or 4 rows in a similar way.
P.S - I also tried pivot_longer but it didnt help
structure(list(Header = c("Sat 12/3 \n358a-947a\n1017a-229p HRS 10.02",
"Sat 12/10 \n559a-1106a\n1134a-227p HRS 8.00"), X = c("Sun 12/4 ",
"Sun 12/11 "), X.1 = c("Mon 12/5 \n548a-1121a\n1149a-618p\n650p-845p HRS 13.95",
"Mon 12/12 \n500a-1121a\n1151a-547p\n616p-830p HRS 14.53"),
X.2 = c("Tue 12/6 \n359a-1120a\n1150a-400p HRS 11.53",
"Tue 12/13 \n548a-1120a\n1148a-449p HRS 10.54"), X.3 = c("Wed 12/7 \n548a-1119a\n1149a-515p HRS 10.95",
"Wed 12/14 \n429a-1120a\n1150a-432p HRS 11.56"), X.4 = c("Thu 12/8 \n549a-1120a\n1149a-447p HRS 10.48",
"Thu 12/15 \n429a-1121a\n1152a-431p HRS 11.52"), X.5 = c("Fri 12/9 \n548a-1120a\n1148a-218p HRS 8.03",
"Fri 12/16 \n430a-1120a\n1150a-432p HRS 11.55")), class = "data.frame", row.names = c(NA,
-2L))
My try (with a little help)
pivot_longer(df, cols = c(1:7)) %>%
select(value) %>%
mutate(value=str_replace(value,"HRS","")) %>%
separate(.,value,into=c("day","entry1","entry2","entry3"),sep="\n") %>%
separate(.,entry1,into=c("time_in1","time_out1"),sep="-") %>%
separate(.,entry2,into=c("time_in2","time_out2"),sep="-") %>%
separate(.,time_out2,into=c("time_out2","duration1"),remove = FALSE,sep=" ",extra = "merge") %>%
separate(.,entry3,into=c("time_in3","time_out3"),sep="-") %>%
separate(.,time_out3,into=c("time_out3","duration2"),remove = FALSE,sep=" ") %>%
mutate(duration=coalesce(duration1,duration2)) %>%
select(day, duration, time_in1,time_out1,time_in2,time_out2,time_in3,time_out3) %>%
separate(.,day,into=c("date","day"),extra="merge") %>%
mutate(day=mdy(paste0(day,"2021")),
duration=str_trim(duration))
Approach
The key was tidyr::separate_rows(), which not only separates the cell by "\n" but also splits the components into rows rather than columns.
Here, it is much better to split into rows than into columns. Suppose that most cells have 2 or 3 entries separated by "\n"; but there is a "rogue" cell, with an unusually large number (say 9) of entries, generated by someone who repeatedly clocked in and out throughout the day.
While splitting into columns would create arbitrarily many time_in* | time_out* columns, which remain empty (NA) in all rows except the "rogue"
date day duration time_in1 time_out1 time_in2 time_out2 time_in3 time_out3 time_in4 time_out4 time_in5 time_out5 time_in6 time_out6 time_in7 time_out7 time_in8 time_out8 time_in9 time_out9
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# ... ... ... ... ... ... ... ... ... NA NA NA NA NA NA NA NA NA NA NA NA
splitting into rows will maintain a tame (and stable) columnar structure
date day duration time_in time_out
<date> <chr> <dbl> <chr> <chr>
# ... ... ... ... ...
# ... ... ... ... ...
# ... ... ... ... ...
without any "extraneous" columns (or rows).
Solution
Given your sample data df
df <- structure(list(Header = c("Sat 12/3 \n358a-947a\n1017a-229p HRS 10.02", "Sat 12/10 \n559a-1106a\n1134a-227p HRS 8.00"),
X = c("Sun 12/4 ", "Sun 12/11 "),
X.1 = c("Mon 12/5 \n548a-1121a\n1149a-618p\n650p-845p HRS 13.95", "Mon 12/12 \n500a-1121a\n1151a-547p\n616p-830p HRS 14.53"),
X.2 = c("Tue 12/6 \n359a-1120a\n1150a-400p HRS 11.53", "Tue 12/13 \n548a-1120a\n1148a-449p HRS 10.54"),
X.3 = c("Wed 12/7 \n548a-1119a\n1149a-515p HRS 10.95", "Wed 12/14 \n429a-1120a\n1150a-432p HRS 11.56"),
X.4 = c("Thu 12/8 \n549a-1120a\n1149a-447p HRS 10.48", "Thu 12/15 \n429a-1121a\n1152a-431p HRS 11.52"),
X.5 = c("Fri 12/9 \n548a-1120a\n1148a-218p HRS 8.03", "Fri 12/16 \n430a-1120a\n1150a-432p HRS 11.55")),
class = "data.frame", row.names = c(NA, -2L))
the following workflow
library(tidyverse)
library(stringr)
# ...
# Code to generate 'df'.
# ...
year_observed <- 2016
results <- df %>%
mutate(id = row_number()) %>%
pivot_longer(!id, names_to = "column") %>%
separate(value, into = c("date", "entries"), sep = "\n", fill = "right", extra = "merge", remove = TRUE) %>%
separate(entries, into = c("times", "duration"), sep = "HRS", fill = "right", extra = "warn", remove = TRUE) %>%
mutate(across(date:duration, trimws),
date = as.Date(paste(str_extract(date, "\\d{1,2}/\\d{1,2}$"), year_observed, sep = "/"), format = "%m/%d/%Y"),
duration = as.numeric(duration),
duration = if_else(is.na(duration), 0, duration),
day = format(date, format = "%a")) %>%
separate_rows(times, sep = "\n") %>%
separate(times, into = c("time_in", "time_out"), sep = "-", fill = "warn", extra = "warn", remove = TRUE) %>%
# ...Further Transformations... %>%
select(id, date, day, duration, time_in, time_out)
# View results.
results
should yield results like
# A tibble: 28 x 6
id date day duration time_in time_out
<int> <date> <chr> <dbl> <chr> <chr>
1 1 2016-12-03 Sat 10.0 358a 947a
2 1 2016-12-03 Sat 10.0 1017a 229p
3 1 2016-12-04 Sun 0 NA NA
4 1 2016-12-05 Mon 14.0 548a 1121a
5 1 2016-12-05 Mon 14.0 1149a 618p
6 1 2016-12-05 Mon 14.0 650p 845p
7 1 2016-12-06 Tue 11.5 359a 1120a
8 1 2016-12-06 Tue 11.5 1150a 400p
9 1 2016-12-07 Wed 11.0 548a 1119a
10 1 2016-12-07 Wed 11.0 1149a 515p
# ... with 18 more rows
where id identifies (by row number) the original record in df.
To pivot into your newly specified output, simply execute this code, or append it to the existing workflow:
wide_results <- results %>%
group_by(id, date) %>% mutate(entry = row_number()) %>% ungroup() %>%
pivot_wider(id_cols = c(date, day, duration), names_from = entry, names_glue = "{.value}_{entry}", values_from = c(time_in, time_out)) %>%
# Select so as to alternate between 'time_in_*' and 'time_out_*'.
select(order(as.numeric(str_extract(colnames(.), "\\d+$")), str_extract(colnames(.), "^time_(in|out)"), na.last = FALSE))
# View results.
wide_results
You should obtain wide_results like:
# A tibble: 14 x 9
date day duration time_in_1 time_out_1 time_in_2 time_out_2 time_in_3 time_out_3
<date> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2016-12-03 Sat 10.0 358a 947a 1017a 229p NA NA
2 2016-12-04 Sun 0 NA NA NA NA NA NA
3 2016-12-05 Mon 14.0 548a 1121a 1149a 618p 650p 845p
4 2016-12-06 Tue 11.5 359a 1120a 1150a 400p NA NA
5 2016-12-07 Wed 11.0 548a 1119a 1149a 515p NA NA
6 2016-12-08 Thu 10.5 549a 1120a 1149a 447p NA NA
7 2016-12-09 Fri 8.03 548a 1120a 1148a 218p NA NA
8 2016-12-10 Sat 8 559a 1106a 1134a 227p NA NA
9 2016-12-11 Sun 0 NA NA NA NA NA NA
10 2016-12-12 Mon 14.5 500a 1121a 1151a 547p 616p 830p
11 2016-12-13 Tue 10.5 548a 1120a 1148a 449p NA NA
12 2016-12-14 Wed 11.6 429a 1120a 1150a 432p NA NA
13 2016-12-15 Thu 11.5 429a 1121a 1152a 431p NA NA
14 2016-12-16 Fri 11.6 430a 1120a 1150a 432p NA NA
Note
You must supply the year_observed (here 2016) to correctly contextualize the dates written in m/d format. Otherwise, they will calibrate to the year 2021, which will skew the days of the week.
Warning
These dates (12/3, etc.) are in December, and close to the end of the calendar year. If any of these entries "cross over" (from 2016) into the next year (ex. 1/1/2017), they will be incorrectly calibrated to the former year (ex. 1/1/2016), and thus have an incorrect date and weekday.
However, if your dates do cross over, that's a good indication that the full date (12/3/2016) should have been notated in the original cells, in which case
results <- df %>%
# ... %>%
mutate(
# ...
date = as.Date(str_extract(date, "(\\d{1,2}/){2,2}\\d{4,4}$"), format = "%m/%d/%Y")
# ...
) # ... %>%
would have sufficed to properly parse the dates.
I have a column in my large data set called Date. How do I extract both the year and month from it? I would like to create a column Month where the month goes from 1-12 and year where the year goes from the first year in my data set to the last year in my data set.
Thanks.
> typeof(data$Date)
[1] "character
> head(data$Date)
[1] "2/06/2020 11:23" "12/06/2020 7:56" "12/06/2020 7:56" "29/06/2020 16:54" "3/06/2020 15:09" "25/06/2020 17:11"
dplyr and lubridate -
library(dplyr)
library(lubridate)
data <- data %>%
mutate(Date = dmy_hm(Date),
month = month(Date),
year = year(Date))
# Date month year
#1 2020-06-02 11:23:00 6 2020
#2 2020-06-12 07:56:00 6 2020
#3 2020-06-12 07:56:00 6 2020
#4 2020-06-29 16:54:00 6 2020
#5 2020-06-03 15:09:00 6 2020
#6 2020-06-25 17:11:00 6 2020
Base R -
data$Date <- as.POSIXct(data$Date, tz = 'UTC', format = '%d/%m/%Y %H:%M')
data <- transform(data, Month = format(Date, '%m'), Year = format(Date, '%Y'))
data
data <- structure(list(Date = c("2/06/2020 11:23", "12/06/2020 7:56",
"12/06/2020 7:56", "29/06/2020 16:54", "3/06/2020 15:09", "25/06/2020 17:11"
)), class = "data.frame", row.names = c(NA, -6L))
I am trying to convert a column in my dataset that contains week numbers into weekly Dates. I was trying to use the lubridate package but could not find a solution. The dataset looks like the one below:
df <- tibble(week = c("202009", "202010", "202011","202012", "202013", "202014"),
Revenue = c(4543, 6764, 2324, 5674, 2232, 2323))
So I would like to create a Date column with in a weekly format e.g. (2020-03-07, 2020-03-14).
Would anyone know how to convert these week numbers into weekly dates?
Maybe there is a more automated way, but try something like this. I think this gets the right days, I looked at a 2020 calendar and counted. But if something is off, its a matter of playing with the (week - 1) * 7 - 1 component to return what you want.
This just grabs the first day of the year, adds x weeks worth of days, and then uses ceiling_date() to find the next Sunday.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
separate(week, c("year", "week"), sep = 4, convert = TRUE) %>%
mutate(date = ceiling_date(ymd(paste(year, "01", "01", sep = "-")) +
(week - 1) * 7 - 1, "week", week_start = 7))
# # A tibble: 6 x 4
# year week Revenue date
# <int> <int> <dbl> <date>
# 1 2020 9 4543 2020-03-01
# 2 2020 10 6764 2020-03-08
# 3 2020 11 2324 2020-03-15
# 4 2020 12 5674 2020-03-22
# 5 2020 13 2232 2020-03-29
# 6 2020 14 2323 2020-04-05
I want to create a dataframe from a given start and end date:
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2020-06-23")
For each row in this dataframe, I should have the start day and end day of the month, so the expected output is:
start end month year
2020-05-17 2020-05-31 May 2020
2020-06-01 2020-06-23 June 2020
I have tried to create a sequence, but I'm stuck on what to do next:
day_seq <- seq(start_date, end_date, 1)
Please, a base R or tidyverse solution will be greatly appreciated.
1) yearmon Using start_date and end_date from the question create a yearmon sequence and then each of the desired columns is a simple one line computation. The stringAsFactors line can be omitted under R 4.0 onwards as that is the default there.
library(zoo)
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)
giving:
start end month year
1 2020-05-17 2020-05-31 May 2020
2 2020-06-01 2020-06-23 June 2020
2) Base R This follows similar logic and gives the same answer. We first define a function month1 which given a Date class vector x returns a Date vector the same length but for the first of the month.
month1 <- function(x) as.Date(cut(x, "month"))
months <- seq(month1(start_date), month1(end_date), "month")
data.frame(start = pmax(start_date, months),
end = pmin(end_date, month1(months + 31) - 1),
month = format(months, "%B"),
year = as.numeric(format(months, "%Y")),
stringsAsFactors = FALSE)
A while ago that I used the tidyverse, but here is my go at things..
sample data
different sample data to tagckle some problems wher the year changes..
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2021-06-23")
code
library( tidyverse )
library( lubridate )
#create a sequence of days from start to end
tibble( date = seq( start_date, end_date, by = "1 day" ) ) %>%
mutate( month = lubridate::month( date ),
year = lubridate::year( date ),
end = as.Date( paste( year, month, lubridate::days_in_month(date), sep = "-" ) ) ) %>%
#the end of the last group is now always larger than tghe maximum date... repair!
mutate( end = if_else( end > max(date), max(date), end ) ) %>%
group_by( year, month ) %>%
summarise( start = min( date ),
end = max( end ) ) %>%
select( start, end, month, year )
output
# # A tibble: 14 x 4
# # Groups: year [2]
# start end month year
# <date> <date> <dbl> <dbl>
# 1 2020-05-17 2020-05-31 5 2020
# 2 2020-06-01 2020-06-30 6 2020
# 3 2020-07-01 2020-07-31 7 2020
# 4 2020-08-01 2020-08-31 8 2020
# 5 2020-09-01 2020-09-30 9 2020
# 6 2020-10-01 2020-10-31 10 2020
# 7 2020-11-01 2020-11-30 11 2020
# 8 2020-12-01 2020-12-31 12 2020
# 9 2021-01-01 2021-01-31 1 2021
# 10 2021-02-01 2021-02-28 2 2021
# 11 2021-03-01 2021-03-31 3 2021
# 12 2021-04-01 2021-04-30 4 2021
# 13 2021-05-01 2021-05-31 5 2021
# 14 2021-06-01 2021-06-23 6 2021
For the specific period in your question, you may use:
library(lubridate)
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2020-06-23")
start <- c(start_date, floor_date(end_date, unit = 'months'))
end <- c(ceiling_date(start_date, unit = 'months'), end_date)
month <- c(as.character(month(start[1], label = TRUE)),
as.character(month(start[2], label = TRUE)))
year <- c(year(start[1]), year(start[2]))
data.frame(start, end, month, year, stringsAsFactors = FALSE)
Here is one approach using intervals with lubridate. You would create a full interval between the 2 dates of interest, and then intersect with monthly ranges for each month (first to last day each month).
library(tidyverse)
library(lubridate)
start_date <- as.Date("2020-05-17")
end_date <- as.Date("2021-08-23")
full_int <- interval(start_date, end_date)
month_seq = seq(start_date, end_date, by = "month")
month_int = interval(floor_date(month_seq, "month"), ceiling_date(month_seq, "month") - days(1))
data.frame(interval = intersect(full_int, month_int)) %>%
mutate(start = int_start(interval),
end = int_end(interval),
month = month.abb[month(start)],
year = year(start)) %>%
select(-interval)
Output
start end month year
1 2020-05-17 2020-05-31 May 2020
2 2020-06-01 2020-06-30 Jun 2020
3 2020-07-01 2020-07-31 Jul 2020
4 2020-08-01 2020-08-31 Aug 2020
5 2020-09-01 2020-09-30 Sep 2020
6 2020-10-01 2020-10-31 Oct 2020
7 2020-11-01 2020-11-30 Nov 2020
8 2020-12-01 2020-12-31 Dec 2020
9 2021-01-01 2021-01-31 Jan 2021
10 2021-02-01 2021-02-28 Feb 2021
11 2021-03-01 2021-03-31 Mar 2021
12 2021-04-01 2021-04-30 Apr 2021
13 2021-05-01 2021-05-31 May 2021
14 2021-06-01 2021-06-30 Jun 2021
15 2021-07-01 2021-07-31 Jul 2021
16 2021-08-01 2021-08-23 Aug 2021
I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.