I have a data set for 10 years. I want to select or subset the data by fiscal year using date variable. For date variable is a character. For instance, I want to select data for fiscal year 2020 which is from 01-10-2019 to 30-09-2020. How can I do this in R ?
Here is an example using zoo package:
df1 <- structure(list(dateA = structure(c(14974, 18628, 14882, 16800,
17835, 16832, 16556, 15949, 16801), class = "Date")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
library(zoo)
df1 %>%
mutate(fiscal_year = as.integer(as.yearmon(dateA) - 4/12 +1))
output:
dateA fiscal_year
<date> <int>
1 2010-12-31 2011
2 2021-01-01 2021
3 2010-09-30 2011
4 2015-12-31 2016
5 2018-10-31 2019
6 2016-02-01 2016
7 2015-05-01 2016
8 2013-09-01 2014
9 2016-01-01 2016
as said by #r2evans you should post a minimal reprex.
However with the few information you posted maybe this worked example may fit:
date_vect <- c('01-10-2019','30-07-2020','15-07-2019','03-03-2020')
date_vect[substr(date_vect,7,12) == "2020"]
Under the hypothesis that you have a vector of dates in a string format. You may want to pick all the strings with the last four character equal to 2020 (the year you're interested in).
P.S: It's good practice to use the appropriate format when dealing with dates. You can unlock other features such as ordering with R date libraries.
Related
I would like to use difftime() to extract the difference between two date/time variables which are as.posixct. But sometimes one (or both) of the values are missing (NA), just like below.
Start time Antibiotic time
2016-06-28 08:36:00 NA
2019-10-30 10:43:00 2019-10-30 10:11:56
NA NA
I want: start time - antibiotic time
Like:
Antibiotica$ABS <- difftime(Antibiotica$StartTime, Antibiotica$AntibioticTime, units=c("mins"), na.rm=TRUE)
But now, I get an error. I think it is because of the wrong use of na.rm=TRUE.
How to add this in the right way?
As Roland points out in the comments, it's not clear that you should remove NA values. If there is a start time but antibiotic time is NA, then the time difference should also be NA. If both times are NA, then again the time difference should be NA
If you were to remove all the NA values in the resulting difftime, then you will only get results for those rows with complete data, but then these will no longer match up to your Antibiotica data frame. In your little example data frame for example, you would only get a single non-NA result. How would you store that in a column?
From your example, your code should work like this:
Antibiotica$ABS <- difftime(Antibiotica$StartTime, Antibiotica$AntibioticTime)
Antibiotica
#> StartTime AntibioticTime ABS
#> 1 2016-06-28 08:36:00 <NA> NA mins
#> 2 2019-10-30 10:43:00 2019-10-30 10:11:56 31.06667 mins
#> 3 <NA> <NA> NA mins
If you're not getting this result, you might need to make sure that your columns are in an actual date-time format (e.g. ensure class(Antibiotica$StartTime) is not "character").
If, once you have the calculation and you only want to have complete cases, you can do
Antibiotica[complete.cases(Antibiotica),]
#> StartTime AntibioticTime
#> 2 2019-10-30 10:43:00 2019-10-30 10:11:56
Data used
Antibiotica <- structure(list(StartTime = structure(c(1467102960, 1572432180, NA),
class = c("POSIXct", "POSIXt"), tzone = ""),
AntibioticTime = structure(c(NA, 1572430316, NA),
class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA, -3L),
class = "data.frame")
Antibiotica
#> StartTime AntibioticTime
#> 1 2016-06-28 08:36:00 <NA>
#> 2 2019-10-30 10:43:00 2019-10-30 10:11:56
#> 3 <NA> <NA>
Created on 2022-01-31 by the reprex package (v2.0.1)
I am trying to convert a column in my dataset that contains week numbers into weekly Dates. I was trying to use the lubridate package but could not find a solution. The dataset looks like the one below:
df <- tibble(week = c("202009", "202010", "202011","202012", "202013", "202014"),
Revenue = c(4543, 6764, 2324, 5674, 2232, 2323))
So I would like to create a Date column with in a weekly format e.g. (2020-03-07, 2020-03-14).
Would anyone know how to convert these week numbers into weekly dates?
Maybe there is a more automated way, but try something like this. I think this gets the right days, I looked at a 2020 calendar and counted. But if something is off, its a matter of playing with the (week - 1) * 7 - 1 component to return what you want.
This just grabs the first day of the year, adds x weeks worth of days, and then uses ceiling_date() to find the next Sunday.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
separate(week, c("year", "week"), sep = 4, convert = TRUE) %>%
mutate(date = ceiling_date(ymd(paste(year, "01", "01", sep = "-")) +
(week - 1) * 7 - 1, "week", week_start = 7))
# # A tibble: 6 x 4
# year week Revenue date
# <int> <int> <dbl> <date>
# 1 2020 9 4543 2020-03-01
# 2 2020 10 6764 2020-03-08
# 3 2020 11 2324 2020-03-15
# 4 2020 12 5674 2020-03-22
# 5 2020 13 2232 2020-03-29
# 6 2020 14 2323 2020-04-05
I have a very large dataset and is why I would like to find a simpler way to handle this:
I would like to identify or subset those observations where the event date is later than the fiscal year date. An additional condition would be that, out of the observations identified in the previous sentence, I would only want those event dates that lie between 31st May and the fiscal year. If it is not possible to apply such a condition, maybe one could apply that the event date lies between 31st May and 1st Jan?
For example, we have the following
fiscal year date event date
2010-04-30 2010-05-03
2016-03-31 2016-04-28
2020-01-31 2020-02-10
2019-08-30 2019-06-03
2009-07-31 2009-10-10
2003-03-31 2003-02-18
2012-06-30 2012-03-10
From the data above, only the first three observations would be kept when applying the conditional code. Any help is super appreciated, thank you! :)
Using tidyverse:
library(tidyverse)
d %>%
mutate(MayDate = as.Date(paste0(lubridate::year(fiscal_year_date),"-05-31"))) %>%
filter(event_date > fiscal_year_date & event_date <= MayDate)
# fiscal_year_date event_date MayDate
# 1 2010-04-30 2010-05-03 2010-05-31
# 2 2016-03-31 2016-04-28 2016-05-31
# 3 2020-01-31 2020-02-10 2020-05-31
data
d <- structure(list(fiscal_year_date = structure(c(14729, 16891, 18292,
18138, 14456, 12142, 15521), class = "Date"), event_date = structure(c(14732,
16919, 18302, 18050, 14527, 12101, 15409), class = "Date")),
class = "data.frame", row.names = c(NA, -7L))
I have date's in a dataframe with corresponding sampling date as presented by the sample dataframe:
Date Temp
2016-06-11 5
2017-08-19 12
2018-01-21 13
2019-04-28 7
The date column is in numeric format currently. I want to convert the numeric month (i.e. 06) into its full name (i.e. June) but am having trouble with the conversion.
I did check the converting dates to names question but was confused by the select DATENAME.
You may simply use months(). Example:
d <- transform(d, date.m=months(v))
d
# date x date.m
# 1 2020-10-01 -1.1390886 October
# 2 2020-11-01 -0.6872151 November
# 3 2020-12-01 1.0632769 December
# 4 2021-01-01 1.7351265 January
Note: If your date is not of class "date" you also need to wrap as.Date:
d <- transform(d, date.m=months(as.Date(v)))
Data:
d <- structure(list(date = structure(c(18536, 18567, 18597, 18628), class = "Date"),
x = c(-1.13908860117162, -0.687215137639502, 1.06327693201579,
1.73512650928455)), class = "data.frame", row.names = c(NA,
-4L))
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).