Is there an R function to split data by month? - r

I have data in the following format:
content, date
Hello, 2019-05-11T23:59:02+00:00
Amazing, 2019-01-08T20:22:02+00:00
Come on, 2018-11-15T10:52:45+00:00
We won, 2018-08-25T16:33:23+00:00
This is only a sample of the data, whereas I have over 1 million rows with "dates" in between August 2018 and May 2019. I would like to split my data into 10 different data frames, with each one representing a specific month (i.e. 1 = August 2018, 2 = September 2018,...,10 = May 2019).
I tried using a dplyr group-by method and also performing a loop but did not find any success. I also tried codes from other posts but to no avail.
Any help is much appreciated. I am new to Stack Overflow so apologies if I did not adhere to any form code of conduct.
Thank you in advance!

The Lubridate package has functions which will meet your needs. The key here is make them Dates (or POSIX).
require(tidyverse)
require(lubridate)
df <- data.frame(content=c('H','A'),
date=c('2019-05-11T23:59:02+00:00', '2019-01-08T20:22:02+00:00'))
df %<>%
mutate(date=ymd_hms(date)) %>%
mutate(monthGroup=floor_date(date, unit='month'))
You can either manually filter for each month using that information or put it in a loop/apply to make the computer do it.
df %>%
filter(monthGroup==ymd('2019-05-01'))
Another way without using floor_date()
df <- data.frame(content=c('H','A'),
date=c('2019-05-11T23:59:02+00:00', '2019-01-08T20:22:02+00:00'))
Get all April 2019 dates; that is dates the came before 01 May 2019 and after 01 April 2019.
df %>%
mutate(date=ymd_hms(date)) %>%
filter(date<ymd('2019-05-01') &
date>=ymd('2019-04-01'))

Related

Filter Data by Seasonal Ranges Over Several Years Based on Month and Day Column in R Studio

I am trying to filter a large dataset to contain results between a range of days and months over several years to evaluate seasonal objectives. My season is defined from 15 March through 15 September. I can't figure out how to filter the days so that they are only applied to March and September and not the other months within the range. My dataframe is very large and contains proprietary information, but I think the most important information is that the dates are describes by columns: SampleDate (date formatted as %y%m%d), day (numeric), and month (numeric).
I have tried filtering using multiple conditions like so:
S1 <- S1 %>%
filter((S1$month >= 3 & S1$day >=15) , (S1$month<=9 & S1$day<=15 ))
I also attempted to set ranges using between for every year that I have data with no luck:
S1 %>% filter(between(SampleDate, as.Date("2010-03-15"), as.Date("2010-09-15") &
as.Date("2011-03-15"), as.Date("2011-09-15")&
as.Date("2012-03-15"), as.Date("2012-09-15")&
as.Date("2013-03-15"), as.Date("2013-09-15")&
as.Date("2014-03-15"), as.Date("2014-09-15")&
as.Date("2015-03-15"), as.Date("2015-09-15")&
as.Date("2016-03-15"), as.Date("2016-09-15")&
as.Date("2017-03-15"), as.Date("2017-09-15")&
as.Date("2018-03-15"), as.Date("2018-09-15")))
I am pretty new to R and can't find any solution online. I know there must be a somewhat simple way to do this! Any help is greatly appreciated!
Maybe something like this:
library(data.table)
df <- setDT(df)
# convert a date like this '2020-01-01' into this '01-01'
df[,`:=`(month_day = str_sub(date, 6, 10))]
df[month_day >= '03-15' & month_day <= '09-15']

How can I convert a characters into dates in RStudio?

still new to R. I wanted to create a simple (bar) chart of the fluctuations/occurrences of burglaries per month in my city. I found that the column, 'Occurence_Date' is a character, I wanted it to be "time", or something simpler, to create a visualization. I wanted the "x-axis" to be the months of January to June 2019, with the "y-axis" to be the amount of burglaries per month. Can anyone help me get started on this please? Thanks!
This is my data frame
The lubridate package is very helpful for working with dates and times in R.
# load.packages("lubridate") ## only run once
library(lubridate)
df$Occurence_Date <- ymd(df$Occurence_Date) # converts text in year month day format, igrores time
Generally it's better to put example data in your question so people can work with it and show an example.

Importing Time Series In r

my dataset I am trying to implement Time Series Analysis on a data set which has two attributes (Year & Sales). Year are 2016,2017 & 2018 for which there are average sales value for all 12 months. My data looks like below:
JAN FEB MAR APR MAY JUNE
2016 4457. 4,105 4,276 4712. 5,116 4,512
2017 4,222 5,432 4,816 5,018 4,497 4,603
2018 4,355 4,972 4,868 4,665 4,735 4,926
This is just some part of my data set to get an idea how it looks like. The months are JAN to DEC. Now I want to know, firstly, how to import this data set into R? As I obviously cannot import it like this because it treats all the columns like X1,X2 etc and these becomes too many variables. Secondly, R takes this data set as "data.frame". How can I convert it into just "ts". I have tried
data.ts<- as.ts(myData)
but it converts it into
"mts" "ts" "matrix"
and moreover, it shows my frequency 1 while it should
be 12. Please help me. I am stuck at the starting.
First you want to restructure your data to be in long format which can be done with the gather function from tidyr.
library(tidyr)
myData <- myData %>% tidyr::gather(timeperiod, sales, JAN:DEC)
Then your data will be structured to create a time series:
ts <- as.ts(data, from=c(2016,1), frequency=12)

How to select rows from a dataset between two dates?

I have a quite large dataset (35 variables and 65 000 rows) and I would like to split it in three regardind specific dates. I have information about animals before and after a surgery. I'm currently using the dplyr package. Bellow I present what my dataset looks like, I juste give an exemple because when using on my datasetdput I obtain something really large and unreadable. As in the exemple I have several dates at which measurements were taken for an individual. The information about the individual is completed by the surgery date which is unique for each individual. As for the example measurements where taken over several years.
Name Date Measurement Surgery_date
Pierre 2016-03-15 5.12 2017-03-21
Pierre 2017-03-16 4.16 2017-03-21
Pierre 2017-08-09 5.08 2017-03-21
Paul 2016-07-03 5.47 2017-03-25
Paul 2016-09-30 4.98 2017-03-25
Paul 2017-04-12 4.51 2017-03-25
For the moment I've been carfull to have date format either for the dates of measurement and for the surgery dates using lubridate package. Then I've tried, using dplyr package to sort my data. I've tried filter and select but neither of those gave the expected results.
data1$Date <- parse_date_time(data1$Date, "d/m/y")
data1$Date <- ymd(data1$Date)
data1$Surgery_date <- parse_date_time(data1$Surgery_date, "d/m/y")
data1$Surgery_date <- ymd(data1$Surgery_date)
before_surgery <- data1
before_surgery <- dplyr::as_tibble(before_surgery)
before_surgery <- before_surgery %>%
filter(Date > Surgery_date)
before_surgery <- before_surgery %>%
select(Date < Surgery_date)
Either way no row is deleted. When I try (by the same meanings) to obtain dates after surgery, no row is actually selected.
I have checked my file to be sure there is actually dates after and before the surgery date (if not this result would have been normal) and I can confirm there is the two kind of dates in the dataset.
I have just put here the example of the dates before surgery, assuming it works on the same pattern for the dates after surgery.
Thank you in advance for those who will take time to read me. I'm sorry if the question is quite similar to other ones but I have not been able to figure a solution on my own...
EDIT : To be more specific the ultimate goal is to have, three separeted datasets. The first one would cover all measures taken before the surgery, the second the day of the surgery itself + 5 days (but I'll ty to handle this one latter on) and the third one would cover measures taken after the surgery.
The solution to what you are asking is straightforward, because you can in fact filter on dates and compare dates in multiple columns. Please try the code below and confirm for yourself that this works as you would expect. If this approach does not work on your own dataset, please share more about your data and processing because there is probably an error in your code. (One error I already saw: you can't use select(Date < Surgery_date). You need to use filter).
This is how I would approach your problem. As you can see, the code is very straightforward.
df <- data.frame(
Name = c(rep('Pierre', 3), rep('Paul', 3)),
Date = c('2016-03-15', '2017-03-26', '2017-08-09', '2016-07-03', '2016-09-30', '2017-04-12'),
Measurement = c(5.12, 4.16, 5.08, 5.47, 4.98, 4.51),
Surgery_date = c(rep('2017-03-21', 3), rep('2017-03-25', 3))
) %>%
mutate(Surgery_date = ymd(Surgery_date),
Date = ymd(Date))
df %>%
filter(Date < Surgery_date)
df %>%
filter(Date > Surgery_date & Date < (Surgery_date + days(5)))
df %>%
filter(Date > Surgery_date)

Date objects in decimals spanning multiple leap years

I would like to figure out a way to convert a day into a decimal where 0 is January 1 and December 31 is 1. No time here just days. I looked for a few solutions like here and here but neither of those solution seem to fit my problem. I also had hopes for the date_decimal function in lubridate. I have figured out a solution which involves converting the Date into a number, merging a dataframe that accounts for leaps years then divides the number by the total number of days in the year.
library(lubridate)
library(dplyr)
df <- data.frame(Date=seq(as.Date("2003/2/10"), as.Date("2007/2/10"), "years"),
var=seq(1,5, by=1))
lubridate function attempt:
date_decimal(df$Date)
Leap year dataframe
maxdaydf<-data.frame(Year=seq(2003,2007,by=1), maxdays=c(365,366,365,365,365))
A dplyr pipe to generate the daydecimal:
df %>%
mutate(Year=year(Date), daynum=yday(Date)) %>%
full_join(maxdaydf, by=c("Year")) %>%
mutate(daydecimal=daynum/maxdays)
But as I said this is clunky and involves a 2nd dataframe which is never ideal. Any suggestions on how I can convert some Dates into decimals?
Instead of date_decimal() you could use decimal_date()
decimal_date(df$Date)
[1] 2003.110 2004.109 2005.110 2006.110 2007.110
Or you can use :
yday(df$Date)/yday(ISOdate(year(df$Date), 12,31))
[1] 0.1123288 0.1120219 0.1123288 0.1123288 0.1123288

Resources