Format Dates in R to show only the year - r

I have got a date in R having class "factor". I want to take the year from the data. Please advice
Example data:
S No. Customer Month amount
1 A 01-01-2020 1500
2 B 23-02-2020 2000
3 C 15-03-2020 2500

data$Month <- as.character(data$Month)
data$Month <- as.Date(data$Month,"%d-%m-%Y")
data$Year <- year(data$Month)

Try this:
df <- read.table(text="S 'No. Customer' Month amount
1 A 01-01-2020 1500
2 B 23-02-2020 2000
3 C 15-03-2020 2500", header = TRUE)
df$year <- format(as.Date(df$Month, "%d-%m-%Y"), "%Y")
#> S No..Customer Month amount year
#> 1 1 A 01-01-2020 1500 2020
#> 2 2 B 23-02-2020 2000 2020
#> 3 3 C 15-03-2020 2500 2020
Created on 2020-04-10 by the reprex package (v0.3.0)


Calculate number of negative values between two dates

I have a data frame of SPEI values. I want to calculate two statistics (explained below) at an interval of
20 years i.e 2021-2040, 2041-2060, 2061-2080, 2081-2100. The first column contains the Date (month-year), and
Each year i.e. 2021, 2022, 2023 etc. till 2100.
The statistics are:
Drought frequency: Number of times SPEI < 0 in the specified period (20 years and 1 year respectively)
Drought Duration: Equal to the number of months between its start (included) and end month (not included) of the specified period. I am assuming a drought event starts when SPEI < 0.
I was wondering if there's a way to do that in R? It seems like an easy problem, but I don't know how to do it. Please help me out. Excel is taking too long. Thanks.
> head(test, 20)
Date spei-3
1 2021-01-01 NA
2 2021-02-01 NA
3 2021-03-01 -0.52133737
4 2021-04-01 -0.60047887
5 2021-05-01 0.56838399
6 2021-06-01 0.02285012
7 2021-07-01 0.26288462
8 2021-08-01 -0.14314685
9 2021-09-01 -0.73132256
10 2021-10-01 -1.23389220
11 2021-11-01 -1.15874943
12 2021-12-01 0.27954143
13 2022-01-01 1.14606657
14 2022-02-01 0.66872986
15 2022-03-01 -1.13758050
16 2022-04-01 -0.27861017
17 2022-05-01 0.99992395
18 2022-06-01 0.61024314
19 2022-07-01 -0.47450485
20 2022-08-01 -1.06682997
I very much like to add some code, but I don't know where to start.
test = "E:/drought.xlsx"
#Extract year and month and add it as a column
test$Year = format(test$Date,"%Y")
test$Month = format(test$Date,"%B")
I don't know how to go from here. I found that cumsum can help, but how do I select one year and then apply cumsum on it. I am not withholding code on purpose. I just don't know where or how to begin.
There are a couple questions the OP's post so I will go through them step by step. You'll need dplyr and lubridate for this workflow.
First, we create some fake data to use:
#create example data
dd<- data.frame(Date = seq.Date(as.Date("2021-01-01"), as.Date("2100-12-01"), by = "month"),
spei = rnorm(960,0,2))
That will look like this, similar to what you have above
> head(dd)
Date spei year year_20 drought
1 2021-01-01 -6.85689789 2021 2021_2040 1
2 2021-02-01 -0.09292459 2021 2021_2040 1
3 2021-03-01 0.13715922 2021 2021_2040 0
4 2021-04-01 2.26805601 2021 2021_2040 0
5 2021-05-01 -0.47325008 2021 2021_2040 1
6 2021-06-01 0.37034138 2021 2021_2040 0
Then we can use lubridate and cut to create our yearly and 20-year variables to group by later and create a column drought signifying if spei was negative.
#create a column to group on by year and by 20-year
dd <- dd %>%
mutate(year = year(Date),
year_20 = cut(year, breaks = c(2020,2040,2060,2080, 2100), include.lowest = T,
labels = c("2021_2040", "2041_2060", "2061_2080", "2081_2100"))) %>%
#column signifying if that month was a drought
mutate(drought = ifelse(spei<0,1,0))
Once we have that, we just use the group_by function to get frequency (or number of months with a drought) by year or 20-year period
#by year
dd %>%
group_by(year) %>%
summarise(year_freq = sum(drought)) %>%
# A tibble: 80 x 2
year year_freq
<dbl> <dbl>
1 2021 6
2 2022 4
3 2023 7
4 2024 6
5 2025 6
6 2026 7
#by 20-year group
dd %>%
group_by(year_20) %>%
summarise(year20_freq = sum(drought)) %>%
# A tibble: 4 x 2
year_20 year20_freq
<fct> <dbl>
1 2021_2040 125
2 2041_2060 121
3 2061_2080 121
4 2081_2100 132
Calculating drought duration is a bit more complicated. It involves
identifying the first month of each drought
calculating the length of each drought
combining information from 1 and 2 together
We can use lag to identify when a month changed from "no drought" to "drought". In this case we want an index of where the value in row i is different from that in row i-1
# find index of where values change.
change.ind <- dd$drought != lag(dd$drought)
#use index to find drought start
drought.start <- dd[change.ind & dd$drought == 1,]
This results in a subset of the initial dataset, but only with the rows with the first month of a drought. Then we can use rle to calculate the length of the drought. rle will calculate the length of every run of numbers, so we will have to subset to only those runs where the value==1 (drought)
#calculate drought lengths
drought.lengths <- rle(dd$drought)
# we only want droughts (values = 1)
drought.lengths <- drought.lengths$lengths[drought.lengths$values==1]
Now we can combine these two pieces of information together. The first row is an NA because there is no value at i-1 to compare the lag to. It can be dropped, unless you want to include that data.
drought.dur <- cbind(drought.start, drought_length = drought.lengths)
Date spei year year_20 drought drought_length
5 2021-05-01 -0.47325008 2021 2021_2040 1 1
9 2021-09-01 -2.04564549 2021 2021_2040 1 1
11 2021-11-01 -1.04293866 2021 2021_2040 1 2
14 2022-02-01 -0.83759671 2022 2021_2040 1 1
17 2022-05-01 -0.07784316 2022 2021_2040 1 1

R: Calculating year to date sum

I would like to calculate sum of sales from the beggining of the year to the newest date.
My data:
ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300
MY YTD should be 200+300
This will sum all values for the current calendar year sum(df$Sales[format(df$Date, "%Y") == format(Sys.Date(), "%Y")]) - you might need to make sure your df$Date variable is of class Date
I assume you Date field is character and last four digits represent year.
Then you can filter where it equals current year with below:
df<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=T)
[1] 500
You could use dplyr to summarise by year. lubridate is also useful to group_by year:
df1<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=TRUE, stringsAsFactors=FALSE)
df1$Date <- as.yearmon(df1$Date,format="%m-%Y")
Year Sales
<dbl> <int>
1 2016 200
2 2017 500

Weekends in a Month in R

I am trying to prepare an xreg serie for my Arima model and I will use number of weekends in a month for it. I can find results for a year but when it is longer than a year, it usually is, I couldn't find a way. Here is what I do so far.
dates <- seq(from=as.Date("2001-01-01"), to=as.Date("2010-12-31"), by = "day")
wd <- weekdays(dates)
aylar <- months(dates[which(wd == "Sunday" | wd == "Satuday")])
What I want is gathering all months' weekends not based on only months but also years. So that I can have the same length of serie with my original forecast serie.
Here is my solution:
month <- months(dates[chron::is.weekend(dates)])
day <- dates[chron::is.weekend(dates)]
# create data.frame
df <- data.frame(date = day, month = month, year = chron::years(day))
df %>% group_by(year, month) %>% summarize(weekends = floor(n()/2))
# year month weekends
# <dbl> <fctr> <dbl>
#1 2001 April 4
#2 2001 August 4
#3 2001 Dezember 5
#4 2001 Februar 4
#5 2001 Januar 4
#6 2001 Juli 4
#7 2001 Juni 4
#8 2001 Mai 4
#9 2001 März 4
#10 2001 November 4
## ... with 110 more rows
I hope this is a starting point for your work.

How to do Group By Rollup in R? (Like SQL)

I have a dataset and I want to perform something like Group By Rollup like we have in SQL for aggregate values.
Below is a reproducible example. I know aggregate works really well as explained here but not a satisfactory fit for my case.
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
now what I want to do is aggregation (sum- by year-month-region) and add the new aggregate row in the existing dataframe
e.g. there should be two additional rows like below with a new name for region as 'USA' for the aggreagted rows
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
I have figured out a way (below) but I am very sure that there exists an optimum solution for this OR a better workaround than mine
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
Thanks for the support!
melt/dcast in the reshape2 package can do subtotalling. After running dcast we replace "(all)" in the month column with the month using na.locf from the zoo package:
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
In recent devel data.table 1.10.5 you can use new feature called "grouping sets" to produce sub totals:
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
You can substitute NA to USA using res[, region := "USA"].
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))

Sales of same date last year

I have a DF <-(ID,Year,Sales) and I want to create a fourth column LastYearSale which selects Sales of last year. For example if have:
ID Year Sales
1 01/01/2015 50000
2 01/01/2014 20000
I want output like this:
ID Year Sales LastYearSales
1 01/01/2015 50000 20000
2 01/01/2014 20000
If your Year column is not in Date class then convert it it into Date class first by
df$Year <- as.Date(df$Year, "%d/%m/%Y")
You can then try with lubridate package
df$LastYearSales <- c(df[df$Year %in% (as.Date(ymd(df$Year) - years(1))), ]$Sales, NA)
# ID Year Sales LastYearSales
# 1 1 2015-01-01 50000 20000
# 2 2 2014-01-01 20000 NA
