Change Duration from (Years and Month) to (Month) in R - r

I have some data frame like below. I need to add a "Duration" column beside the "Years and Month" column and convert the "Years and Month" to Month as duration.
For instance, I need to change the 2Y3M to 27M.
I have searched for it and didn't succeed. How do I do that? Thanks in advance.
Years and Month
Percentage Change
2Y3M
13%
3Y4M
23%

Here are a few approaches. See Note at the end for the input x.
1) Convert to yearmon class which stores its input internally as year+(month-1)/12. We can get the internal number by converting it to numeric, then multiply by 12 and add back the 1.
library(zoo)
ym <- as.yearmon(x, "%YY%mM")
12 * as.numeric(ym) + 1
## [1] 27 40
This could be written as a one-liner like this:
12 * as.numeric(as.yearmon(x, "%YY%mM")) + 1
1a) Using ym from above this would also work where as.integer extracts the year and cycle gets the month:
12 * as.integer(ym) + cycle(ym)
## [1] 27 40
2) A base solution would be to read in x splitting it into a 2 column data frame which is converted to a matrix. matrix multiply that by c(12, 1) to get the result.
d <- read.table(text = x, sep = "Y", comment.char = "M")
c(as.matrix(d) %*% c(12, 1))
## [1] 27 40
This could also be written as a one-liner:
c(as.matrix(read.table(text = x, sep = "Y", comment.char = "M")) %*% c(12, 1))
Note
The input x in reproducible form is
x <- c("2Y3M", "3Y4M")

Assuming your dataframe is called df and column as ym you can use strcapture to extract year and month value.
result <- transform(strcapture('(\\d+)Y(\\d+)M', df$ym,
proto = list(year = integer(), month = integer())),
yearmonth = (year * 12) + month)
result
# year month yearmonth
#1 2 3 27
#2 3 4 40
To assign the value to same column.
df$ym <- transform(strcapture('(\\d+)Y(\\d+)M', df$ym,
proto = list(year = integer(), month = integer())),
yearmonth = (year * 12) + month)$yearmonth

Assuming your "Years and Month" column is a character type, I would extract the years and months separately, then figure out how many months it is.
library(tidyverse)
your_df <- tibble(`Years and Month` = c("2Y3M", "3Y4M"))
your_df %>%
mutate(years = str_extract(`Years and Month`, "^\\d+(?=Y)"),
months = str_extract(`Years and Months`, "(?<=Y)\\d+")) %>%
mutate(total_months = as.numeric(years)*12 + as.numeric(months))

Related

How to sort column with month order and calculate differences between column R data.table

Goal :
Sort column as month order 1~12 in a pivot table to get figures difference compare with different year same month .
Desire data shape :
# 1 - sort as every Jan ~ Dec inside every year
setcolorder(d_c,
c("2016-1","2017-1","2018-1".....))
# 2 - finally add column to calculate the differences
d_c[,"dif":=format(`lastyear_samemonth_column`-`neweryear_samemonthcolumn`,big.mark = ",")]
Data :
set.seed(566684)
n = 100
d <-as.data.table(tibble(month = sample(1:12, n, replace = TRUE),
year = sample(2016:2018,n, replace = TRUE),
`year-month` = paste(year, month, sep = '-'),
value = rnorm(n),
c1 = sample(LETTERS,n,replace = TRUE)))
d_c <- dcast(d,c1 ~ `year-month`, value.var = "value" ,fun.aggregate = sum)
Problem :
They grouped by "year-month" column as ascending order but donnot how to sort as monthly order and assign dynamic column name to get the comparison result
To order the data you can use -
library(data.table)
cols <- c(1, order(as.numeric(sub('.*-', '', names(d_c)[-1]))) + 1)
d_c[, ..cols]

Find the longest, Non-NA common sequence between two time series in R

Say I have two, different-length time series. Both have columns time and value. Both of them have NA values at random positions. For example:
# Generate first series
series1 <- data.frame(
time = seq.POSIXt(
from = as.POSIXct("2020-01-01", origin = "1970-01-01"),
length.out = 100,
by = "1 day"
),
value = runif(100, min = 0, max = 100)
)
# Generate second series, which starts and ends and different times
series2 <- data.frame(
time = seq.POSIXt(
from = as.POSIXct("2019-12-01", origin = "1970-01-01"),
length.out = 80,
by = "1 day"
),
value = runif(80, min = 0, max = 100)
)
# Remove some values at random
random_idx1 <- sample(seq_len(nrow(series1)), 20)
random_idx2 <- sample(seq_len(nrow(series2)), 20)
series1$value[random_idx1] <- NA
series2$value[random_idx2] <- NA
Great. If I were to determine the largest non-NA sequence for each series, I could use stats::na.contiguous(). However, the longest sequence for one series is not the same for the other.
Now the question is: how can I determine the longest overlapping Non-NA sequence of values between the two series? That is, what is the longest sequence of values that are time-matched between the two time series AND are not NA values?
In the question series2 ends in 2019 whereas series1 starts in 2020 so there is no run of non-NA values in common so let us use a different example given in the Note at the end.
1) Using only base R we could do this:
na.contiguous(merge(DF1, DF2, by = 1))
2) or we could convert to zoo and do the same thing. Use fortify.zoo(z) to convert back or just leave it as zoo. If you want separate zoo objects use z$z1 and z$z2. Note that time(z) is the times in the result. It would also be possible to use ts class if the times are regularly spaced: as.ts(z).
library(zoo)
z1 <- read.zoo(DF1)
z2 <- read.zoo(DF2)
z <- na.contiguous(cbind(z1, z2))
z
## z1 z2
## 3 3 12
## 4 4 13
## 5 5 14
## attr(,"na.action")
## [1] 1 2 6 7
## attr(,"class")
## [1] omit
Note
DF1 <- data.frame(1:6, c(1, 2, 3, 4, 5, NA))
DF2 <- data.frame(2:7, c(NA, 12, 13, 14, 15, 16))
We do a full_join by 'time', apply run-length-id (rle) on the logical vector i.e. non-NA elements of 'value.x' and 'value.y', extract the lengths where the 'values' are TRUE, get the max
library(dplyr)
full_join(series1, series2, by = 'time') %>%
summarise(len1 = with(rle(!is.na(value.x) &
!is.na(value.y)), max(lengths[values])))
# len1
#1 5
It returns the largest non-NA elements common to both 'value' columns from 'series1' and 'series2' dataset

How can I use PAD function (from PADR() package) for multiple data frames?

I have 24 files (1 for each hour of the day, HR_NBR = Hour Number) and I've to pad the dates in each of the files.
AS-IS data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
03/07/2016 1 10
TO-BE data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
02/07/2016 NA NA
03/07/2016 1 10
I can use the pad function for each file, like this:
chil_bev1_1 = pad (chil_bev1_1, interval= "day") # Hour1
chil_bev1_2 = pad (chil_bev1_2, interval= "day") # Hour2
and so on.
And it works. But I want to use a loop or LAPPLY.
I tried several variations of these 2 codes, but none of them worked:
df1 = data.frame (chil_bev1_1)
df2 = data.frame (chil_bev1_2)
dflist = c("df1","df2")
CODE1:
x = function(df) {df %>% pad}
allpad = lapply(dflist,x)
CODE2:
x = function(df) {pad (df)}
allpad = lapply(dflist,x)
The error is
"x must be a data frame".
I'm new to R. Any help would be greatly appreciated.
Thank you.
I managed to figure it out. Here's the answer:
hour_list = list(chil_bev1_1, chil_bev1_2)
chil_bev1n = lapply (hour_list, function (x) {x %>% complete(CLNDR_DT = seq.Date(min(CLNDR_DT), max(CLNDR_DT), by="day"), fill = list(QTY=0))})
Notes:
The fill = list() function replaces the NAs with 0s.
CLNDR_DT is the name of the column that contains dates.

how to calculate date difference in R when it involves BC and AD

I have a data frame like this:
df = data.frame(dt = c('0101-01-01','0023-10-20'), comment = c('BC','AD'))
the second dt is actually year -23 according to comment.
how can I make R recognise the first date is a BC and get the time difference from these two dates?
We convert to numeric after changing to yearmon class, change the sign to - for those having 'BC' in 'comment' and take the difference
library(zoo)
v2 <- as.numeric(as.yearmon(df$dt))
If we want to make the 'year' more approximate
v2 <- lubridate::year(df$dt) +
(strptime(df$dt, format = "%Y-%m-%d")$yday + 1)/365
i1 <- df$comment == "BC"
v2[i1] <- -1* v2[i1]
diff(v2)
#[1] 124.75

Aggregating daily content

I've been attempting to aggregate (some what erratic) daily data. I'm actually working with csv data, but if i recreate it - it would look something like this:
library(zoo)
dates <- c("20100505", "20100505", "20100506", "20100507")
val1 <- c("10", "11", "1", "6")
val2 <- c("5", "31", "2", "7")
x <- data.frame(dates = dates, val1=val1, val2=val2)
z <- read.zoo(x, format = "%Y%m%d")
Now i'd like to aggregate this on a daily basis (notice that some times there are >1 datapoint for a day, and sometimes there arent.
I've tried lots and lots of variations, but i cant seem to aggregate, so for instance this fails:
aggregate(z, as.Date(time(z)), sum)
# Error in Summary.factor(2:3, na.rm = FALSE) : sum not meaningful for factors
There seems to be a lot of content regarding aggregate, and i've tried a number of versions but cant seem to sum this on a daily level. I'd also like to run cummax and cumulative averages in addition to the daily summing.
Any help woudl be greatly appreciated.
Update
The code I am actually using is as follows:
z <- read.zoo(file = "data.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE, blank.lines.skip = T, na.strings="NA", format = "%Y%m%d");
It seems my (unintentional) quotation of the numbers above is similar to what is happening in practice, because when I do:
aggregate(z, index(z), sum)
#Error in Summary.factor(25L, na.rm = FALSE) : sum not meaningful for factors
There a number of columns (100 or so), how can i specify them to be as.numeric automatically ? (stringAsFactors = False doesnt appear to work?)
Or you aggregate before using zoo (val1 and val2 need to be numeric though).
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
y <- aggregate(x[,2:3],by=list(x[,1]),FUN=sum)
and then feed y into zoo.
You avoid the warning:)
You started on the right path but made a couple of mistakes.
First, zoo only consumes matrices, not data.frames. Second, those need numeric inputs:
> z <- zoo(as.matrix(data.frame(val1=c(10,11,1,6), val2=c(5,31,2,7))),
+ order.by=as.Date(c("20100505","20100505","20100506","20100507"),
+ "%Y%m%d"))
Warning message:
In zoo(as.matrix(data.frame(val1 = c(10, 11, 1, 6), val2 = c(5, :
some methods for "zoo" objects do not work if the index entries in
'order.by' are not unique
This gets us a warning which is standard in zoo: it does not like identical time indices.
Always a good idea to show the data structure, maybe via str() as well, maybe run summary() on it:
> z
val1 val2
2010-05-05 10 5
2010-05-05 11 31
2010-05-06 1 2
2010-05-07 6 7
And then, once we have it, aggregation is easy:
> aggregate(z, index(z), sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
>
val1 and val2 are character strings. data.frame() converts them to factors. Summing factors doesn't make sense. You probably intended:
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
z <- read.zoo(x, format = "%Y%m%d")
aggregate(z, as.Date(time(z)), sum)
which yields:
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
Convert the character columns to numeric and then use read.zoo making use of its aggregate argument:
> x[-1] <- lapply(x[-1], function(x) as.numeric(as.character(x)))
> read.zoo(x, format = "%Y%m%d", aggregate = sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7

Resources