How do i convert my date values into year in r - r

another day with new complex faced
Below are the columns and rows that I have as input:
ID Age
123 23 Years 1 Month 2 Days
125 28 Years 9 Month 14 Days
126 28 years
127 34 YEAR
128 35 Years 8 Month 21 Days
129 38 Years 5 Month 25 Days
130 32.8
I need them as yearly calculated in new columns like:
ID Age Age_new
123 23 Years 1 Month 2 Days 23.1
125 28 Years 9 Month 14 Days 28.9
126 28 years 28
127 34 YEAR 34
128 35 Years 8 Month 21 Days 35.8
129 38 Years 5 Month 25 Days 38.5
130 32.8 32.8
I have tried the by stringr package but I get only first character string
which doesn't provide like the above.

Here's a gross approximation:
func <- function(x, ptn) {
out <- gsub(paste0(".*?\\b([0-9.]+)\\s*", ptn, ".*"), "\\1", x, ignore.case = TRUE)
ifelse(out == x, NA, out)
}
library(dplyr)
dat %>%
mutate(
data.frame(
lapply(c(yr = "year", mon = "month", day = "day"),
function(ptn) as.numeric(func(Age, ptn)))
),
yr = if_else(is.na(yr), suppressWarnings(as.numeric(Age)), yr),
across(c(yr, mon, day), ~ coalesce(., 0)), New_Age = yr + mon/12 + day/365
)
# ID Age yr mon day New_Age
# 1 123 23 Years 1 Month 2 Days 23.0 1 2 23.08881
# 2 125 28 Years 9 Month 14 Days 28.0 9 14 28.78836
# 3 126 28 years 28.0 0 0 28.00000
# 4 127 34 YEAR 34.0 0 0 34.00000
# 5 128 35 Years 8 Month 21 Days 35.0 8 21 35.72420
# 6 129 38 Years 5 Month 25 Days 38.0 5 25 38.48516
# 7 130 32.8 32.8 0 0 32.80000
(I offer no warranty on true accuracy.)
Data
dat <- structure(list(ID = c(123L, 125L, 126L, 127L, 128L, 129L, 130L), Age = c("23 Years 1 Month 2 Days", "28 Years 9 Month 14 Days", "28 years", "34 YEAR", "35 Years 8 Month 21 Days", "38 Years 5 Month 25 Days", "32.8")), class = "data.frame", row.names = c(NA, -7L))

This is my approach. I always try to avoid regex since it's too scary for me. If your data is exactly separated like your example, I think my code will work. I completely understand this is not the most efficient way. but heyy it works
dat %>%
mutate(space_counter = stringr::str_count(Age," ")) %>%
tidyr::separate(Age,into = paste0("tmp_col_",1:(max(.$space_counter)+1)),sep = " ") %>%
select(ID, tmp_col_1,tmp_col_3,tmp_col_5) %>%
setNames(c("ID","year","month","day")) %>%
mutate(across(everything(), ~replace_na(.x, 0))) %>%
mutate_if(is.character,as.integer) %>%
mutate(asdur = as.duration(years(year) + months(month) + days(day))) %>%
mutate(age_new = as.numeric(asdur)/3.154e+7)
output:

Related

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

How to group by in base R

I would like to express the following SQL query using base R (without any particular package):
select month, day, count(*) as count, avg(dep_delay) as avg_delay
from flights
group by month, day
having count > 1000
It selects the mean departure delay and the number of flights per day on busy days (days with more than 1000 flights). The dataset is nycflights13 containing information of flights departed from NYC in 2013.
Notice I can easily write this in dplyr as:
flights %>%
group_by(month, day) %>%
summarise(count = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
filter(count > 1000)
Since I was reminded earlier about the elegance of by (tip of the hat to #Parfait), here is a solution using by:
res <- by(flights, list(flights$month, flights$day), function(x)
if (nrow(x) > 1000) {
c(
month = unique(x$month),
day = unique(x$day),
count = nrow(x),
avg_delay = mean(x$dep_delay, na.rm = TRUE))
})
# Store in data.frame and order by month, day
df <- do.call(rbind, res);
df <- df[order(df[, 1], df[, 2]) ,];
# month day count avg_delay
#[1,] 7 8 1004 37.296646
#[2,] 7 9 1001 30.711499
#[3,] 7 10 1004 52.860702
#[4,] 7 11 1006 23.609392
#[5,] 7 12 1002 25.096154
#[6,] 7 17 1001 13.670707
#[7,] 7 18 1003 20.626789
#[8,] 7 25 1003 19.674134
#[9,] 7 31 1001 6.280843
#[10,] 8 7 1001 8.680402
#[11,] 8 8 1001 43.349947
#[12,] 8 12 1001 8.308157
#[13,] 11 27 1014 16.697651
#[14,] 12 2 1004 9.021978
as commented you can use a combi of subset and aggregate. Changed the order of day & month to recieve the same order as your dplyr approach. Using na.action = NULL to count rows inclunding NAs.
library(nycflights13)
#> Warning: Paket 'nycflights13' wurde unter R Version 3.4.4 erstellt
subset(aggregate(dep_delay ~ day + month, flights,
function(x) cbind(count=length(x), avg_delay=mean(x, na.rm = TRUE)),
na.action = NULL),
dep_delay[,1] > 1000)
#> day month dep_delay.1 dep_delay.2
#> 189 8 7 1004.000000 37.296646
#> 190 9 7 1001.000000 30.711499
#> 191 10 7 1004.000000 52.860702
#> 192 11 7 1006.000000 23.609392
#> 193 12 7 1002.000000 25.096154
#> 198 17 7 1001.000000 13.670707
#> 199 18 7 1003.000000 20.626789
#> 206 25 7 1003.000000 19.674134
#> 212 31 7 1001.000000 6.280843
#> 219 7 8 1001.000000 8.680402
#> 220 8 8 1001.000000 43.349947
#> 224 12 8 1001.000000 8.308157
#> 331 27 11 1014.000000 16.697651
#> 336 2 12 1004.000000 9.021978
Created on 2018-04-05 by the reprex package (v0.2.0).
Not a particularly elegant solution, but this will do what you want using Base R
flights_split <- split(flights, f = list(flights$month, flights$day))
result <- lapply(flights_split, function(x) {
if(nrow(x) > 1000) {
data.frame(month = unique(x$month), day = unique(x$day), avg_delay = mean(x$dep_delay, na.rm = T), count = nrow(x))
} else {
NULL
}
}
)
do.call(rbind, result)
# month day mean_delay n
# 12.2 12 2 9.021978 1004
# 8.7 8 7 8.680402 1001
# 7.8 7 8 37.296646 1004
# 8.8 8 8 43.349947 1001
# 7.9 7 9 30.711499 1001
# 7.10 7 10 52.860702 1004
# 7.11 7 11 23.609392 1006
# 7.12 7 12 25.096154 1002
# 8.12 8 12 8.308157 1001
# 7.17 7 17 13.670707 1001
# 7.18 7 18 20.626789 1003
# 7.25 7 25 19.674134 1003
# 11.27 11 27 16.697651 1014
# 7.31 7 31 6.280843 1001
Here is my solution:
grp <- expand.grid(mth = unique(flights$month), d = unique(flights$day))
out <- mapply(function(mth, d){
sub_data <- subset(flights, month == mth & day == d)
df <- data.frame(
month = mth,
day = d,
count = nrow(sub_data),
avg_delay = mean(sub_data$dep_delay, na.rm = TRUE)
)
df[df$count > 1000]
}, grp$mth, grp$d)
res <- do.call(rbind, out)
This is a lot slower than the dplyr solution.

Calculating the difference between two two-digit years

Is there any easy way in R to calculate the difference between two columns of two-digit years (just years, no months/days as it's unnecessary here) in order to produce a column of ages?
I've fairly new to this and have been playing with 'if' statements and algebra without success.
The data looks like this, but larger:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"))
You could use strptime() with the format %y:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"),
stringsAsFactors = F) # You might want to use this as a default!
dat$year1 <- strptime(dat$year1, format = "%y")
dat$year2 <- strptime(dat$year2, format = "%y")
as.vector(difftime(dat$year2,
dat$year1,
units = "days"))/365.242
4.999311 5.002163 4.999425 4.999425 4.999425
Format to a date, format back to a number, take the difference:
do.call(`-`, lapply(dat[1:2], function(x)
as.numeric(format(as.Date(x, format="%y"), "%Y"))))
#[1] -5 -5 -5 -5 -5
This may hit cases where it doesn't work if you have old dates in the early 1900's. As per ?strptime:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
df$age <- ifelse(df$year2 < df$year1, df$year2 - df$year1 + 100, df$year2 -df$year1)
should work under the assumption year2 is some kind of current year and year1 is the year of birth and there are no people born before 1918.
Example:
df <- data.frame(year1 = sample(18:99, 1000, replace = T),
year2 = sample(1:99, 1000, replace = T))
> head(df)
year1 year2
1 27 88
2 41 55
3 90 36
4 81 93
5 56 60
6 27 61
df$age <- ifelse(df$year2 < df$year1, df$year2 - df$year1 + 100, df$year2 -df$year1)
> head(df)
year1 year2 age
1 73 88 15
2 50 17 67
3 47 41 94
4 54 43 89
5 36 82 46
6 62 85 23
With your data example:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"))
dat$age <- ifelse(as.numeric(as.character(dat$year2)) < as.numeric(as.character(dat$year1)),
as.numeric(as.character(dat$year2)) - as.numeric(as.character(dat$year1)) + 100,
as.numeric(as.character(dat$year2)) - as.numeric(as.character(dat$year1)))
> dat
year1 year2 age
1 98 03 5
2 99 04 5
3 00 05 5
4 01 06 5
5 02 07 5
one method is to use as.Date with a dplyr chain:
dat %>%
mutate(year1 = as.Date(year1, format = "%y"),
year2 = as.Date(year2, format = "%y")) %>%
mutate(age = year2 - year1)
which returns:
year1 year2 age
1 1998-10-26 2003-10-26 1826 days
2 1999-10-26 2004-10-26 1827 days
3 2000-10-26 2005-10-26 1826 days
4 2001-10-26 2006-10-26 1826 days
5 2002-10-26 2007-10-26 1826 days
p.s. it assumes default day and month for both columns, but it assumes same value for both, so does not affect the difference calculation.

Percentile for multiple groups of values in R

I'm using R to do my data analysis.
I'm looking for the code to achieve the below mentioned output.
I need a single piece of code to do this as I have over 500 groups & 24 months in my actual data. The below sample has only 2 groups & 2 months.
This is a sample of my data.
Date Group Value
1-Jan-16 A 10
2-Jan-16 A 12
3-Jan-16 A 17
4-Jan-16 A 20
5-Jan-16 A 12
5-Jan-16 B 56
1-Jan-16 B 78
15-Jan-16 B 97
20-Jan-16 B 77
21-Jan-16 B 86
2-Feb-16 A 91
2-Feb-16 A 44
3-Feb-16 A 93
4-Feb-16 A 87
5-Feb-16 A 52
5-Feb-16 B 68
1-Feb-16 B 45
15-Feb-16 B 100
20-Feb-16 B 81
21-Feb-16 B 74
And this is the output I'm looking for.
Month Year Group Minimum Value 5th Percentile 10th Percentile 50th Percentile 90th Percentile Max Value
Jan 2016 A
Jan 2016 B
Feb 2016 A
Feb 2016 B
considering dft as your input, you can try:
library(dplyr)
dft %>%
mutate(Date = as.Date(Date, format = "%d-%b-%y")) %>%
mutate(mon = month(Date),
yr = year(Date)) %>%
group_by(mon,yr,Group) %>%
mutate(minimum = min(Value),
maximum = max(Value),
q95 = quantile(Value, 0.95)) %>%
select(minimum, maximum, q95) %>%
unique()
which gives:
mon yr Group minimum maximum q95
<int> <int> <chr> <int> <int> <dbl>
1 1 2016 A 10 20 19.4
2 1 2016 B 56 97 94.8
3 2 2016 A 44 93 92.6
4 2 2016 B 45 100 96.2
and add more variables as per your need.

How to group in R with partial match and assign a column with the aggregated value?

Below is the data frame I have:
Quarter Revenue
1 2014-Q1 10
2 2014-Q2 20
3 2014-Q3 30
4 2014-Q4 40
5 2015-Q1 50
6 2015-Q2 60
7 2015-Q3 70
8 2015-Q4 80
I want to find the mean of the quarters containing Q1,Q2,Q3,Q4 separately (for e.g. for text containing Q1, I have two values for revenue i.e. 10 and 50, the mean of which is 30) and insert a column depicting the mean. The o/p should look like the one described below:
Quarter Revenue Aggregate
1 2014-Q1 10 30
2 2014-Q2 20 40
3 2014-Q3 30 50
4 2014-Q4 40 60
5 2015-Q1 50 30
6 2015-Q2 60 40
7 2015-Q3 70 50
8 2015-Q4 80 60
Could you all please let me know if there are any processes without using the popular packages and with using too.
Thanks!
We can separate the "Quarter" into "Year", "Quart", group by "Quart", and get the mean of "Revenue"
library(dplyr)
library(tidyr)
separate(df1, Quarter, into = c("Year", "Quart"), remove = FALSE) %>%
group_by(Quart) %>%
mutate(Aggregate = mean(Revenue)) %>%
ungroup() %>%
select(-Quart, -Year)
# Quarter Revenue Aggregate
# <chr> <int> <dbl>
#1 2014-Q1 10 30
#2 2014-Q2 20 40
#3 2014-Q3 30 50
#4 2014-Q4 40 60
#5 2015-Q1 50 30
#6 2015-Q2 60 40
#7 2015-Q3 70 50
#8 2015-Q4 80 60
Or we can do this compactly with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by the substring of 'Quarter (removed the Year and -), we assign (:=) the mean of 'Revenue' to create the 'Aggregate'.
library(data.table)
setDT(df1)[, Aggregate := mean(Revenue) ,.(sub(".*-", "", Quarter))]
One possible solution using functions from the base package.
qtr <- c("Q1", "Q2", "Q3", "Q4")
avg <- numeric()
for (n in 1:length(qtr)) {
ind <- grep(qtr[n], df1$Quarter)
avg[length(avg) + 1] <- mean(df1$Revenue[ind])
}
df1 <- transform(df1, Aggregate = avg)
Apparently using functions from other packages (e.g., dplyr) make code less verbose.

Resources