Calculating the difference between two two-digit years - r

Is there any easy way in R to calculate the difference between two columns of two-digit years (just years, no months/days as it's unnecessary here) in order to produce a column of ages?
I've fairly new to this and have been playing with 'if' statements and algebra without success.
The data looks like this, but larger:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"))

You could use strptime() with the format %y:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"),
stringsAsFactors = F) # You might want to use this as a default!
dat$year1 <- strptime(dat$year1, format = "%y")
dat$year2 <- strptime(dat$year2, format = "%y")
as.vector(difftime(dat$year2,
dat$year1,
units = "days"))/365.242
4.999311 5.002163 4.999425 4.999425 4.999425

Format to a date, format back to a number, take the difference:
do.call(`-`, lapply(dat[1:2], function(x)
as.numeric(format(as.Date(x, format="%y"), "%Y"))))
#[1] -5 -5 -5 -5 -5
This may hit cases where it doesn't work if you have old dates in the early 1900's. As per ?strptime:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.

df$age <- ifelse(df$year2 < df$year1, df$year2 - df$year1 + 100, df$year2 -df$year1)
should work under the assumption year2 is some kind of current year and year1 is the year of birth and there are no people born before 1918.
Example:
df <- data.frame(year1 = sample(18:99, 1000, replace = T),
year2 = sample(1:99, 1000, replace = T))
> head(df)
year1 year2
1 27 88
2 41 55
3 90 36
4 81 93
5 56 60
6 27 61
df$age <- ifelse(df$year2 < df$year1, df$year2 - df$year1 + 100, df$year2 -df$year1)
> head(df)
year1 year2 age
1 73 88 15
2 50 17 67
3 47 41 94
4 54 43 89
5 36 82 46
6 62 85 23
With your data example:
dat <- data.frame(year1=c("98","99","00","01","02"),
year2=c("03","04","05","06","07"))
dat$age <- ifelse(as.numeric(as.character(dat$year2)) < as.numeric(as.character(dat$year1)),
as.numeric(as.character(dat$year2)) - as.numeric(as.character(dat$year1)) + 100,
as.numeric(as.character(dat$year2)) - as.numeric(as.character(dat$year1)))
> dat
year1 year2 age
1 98 03 5
2 99 04 5
3 00 05 5
4 01 06 5
5 02 07 5

one method is to use as.Date with a dplyr chain:
dat %>%
mutate(year1 = as.Date(year1, format = "%y"),
year2 = as.Date(year2, format = "%y")) %>%
mutate(age = year2 - year1)
which returns:
year1 year2 age
1 1998-10-26 2003-10-26 1826 days
2 1999-10-26 2004-10-26 1827 days
3 2000-10-26 2005-10-26 1826 days
4 2001-10-26 2006-10-26 1826 days
5 2002-10-26 2007-10-26 1826 days
p.s. it assumes default day and month for both columns, but it assumes same value for both, so does not affect the difference calculation.

Related

How do i convert my date values into year in r

another day with new complex faced
Below are the columns and rows that I have as input:
ID Age
123 23 Years 1 Month 2 Days
125 28 Years 9 Month 14 Days
126 28 years
127 34 YEAR
128 35 Years 8 Month 21 Days
129 38 Years 5 Month 25 Days
130 32.8
I need them as yearly calculated in new columns like:
ID Age Age_new
123 23 Years 1 Month 2 Days 23.1
125 28 Years 9 Month 14 Days 28.9
126 28 years 28
127 34 YEAR 34
128 35 Years 8 Month 21 Days 35.8
129 38 Years 5 Month 25 Days 38.5
130 32.8 32.8
I have tried the by stringr package but I get only first character string
which doesn't provide like the above.
Here's a gross approximation:
func <- function(x, ptn) {
out <- gsub(paste0(".*?\\b([0-9.]+)\\s*", ptn, ".*"), "\\1", x, ignore.case = TRUE)
ifelse(out == x, NA, out)
}
library(dplyr)
dat %>%
mutate(
data.frame(
lapply(c(yr = "year", mon = "month", day = "day"),
function(ptn) as.numeric(func(Age, ptn)))
),
yr = if_else(is.na(yr), suppressWarnings(as.numeric(Age)), yr),
across(c(yr, mon, day), ~ coalesce(., 0)), New_Age = yr + mon/12 + day/365
)
# ID Age yr mon day New_Age
# 1 123 23 Years 1 Month 2 Days 23.0 1 2 23.08881
# 2 125 28 Years 9 Month 14 Days 28.0 9 14 28.78836
# 3 126 28 years 28.0 0 0 28.00000
# 4 127 34 YEAR 34.0 0 0 34.00000
# 5 128 35 Years 8 Month 21 Days 35.0 8 21 35.72420
# 6 129 38 Years 5 Month 25 Days 38.0 5 25 38.48516
# 7 130 32.8 32.8 0 0 32.80000
(I offer no warranty on true accuracy.)
Data
dat <- structure(list(ID = c(123L, 125L, 126L, 127L, 128L, 129L, 130L), Age = c("23 Years 1 Month 2 Days", "28 Years 9 Month 14 Days", "28 years", "34 YEAR", "35 Years 8 Month 21 Days", "38 Years 5 Month 25 Days", "32.8")), class = "data.frame", row.names = c(NA, -7L))
This is my approach. I always try to avoid regex since it's too scary for me. If your data is exactly separated like your example, I think my code will work. I completely understand this is not the most efficient way. but heyy it works
dat %>%
mutate(space_counter = stringr::str_count(Age," ")) %>%
tidyr::separate(Age,into = paste0("tmp_col_",1:(max(.$space_counter)+1)),sep = " ") %>%
select(ID, tmp_col_1,tmp_col_3,tmp_col_5) %>%
setNames(c("ID","year","month","day")) %>%
mutate(across(everything(), ~replace_na(.x, 0))) %>%
mutate_if(is.character,as.integer) %>%
mutate(asdur = as.duration(years(year) + months(month) + days(day))) %>%
mutate(age_new = as.numeric(asdur)/3.154e+7)
output:

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

Performing operation among levels of grouped variable in R/dplyr

I want to perform a calculation among levels a grouping variable and fit this into a dplyr/tidyverse style workflow. I know this is confusing wording, but I hope the example below helps to clarify.
Below, I want to find the difference between levels "A" and "B" for each year that that I have data. One solution was to cast the data from long to wide format, and use mutate() in order to find the difference between A and B and create a new column with the results.
Ultimately, I'm working with a much larger dataset in which for each of N species, and for every year of sampling, I want to find the response ratio of some measured variable. Being able to keep the calculation in a long-format workflow would greatly help with later uses of the data.
library(tidyverse)
library(reshape)
set.seed(34)
test = data.frame(Year = rep(seq(2011,2020),2),
Letter = rep(c('A','B'),each = 10),
Response = sample(100,20))
test.results = test %>%
cast(Year ~ Letter, value = 'Response') %>%
mutate(diff = A - B)
#test.results
Year A B diff
2011 93 48 45
2012 33 44 -11
2013 9 80 -71
2014 10 61 -51
2015 50 67 -17
2016 8 43 -35
2017 86 20 66
2018 54 99 -45
2019 29 100 -71
2020 11 46 -35
Is there some solution where I could group by Year, and then use a function like summarize() to calculate between the levels of variable "Letters"?
group_by(Year)%>%
summarise( "something here to perform a calculation between levels A and B of the variable "Letters")
You can subset the Response values for "A" and "B" and then take the difference.
library(dplyr)
test %>%
group_by(Year) %>%
summarise(diff = Response[Letter == 'A'] - Response[Letter == 'B'])
# Year diff
# <int> <int>
# 1 2011 45
# 2 2012 -11
# 3 2013 -71
# 4 2014 -51
# 5 2015 -17
# 6 2016 -35
# 7 2017 66
# 8 2018 -45
# 9 2019 -71
#10 2020 -35
In this example, we can also take advantage of the fact that if we arrange the data "A" would come before "B" so we can use diff :
test %>%
arrange(Year, desc(Letter)) %>%
group_by(Year) %>%
summarise(diff = diff(Response))

How to group in R with partial match and assign a column with the aggregated value?

Below is the data frame I have:
Quarter Revenue
1 2014-Q1 10
2 2014-Q2 20
3 2014-Q3 30
4 2014-Q4 40
5 2015-Q1 50
6 2015-Q2 60
7 2015-Q3 70
8 2015-Q4 80
I want to find the mean of the quarters containing Q1,Q2,Q3,Q4 separately (for e.g. for text containing Q1, I have two values for revenue i.e. 10 and 50, the mean of which is 30) and insert a column depicting the mean. The o/p should look like the one described below:
Quarter Revenue Aggregate
1 2014-Q1 10 30
2 2014-Q2 20 40
3 2014-Q3 30 50
4 2014-Q4 40 60
5 2015-Q1 50 30
6 2015-Q2 60 40
7 2015-Q3 70 50
8 2015-Q4 80 60
Could you all please let me know if there are any processes without using the popular packages and with using too.
Thanks!
We can separate the "Quarter" into "Year", "Quart", group by "Quart", and get the mean of "Revenue"
library(dplyr)
library(tidyr)
separate(df1, Quarter, into = c("Year", "Quart"), remove = FALSE) %>%
group_by(Quart) %>%
mutate(Aggregate = mean(Revenue)) %>%
ungroup() %>%
select(-Quart, -Year)
# Quarter Revenue Aggregate
# <chr> <int> <dbl>
#1 2014-Q1 10 30
#2 2014-Q2 20 40
#3 2014-Q3 30 50
#4 2014-Q4 40 60
#5 2015-Q1 50 30
#6 2015-Q2 60 40
#7 2015-Q3 70 50
#8 2015-Q4 80 60
Or we can do this compactly with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by the substring of 'Quarter (removed the Year and -), we assign (:=) the mean of 'Revenue' to create the 'Aggregate'.
library(data.table)
setDT(df1)[, Aggregate := mean(Revenue) ,.(sub(".*-", "", Quarter))]
One possible solution using functions from the base package.
qtr <- c("Q1", "Q2", "Q3", "Q4")
avg <- numeric()
for (n in 1:length(qtr)) {
ind <- grep(qtr[n], df1$Quarter)
avg[length(avg) + 1] <- mean(df1$Revenue[ind])
}
df1 <- transform(df1, Aggregate = avg)
Apparently using functions from other packages (e.g., dplyr) make code less verbose.

Manipulating data in R from columns to rows

I have data that is currently organized as follows:
X.1 State MN X.2 WI X.3
NA Price Pounds Price Pounds
Year NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65
I would like to convert it to something more like this:
Year State Price Pounds
1980 MN 56 23
1999 MN 41 63
1980 WI 56 96
1999 WI 56 65
Any suggestions for some R-code to manipulate this data correctly?
Thanks!
This requires some manipulation to get it into a format that you can reshape.
df <- read.table(h=T, t=" X.1 State MN X.2 WI X.3
NA NA Price Pounds Price Pounds
Year NA NA NA NA NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65")
df <- df[-2]
# Auto-process names; you should look at intermediate step results to see
# what's going on. This would probably be better addressed with something
# like `na.locf` from `zoo` but this is all in base. Note you can do something
# a fair bit simpler if you know you have the same number of items for each
# state, but this should be robust to different numbers.
df.names <- names(df)
df.names <- ifelse(grepl("X.[0-9]+", df.names), NA, df.names)
df.names[[1]] <- "Year"
df.names.valid <- Filter(Negate(is.na), df.names)
df.names[is.na(df.names)] <- df.names.valid[cumsum(!is.na(df.names))[is.na(df.names)]]
names(df) <- df.names
# rename again by adding Price/Pounds
names(df)[-1] <- paste(
vapply(2:5, function(x) as.character(df[1, x]), ""), # need to do this because we're pulling across different factor columns
names(df)[-1],
sep="."
)
df <- df[-(1:2),] # Don't need rows 1:2 anymore
df
Produces:
Year Price.MN Pounds.MN Price.WI Pounds.WI
3 1980 56 23 56 96
4 1999 41 63 56 65
Then:
using base reshape:
reshape(df, direction="long", varying=2:5)
Which gets you basically where you want to be:
Year time Price Pounds id
1.MN 1980 MN 56 23 1
2.MN 1999 MN 41 63 2
1.WI 1980 WI 56 96 1
2.WI 1999 WI 56 65 2
Clearly you'll want to rename some columns, etc., but that's straightforward. The key point with reshape is that the column names matter so we constructed them in a way that reshape can use.
using reshape2::melt/cast:
library(reshape2)
df.mlt <- melt(df, id.vars="Year")
df.mlt <- transform(df.mlt,
metric=sub("\\..*", "", variable),
state=sub(".*\\.", "", variable)
)
dcast(df.mlt[-2], Year + state ~ metric)
produces:
Year state Pounds Price
1 1980 MN 23 56
2 1980 WI 96 56
3 1999 MN 63 41
4 1999 WI 65 56
BE VERY CAREFUL, it is likely that Price and Pounds are factors because the column used to have both character and numeric values. You will need to convert to numeric with as.numeric(as.character(df$Price)).
Well that was a nice challenge. It's a lot of strsplits and greps, and it may not generalize to your entire data set. Or maybe it will, you never know.
> txt <- "X.1 State MN X.2 WI X.3
NA Price Pounds Price Pounds
Year NA
1980 NA 56 23 56 96
1999 NA 41 63 56 65"
>
> x <- textConnection(txt)
> y <- gsub("((X[.][0-9]{1})|NA)|\\s+", " ", readLines(x))
> z <- unlist(strsplit(y, "^\\s+"))
> a <- z[nzchar(z)]
> b <- unlist(strsplit(a, "\\s+"))
> nums <- as.numeric(grep("[0-9]", b[nchar(b) == 2], value = TRUE))
> Price = rev(nums[c(TRUE, FALSE)])
> pounds <- nums[-which(nums %in% Price)]
> data.frame(Year = rep(b[grepl("[0-9]{4}", b)], 2),
State = unlist(lapply(b[grepl("[A-Z]{2}", b)], rep, 2)),
Price = Price,
Pounds = c(pounds[1], rev(pounds[2:3]), pounds[4]))
## Year State Price Pounds
## 1 1980 MN 56 23
## 2 1999 MN 41 63
## 3 1980 WI 56 96
## 4 1999 WI 56 65

Resources