Replacing week/month times as strings with time in years? (R) - r

I found a pet adoption dataset that includes the age of a pet when adopted. However, the age variable contains strings like "3 months" or "4 years" or "3 weeks" all in the same column. The dataset is otherwise tidy. How can I convert these variables into year values?
I've tried something like this:
for(i in i:nrow(Pet_Train$AgeuponOutcome)){
if(grepl(i, "month") == TRUE)
Pet_Train$Age_in_Years[i] == "0"
}
But I have little experience with loops/if statements/this "grepl" function I just looked up. I do have experience with tidy functions like mutate() and filter() but I'm not sure how to apply those with these many of possible argument combinations.
Since there are 27,000 instances, so I'd rather not go through this by hand.
Edit:
I figured out how to use the grepl function to replace instances containing "month" with "less than a year." But is there a way to take the exact number of months and convert them into the year as a decimal?

The first two use only base of R and the third uses dplyr and tidyr.
1) Use read.table to split the input column into the numeric and units parts and then multiply the numeric part by the fraction of a year that the units part represents.
PT <- data.frame(Age = c("3 months", "4 years", "3 weeks")) # input
transform(cbind(PT, read.table(text = as.character(PT$Age))),
Years = V1 * (7 / 365.25 * (V2 == "weeks") + 1/12 * (V2 == "months") + (V2 == "years")))
giving:
Age V1 V2 Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
2) Alternately the last line could be written in terms of switch:
transform(cbind(PT, read.table(text = as.character(PT$Age), as.is = TRUE)),
Years = V1 * sapply(V2, switch, weeks = 7 / 365.25, months = 1 / 12, years = 1))
3) This uses dplyr and tidyr:
PT %>%
separate(Age, c("No", "Units")) %>%
mutate(No = as.numeric(No),
Years = No * case_when(Units == "weeks" ~ 7 / 365.25,
Units == "months" ~ 1 / 12,
Units == "years" ~ 1))
giving:
No Units Years
1 3 months 0.25000000
2 4 years 4.00000000
3 3 weeks 0.05749487

lubridate-based solution:
library(tidyverse)
library(lubridate)
dat <- data_frame(age_text = c("3 months", "4 years", "3 weeks"))
dat %>% mutate(age_in_years = duration(age_text) / dyears(1))

The answer of David Rubinger uses the lubridate package to coerce character strings to objects of class Duration.
The as.duration() function seems to recognize a variety of strings, e.g.,
age_text <- c("3 months", "4 years", "3 weeks", "52 weeks", "365 days 6 hours")
lubridate::as.duration(age_text)
[1] "7889400s (~13.04 weeks)" "126230400s (~4 years)" "1814400s (~3 weeks)"
[4] "31449600s (~52 weeks)" "31557600s (~1 years)"
However, the OP has requested to convert the strings into year values rather than seconds.
This can be achieved by using the as.numeric() function which takes a units parameter to specify the desired conversion:
as.numeric(lubridate::as.duration(age_text), units = "years")
[1] 0.25000000 4.00000000 0.05749487 0.99657769 1.00000000
Other units can be chosen as well:
as.numeric(lubridate::as.duration(age_text), units = "months")
[1] 3.0000000 48.0000000 0.6899384 11.9589322 12.0000000
as.numeric(lubridate::as.duration(age_text), units = "weeks")
[1] 13.04464 208.71429 3.00000 52.00000 52.17857

Just to expand on the comment I left, you could use ifelse. First though, here's a reproducible example of your data (always very useful for you to provide this when asking a question):
df <- data.frame("Duration" = c("3 months", "4 years", "3 weeks"))
You can then split out the units and values from this using string split:
df$Value <- as.numeric(vapply(strsplit(as.character(df$Duration), split = " "), `[`, 1, FUN.VALUE=character(1)))
df$Units <- vapply(strsplit(as.character(df$Duration), split = " "), `[`, 2, FUN.VALUE=character(1))
Finally, use nested ifelse arguments which tell R what to do if data in a column matches a condition, and what to do if not - so I have this saying that, if the units is weeks, divide the amount by 52.18 (the number of weeks per year).
df$Years <- ifelse(df[,'Units']=="weeks", df[,'Value']/(365.25/7), ifelse(df[,'Units']=="months", df[,'Value']/12, df[,'Value']))
And the successful output:
> df
Duration Value Units Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
Note: It would be more appropriate to do this with "days" as your unit of time, which could be done if you had dates for the first and second event (birth and adoption dates of the animal). This is because years and months are variable length units - December is longer than February, 2016 was longer than 2015 and 2017.

Related

split a column into numeric and non-numeric components

I need to split one column into 2 where the the resulting columns contain the numeric or character portions of the original column.
df <- data.frame(myCol = c("24 hours", "36days", "1month", "2 months +"))
myCol
24 hours
36days
1month
2 months +
result should be:
alpha numeric
hours 24
days 36
month 1
months + 2
Note the inconsistent formatting of the original dataframe (sometimes with spaces, sometimes without).
tidy or base solutions are fine
Thanks
One solution could be:
library(tidyverse)
df %>%
separate(myCol,
into = c("numeric", "alpha"),
sep = "(?=[a-z +]+)(?<=[0-9])"
)
Which returns:
numeric alpha
1 24 hours
2 36 days
3 1 month
4 2 months +
You could do:
library(stringr)
df$numeric <- str_extract(df$myCol, "[0-9]+")
df$alpha <- str_remove(df$myCol, df$numeric)
Or with base functions
df$numeric <- regmatches(df$myCol, regexpr("[0-9]+", df$myCol))
df$alpha <- gsub("[0-9]+", "", df$myCol)

How to convert words to a date?

I have a df of dates that are in this format: 4 days ago,
6 weeks ago, 8 months ago, 1 year ago.
I want to write a statement that checks first to see if it's month, week, year. Then it extracts the number. After that I do the appropriate calculation by subtracting from Sys.Date(). I've tried a couple different ways and can't get it to work.
Any chance you can help me with one and I can i figure out rest?
Thanks in advance.
Does this crude function help you? It should work even for strings like "3 years, 2 months ago". Returns NA if month, year or day do not appear in the string with a number in front.
library("stringr")
# Small helper function to convert NAs to zero and convert to numeric
na_to_zero <- function(x) {
x[is.na(x)] <- "0"
return(as.numeric(x))
}
get_date_before_today <- function(d) {
today <- Sys.Date()
days <- na_to_zero(str_extract(d, "(?i)[0-9]*(?= day\\D)"))
months <- na_to_zero(str_extract(d, "(?i)[0-9]*(?= month\\D)"))
years <- na_to_zero(str_extract(d, "(?i)[0-9]*(?= year\\D)"))
days_ago <- days + 365.25/12*months + 365.25*years
date_before_today <- today - days_ago
# If no matches were made, zeros are substituted for all, and hence days_ago is 0
date_before_today[days_ago == 0] <- NA
return(date_before_today)
}
Testing:
d <- c("4 months ago asds", "2 years ago", "1 day ago", "5 years, 3 months", "never")
get_date_before_today(d)
#[1] "2018-05-15" "2016-09-13" "2018-09-13" "2013-06-14" NA
Note, it does not give you exact dates per se. But I guess one can argue that, for example, 1 month ago can be ambiguous. What does 1 month ago mean exactly for if today is the 31st of October?
The "weeks" case can be added trivially.
We can patch together a few tidyverse functions to make quick work of this. Mostly using lubrdate for the date shifting, stringr for the string parsing, and purrr for the mapping. For example
mm <- stringr::str_match(x, "(\\d+) (day|week|month|year)s? ago")
shifter <- list(day=days, week=weeks, month=months, year=years)
shifts <- map2(mm[,3], as.numeric(mm[,2]), ~case_when(.x=="day"~days(.y),
.x=="week"~weeks(.y),
.x=="month"~months(.y),
.x=="year"~years(.y)))
map_dbl(shifts, ~today()-.x) %>% as_date
# [1] "2018-09-10" "2018-08-03" "2018-01-14" "2017-09-14"
# where today() returns [1] "2018-09-14"

How to convert char string like "70 min" , "1 hr 30 min" and N/A to a numeric type in a dataframe [duplicate]

This question already has answers here:
Calculate character string "days, hours, minutes, seconds" to numeric total days [duplicate]
(4 answers)
Closed 5 years ago.
I have a dataframe where I have a column which has value of runtime as "70 min" or "1 hr 30 Min" and N/A etc. I want to convert these values to numeric, like 70 min should be 70 and 1 hr 30 minutes should be 90. Also, I want to Keep N/A as it is.
a<- c("70 min", "1 hr 30 Min")
typeof(a)
a <- as.numeric(a)
when I tried as.numeric, it converted everything to NA, some experiments with lubridate package also disappointed me. Any expert advice please.
The duplicate link did not look particularly appetizing to me, so I will offer the following regex based solution. Assuming your non standard timestamp be in a fixed and known format, we can use a regex to extract out the various portions. Under the assumption that you only have hour and minute information, you can try:
a <- c("70 min", "1 hr 30 Min", "Blah")
hrs <- as.numeric(gsub(".*?(\\d+) [Hh]rs?.*", "\\1", a))
hrs[is.na(hrs)] <- 0
min <- as.numeric(gsub(".*?(\\d+) [Mm]in.*", "\\1", a))
min[is.na(min)] <- 0
total <- hrs*60 + min
Output:
> min
[1] 0 30 0
> hrs
[1] 0 1 0
> total
[1] 0 90 0

Convert age entered as 'X Weeks, Y Days, Z hours' in R

I have an age variable containing observations that follow this (inconsistent) format:
3 weeks, 2 days, 4 hours
4 weeks, 6 days, 12 hours
3 days, 18 hours
4 days, 3 hours
7 hours
8 hours
I need to convert each observation to hours using R.
I have used strsplit(vector, ',') to split the variable at each comma.
I am running trouble because splitting each observation at the ',' yields anywhere from 1 to 3 entries for each observation. I do not know how to properly index these entries so that I end up with one row for each observation.
I am guessing that once I am able to store these values in sensible rows, I can extract the numeric data from each column in a row and convert accordingly, then sum the entire row.
I am also open to any different methods of approaching this problem.
After you split your data you can parse the resulting list for the keywords defining the times like 'hours', 'weeks', 'days' and create a dataframe containing the relevant value (or 0 if there is no value for a certain keyword). You can achieve that with something like this:
library(dplyr)
vector = c("3 weeks, 2 days, 4 hours", "4 weeks, 6 days, 12 hours", "3 days, 18 hours", "4 days, 3 hours", "7 hours", "8 hours")
split_vector = strsplit(vector, ",", fixed = TRUE)
parse_string = function(i){
x = split_vector[[i]]
data_frame(ID = i) %>%
mutate(hours = ifelse(any(grepl("hours", x)), as.numeric(gsub("\\D", "", x[grepl("hours", x)])), 0),
days = ifelse(any(grepl("days", x)), as.numeric(gsub("\\D", "", x[grepl("days", x)])), 0),
weeks = ifelse(any(grepl("weeks", x)), as.numeric(gsub("\\D", "", x[grepl("weeks", x)])), 0))
}
all_parsed = lapply(1:length(split_vector), parse_string)
all_parsed = rbind_all(all_parsed) %>%
mutate(final_hours = hours + days * 24 + weeks * 7 * 24)
Hadleyverse comes to the rescue again:
library(lubridate)
library(stringr)
dat <- readLines(textConnection(" 3 weeks, 2 days, 4 hours
4 week, 6 days, 12 hours
3 days, 18 hours
4 day, 3 hours
7 hours
8 hour"))
sapply(str_split(str_trim(dat), ",[ ]*"), function(x) {
sum(sapply(x, function(y) {
bits <- str_split(str_trim(y), "[ ]+")[[1]]
duration(as.numeric(bits[1]), bits[2])
})) / 3600
})
## [1] 556 828 90 99 7 8
I whacked the data a bit to show it's also somewhat flexible in how it parses things. I rly don't think the second str_trim is absolutely necessary but didn't have cycles to verify.
The exposition is that it trims the original vector then splits it into components (which makes a list of vectors). That list is then iterated over and the individual vector elements are further trimmed and split into # and unit duration. That's passed to lubridate and the value is returned and automatically converted to numeric seconds by the call to sum and we then make it into hours.

How to extract quarters from dates?

Why doesn't the cut at "3 months" produce labels as expected?
# create time series data
everyday <- seq(from = as.Date('2014-1-1'), to = as.Date('2014-12-31'), by = 'day')
# create a factor based on the quarter of the year an observation is in:
qtrs <- cut(everyday, "3 months", labels = paste0('Q', 1:4))
## Error in cut.default(unclass(x), unclass(breaks), labels = labels,
## right = right, :
## lengths of 'breaks' and 'labels' differ
The cut is every 3 months, so that would create 4 Quarters and I'd expect to need 4 labels, but the error message suggests that the length of breaks and labels differs.
qtrs <- cut(everyday, "3 months", labels = paste0('Q', 1:5))
table(qtrs)
## qtrs
## Q1 Q2 Q3 Q4 Q5
## 90 91 92 92 0
The fifth label Q5 seems to be needed and yet appears with a zero count.
The example is taken from "Data Manipulation with R" by Phil Spector,
http://www.springer.com/statistics/computational+statistics/book/978-0-387-74730-9
This does not answer your original question, but is a way to achieve (what I assume is) the same result, without cut. You may use the quarters function to extract the 'quarter' from a Date object:
table(quarters(everyday))
# Q1 Q2 Q3 Q4
# 90 91 92 92

Resources