I have a column of data that has mixed units. I'm trying to use ifelse() to standardize the minute values to hours, which is the other unit.
Starting with data like:
test_df <- data.frame(
median_playtime = c("2.5 hours", "9 minutes", "20 hours")
)
I'm trying this:
test_df$median_playtime_hours <- ifelse(
#if the data has hours in it, then...
test = length(grep("hours", as.character(test_df$median_playtime) ,value=FALSE)) == 1
#text removal if it contains hours
,as.numeric(gsub(pattern = " hours", replacement = "", x = as.character(test_df$median_playtime)))
#otherwise, remove minutes text and divide by 60
,as.numeric(gsub(pattern = " minutes", replacement = "", x = test_df$median_playtime)) / 60
)
Each conditional line works ok but produces NAs for the mismatch cases, so the end result is NAs across the board. Is there a way to either ignore the NAs or merge the two conditions so the NAs aren't the only value returned?
There's an issue with your test - it only returns a single value of FALSE. If you instead use grepl to test you get your expected result:
test_df$median_playtime_hours <- ifelse(
#if the data has hours in it, then...
test = grepl("hours", as.character(test_df$median_playtime)),
#text removal if it contains hours
as.numeric(gsub(pattern = " hours", replacement = "", x = as.character(test_df$median_playtime))),
#otherwise, remove minutes text and divide by 60
as.numeric(gsub(pattern = " minutes", replacement = "", x = test_df$median_playtime)) / 60
)
If you separate numbers from units, a lookup table works nicely:
library(tidyverse)
test_df <- tibble(
median_playtime = c("2.5 hours", "9 minutes", "20 hours")
)
test_df %>%
separate(median_playtime, c('time', 'units'), sep = '\\s', convert = TRUE) %>%
mutate(seconds = time * c('minutes' = 60, 'hours' = 60*60)[units])
#> # A tibble: 3 x 3
#> time units seconds
#> <dbl> <chr> <dbl>
#> 1 2.5 hours 9000
#> 2 9 minutes 540
#> 3 20 hours 72000
If you want to keep it all in base,
test_df <- data.frame(
median_playtime = c("2.5 hours", "9 minutes", "20 hours"),
stringsAsFactors = FALSE
)
test_df$seconds <- sapply(strsplit(test_df$median_playtime, "\\s"), function(x){
as.numeric(x[1]) * c(minutes = 60, hours = 60*60)[x[2]]
})
test_df
#> median_playtime seconds
#> 1 2.5 hours 9000
#> 2 9 minutes 540
#> 3 20 hours 72000
Related
I have a variable named duration.video in the following format hh:mm:ss that I would like to recode into a categorical variable ('Less than 5 minutes', 'between 5 and 30 min', etc.)
Here is my line of code:
video$Duration.video<-as.factor(car::recode(
video$Duration.video,
"00:00:01:00:04:59='Less than 5 minutes';00:05:00:00:30:00='Between 5 and 30 minutes';00:30:01:01:59:59='More than 30 minutes and less than 2h';02:00:00:08:00:00='2h and more'"
))
The code does not work because all the values of the variable are put in one category ('Between 5 and 30 minutes').
I think it's because my variable is in character format but I can't convert it to numeric. And also maybe the format with ":" can be a problem for the recoding in R.
I tried to convert to data.table::ITime but the result remains the same.
This is a tidy solution. You can get this done with base R but this may be easier.
library(lubridate)
library(dplyr)
df <- data.frame(
duration_string = c("00:00:03","00:00:06","00:12:00","00:31:00","01:12:01")
)
df <- df %>%
mutate(
duration = as.duration(hms(duration_string)),
cat_duration = case_when(
duration < dseconds(5) ~ "less than 5 secs",
duration >= dseconds(5) & duration < dminutes(30) ~ "between 5 secs and 30 mins",
duration >= dminutes(30) & duration < dhours(1) ~ "between 30 mins and 1 hour",
duration > dhours(1) ~ "more than 1 hour",
) ,
cat_duration = factor(cat_duration,levels = c("less than 5 secs",
"between 5 secs and 30 mins",
"between 30 mins and 1 hour",
"more than 1 hour"
))
)
We can use factor. This only uses base R:
labs <- c('Less than 5 minutes',
'Between 5 and 30 minutes',
'More than 30 minutes and less than 2h',
'2h and more')
transform(df, factor = {
hms <- substr(duration_string, 1, 8)
factor((hms >= "00:00:05") + (hms > "00:30:00") + (hms >= "02:00:00"), 0:3, labs)
})
I have a variable time_col which contains the word 'minutes' if it is in minutes (e.g. 20 minutes), and only contains a number if it in hours (e.g. 2). I want to remove the word 'minutes' and convert it into hours when the observation is in minutes.
df <- raw_df %>%
mutate(time_col = ifelse(str_detect(time_col, "minutes"), time_col/60, time_col))
However, this gives an error:
'Error: Problem with `mutate()` input `time_col`. x non-numeric
argument to binary operator.'
I don't have this issue when I use ifelse(str_detect(time_col, "minutes"), 1, 0) so I think this is because the str_detect replaces time_col before going over to the ifelse condition.
How do I fix this issue?
I've created a dummy dataframe to demonstrate.
Since your time_col is character, you'll need to first get rid of the string " minutes" (note the space before "minutes"), change it to numeric, then divide it by 60.
Input
library(tidyverse)
df <- data.frame(Dummy = letters[1:3],
time_col = c("2", "20 minutes", "30 minutes"))
df
Dummy time_col
1 a 2
2 b 20 minutes
3 c 30 minutes
Code and output
df %>% mutate(time_col = ifelse(
str_detect(time_col, "minutes"),
as.numeric(gsub(" minutes", "", time_col)) / 60,
time_col
))
Dummy time_col
1 a 2
2 b 0.333333333333333
3 c 0.5
I have a lubridate period column in my table as the following shows.
workerID worked_hours
02 08H30M00S
02 08H00M00S
03 08H00M00S
03 05H40M00S
What I want to achieve is like sum the number of hours worked by workerID. And I also want it to be in the HH:MM:SS format, even if the hours exceed 24, I dont want it to have the day and instead have the hours accumulate to more than 24.
I have tried working with
df %>%
group_by(workerID) %>%
summarise(sum(worked_hours))
but this returns a 0.
You can use the package lubridate which makes dealing with times a bit easier. In your case, we need to convert to hms (hours minutes seconds) class first, group by worker ID and take the sum. However, in order to get it in the format HH:MM:SS, we need to convert to period, i.e.
library(tidyverse)
library(lubridate)
df %>%
mutate(new = as.duration(hms(worked_hours))) %>%
group_by(workerID) %>%
summarise(sum_times = sum(new)) %>%
mutate(sum_times = seconds_to_period(sum_times))
which gives,
# A tibble: 2 x 2
workerID sum_times
<int> <S4: Period>
1 2 16H 30M 0S
2 3 13H 40M 0S
There's also a base R solution. I've added a row to exceed minutes and hours.
workerID worked_hours
1 2 08H30M00S
2 2 08H00M00S
3 3 08H00M00S
4 3 05H40M00S
5 2 09H45M00S
We could split worked_hours at the characters, then aggregate it by worker's ID. After that, we need to subtract full hours from the minutes. Finally we collapse the time with :.
p <- cbind(p[1], do.call(rbind, lapply(strsplit(p$worked_hours, "\\D"), as.numeric)))
p <- aggregate(. ~ workerID, p, sum)
p$`1` <- p$`1` + floor(p$`2` / 60)
p$`2` <- p$`2` %% 60
p[-1] <- lapply(p[-1], function(x) sprintf("%02d", x)) # to always have two digits
cbind(p[1], worked_hours=apply(p[-1], 1, function(x) paste(x, collapse=":")))
# workerID worked_hours
# 1 2 26:15:00
# 2 3 13:40:00
Data
p <- structure(list(workerID = c("2", "2", "3", "3", "2"), worked_hours = c("08H30M00S",
"08H00M00S", "08H00M00S", "05H40M00S", "09H45M00S")), row.names = c(NA,
-5L), class = "data.frame")
I found a pet adoption dataset that includes the age of a pet when adopted. However, the age variable contains strings like "3 months" or "4 years" or "3 weeks" all in the same column. The dataset is otherwise tidy. How can I convert these variables into year values?
I've tried something like this:
for(i in i:nrow(Pet_Train$AgeuponOutcome)){
if(grepl(i, "month") == TRUE)
Pet_Train$Age_in_Years[i] == "0"
}
But I have little experience with loops/if statements/this "grepl" function I just looked up. I do have experience with tidy functions like mutate() and filter() but I'm not sure how to apply those with these many of possible argument combinations.
Since there are 27,000 instances, so I'd rather not go through this by hand.
Edit:
I figured out how to use the grepl function to replace instances containing "month" with "less than a year." But is there a way to take the exact number of months and convert them into the year as a decimal?
The first two use only base of R and the third uses dplyr and tidyr.
1) Use read.table to split the input column into the numeric and units parts and then multiply the numeric part by the fraction of a year that the units part represents.
PT <- data.frame(Age = c("3 months", "4 years", "3 weeks")) # input
transform(cbind(PT, read.table(text = as.character(PT$Age))),
Years = V1 * (7 / 365.25 * (V2 == "weeks") + 1/12 * (V2 == "months") + (V2 == "years")))
giving:
Age V1 V2 Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
2) Alternately the last line could be written in terms of switch:
transform(cbind(PT, read.table(text = as.character(PT$Age), as.is = TRUE)),
Years = V1 * sapply(V2, switch, weeks = 7 / 365.25, months = 1 / 12, years = 1))
3) This uses dplyr and tidyr:
PT %>%
separate(Age, c("No", "Units")) %>%
mutate(No = as.numeric(No),
Years = No * case_when(Units == "weeks" ~ 7 / 365.25,
Units == "months" ~ 1 / 12,
Units == "years" ~ 1))
giving:
No Units Years
1 3 months 0.25000000
2 4 years 4.00000000
3 3 weeks 0.05749487
lubridate-based solution:
library(tidyverse)
library(lubridate)
dat <- data_frame(age_text = c("3 months", "4 years", "3 weeks"))
dat %>% mutate(age_in_years = duration(age_text) / dyears(1))
The answer of David Rubinger uses the lubridate package to coerce character strings to objects of class Duration.
The as.duration() function seems to recognize a variety of strings, e.g.,
age_text <- c("3 months", "4 years", "3 weeks", "52 weeks", "365 days 6 hours")
lubridate::as.duration(age_text)
[1] "7889400s (~13.04 weeks)" "126230400s (~4 years)" "1814400s (~3 weeks)"
[4] "31449600s (~52 weeks)" "31557600s (~1 years)"
However, the OP has requested to convert the strings into year values rather than seconds.
This can be achieved by using the as.numeric() function which takes a units parameter to specify the desired conversion:
as.numeric(lubridate::as.duration(age_text), units = "years")
[1] 0.25000000 4.00000000 0.05749487 0.99657769 1.00000000
Other units can be chosen as well:
as.numeric(lubridate::as.duration(age_text), units = "months")
[1] 3.0000000 48.0000000 0.6899384 11.9589322 12.0000000
as.numeric(lubridate::as.duration(age_text), units = "weeks")
[1] 13.04464 208.71429 3.00000 52.00000 52.17857
Just to expand on the comment I left, you could use ifelse. First though, here's a reproducible example of your data (always very useful for you to provide this when asking a question):
df <- data.frame("Duration" = c("3 months", "4 years", "3 weeks"))
You can then split out the units and values from this using string split:
df$Value <- as.numeric(vapply(strsplit(as.character(df$Duration), split = " "), `[`, 1, FUN.VALUE=character(1)))
df$Units <- vapply(strsplit(as.character(df$Duration), split = " "), `[`, 2, FUN.VALUE=character(1))
Finally, use nested ifelse arguments which tell R what to do if data in a column matches a condition, and what to do if not - so I have this saying that, if the units is weeks, divide the amount by 52.18 (the number of weeks per year).
df$Years <- ifelse(df[,'Units']=="weeks", df[,'Value']/(365.25/7), ifelse(df[,'Units']=="months", df[,'Value']/12, df[,'Value']))
And the successful output:
> df
Duration Value Units Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
Note: It would be more appropriate to do this with "days" as your unit of time, which could be done if you had dates for the first and second event (birth and adoption dates of the animal). This is because years and months are variable length units - December is longer than February, 2016 was longer than 2015 and 2017.
I have an age variable containing observations that follow this (inconsistent) format:
3 weeks, 2 days, 4 hours
4 weeks, 6 days, 12 hours
3 days, 18 hours
4 days, 3 hours
7 hours
8 hours
I need to convert each observation to hours using R.
I have used strsplit(vector, ',') to split the variable at each comma.
I am running trouble because splitting each observation at the ',' yields anywhere from 1 to 3 entries for each observation. I do not know how to properly index these entries so that I end up with one row for each observation.
I am guessing that once I am able to store these values in sensible rows, I can extract the numeric data from each column in a row and convert accordingly, then sum the entire row.
I am also open to any different methods of approaching this problem.
After you split your data you can parse the resulting list for the keywords defining the times like 'hours', 'weeks', 'days' and create a dataframe containing the relevant value (or 0 if there is no value for a certain keyword). You can achieve that with something like this:
library(dplyr)
vector = c("3 weeks, 2 days, 4 hours", "4 weeks, 6 days, 12 hours", "3 days, 18 hours", "4 days, 3 hours", "7 hours", "8 hours")
split_vector = strsplit(vector, ",", fixed = TRUE)
parse_string = function(i){
x = split_vector[[i]]
data_frame(ID = i) %>%
mutate(hours = ifelse(any(grepl("hours", x)), as.numeric(gsub("\\D", "", x[grepl("hours", x)])), 0),
days = ifelse(any(grepl("days", x)), as.numeric(gsub("\\D", "", x[grepl("days", x)])), 0),
weeks = ifelse(any(grepl("weeks", x)), as.numeric(gsub("\\D", "", x[grepl("weeks", x)])), 0))
}
all_parsed = lapply(1:length(split_vector), parse_string)
all_parsed = rbind_all(all_parsed) %>%
mutate(final_hours = hours + days * 24 + weeks * 7 * 24)
Hadleyverse comes to the rescue again:
library(lubridate)
library(stringr)
dat <- readLines(textConnection(" 3 weeks, 2 days, 4 hours
4 week, 6 days, 12 hours
3 days, 18 hours
4 day, 3 hours
7 hours
8 hour"))
sapply(str_split(str_trim(dat), ",[ ]*"), function(x) {
sum(sapply(x, function(y) {
bits <- str_split(str_trim(y), "[ ]+")[[1]]
duration(as.numeric(bits[1]), bits[2])
})) / 3600
})
## [1] 556 828 90 99 7 8
I whacked the data a bit to show it's also somewhat flexible in how it parses things. I rly don't think the second str_trim is absolutely necessary but didn't have cycles to verify.
The exposition is that it trims the original vector then splits it into components (which makes a list of vectors). That list is then iterated over and the individual vector elements are further trimmed and split into # and unit duration. That's passed to lubridate and the value is returned and automatically converted to numeric seconds by the call to sum and we then make it into hours.