Using the same variable within str_detect and mutate - r

I have a variable time_col which contains the word 'minutes' if it is in minutes (e.g. 20 minutes), and only contains a number if it in hours (e.g. 2). I want to remove the word 'minutes' and convert it into hours when the observation is in minutes.
df <- raw_df %>%
mutate(time_col = ifelse(str_detect(time_col, "minutes"), time_col/60, time_col))
However, this gives an error:
'Error: Problem with `mutate()` input `time_col`. x non-numeric
argument to binary operator.'
I don't have this issue when I use ifelse(str_detect(time_col, "minutes"), 1, 0) so I think this is because the str_detect replaces time_col before going over to the ifelse condition.
How do I fix this issue?

I've created a dummy dataframe to demonstrate.
Since your time_col is character, you'll need to first get rid of the string " minutes" (note the space before "minutes"), change it to numeric, then divide it by 60.
Input
library(tidyverse)
df <- data.frame(Dummy = letters[1:3],
time_col = c("2", "20 minutes", "30 minutes"))
df
Dummy time_col
1 a 2
2 b 20 minutes
3 c 30 minutes
Code and output
df %>% mutate(time_col = ifelse(
str_detect(time_col, "minutes"),
as.numeric(gsub(" minutes", "", time_col)) / 60,
time_col
))
Dummy time_col
1 a 2
2 b 0.333333333333333
3 c 0.5

Related

adjust "width" argument in rollapply() function in r for discontinuous dates

I have a dataset of daily remotely sensed data. In short, it's reflectance (values between 0 and 1) for the last 20 years. Because it's remotely sensed data, some dates do not have a value because of clouds or some other obstruction.
I want to use rollapply() in R's zoo package to detect in the time series when the values remain at 1.0 for a certain amount of time (let's say 2 weeks) or at 0 for that same amount of time.
I have code to do this, but the width argument in the rollapply() function (the 2-week threshold mentioned in the previous paragraph) looks at data points rather than time. So it looks at 14 data values rather than 14 days, which may span over a month due to the missing data values from cloud cover etc.
Here's an example:
test_data <- data.frame(date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
test_data$date <- ymd(test_data$date)
select_first_1_value <- test_data %>%
mutate(value = rollapply(value, width = 3, min, align = "left", fill = NA, na.rm = TRUE)) %>%
filter(value == 1) %>%
filter(row_number() == 1) %>%
ungroup
With the argument as width = 3, it works. It finds that 2000-01-02 is the first date where a value = 1 occurs for at least 3 values. However, if I change this to 14, it no longer works, because it only sees 5 values in this instance. Even if I wrote out an additional 10 values that equal 1 (for a total of 15), it would be incorrect because the value = 0 at 2000-01-18 and it is only counting data points and not dates.
But when we look at the dates, there are missing dates between 2000-01-03 and 2000-01-17. If both are a value = 1, then I want to extract 2000-01-02 as the first instance where the time series remains at 1 for at least 14 consecutive days. Here, I'm assuming that the values are 1 for the missing days.
Any help is greatly appreciated. Thank you.
There really are two problems here:
How to roll by date rather than number of points.
how to find the first stretch of 14 days of 1's assuming that missing dates are 1.
Note that (2) is not readily solved by (1) because the start of the first series of ones may not be any of the listed dates! For example, suppose we change the first date to Dec 1, 1999 giving test_data2 below. Then the start of the first period of 14 ones is Dec 2, 1999. It is not any of the dates in the test_data2 series.
test_data2 <- data.frame(
date = c("1999-12-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
1) What we need to do is not roll by date but rather expand the series to fill in the missing dates giving zz and then use rollapply. Below do that by creating a zoo series (which also converts the dates to Date class) and then convert that to ts class. Because ts class can only represent regularly spaced series that conversion will fill in the missing dates and provide a value of NA for them. We can fill those in with 1 and then convert back to zoo with Date class index.
library(zoo)
z <- read.zoo(test_data2)
zz <- z |> as.ts() |> na.fill(1) |> as.zoo() |> aggregate(as.Date)
r <- rollapply(zz, 14, min, na.rm = TRUE, partial = TRUE, align = "left")
time(r)[which(r == 1)[1]]
## [1] "1999-12-02"
2) Another way to solve this not involving rollapply at all would be to use rle. Using zz from above
ok <- with(rle(coredata(zz)), rep(lengths >= 14 & values == 1, lengths))
tt[which(ok)[1]]
## [1] "1999-12-02"
3) Another way without using rollapply is to extract the 0 value rows and then keep only those whose difference exceeds 14 days from the next 0 value row. Finally take the first such row and use the date one day after it. This assumes that there is at least one 0 row before the first run of 14+ ones. Below we have returned back to using test_data from the question although this would have also worked with test_data2.
library(dplyr)
test_data %>%
mutate(date = as.Date(date)) %>%
filter(value == 0) %>%
mutate(diff = as.numeric(lead(date) - date)) %>%
filter(diff > 14) %>%
head(1) %>%
mutate(date = date + 1)
## date value diff
## 1 2000-01-02 0 17
rollapply over dates rather than points
4) The question also discussed using rollapply over dates rather than points which we address here. As noted above this does not actually solve the question of finding the first stretch of 14+ ones so instead we show how to find the first date in the series which starts a stretch of at least 14 ones. In general, we do this by first calculating a width vector using findInterval and then use rollapply in the usual way but with those widths rather than using a scalar width. This only involves one extra line of code to calculate the widths, w.
# using test_data from question
tt <- as.Date(test_data$date)
w <- findInterval(tt + 13, tt, rightmost.closed = TRUE) - seq_along(tt) + 1
r <- rollapply(test_data$value, w, min, fill = NA, na.rm = TRUE, align = "left")
tt[which(r == 1)[1]]
## [1] "2000-01-02"
There are further examples in ?rollapply showing how to roll by time rather than number of points.
sqldf
5) A completely different way of approaching the problem of finding the first 14+ ones with a date in the series is to use an SQL self join. It joins the first instance of test aliased to a to a second instance b associating all rows of b within the indicated date range and of a taking the minimum value of those creating a new column min14 with those minimums. The having clause then keeps only those rows for which min14 is 1 and of those the limit clause keeps the first. We then extract the date at the end.
library(sqldf)
test <- transform(test_data, date = as.Date(date))
sqldf("select a.*, min(b.value) min14
from test a
left join test b on b.date between a.date and a.date + 13
group by a.rowid
having min14 = 1
limit 1")$date
## [1] "2000-01-02"
You may look into runner package where you can pass k as days/weeks etc. See this example, to sum the last 3 days of value.
library(dplyr)
library(runner)
test_data %>%
mutate(date = as.Date(date),
sum_val = runner(value, k = "3 days", idx = date, f = sum))
# date value sum_val
#1 2000-01-01 0 0
#2 2000-01-02 1 1
#3 2000-01-03 1 2
#4 2000-01-17 1 1
#5 2000-01-18 0 1
Notice row 4 has value 1 (and not 3) because there is only 1 value that occurred in last 3 days.

split a column into numeric and non-numeric components

I need to split one column into 2 where the the resulting columns contain the numeric or character portions of the original column.
df <- data.frame(myCol = c("24 hours", "36days", "1month", "2 months +"))
myCol
24 hours
36days
1month
2 months +
result should be:
alpha numeric
hours 24
days 36
month 1
months + 2
Note the inconsistent formatting of the original dataframe (sometimes with spaces, sometimes without).
tidy or base solutions are fine
Thanks
One solution could be:
library(tidyverse)
df %>%
separate(myCol,
into = c("numeric", "alpha"),
sep = "(?=[a-z +]+)(?<=[0-9])"
)
Which returns:
numeric alpha
1 24 hours
2 36 days
3 1 month
4 2 months +
You could do:
library(stringr)
df$numeric <- str_extract(df$myCol, "[0-9]+")
df$alpha <- str_remove(df$myCol, df$numeric)
Or with base functions
df$numeric <- regmatches(df$myCol, regexpr("[0-9]+", df$myCol))
df$alpha <- gsub("[0-9]+", "", df$myCol)

Using ifelse() with R and text removal: how to handle NA values?

I have a column of data that has mixed units. I'm trying to use ifelse() to standardize the minute values to hours, which is the other unit.
Starting with data like:
test_df <- data.frame(
median_playtime = c("2.5 hours", "9 minutes", "20 hours")
)
I'm trying this:
test_df$median_playtime_hours <- ifelse(
#if the data has hours in it, then...
test = length(grep("hours", as.character(test_df$median_playtime) ,value=FALSE)) == 1
#text removal if it contains hours
,as.numeric(gsub(pattern = " hours", replacement = "", x = as.character(test_df$median_playtime)))
#otherwise, remove minutes text and divide by 60
,as.numeric(gsub(pattern = " minutes", replacement = "", x = test_df$median_playtime)) / 60
)
Each conditional line works ok but produces NAs for the mismatch cases, so the end result is NAs across the board. Is there a way to either ignore the NAs or merge the two conditions so the NAs aren't the only value returned?
There's an issue with your test - it only returns a single value of FALSE. If you instead use grepl to test you get your expected result:
test_df$median_playtime_hours <- ifelse(
#if the data has hours in it, then...
test = grepl("hours", as.character(test_df$median_playtime)),
#text removal if it contains hours
as.numeric(gsub(pattern = " hours", replacement = "", x = as.character(test_df$median_playtime))),
#otherwise, remove minutes text and divide by 60
as.numeric(gsub(pattern = " minutes", replacement = "", x = test_df$median_playtime)) / 60
)
If you separate numbers from units, a lookup table works nicely:
library(tidyverse)
test_df <- tibble(
median_playtime = c("2.5 hours", "9 minutes", "20 hours")
)
test_df %>%
separate(median_playtime, c('time', 'units'), sep = '\\s', convert = TRUE) %>%
mutate(seconds = time * c('minutes' = 60, 'hours' = 60*60)[units])
#> # A tibble: 3 x 3
#> time units seconds
#> <dbl> <chr> <dbl>
#> 1 2.5 hours 9000
#> 2 9 minutes 540
#> 3 20 hours 72000
If you want to keep it all in base,
test_df <- data.frame(
median_playtime = c("2.5 hours", "9 minutes", "20 hours"),
stringsAsFactors = FALSE
)
test_df$seconds <- sapply(strsplit(test_df$median_playtime, "\\s"), function(x){
as.numeric(x[1]) * c(minutes = 60, hours = 60*60)[x[2]]
})
test_df
#> median_playtime seconds
#> 1 2.5 hours 9000
#> 2 9 minutes 540
#> 3 20 hours 72000

Summing up periods

I have a lubridate period column in my table as the following shows.
workerID worked_hours
02 08H30M00S
02 08H00M00S
03 08H00M00S
03 05H40M00S
What I want to achieve is like sum the number of hours worked by workerID. And I also want it to be in the HH:MM:SS format, even if the hours exceed 24, I dont want it to have the day and instead have the hours accumulate to more than 24.
I have tried working with
df %>%
group_by(workerID) %>%
summarise(sum(worked_hours))
but this returns a 0.
You can use the package lubridate which makes dealing with times a bit easier. In your case, we need to convert to hms (hours minutes seconds) class first, group by worker ID and take the sum. However, in order to get it in the format HH:MM:SS, we need to convert to period, i.e.
library(tidyverse)
library(lubridate)
df %>%
mutate(new = as.duration(hms(worked_hours))) %>%
group_by(workerID) %>%
summarise(sum_times = sum(new)) %>%
mutate(sum_times = seconds_to_period(sum_times))
which gives,
# A tibble: 2 x 2
workerID sum_times
<int> <S4: Period>
1 2 16H 30M 0S
2 3 13H 40M 0S
There's also a base R solution. I've added a row to exceed minutes and hours.
workerID worked_hours
1 2 08H30M00S
2 2 08H00M00S
3 3 08H00M00S
4 3 05H40M00S
5 2 09H45M00S
We could split worked_hours at the characters, then aggregate it by worker's ID. After that, we need to subtract full hours from the minutes. Finally we collapse the time with :.
p <- cbind(p[1], do.call(rbind, lapply(strsplit(p$worked_hours, "\\D"), as.numeric)))
p <- aggregate(. ~ workerID, p, sum)
p$`1` <- p$`1` + floor(p$`2` / 60)
p$`2` <- p$`2` %% 60
p[-1] <- lapply(p[-1], function(x) sprintf("%02d", x)) # to always have two digits
cbind(p[1], worked_hours=apply(p[-1], 1, function(x) paste(x, collapse=":")))
# workerID worked_hours
# 1 2 26:15:00
# 2 3 13:40:00
Data
p <- structure(list(workerID = c("2", "2", "3", "3", "2"), worked_hours = c("08H30M00S",
"08H00M00S", "08H00M00S", "05H40M00S", "09H45M00S")), row.names = c(NA,
-5L), class = "data.frame")

Convert age entered as 'X Weeks, Y Days, Z hours' in R

I have an age variable containing observations that follow this (inconsistent) format:
3 weeks, 2 days, 4 hours
4 weeks, 6 days, 12 hours
3 days, 18 hours
4 days, 3 hours
7 hours
8 hours
I need to convert each observation to hours using R.
I have used strsplit(vector, ',') to split the variable at each comma.
I am running trouble because splitting each observation at the ',' yields anywhere from 1 to 3 entries for each observation. I do not know how to properly index these entries so that I end up with one row for each observation.
I am guessing that once I am able to store these values in sensible rows, I can extract the numeric data from each column in a row and convert accordingly, then sum the entire row.
I am also open to any different methods of approaching this problem.
After you split your data you can parse the resulting list for the keywords defining the times like 'hours', 'weeks', 'days' and create a dataframe containing the relevant value (or 0 if there is no value for a certain keyword). You can achieve that with something like this:
library(dplyr)
vector = c("3 weeks, 2 days, 4 hours", "4 weeks, 6 days, 12 hours", "3 days, 18 hours", "4 days, 3 hours", "7 hours", "8 hours")
split_vector = strsplit(vector, ",", fixed = TRUE)
parse_string = function(i){
x = split_vector[[i]]
data_frame(ID = i) %>%
mutate(hours = ifelse(any(grepl("hours", x)), as.numeric(gsub("\\D", "", x[grepl("hours", x)])), 0),
days = ifelse(any(grepl("days", x)), as.numeric(gsub("\\D", "", x[grepl("days", x)])), 0),
weeks = ifelse(any(grepl("weeks", x)), as.numeric(gsub("\\D", "", x[grepl("weeks", x)])), 0))
}
all_parsed = lapply(1:length(split_vector), parse_string)
all_parsed = rbind_all(all_parsed) %>%
mutate(final_hours = hours + days * 24 + weeks * 7 * 24)
Hadleyverse comes to the rescue again:
library(lubridate)
library(stringr)
dat <- readLines(textConnection(" 3 weeks, 2 days, 4 hours
4 week, 6 days, 12 hours
3 days, 18 hours
4 day, 3 hours
7 hours
8 hour"))
sapply(str_split(str_trim(dat), ",[ ]*"), function(x) {
sum(sapply(x, function(y) {
bits <- str_split(str_trim(y), "[ ]+")[[1]]
duration(as.numeric(bits[1]), bits[2])
})) / 3600
})
## [1] 556 828 90 99 7 8
I whacked the data a bit to show it's also somewhat flexible in how it parses things. I rly don't think the second str_trim is absolutely necessary but didn't have cycles to verify.
The exposition is that it trims the original vector then splits it into components (which makes a list of vectors). That list is then iterated over and the individual vector elements are further trimmed and split into # and unit duration. That's passed to lubridate and the value is returned and automatically converted to numeric seconds by the call to sum and we then make it into hours.

Resources