Splitting Time Values from String having H, M, S - R Programming - r

Let S1 be vector of different values of time
s1 = c("PT1H57M3S", "PT1H3M46S","PT1H33S","PT1H2M", "PT18S","PT18M9S", "PT1H39M22S")
I want to separate values of Hours, Minutes and Seconds
for eg. PT1H57M3S should go into columns of
H M S
1 57 3
I have put only few types of different string values. otherwise it forms part of column of a dataframe.
Please suggest how to do in R Programming

We could split at the boundary between a letter and number, then convert it to a data.frame and use rbindlist from data.table
library(data.table)
rbindlist(
lapply(strsplit(s1, "(?<=[A-Z])(?=[0-9])|(?<=[0-9])(?=[A-Z])", perl = TRUE),
function(x) {
x1 <- x[-1];val <- x1[seq(1, length(x1), by = 2)]
nm <- x1[seq(2, length(x1), by = 2)]
setNames(as.data.frame.list(val), nm)}),
fill = TRUE)
# H M S
#1: 1 57 3
#2: 1 3 46
#3: 1 NA 33
#4: 1 2 NA
#5: NA NA 18
#6: NA 18 9
#7: 1 39 22
We could also do this with tidyverse
library(tidyverse)
library(stringi)
out <- map2_df(stri_extract_all_regex(s1, "\\d+"),
stri_extract_all_regex(s1, "[HMS]"), ~ .x %>%
as.integer %>%
as.list %>%
set_names(.y) )
out
#A tibble: 7 x 3
# H M S
# <int> <int> <int>
#1 1 57 3
#2 1 3 46
#3 1 NA 33
#4 1 2 NA
#5 NA NA 18
#6 NA 18 9
#7 1 39 22
If we need to replace the NA with 0
out[is.na(out)] <- 0
Or if we need to do this by converting to time class,
library(lubridate)
v1 <- parse_date_time(sub("^PT", "", s1),
order = rlang::syms(tolower(unique(gsub("[^HMS]+", "", s1)))))
tibble(Hour = hour(v1), Minute = minute(v1), Seconds = seconds(v1))
# A tibble: 7 x 3
# Hour Minute Seconds
# <int> <int> <dbl>
#1 1 57 3
#2 1 3 46
#3 1 0 33
#4 1 0 2
#5 0 0 18
#6 18 0 9
#7 1 39 22
Here, we picked up the formats programmatically from the input string
Or we can do only with base R
v1 <- do.call(pmax, c(lapply(paste0("PT", gsub("(.)", "%\\1\\1",
unique(gsub("[^HMS]+", "", s1)))), strptime, x = s1), list(na.rm= TRUE)))
data.frame(hour = v1$hour, minute = v1$min, sec = v1$sec)
# hour minute sec
#1 1 57 3
#2 1 3 46
#3 1 0 33
#4 1 2 0
#5 0 0 18
#6 0 18 9
#7 1 39 22

Here's a base R solution:
df <- data.frame(H = s1, M = s1, S = s1, stringsAsFactors = FALSE)
df$H <- regmatches(df$H, regexec("\\d{1,2}(?=H)", df$H, perl = TRUE))
df$M <- regmatches(df$M, regexec("\\d{1,2}(?=M)", df$M, perl = TRUE))
df$S <- regmatches(df$S, regexec("\\d{1,2}(?=S)", df$S, perl = TRUE))
df[] <- lapply(df, as.integer) # Convert columns to integer data type
# Output
H M S
1 1 57 3
2 1 3 46
3 1 NA 33
4 1 2 NA
5 NA NA 18
6 NA 18 9
7 1 39 22

Rather than split these into different variables, a more robust solution is to parse the times to some kind of time class, e.g. hms or chron (or even just difftime or POSIXct). Presently hms is a good choice, as it is well-supported by tibble if you're using the tidyverse.
All that said, the hard part isn't really converting, it's parsing to one of the above in the first place. A one-shot way to do so is lubridate::parse_date_time, which parses to POSIXct, but will guess among supplied formats until one works, which saves a lot of control flow code.
s1 <- c("PT1H57M3S", "PT1H3M46S","PT1H33S","PT1H2M", "PT18S","PT18M9S", "PT1H39M22S")
hms::as.hms(
lubridate::parse_date_time(
s1,
# token orders to try, in order
orders = c('PT%HH%MM%SS', 'PT%HH%SS', 'PT%MM%SS', 'PT%SS'),
exact = TRUE, # take orders as literal strptime-style formats
truncated = 2), # allow 0-2 missing tokens on end of orders
tz = 'UTC') # parse_date_time returns POSIXct in UTC time zone
#> 01:57:03
#> 01:03:46
#> 01:00:33
#> 01:02:00
#> 00:00:18
#> 00:18:09
#> 01:39:22

You can use base r:
a=sub("PT(\\d+H)?(\\d+M)?(\\d+S)?","\\1,\\2,\\3",s1)
read.csv(h=F,text=gsub("[HMS]","",a),col.names = c("H","M","S"))
H M S
1 1 57 3
2 1 3 46
3 1 NA 33
4 1 2 NA
5 NA NA 18
6 NA 18 9
7 1 39 22

Related

R - how to select elements from sublists of a list by their name

I have a list of lists that looks like this:
list(list("A[1]" = data.frame(W = 1:5),
"A[2]" = data.frame(X = 6:10),
B = data.frame(Y = 11:15),
C = data.frame(Z = 16:20)),
list("A[1]" = data.frame(W = 21:25),
"A[2]" = data.frame(X = 26:30),
B = data.frame(Y = 31:35),
C = data.frame(Z = 36:40)),
list("A[1]" = data.frame(W = 41:45),
"A[2]" = data.frame(X = 46:50),
B = data.frame(Y = 51:55),
C = data.frame(Z = 56:60))) -> dflist
I need my output to also be a list of list with length 3 so that each sublist retains elements whose names start with A[ while dropping other elements.
Based on some previous questions, I am trying to use this:
dflist %>%
map(keep, names(.) %in% "A[")
but that gives the following error:
Error in probe(.x, .p, ...) : length(.p) == length(.x) is not TRUE
Trying to select a single element, for example just A[1] like this:
dflist %>%
map(keep, names(.) %in% "A[1]")
also doesn't work. How can I achieve the desired output?
I think you want:
purrr::map(dflist, ~.[stringr::str_starts(names(.), "A\\[")])
What this does is:
For each sublist (purrr::map)
Select all elements of that sublist (.[], where . is the sublist)
Whose names start with A[ (stringr::str_starts(names(.), "A\\["))
You got the top level map correct, since you want to modify the sublists. However, map(keep, names(.) %in% "A[") has some issues:
names(.) %in% "A[" should be a function or a formula (starting with ~
purrr::keep applies the filtering function to each element of the sublist, namely to the data frames directly. It never "sees" the names of each data frame. Actually I don't think you can use keep for this problem at all
Anyway this produces:
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50
If we want to use keep, use
library(dplyr)
library(purrr)
library(stringr)
map(dflist, ~ keep(.x, str_detect(names(.x), fixed("A["))))
Here a base R solution:
lapply(dflist, function(x) x[grep("A\\[",names(x))] )
[[1]]
[[1]]$`A[1]`
W
1 1
2 2
3 3
4 4
5 5
[[1]]$`A[2]`
X
1 6
2 7
3 8
4 9
5 10
[[2]]
[[2]]$`A[1]`
W
1 21
2 22
3 23
4 24
5 25
[[2]]$`A[2]`
X
1 26
2 27
3 28
4 29
5 30
[[3]]
[[3]]$`A[1]`
W
1 41
2 42
3 43
4 44
5 45
[[3]]$`A[2]`
X
1 46
2 47
3 48
4 49
5 50

R - replace all values smaller than a specific value in a column with the nearest bigger value

I have a data frame like this one:
df <- data.frame(c(1,2,3,4,5,6,7), c(0,23,55,0,1,40,21))
names(df) <- c("a", "b")
a b
1 0
2 23
3 55
4 0
5 1
6 40
7 21
Now I want to replace all values smaller than 22 in column b with the nearest bigger value. Of course it is possible to use loops, but since I have quite big datasets this is way too slow.
The solution should look somewhat like this:
a b
1 23
2 23
3 55
4 55
5 40
6 40
7 40
Here is a tidyverse possibility (but note #phiver's comment on replacement ambiguities)
library(tidyverse);
df %>%
mutate(b = ifelse(b < 22, NA, b)) %>%
fill(b) %>%
fill(b, .direction = "up");
# a b
#1 1 23
#2 2 23
#3 3 55
#4 4 55
#5 5 55
#6 6 40
#7 7 40
Explanation: Replace values b < 22 with NA and then use fill to fill NAs with previous/following non-NA entries.
Sample data
df <- data.frame(a = c(1,2,3,4,5,6,7), b = c(0,23,55,0,1,40,21))
You can use zoo::rollapply :
library(zoo)
df$b <- rollapply(df$b,3,function(x)
if (x[2] < 22) min(x[x>22]) else x[2],
partial =T)
# df
# a b
# 1 1 23
# 2 2 23
# 3 3 55
# 4 4 55
# 5 5 40
# 6 6 40
# 7 7 40
In base R you could do this for the same output:
transform(df, b = sapply(seq_along(b),function(i)
if (b[i] < 22) {
bi <- c(b,Inf)[seq(i-1,i+1)]
min(bi[bi>=22])
} else b[i]))

R - delete consecutive (ONLY) duplicates

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help and some other posts, I have tried using the duplicated function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [.
EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by #James, and would be the answer I would "accept", and dp being the dplyr answer given by #Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle performs better than I thought it would.
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
A cheap solution with dplyr that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

R ifelse condition with hourly data: frequency of continuously NA

With the help of sebastian-c, I figured out my problem with daily data. Please see: R ifelse condition: frequency of continuously NA
And now I have a data set with hourly data:
set.seed(1234)
day <- c(rep(1:2, each=24))
hr <- c(rep(0:23, 2))
v <- c(rep(NA, 48))
A <- data.frame(cbind(day, hr, v))
A$v <- sample(c(NA, rnorm(100)), nrow(A), prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
What I need to do is: If there are more(>=) 4 continuously missing day-hours(7AM-7PM) or >= 3 continuously missing night-hours(7PM-7AM), I will delete the entire day from the data frame, otherwise just run linear interpolation. Thus, the second day should be entirely deleted from the data frame since there are 4 continuously NA during day-time (7AM-10AM). The result is preferably remain data frame. Please help, thank you!
If I modify the NA_run function from the question you linked to take a variable named v instead of value and return the boolean rather than the data.frame:
NA_run <- function(x, maxlen){
runs <- rle(is.na(x$v))
any(runs$lengths[runs$values] >= maxlen)
}
I can then write a wrapper function to call it twice for daytime and nighttime:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
any(daytime, nighttime)
}
Which gives me a data.frame of days to drop.
> ddply(A, .(day), dropfun)
day V1
1 1 TRUE
2 2 FALSE
>
We can alter the dropfun to return the dataframe instead though:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
if(any(daytime, nighttime)) NULL else x
}
> ddply(A, .(day), dropfun)
day hr v
1 2 0 NA
2 2 1 NA
3 2 2 2.54899107
4 2 3 NA
5 2 4 -0.03476039
6 2 5 NA
7 2 6 0.65658846
8 2 7 0.95949406
9 2 8 NA
10 2 9 1.08444118
11 2 10 0.95949406
12 2 11 NA
13 2 12 -1.80603126
14 2 13 NA
15 2 14 NA
16 2 15 0.97291675
17 2 16 NA
18 2 17 NA
19 2 18 NA
20 2 19 -0.29429386
21 2 20 0.87820363
22 2 21 NA
23 2 22 0.56305582
24 2 23 -0.11028549
>

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

Resources