R: Formatting youtube video duration into proper time (seconds) - r

I have vector (column data) which contains youtube playback duration in a character string format in R.
x <- c(PT1H8S, PT9M55S, PT13M57S, PT1M5S, PT30M12S, PT1H21M5S, PT6M48S, PT31S, PT2M)
How do I get rid of PT then get the overall duration in seconds format?
Resultant vector should be c(3608, 595, 837, 65, 1812, 4865, 408, 31, 120)
example: PT1H21M5S in the form of seconds = 4865.
(calculated as 1H = 1*3600, 21M = 21*60, 5S = 5*1)

I wrote a little apply loop with regex commands, deleting everything but the seconds, minutes, or hours and then converting everything into seconds.
x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S")
x2 <- sapply(x, function(i){
t <- as.numeric(gsub("^(.*)M|^(.*)H|S$", "", i))
if(grepl("M", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)H|M(.*)$", "",i)) * 60
if(grepl("H", i)) t <- t + as.numeric(gsub("^(.*)PT|H(.*)$", "",i)) * 3600
t
})
x2
PT1H8S PT9M55S PT13M57S PT1M5S PT30M12S PT1H21M5S PT6M48S
3608 595 837 65 1812 4865 408
EDIT: Per request
x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S", "PT31S", "PT2M")
x2 <- sapply(x, function(i){
t <- 0
if(grepl("S", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)M|^(.*)H|S$", "", i))
if(grepl("M", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)H|M(.*)$", "",i)) * 60
if(grepl("H", i)) t <- t + as.numeric(gsub("^(.*)PT|H(.*)$", "",i)) * 3600
t
})
x2
PT1H8S PT9M55S PT13M57S PT1M5S PT30M12S PT1H21M5S PT6M48S PT31S PT2M
3608 595 837 65 1812 4865 408 31 120
This should cover all the cases. If there are more, the trick is to alter the regex. ^ is the beginning of the character vector, $ is the end. (.*) is everything. So ^(.*)H means everything between beginning and H. We replace this with nothing.

Here's a dplyr and stringr solution:
df %>%
# extract hours, minutes, and seconds and convert to numeric:
mutate(
h = as.numeric(str_extract(x, "(?<=PT)\\d+(?=H)")),
m = as.numeric(str_extract(x, "(?<=PT|H)\\d+(?=M)")),
s = as.numeric(str_extract(x, "(?<=PT|H|M)\\d+(?=S)"))
) %>%
# replace NA with 0:
mutate(
across(everything(), replace_na, 0)
) %>%
# calculate time in seconds:
mutate(sec = h*3600+m*60+s)
x h m s sec
1 PT1H8S 1 0 8 3608
2 PT9M55S 0 9 55 595
3 PT13M57S 0 13 57 837
4 PT1M5S 0 1 5 65
5 PT30M12S 0 30 12 1812
6 PT1H21M5S 1 21 5 4865
7 PT6M48S 0 6 48 408
8 PT31S 0 0 31 31
9 PT2M 0 2 0 120
Data:
df <- data.frame(x = c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S", "PT31S", "PT2M"))

You can use Lubridate package:
library(lubridate)
x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S")
x2 <- as.numeric(duration(x))
x2
[1] 3608 595 837 65 1812 4865 408

Related

Replacing NA with mean using loop in R

I have to solve this problem using loop in R (I am aware that you can do it much more easily without loops, but it is for school...).
So I have vector with NAs like this:
trades<-sample(1:500,150,T)
trades<-trades[order(trades)]
trades[sample(10:140,25)]<-NA
and I have to create a FOR loop that will replace NAs with mean from 2 numbers before the NA and 2 numbers that come after the NA.
This I am able to do, with loop like this:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T) {
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]), na.rm = T)
}
}
But there is another part to the homework. If there is NA within the 2 previous or 2 following numbers, then you have to replace the NA with mean from 4 previous numbers and 4 following numbers (I presume with removing the NAs). But I just am not able to crack it... I have the best results with this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])==T && is.na(trades[c(i-1:2)]==T || is.na(trades[c(i+1:2)]==T))) {
trades[i] <- mean(c(trades[c(i-1:4)], trades[c(i+1:4)]), na.rm = T)
}else if (is.na(trades[i])==T){
trades[i] <- mean(c(trades[c(i-1:2)], trades[c(i+1:2)]))
}
}
But it still misses some NAs.
Thank you for your help in advance.
We can use na.approx from zoo
library(zoo)
na.approx(trades)
Here is another solution using a loop. I did shortcut some code by using lead and lag from dplyr. First we use 2 recursive functions to calculate the lead and lag sums. Then we use conditional statements to determine if there are any missing data. Lastly, we fill the missing data using either the output of the recursive or the sum of the previous and following 4 (with NA removed). I would note that this is not the way that I would go about this issue, but I tried it out with a loop as requested.
library(dplyr)
r.lag <- function(x, n){
if (n == 1) return(lag(x = x, n = 1))
else return( lag(x = x, n = n) + r.lag(x = x, n = n-1))
}
r.lead <- function(x, n){
if (n == 1) return(lead(x = x, n = 1))
else return( lead(x = x, n = n) + r.lead(x = x, n = n-1))
}
lead.vec <- r.lead(trades, 2)
lag.vec <- r.lag(trades, 2)
output <- vector(length = length(trades))
for(i in 1:length(trades)){
if(!is.na(trades[[i]])){
output[[i]] <- trades[[i]]
}
else if(is.na(trades[[i]]) & !is.na(lead.vec[[i]]) & !is.na(lag.vec[[i]])){
output[[i]] <- (lead.vec[[i]] + lag.vec[[i]])/4
}
else
output[[i]] <- mean(
c(trades[[i-4]], trades[[i-3]], trades[[i-2]], trades[[i-1]],
trades[[i+4]], trades[[i+3]], trades[[i+2]], trades[[i+1]]),
na.rm = T
)
}
tibble(
original = trades,
filled = output
)
#> # A tibble: 150 x 2
#> original filled
#> <int> <dbl>
#> 1 7 7
#> 2 7 7
#> 3 12 12
#> 4 18 18
#> 5 30 30
#> 6 31 31
#> 7 36 36
#> 8 NA 40
#> 9 43 43
#> 10 50 50
#> # … with 140 more rows
So it seems that posting to StackOverflow helped me solve the problem.
trades<-sample(1:500,25,T)
trades<-trades[order(trades)]
trades[sample(1:25,5)]<-NA
which gives us:
[1] NA 20 24 30 NA 77 188 217 238 252 264 273 296 NA 326 346 362 368 NA NA 432 451 465 465 490
and if you run this loop:
for (i in 1:length(trades)) {
if (is.na(trades[i])== T) {
test1 <- c(trades[c(i+1:2)])
if (any(is.na(test1))==T) {
test2 <- c(trades[abs(c(i-1:4))], trades[c(i+1:4)])
trades[i] <- round(mean(test2, na.rm = T),0)
}else {
test3 <- c(trades[abs(c(i-1:2))], trades[c(i+1:2)])
trades[i] <- round(mean(test3, na.rm = T),0)
}
}
}
it changes the NAs to this:
[1] 22 20 24 30 80 77 188 217 238 252 264 273 296 310 326 346 362 368 387 410 432 451 465 465 490
So it works pretty much as expected.
Thank you for all your help.

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

Divide a vector by different values based on the result of the division

I have a Df like this:
x y z
<dbl> <dbl> <dbl>
1 408001.9 343 0
2 407919.2 343 0
3 407839.6 343 0
4 407761.2 343 0
5 407681.7 343 0
6 407599.0 343 0
7 407511.0 343 0
8 407420.5 343 0
9 407331.0 343 0
10 407242.0 343 0
11 407152.7 343 0
12 407062.5 343 0
13 406970.7 343 0
14 406876.6 342 0
15 406777.1 342 0
16 406671.0 342 0
17 406560.9 342 0
18 406449.4 342 0
19 406339.0 342 0
20 406232.5 342 0
... ... ... ...
with x decreasing.
And a vector like
vec=(a1, a2, a3, a4, a5, a6, ...)
with a1< a2< a3< a4...
Now I want to divide df$x by vec[1], what will give the same result (rounded) as for df$y.
But now, when the value in df$z drops by one to 342, I want to divide the value in df$x by vec[2] from then on, to get the new df$z values.
From here the result will be different from df$y, as for df$y the number to divide with is allways vec[1]and will not change
Every time the value I get for df$z drops by one, the next values for df$z shal be calculated with the corresponding vec[i] where i is the number of drops+1 so far
In the end I want a vector df$z, where the values are df$x / vec[i], where vec [i] depends on, what the last number of df$z is.
reproducible example:
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
)
vec <- seq(10, 10.9, 0.03)
for(i in 1:31){
test[i+1] <- round(test$x/vec[i])
}
This will give you a df with one col for every value of vec, that test$x got divided by.
Now, in the end, my vector shall contain the values of col2 until the value in col2 drops from 60 to 59. Afterwards I want the values from col3 until the value in col3 drops below 59 to 58. Then I want the values from col4 and so on.
How can I achive this with any data(like mine above, which is not linear ditributed as this example.)
I tried some for and while loops, but none worked. I didn't even get close to what I want.
I think my problem is that I dont know how to make the condition depenent on a value(the value of df$z at point i), that I want to calculate in the same operation. I want to calculate the value of df$z[i] with the value of vec[t], that has been used so far. But if the value of df$z drops by one at a certain observation[i], the value of vec[t+1] shall be used for the division from then on.
Thanks for your help.
I hope I've understood what you are asking. This might be it...
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
vec <- seq(10, 10.9, 0.03)
#this function determines the index of `vec` to use
xcol<-function(v){
x<-rep(NA,length(v))
x[1] <- 1
for(i in 2:length(v)){
x[i] <- x[i-1]
if(round(v[i]/vec[x[i]])<round(v[i-1]/vec[x[i]])){
x[i] <- x[i]+1
}
}
return(x)
}
test$xcol <- xcol(test$x)
test$z <- round(test$x/vec[test$xcol])
test
x xcol z
1 600 1 60
2 598 1 60
3 596 1 60
4 594 2 59
5 592 2 59
6 590 2 59
7 588 2 59
8 586 3 58
9 584 3 58
10 582 3 58
11 580 3 58
12 578 4 57
...

R strip split a column in dataframe

I have a 'data' frame, with multiple columns, one of them being 'Runtime' which has data in two formats:
Runtime
1 h 10 min
67 min
1 h 0 min
86 min
97 min
I want to convert all of them into Minutes. Have tried 'strsplit' and 'strip_split_fixed'. Can anyone show me a way to achieve my goal, split or any other method?
Thank you in advance !
I think I saw this kind of solution somewhere. Don't hit me.
df = data.frame(Runtime = c('1 h 10 min', '67 min', '1 h 0 min', '86 min', '97 min'))
df$exp <- gsub("h", "* 60 +", df$Runtime)
df$exp <- gsub("min", "* 1", df$exp)
sapply(df$exp, FUN = function(x) eval(parse(text = x)))
1 * 60 + 10 * 1 67 * 1 1 * 60 + 0 * 1 86 * 1 97 * 1
70 67 60 86 97
You can get it one call using gsubfn and regex:
library(gsubfn)
gsubfn("^(?:(\\d+)\\s*h)?\\s*(\\d+)\\s*min.*$",
~ sum(as.numeric(x) * 60, as.numeric(y), as.numeric(z), na.rm=TRUE), x)
#[1] "70" "67" "60" "86" "97"
Here's an example of how you can do it:
# setting up your data.frame of interest
df = data.frame(Runtime = c('1 h 10 min', '67 min', '1 h 0 min', '86 min', '97 min'))
df$Runtime = gsub(' min', '', df$Runtime) # remove the min labels
hrs = grepl('h', x = df$Runtime) # which values are in an "x h y min" format?
runtime_sub = sapply(strsplit(df[hrs, 'Runtime'], ' h '), function(i) sum(as.numeric(i) * c(60, 1))) # convert the "x h y min" entries into numeric values in minutes
df$Runtime = as.numeric(df$Runtime) # convert the vector to numeric (yes, it's supposed to return a warning. Ignore it.
df[hrs, 'Runtime'] = runtime_sub # add the converted values
This results in:
Runtime
1 70
2 67
3 60
4 86
5 97
1) Read df[[1]] and if the third column is NA then the first column gives the minutes; otherwise, 60 times the first column plus the third column gives the minutes:
with(read.table(text = as.character(df[[1]]), fill = TRUE),
ifelse(is.na(V3), V1, 60*V1 + V3))
## [1] 70 67 60 86 97
2) A variation is to paste "0 h" at the beginning of each component that does not have an h giving hm and read that computing 60 times the first column plus the third column.
hm <- paste(ifelse(grepl("h", df[[1]]), "", "0 h"), df[[1]])
with(read.table(text = hm), 60 * V1 + V3)
## [1] 70 67 60 86 97

Finding local maxima and minima in R

I'm trying to create a function to find a "maxima" and "minima". I have the following data:
y
157
144
80
106
124
46
207
188
190
208
143
170
162
178
155
163
162
149
135
160
149
147
133
146
126
120
151
74
122
145
160
155
173
126
172
93
I have tried this function to find "maxima"
localMaxima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(-.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
maks <- localMaxima(x)
And funtion to find "minima"
localMinima <- function(x) {
# Use -Inf instead if x is numeric (non-integer)
y <- diff(c(.Machine$integer.max, x)) > 0L
rle(y)$lengths
y <- cumsum(rle(y)$lengths)
y <- y[seq.int(1L, length(y), 2L)]
if (x[[1]] == x[[2]]) {
y <- y[-1]
}
y
}
mins <- localMinima(x)
And the result is not 100% right
maks = 1 5 7 10 12 14 16 20 24 27 31 33 35
mins = 3 6 8 11 13 15 19 23 26 28 32 34 36
The result should
maks = 5 7 10 12 14 16 20 24 27 31 33 35
mins = 3 6 8 11 13 15 19 23 26 28 32 34
Finding local maxima and minima in R comes close, but doesn't quite fit.
How can I fix this?
Thanks you very much
You could define two functions like the below which produce the vectors you need:
library(data.table)
#shift lags or leads a vector by a certain amount defined as the second argument
#the default is to lag a vector.
#The rationale behind the below code is that each local minimum's adjucent
#values will be greater than itself. The opposite is true for a local
#maximum. I think this is what you are trying to achieve and one way to do
#it is the following code
maximums <- function(x) which(x - shift(x, 1) > 0 & x - shift(x, 1, type='lead') > 0)
minimums <- function(x) which(x - shift(x, 1) < 0 & x - shift(x, 1, type='lead') < 0)
Output:
> maximums(y)
[1] 5 7 10 12 14 16 20 24 27 31 33 35
> minimums(y)
[1] 3 6 8 11 13 15 19 23 26 28 32 34
this is a function i wrote a while back (and it's more general than you need). it finds peaks in sequential data x, where i define a peak as a local maxima with m points either side of it having lower value than it (so bigger m leads to more stringent criteria for peak finding):
find_peaks <- function (x, m = 3){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
so for your case m = 1:
find_peaks(x, m = 1)
#[1] 5 7 10 12 14 16 20 24 27 31 33 35
and for the minima:
find_peaks(-x, m = 1)
#[1] 3 6 8 11 13 15 19 23 26 28 32 34

Resources