Separate hour and minutes in R - r

I have a column for time, but it hasn't been separated by : or any thing. It looks like this:
person time
1 356
1 931
1 2017
1 2103
2 256
2 1031
2 1517
2 2206
How do I separate them?

There are different ways of approaching the issue. Which method you choose depends on your desired output.
For example, you could use stringr::str_split to split time into a list vector of hours and minutes using a positive look-ahead
library(tidyverse)
df %>% mutate(time = str_split(time, "(?=\\d{2}$)"))
# person time
#1 1 3, 56
#2 1 9, 31
#3 1 20, 17
#4 1 2, 13
#5 2 2, 56
#6 2 10, 31
#7 2 15, 17
#8 2 2, 26
Or we can use tidyr::separate to create two new columns hours and minutes
df %>% separate(time, c("hours", "minutes"), sep = "(?=\\d{2}$)")
# person hours minutes
#1 1 3 56
#2 1 9 31
#3 1 20 17
#4 1 2 13
#5 2 2 56
#6 2 10 31
#7 2 15 17
#8 2 2 26
In response to your comment you could use stringr::str_replace
df %>% mutate(time = str_replace(time, "(?=\\d{2}$)", ":"))
# person time
#1 1 3:56
#2 1 9:31
#3 1 20:17
#4 1 2:13
#5 2 2:56
#6 2 10:31
#7 2 15:17
#8 2 2:26
And the same in base R using sub
transform(df, time = sub("(?=\\d{2}$)", ":", time, perl = TRUE))
giving the same result.
Sample data
df <- read.table(text = "
person time
1 356
1 931
1 2017
1 213
2 256
2 1031
2 1517
2 226", header = T)

We can use strptime with sprintf in base R
df[c("hour", "min")] <- unclass(strptime(sprintf("%04d00", df$time),
"%H%M%S"))[c('hour', 'min')]
df
# person time hour min
#1 1 356 3 56
#2 1 931 9 31
#3 1 2017 20 17
#4 1 213 2 13
#5 2 256 2 56
#6 2 1031 10 31
#7 2 1517 15 17
#8 2 226 2 26
Or if it needs to only create a delimiter
tmp <- sub('(\\d{2})$', ':\\1', df$time)
tmp
#[1] "3:56" "9:31" "20:17" "2:13" "2:56" "10:31" "15:17" "2:26"
and then it can be separated in to two column with read.table
read.table(text = tmp, sep=":", header = FALSE, col.names = c('hour', 'min'))
data
df <- structure(list(person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), time = c(356L,
931L, 2017L, 213L, 256L, 1031L, 1517L, 226L)),
class = "data.frame", row.names = c(NA,
-8L))

Another possibility:
res<-strsplit(gsub("(\\d+(?=\\d{2,}))(\\d{1,})",
"\\1:\\2",df$time,perl = T),":")
df$Minutes <- sapply(res,"[[",2)
df$Hr <- sapply(res,"[[",1)
df
Result:
person time Minutes Hr
1 1 356 56 3
2 1 931 31 9
3 1 2017 17 20
4 1 2103 03 21
5 2 256 56 2
6 2 1031 31 10
7 2 1517 17 15
8 2 2206 06 22
Data:
df <-structure(list(person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), time = c(356L,
931L, 2017L, 2103L, 256L, 1031L, 1517L, 2206L)), row.names = c(NA,
-8L), class = "data.frame")

If you want to show time in HH:MM format, probably we can use sprintf with sub to enter semicolon (:) in between
sub("(\\d{2})(\\d{2})", "\\1:\\2",sprintf("%04d", df$time))
#[1] "03:56" "09:31" "20:17" "21:03" "02:56" "10:31" "15:17" "22:06"

Related

Binned physiological time series data in R: calculate duration spent in each bin

I have a dataset containing changes in mean arterial blood pressure (MAP) over time from multiple participants. Here is an example dataframe:
df=structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), Time = structure(1:14, .Label = c("11:02:00",
"11:03:00", "11:04:00", "11:05:00", "11:06:00", "11:07:00", "11:08:00",
"13:30:00", "13:31:00", "13:32:00", "13:33:00", "13:34:00", "13:35:00",
"13:36:00"), class = "factor"), MAP = c(90.27999878, 84.25, 74.81999969,
80.87000275, 99.38999939, 81.51000214, 71.51000214, 90.08999634,
88.75, 84.72000122, 83.86000061, 94.18000031, 98.54000092, 51
)), class = "data.frame", row.names = c(NA, -14L))
I have binned the data into groups: e.g. MAP 40-60, 60-80, 80-100 and added a unique flag (1, 2 or 3) in an additional column map_bin. This is my code so far:
library(dplyr)
#Mean Arterial Pressure
#Bin 1=40-60; Bin 2=60-80; Bin 3=80-100
map_bin=c("1","2","3")
output <- as_tibble(df) %>%
mutate(map_bin = case_when(
MAP >= 40 & MAP < 60 ~ map_bin[1],
MAP >= 60 & MAP < 80 ~ map_bin[2],
MAP >= 80 & MAP < 100 ~ map_bin[3]
))
For each ID I wish to calculate, in an additional column, the total time MAP is in each bin. I expect the following output:
ID
Time
MAP
map_bin
map_bin_dur
1
11:02:00
90.27999878
3
5
1
11:03:00
84.25
3
5
1
11:04:00
74.81999969
2
2
1
11:05:00
80.87000275
3
5
1
11:06:00
99.38999939
3
5
1
11:07:00
81.51000214
3
5
1
11:08:00
71.51000214
2
2
2
13:30:00
90.08999634
3
6
2
13:31:00
88.75
3
6
2
13:32:00
84.72000122
3
6
2
13:33:00
83.86000061
3
6
2
13:34:00
94.18000031
3
6
2
13:35:00
98.54000092
3
6
2
13:36:00
51
1
1
Where map_bin_dur is the time in minutes that MAP for each individual resided in each bin. e.g. ID 1 had a MAP in Bin 3 for 5 minutes in total.
If you have Time column of 1 min-duration always you can use add_count -
library(dplyr)
output <- output %>% add_count(ID, map_bin, name = 'map_bin_dur')
output
# ID Time MAP map_bin map_bin_dur
# <int> <fct> <dbl> <chr> <int>
# 1 1 11:02:00 90.3 3 5
# 2 1 11:03:00 84.2 3 5
# 3 1 11:04:00 74.8 2 2
# 4 1 11:05:00 80.9 3 5
# 5 1 11:06:00 99.4 3 5
# 6 1 11:07:00 81.5 3 5
# 7 1 11:08:00 71.5 2 2
# 8 2 13:30:00 90.1 3 6
# 9 2 13:31:00 88.8 3 6
#10 2 13:32:00 84.7 3 6
#11 2 13:33:00 83.9 3 6
#12 2 13:34:00 94.2 3 6
#13 2 13:35:00 98.5 3 6
#14 2 13:36:00 51 1 1

Select rows with all longitudinal measurements

I have a longitudinal dataset with ID, Wave (Wave1-4), and Score. Here's sample data with the same structure. The length of the original data is around 2000, with 500 participants total, put in long form.
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
5 1002 1 30
6 1002 3 30
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
11 1004 1 22
12 1005 1 20
13 1005 2 18
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
I would like to select the 'ID's with all four measurements of 'Score' available. In other words, I want to select rows of the participants with 'Score' available for all 4 waves.
I've been trying to select rows with 'ID's that have data in all 'Wave's. My tryout so far has been based on this idea: if a participant has all four measurements, the ID will appear in the data four times.
That's why I tried to count the number of IDs,
table(data$id) == 4
and although it showed me the number of each ID appearing in the data, I cannot select the corresponding rows.
all.data <- subset(data, subset=table(data$id) == 4)
Because the length of the original data is different, being in long form. "Length of logical index must be 1 or 2637, not 828" I would need a long-form data for further analysis, so I wish not to change it.
You can try:
df[as.logical(with(df, ave(Wave, ID, FUN = function(x) length(x) == 4))), ]
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
Or if you want to keep your basic idea, a slight modification of #jay.sf code:
df[df$ID %in% names(which(table(df$ID) == 4)), ]
I like your table() approach.
> table(d$ID) == 4
1001 1002 1003 1004 1005 1006
TRUE FALSE TRUE FALSE FALSE TRUE
The interesting IDs are in the names() though. So to get your code to work you could extract the IDs like so
subs <- names(which(table(d$ID) == 4))
and get your desired subset using %in%.
all.data <- subset(d, subset=d$ID %in% subs)
Result
> all.data
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
(BTW: Always make sure with ?<name> that you do not define any existing function names as object names, this will save you a lot of trouble. In your case type ?data in a fresh session before loading the object.)
Data
> dput(d)
structure(list(ID = c(1001L, 1001L, 1001L, 1001L, 1002L, 1002L,
1003L, 1003L, 1003L, 1003L, 1004L, 1005L, 1005L, 1006L, 1006L,
1006L, 1006L), Wave = c(1L, 2L, 3L, 4L, 1L, 3L, 1L, 2L, 3L, 4L,
1L, 1L, 2L, 1L, 2L, 3L, 4L), Score = c(28L, 27L, 28L, 26L, 30L,
30L, 30L, 30L, 29L, 28L, 22L, 20L, 18L, 22L, 23L, 25L, 19L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17"))
instead of feeding table(data$ID), try with
ID %in% names(table(data$ID)[table(data$ID)==4])
As the table gives you the number of occurrences for each ID (named vector)
This is a quick data.table answer.
library(data.table)
dt <- structure(list(ID = c(1001, 1001, 1001, 1001, 1002, 1002, 1003,
1003, 1003, 1003, 1004, 1005, 1005, 1006, 1006, 1006, 1006),
Wave = c(1, 2, 3, 4, 1, 3, 1, 2, 3, 4, 1, 1, 2, 1, 2, 3,
4), Score = c(28, 27, 28, 26, 30, 30, 30, 30, 29, 28, 22,
20, 18, 22, 23, 25, 19)), row.names = c(NA, -17L), class = c("data.table",
"data.frame"))
dt[ , .(Score, N = uniqueN(.SD)) , by = list(ID), .SDcols = c("Wave")][N == 4,]
> ID Score N
1: 1001 28 4
2: 1001 27 4
3: 1001 28 4
4: 1001 26 4
5: 1003 30 4
6: 1003 30 4
7: 1003 29 4
8: 1003 28 4
9: 1006 22 4
10: 1006 23 4
11: 1006 25 4
12: 1006 19 4
For the sake of completeness, here are two data.table solutions. Both identify those IDs for which Wave has values 1 to 4. One approach uses subsetting, the other one is joining.
Subsetting
library(data.table)
setDT(df)[ID %in% dt[ , which(uniqueN(Wave) == 4L), by = ID]$ID]
ID Wave Score
1: 1001 1 28
2: 1001 2 27
3: 1001 3 28
4: 1001 4 26
5: 1003 1 30
6: 1003 2 30
7: 1003 3 29
8: 1003 4 28
9: 1006 1 22
10: 1006 2 23
11: 1006 3 25
12: 1006 4 19
Joining
library(data.table)
setDT(df)[df[, .N, .(ID, Wave)][, .N, ID][N == 4L, .(ID)], on = "ID"]
which returns the same result.
Data
library(data.table)
fread("
rn ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
5 1002 1 30
6 1002 3 30
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
11 1004 1 22
12 1005 1 20
13 1005 2 18
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19", drop = 1L)

How to reset a Variable Value to 0 for starting point? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 4 years ago.
I have a odometer reading with following sample data for different cars. I intend to reset value of odometer to effectively
measure the distance he traveled in an effective manner
Sample data
ID ODometer
1 2132
1 2133
1 2134
1 2135
1 2136
1 2137
2 1123
2 1124
2 1125
Expected:
Expected Output
ID Odometer
1 1
1 2
1 3
1 4
1 5
1 6
2 1
2 2
2 3
We can use row_number() after grouping by 'ID'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Odometer = row_number())
# A tibble: 9 x 3
# Groups: ID [2]
# ID ODometer Odometer
# <int> <int> <int>
#1 1 2132 1
#2 1 2133 2
#3 1 2134 3
#4 1 2135 4
#5 1 2136 5
#6 1 2137 6
#7 2 1123 1
#8 2 1124 2
#9 2 1125 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
ODometer = c(2132L,
2133L, 2134L, 2135L, 2136L, 2137L, 1123L, 1124L, 1125L)),
class = "data.frame", row.names = c(NA, -9L))

Moving average and moving slope in R

I am looking to separately calculate a 7-day moving average and 7-day moving slope of 'oldvar'.
My sincere apologies that I didn't add the details below in my original post. These are repeated observations for each id which can go from a minimum of 3 observations per id to 100 observations per id. The start day can be different for different IDs, and to make things complicated, the days are not equally spaced, so some IDs have missing days.
Here is the data structure. Please note that 'average' is the variable that I am trying to create as moving 7-day average for each ID:
id day outcome average
1 1 15 100 NA
2 1 16 110 NA
3 1 17 190 NA
4 1 18 130 NA
5 1 19 140 NA
6 1 20 150 NA
7 1 21 160 140
8 1 22 100 140
9 1 23 180 150
10 1 24 120 140
12 2 16 90 NA
13 2 17 110 NA
14 2 18 120 NA
12 2 20 130 NA
15 3 16 110 NA
16 3 18 200 NA
17 3 19 180 NA
18 3 21 170 NA
19 3 22 180 168
20 3 24 210 188
21 3 25 160 180
22 3 27 200 184
Also, would appreciate advice on how to calculate a moving 7-day slope using the same.
Thank you and again many apologies for being unclear the first time around.
The real challenge is to create a data.frame after completing the missing rows. One solution could be using zoo library. The rollapply function will provide a way to assign NA value for the initial rows.
Using data from OP as is, the solution could be:
library(zoo)
library(dplyr)
# Data from OP
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
day = c(15L,16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 16L, 17L, 18L, 20L,
16L, 18L, 19L, 21L, 22L, 24L, 25L, 27L),
outcome = c(100L, 110L,190L, 130L, 140L, 150L, 160L, 100L, 180L, 120L, 90L, 110L, 120L,
130L, 110L, 200L, 180L, 170L, 180L, 210L, 160L, 200L)),
.Names = c("id", "day", "outcome"), row.names = c(NA, -22L), class = "data.frame")
# Make a list without missing day for each id
df_complete <- merge(
expand.grid(id=unique(df$id), day=min(df$day):max(df$day)),
df, all=TRUE)
# Valid range of day for each ID group
df_id_wise_range <- df %>% group_by(id) %>%
summarise(min_day = min(day), max_day = max(day)) %>% as.data.frame()
# id min_day max_day
# 1 1 15 24
# 2 2 16 20
# 3 3 16 27
# Join original df and df_complete and then use df_id_wise_range to
# filter it for valid range of day for each group
df_final <- df_complete %>%
left_join(df, by=c("id","day")) %>%
select(-outcome.y) %>%
inner_join(df_id_wise_range, by="id") %>%
filter(day >= min_day & day <= max_day) %>%
mutate(outcome = outcome.x) %>%
select( id, day, outcome) %>%
as.data.frame()
# Now apply mean to get average
df_average <- df_final %>% group_by(id) %>%
mutate(average= rollapply(outcome, 7, mean, na.rm = TRUE, by = 1,
fill = NA, align = "right", partial = 7)) %>% as.data.frame()
df_average
# The result
# id day outcome average
#1 1 15 100 NA
#2 1 16 110 NA
#3 1 17 190 NA
#4 1 18 130 NA
#5 1 19 140 NA
#6 1 20 150 NA
#7 1 21 160 140.0
#8 1 22 100 140.0
#9 1 23 180 150.0
#10 1 24 120 140.0
#11 2 16 90 NA
#12 2 17 110 NA
#13 2 18 120 NA
#....
#....
#19 3 19 180 NA
#20 3 20 NA NA
#21 3 21 170 NA
#22 3 22 180 168.0
#23 3 23 NA 182.5
#24 3 24 210 188.0
#25 3 25 160 180.0
#26 3 26 NA 180.0
#27 3 27 200 184.0
The steps to calculate moving slope are:
First create a function to return slope
Use function as as part of rollapplyr
#Function to calculate slope
slop_e <- function(z) coef(lm(b ~ a, as.data.frame(z)))[[2]]
#Apply function
z2$slope <- rollapplyr(zoo(z2), 7, slop_e , by.column = FALSE, fill = NA, align = "right")
z2
a b mean_a slope
1 1 21 NA NA
2 2 22 NA NA
3 3 23 NA NA
4 4 24 NA NA
5 5 25 NA NA
6 6 26 NA NA
7 7 27 4 1
8 8 28 5 1
9 9 29 6 1
10 10 30 7 1
11 11 31 8 1
12 12 32 9 1
13 13 33 10 1
14 14 34 11 1
15 15 35 12 1
16 16 36 13 1
17 17 37 14 1
18 18 38 15 1
19 19 39 16 1
20 20 40 17 1

lapply alternative to for loop to append to data frame

I have a data frame:
df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L,
45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")
> head(df)
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)
By using a for loop over each chrom level, and another over each row I get the expected results:
for (c in levels(df$chrom)){
df_chrom<-filter(df, chrom == c)
df_chrom<-arrange(df_chrom, df_chrom$pos)
for (i in 1:nrow(df_chrom)){
dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
logdist<-log10(dist)
cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
}
}
However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.
Any pointers would be appreciated
Here's the output from my solution:
chrom index pos dist log10dist
1 1 10 124 2.093422
1 2 134 66 1.819544
1 3 200 200 2.30103
1 4 400 200 2.30103
1 5 600 400 2.60206
1 6 1000 NA NA
2 1 20 13 1.113943
2 2 33 NA NA
3 1 40 5 0.69897
3 2 45 NA NA
4 1 50 5 0.69897
4 2 55 45 1.653213
4 3 100 23 1.361728
4 4 123 NA NA
We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference
library(data.table)
setDT(df)[order(pos), {v1 <- diff(pos)
.(index = seq_len(.N), pos = pos,
dist = c(v1, NA), logdiff = c(log10(v1), NA))}
, by = chrom]
# chrom index pos dist logdiff
# 1: 1 1 10 124 2.093422
# 2: 1 2 134 66 1.819544
# 3: 1 3 200 200 2.301030
# 4: 1 4 400 200 2.301030
# 5: 1 5 600 400 2.602060
# 6: 1 6 1000 NA NA
# 7: 2 1 20 13 1.113943
# 8: 2 2 33 NA NA
# 9: 3 1 40 5 0.698970
#10: 3 2 45 NA NA
#11: 4 1 50 5 0.698970
#12: 4 2 55 45 1.653213
#13: 4 3 100 23 1.361728
#14: 4 4 123 NA NA
Upon running the OP's code the output printed are
#1 1 10 124 2.093422
#1 2 134 66 1.819544
#1 3 200 200 2.30103
#1 4 400 200 2.30103
#1 5 600 400 2.60206
#1 6 1000 NA NA
#2 1 20 13 1.113943
#2 2 33 NA NA
#3 1 40 5 0.69897
#3 2 45 NA NA
#4 1 50 5 0.69897
#4 2 55 45 1.653213
#4 3 100 23 1.361728
#4 4 123 NA NA
We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.
do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
# chrom pos dist
#1.1 1 10 2.093422
#1.3 1 134 1.819544
#1.2 1 200 2.301030
#1.4 1 400 2.301030
#1.5 1 600 2.602060
#1.6 1 1000 NA
#2.7 2 20 1.113943
#2.8 2 33 NA
#3.9 3 40 0.698970
#3.10 3 45 NA
#4.11 4 50 0.698970
#4.12 4 55 1.653213
#4.13 4 100 1.361728
#4.14 4 123 NA

Resources