Analyzing data in order of column and then row in R - r

I have a dataset of logged data at 5 minutes intervals that also includes data at 1 minute intervals denoted by _1 - _5 in the header.
Each row represents a 5 minute interval.
datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
20190710 09:00:00 21 13 14 26 29 32
20190710 09:05:00 21 28 28 29 38 12
20190710 09:10:00 20 8 15 29 30 19
20190711 11:12:00 18 6 9 18 51 49
20190711 11:17:00 17 49 48 48 30 10
The actual dataset has an additional 25 columns of data logged at 5 minute intervals and consists of approximately 25000 rows.
I'm looking for an efficient way of analyzing the speed for each day.
For example, if I wanted to plot the speed for each day it would take speed_1 to speed_5 from the earliest entry on a particular day, say 09:00:00, then speed_1 to speed_5 from the next time, 09:05:00, and so on for the whole day.
Currently I have created an additional dataframe for the speed that fills in the times to give:
datetime speed
20190710 09:00:00 13
20190710 09:01:00 14
20190710 09:02:00 26
20190710 09:03:00 29
20190710 09:04:00 32
This results in having a second df of 125000 entries. I was wondering if there was a more memory efficient way of analyzing the original dataset as the datasets may grow considerably in the future.
Edit: Reproducible code added
structure(list(time = structure(1:3, .Label = c("20190710 09-00-00", "20190710 09-05-00", "20190710 09-10-00"), class = "factor"), temp = c(21, 21, 20), speed_1 = c(13, 28, 8), speed_2 = c(14, 28, 15), speed_3 = c(26, 29, 29), speed_4 = c(29, 38, 30), speed_5 = c(32, 12, 19)), .Names = c("time", "temp", "speed_1", "speed_2", "speed_3", "speed_4", "speed_5"), row.names = c(NA, -3L), class = "data.frame")

Here is a dplyr version:
library(tidyverse)
library(lubridate)
df <- read.table(text='datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
"20190710 09:00:00" 21 13 14 26 29 32
"20190710 09:05:00" 21 28 28 29 38 12
"20190710 09:10:00" 20 8 15 29 30 19
"20190711 11:12:00" 18 6 9 18 51 49
"20190711 11:17:00" 17 49 48 48 30 10',header=T)
# we take our dataframe
df %>%
# ...then we put all the speed columns in one column
pivot_longer(starts_with("speed_")
, names_to = "minute"
, values_to = "speed") %>%
# ...then we...
mutate(datetime = ymd_hms(datetime) #...turn the "datetime" column actually into a datetime format
, minute = gsub("speed_", "", minute) %>% as.numeric() # ...remove "speed_" from the former column names (which are now in column "speed")
, datetime = datetime + minutes(minute - 1)) # ...and add the minute to our datetime...
...to get this:
# A tibble: 25 x 4
datetime temp minute speed
<dttm> <int> <dbl> <int>
1 2019-07-10 09:00:00 21 1 13
2 2019-07-10 09:01:00 21 2 14
3 2019-07-10 09:02:00 21 3 26
4 2019-07-10 09:03:00 21 4 29
5 2019-07-10 09:04:00 21 5 32
6 2019-07-10 09:05:00 21 1 28
7 2019-07-10 09:06:00 21 2 28
8 2019-07-10 09:07:00 21 3 29
9 2019-07-10 09:08:00 21 4 38
10 2019-07-10 09:09:00 21 5 12
# ... with 15 more rows

Some example data and expected output would help a lot. I gave it a shot anyways. You can do this if you simply want a list of all the speeds for every date.
dataset <- read.table(text='datetime temp speed_1 speed_2 speed_3 speed_4 speed_5
"20190710 09:00:00" 21 13 14 26 29 32
"20190710 09:05:00" 21 28 28 29 38 12
"20190710 09:10:00" 20 8 15 29 30 19
"20190711 11:12:00" 18 6 9 18 51 49
"20190711 11:17:00" 17 49 48 48 30 10',header=T)
dataset$datetime <- as.POSIXlt(dataset$datetime,format="%Y%m%d %H:%M:%OS")
lapply(split(dataset,as.Date(dataset$datetime)), function(x) c(t(x[,3:ncol(x)])) )
output:
$`2019-07-10`
[1] 13 14 26 29 32 28 28 29 38 12 8 15 29 30 19
$`2019-07-11`
[1] 6 9 18 51 49 49 48 48 30 10
Edit: Updated answer so that the speeds are in the correct order.

Here is something raw using data.table:
library(data.table)
setDT(df)
df[, time := as.POSIXct(time, format="%Y%m%d %H-%M-%OS")]
out <-
df[, !"temp"
][, melt(.SD, id.vars = "time")
][, time := time + (rleid(variable)-1)*60, time
][order(time), !"variable"]
out
# time value
# 1: 2019-07-10 09:00:00 13
# 2: 2019-07-10 09:01:00 14
# 3: 2019-07-10 09:02:00 26
# 4: 2019-07-10 09:03:00 29
# 5: 2019-07-10 09:04:00 32
# 6: 2019-07-10 09:05:00 28
# 7: 2019-07-10 09:06:00 28
# 8: 2019-07-10 09:07:00 29
# 9: 2019-07-10 09:08:00 38
# 10: 2019-07-10 09:09:00 12
# 11: 2019-07-10 09:10:00 8
# 12: 2019-07-10 09:11:00 15
# 13: 2019-07-10 09:12:00 29
# 14: 2019-07-10 09:13:00 30
# 15: 2019-07-10 09:14:00 19
Data:
df <- data.frame(
time = factor(c("20190710 09-00-00", "20190710 09-05-00", "20190710 09-10-00")),
temp = c(21, 21, 20),
speed_1 = c(13, 28, 8),
speed_2 = c(14, 28, 15),
speed_3 = c(26, 29, 29),
speed_4 = c(29, 38, 30),
speed_5 = c(32, 12, 19)
)

Related

Splitting a dateTime vector if time is greater than x between vector components

I have the following data:
df <- data.frame(index = 1:85,
times = c(seq(as.POSIXct("2020-10-03 21:31:00 UTC"),
as.POSIXct("2020-10-03 22:25:00 UTC")
"min"),
seq(as.POSIXct("2020-11-03 10:10:00 UTC"),
as.POSIXct("2020-11-03 10:39:00 UTC"),
"min")
))
if we look at row 55 and 56 there is a clear divide in times:
> df[55:56, ]
index times
55 55 2020-10-03 22:25:00
56 56 2020-11-03 10:10:00
I would like to add a third categorical column split based on the splits,
e.g. row df$split[55, ] = A and row df$split[56, ] = B
logic like
If time gap between rows is greater than 5 mins start new category for subsequent rows until the next instance where time gap > 5 mins.
thanks
You could use
library(dplyr)
df %>%
mutate(cat = 1 + cumsum(c(0, diff(times)) > 5))
which returns
index times cat
1 1 2020-10-03 21:31:00 1
2 2 2020-10-03 21:32:00 1
3 3 2020-10-03 21:33:00 1
4 4 2020-10-03 21:34:00 1
5 5 2020-10-03 21:35:00 1
6 6 2020-10-03 21:36:00 1
7 7 2020-10-03 21:37:00 1
8 8 2020-10-03 21:38:00 1
...
53 53 2020-10-03 22:23:00 1
54 54 2020-10-03 22:24:00 1
55 55 2020-10-03 22:25:00 1
56 56 2020-11-03 10:10:00 2
57 57 2020-11-03 10:11:00 2
58 58 2020-11-03 10:12:00 2
59 59 2020-11-03 10:13:00 2
If you need letters or something else, you could for example use
df %>%
mutate(cat = LETTERS[1 + cumsum(c(0, diff(times)) > 5)])
to convert the categories 1 and 2 into A and B.

Dataframe to tidy format in R

I've this dataframe
x <- data.frame("date" = c("03-01-2005","04-01-2005","05-01-2005","06-01-2005"),
"pricemax.0" = c(50,20,25,56),
"pricemax.200" = c(25,67,89,30),
"pricemax.1000" = c(45,60,40,30),
"pricemax.1400" = c(60,57,32,44),
"pricemin.0" = c(22,15,23,43),
"pricemin.200" = c(21,40,59,21),
"pricemin.1000" = c(32,12,20,24),
"pricemin.1400" = c(30,20,14,20))
The numbers after the dot represents hours, e.g pricemax.200 would be 02:00. I need to gather the date and time information in one column of class POSIXct with the other two columns being pricemax and pricemin.
So, what I want is something like this:
And what I've done so far:
tidy_x <- x %>%
pivot_longer(
cols = contains("pricemax"),
names_to = c(NA,"hour"),
names_sep = "\\.",
values_to = "pricemax"
) %>%
pivot_longer(
cols = contains("pricemin"),
names_to = c(NA,"hour_2"),
names_sep = "\\.",
values_to = "pricemin"
)
I'm not sure how I can combine the date and time columns and keep the variables pricemin and pricemax organized.
Using dplyr and tidyr, you can do :
library(dplyr)
library(tidyr)
x %>%
pivot_longer(cols = -date,
names_to = c('.value', 'time'),
names_sep = '\\.') %>%
mutate(time = sprintf('%04s', time)) %>%
unite(datetime, date, time, sep = " ") %>%
mutate(datetime = lubridate::dmy_hm(datetime))
# A tibble: 16 x 3
# datetime pricemax pricemin
# <dttm> <dbl> <dbl>
# 1 2005-01-03 00:00:00 50 22
# 2 2005-01-03 02:00:00 25 21
# 3 2005-01-03 10:00:00 45 32
# 4 2005-01-03 14:00:00 60 30
# 5 2005-01-04 00:00:00 20 15
# 6 2005-01-04 02:00:00 67 40
# 7 2005-01-04 10:00:00 60 12
# 8 2005-01-04 14:00:00 57 20
# 9 2005-01-05 00:00:00 25 23
#10 2005-01-05 02:00:00 89 59
#11 2005-01-05 10:00:00 40 20
#12 2005-01-05 14:00:00 32 14
#13 2005-01-06 00:00:00 56 43
#14 2005-01-06 02:00:00 30 21
#15 2005-01-06 10:00:00 30 24
#16 2005-01-06 14:00:00 44 20
Get the data in long format with max and min in different column and hour information in different column. We make hour information consistent (of 4 digits) using sprintf and combine them into one column and convert it into datetime value.
Maybe you can try reshape like below to make a long data frame
y <- transform(
reshape(x, direction = "long", varying = -1),
date = strptime(paste(date, time / 100), "%d-%m-%Y %H")
)[c("date", "pricemax", "pricemin")]
y <- `row.names<-`(y[order(y$date),],NULL)
which gives
> y
date pricemax pricemin
1 2005-01-03 00:00:00 50 22
2 2005-01-03 02:00:00 25 21
3 2005-01-03 10:00:00 45 32
4 2005-01-03 14:00:00 60 30
5 2005-01-04 00:00:00 20 15
6 2005-01-04 02:00:00 67 40
7 2005-01-04 10:00:00 60 12
8 2005-01-04 14:00:00 57 20
9 2005-01-05 00:00:00 25 23
10 2005-01-05 02:00:00 89 59
11 2005-01-05 10:00:00 40 20
12 2005-01-05 14:00:00 32 14
13 2005-01-06 00:00:00 56 43
14 2005-01-06 02:00:00 30 21
15 2005-01-06 10:00:00 30 24
16 2005-01-06 14:00:00 44 20
Here is a data.table approach:
setDT(x)
DT <- melt.data.table(x, id.vars = "date")
DT[, c("var", "time") := tstrsplit(variable , ".", fixed=TRUE)
][, datetime := as.POSIXct(paste(date, as.integer(time) / 100), format = "%d-%m-%Y %H")
][, setdiff(names(DT), c("datetime", "var", "value")) := NULL]
DT <- dcast.data.table(DT, datetime ~ var, value.var = "value")
> DT
datetime pricemax pricemin
1: 2005-01-03 00:00:00 50 22
2: 2005-01-03 02:00:00 25 21
3: 2005-01-03 10:00:00 45 32
4: 2005-01-03 14:00:00 60 30
5: 2005-01-04 00:00:00 20 15
6: 2005-01-04 02:00:00 67 40
7: 2005-01-04 10:00:00 60 12
8: 2005-01-04 14:00:00 57 20
9: 2005-01-05 00:00:00 25 23
10: 2005-01-05 02:00:00 89 59
11: 2005-01-05 10:00:00 40 20
12: 2005-01-05 14:00:00 32 14
13: 2005-01-06 00:00:00 56 43
14: 2005-01-06 02:00:00 30 21
15: 2005-01-06 10:00:00 30 24
16: 2005-01-06 14:00:00 44 20

R: the best way to locate the index of last observation of unique values of a column

I have the following data. It is always going to be in ascending order. I want to be able to locate the last values of all the unique values, i.e. the last value of 0, 1, 2, 3, 4 ..... In example below, 1 doesn't exist , so can skip and move on to find the last value 2 and return the index.
I want the a vector of indices of all the last observations of different unique values.
How can I do that ? Thanks.
structure(c(0, 0, 0, 0, 2, 2, 3, 3, 13, 14, 14, 14, 14, 24, 34,
35, 37, 38, 38, 40, 42, 42, 43, 43, 44, 54, 54, 54, 64), index = structure(c(1167667200,
1167753600, 1167840000, 1167926400, 1168012800, 1168099200, 1168185600,
1168272000, 1168358400, 1168444800, 1168531200, 1168617600, 1168704000,
1168790400, 1168876800, 1168963200, 1169049600, 1169136000, 1169222400,
1169308800, 1169395200, 1169481600, 1169568000, 1169654400, 1169740800,
1169827200, 1169913600, 1.17e+09, 1170086400), tzone = "", tclass = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"), .Dim = c(29L, 1L), .Dimnames = list(
NULL, "testing"))
You can try:
which(rev(!duplicated(rev(df$testing))))
#> [1] 4 6 8 9 13 14 15 16 17 19 20 22 24 25 28 29
You can use the rle function to determine the run lengths of each value, then index into the appropriate row by means of cumsum:
indices <- cumsum(rle(as.vector(a))$lengths)
a[indices]
testing
2007-01-04 16:00:00 0
2007-01-06 16:00:00 2
2007-01-08 16:00:00 3
2007-01-09 16:00:00 13
2007-01-13 16:00:00 14
2007-01-14 16:00:00 24
2007-01-15 16:00:00 34
2007-01-16 16:00:00 35
2007-01-17 16:00:00 37
2007-01-19 16:00:00 38
2007-01-20 16:00:00 40
2007-01-22 16:00:00 42
2007-01-24 16:00:00 43
2007-01-25 16:00:00 44
2007-01-28 16:00:00 54
2007-01-29 16:00:00 64
library(zoo)
df <- as.data.frame(df)
cumsum(rle(df$testing)$lengths)
# [1] 4 6 8 9 13 14 15 16 17 19 20 22 24 25 28 29
1) If x is the input xts object then this gives the indices of the last occurrence of each element.
findInterval(unique(x), x)
## [1] 4 6 8 9 13 14 15 16 17 19 20 22 24 25 28 29
2) This alternative gives a named vector as the result:
cumsum(table(x))
## 0 2 3 13 14 24 34 35 37 38 40 42 43 44 54 64
## 4 6 8 9 13 14 15 16 17 19 20 22 24 25 28 29

insert new rows to the time series data, with date added automatically

I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33

How to identify the records that belong to a certain time interval when I know the start and end records of that interval? (R)

So, here is my problem. I have a dataset of locations of radiotagged hummingbirds I’ve been following as part of my thesis. As you might imagine, they fly fast so there were intervals when I lost track of where they were until I eventually found them again.
Now I am trying to identify the segments where the bird was followed continuously (i.e., the intervals between “Lost” periods).
ID Type TimeStart TimeEnd Limiter Starter Ender
1 Observed 6:45:00 6:45:00 NO Start End
2 Lost 6:45:00 5:31:00 YES NO NO
3 Observed 5:31:00 5:31:00 NO Start NO
4 Observed 9:48:00 9:48:00 NO NO NO
5 Observed 10:02:00 10:02:00 NO NO NO
6 Observed 10:18:00 10:18:00 NO NO NO
7 Observed 11:00:00 11:00:00 NO NO NO
8 Observed 13:15:00 13:15:00 NO NO NO
9 Observed 13:34:00 13:34:00 NO NO NO
10 Observed 13:43:00 13:43:00 NO NO NO
11 Observed 13:52:00 13:52:00 NO NO NO
12 Observed 14:25:00 14:25:00 NO NO NO
13 Observed 14:46:00 14:46:00 NO NO End
14 Lost 14:46:00 10:47:00 YES NO NO
15 Observed 10:47:00 10:47:00 NO Start NO
16 Observed 10:57:00 11:00:00 NO NO NO
17 Observed 11:10:00 11:10:00 NO NO NO
18 Observed 11:19:00 11:27:55 NO NO NO
19 Observed 11:28:05 11:32:00 NO NO NO
20 Observed 11:45:00 12:09:00 NO NO NO
21 Observed 11:51:00 11:51:00 NO NO NO
22 Observed 12:11:00 12:11:00 NO NO NO
23 Observed 13:15:00 13:15:00 NO NO End
24 Lost 13:15:00 7:53:00 YES NO NO
25 Observed 7:53:00 7:53:00 NO Start NO
26 Observed 8:48:00 8:48:00 NO NO NO
27 Observed 9:25:00 9:25:00 NO NO NO
28 Observed 9:26:00 9:26:00 NO NO NO
29 Observed 9:32:00 9:33:25 NO NO NO
30 Observed 9:33:35 9:33:35 NO NO NO
31 Observed 9:42:00 9:42:00 NO NO NO
32 Observed 9:44:00 9:44:00 NO NO NO
33 Observed 9:48:00 9:48:00 NO NO NO
34 Observed 9:48:30 9:48:30 NO NO NO
35 Observed 9:51:00 9:51:00 NO NO NO
36 Observed 9:54:00 9:54:00 NO NO NO
37 Observed 9:55:00 9:55:00 NO NO NO
38 Observed 9:57:00 10:01:00 NO NO NO
39 Observed 10:02:00 10:02:00 NO NO NO
40 Observed 10:04:00 10:04:00 NO NO NO
41 Observed 10:06:00 10:06:00 NO NO NO
42 Observed 10:20:00 10:33:00 NO NO NO
43 Observed 10:34:00 10:34:00 NO NO NO
44 Observed 10:39:00 10:39:00 NO NO End
Note: When there is a “Start” and an “End” in the same row it’s because the non-lost period consists only of that record.
I was able to identify the records that start or end these “non-lost” periods (under the columns “Starter” and “Ender”), but now I want to be able to identify those periods by giving them unique identifiers (period A,B,C or 1,2,3, etc).
Ideally, the name of the identifier would be the name of the start point for that period (i.e., ID[ Starter==”Start”])
I'm looking for something like this:
ID Type TimeStart TimeEnd Limiter Starter Ender Period
1 Observed 6:45:00 6:45:00 NO Start End 1
2 Lost 6:45:00 5:31:00 YES NO NO Lost
3 Observed 5:31:00 5:31:00 NO Start NO 3
4 Observed 9:48:00 9:48:00 NO NO NO 3
5 Observed 10:02:00 10:02:00 NO NO NO 3
6 Observed 10:18:00 10:18:00 NO NO NO 3
7 Observed 11:00:00 11:00:00 NO NO NO 3
8 Observed 13:15:00 13:15:00 NO NO NO 3
9 Observed 13:34:00 13:34:00 NO NO NO 3
10 Observed 13:43:00 13:43:00 NO NO NO 3
11 Observed 13:52:00 13:52:00 NO NO NO 3
12 Observed 14:25:00 14:25:00 NO NO NO 3
13 Observed 14:46:00 14:46:00 NO NO End 3
14 Lost 14:46:00 10:47:00 YES NO NO Lost
15 Observed 10:47:00 10:47:00 NO Start NO 15
16 Observed 10:57:00 11:00:00 NO NO NO 15
17 Observed 11:10:00 11:10:00 NO NO NO 15
18 Observed 11:19:00 11:27:55 NO NO NO 15
19 Observed 11:28:05 11:32:00 NO NO NO 15
20 Observed 11:45:00 12:09:00 NO NO NO 15
21 Observed 11:51:00 11:51:00 NO NO NO 15
22 Observed 12:11:00 12:11:00 NO NO NO 15
23 Observed 13:15:00 13:15:00 NO NO End 15
24 Lost 13:15:00 7:53:00 YES NO NO Lost
Would this be too hard to do in R?
Thanks!
> d <- data.frame(Limiter = rep("NO", 44), Starter = rep("NO", 44), Ender = rep("NO", 44), stringsAsFactors = FALSE)
> d$Starter[c(1, 3, 15, 25)] <- "Start"
> d$Ender[c(1, 13, 23, 44)] <- "End"
> d$Limiter[c(2, 14, 24)] <- "Yes"
> d$Period <- ifelse(d$Limiter == "Yes", "Lost", which(d$Starter == "Start")[cumsum(d$Starter == "Start")])
> d
Limiter Starter Ender Period
1 NO Start End 1
2 Yes NO NO Lost
3 NO Start NO 3
4 NO NO NO 3
5 NO NO NO 3
6 NO NO NO 3
7 NO NO NO 3
8 NO NO NO 3
9 NO NO NO 3
10 NO NO NO 3
11 NO NO NO 3
12 NO NO NO 3
13 NO NO End 3
14 Yes NO NO Lost
15 NO Start NO 15
16 NO NO NO 15
17 NO NO NO 15
18 NO NO NO 15
19 NO NO NO 15
20 NO NO NO 15
21 NO NO NO 15
22 NO NO NO 15
23 NO NO End 15
24 Yes NO NO Lost
25 NO Start NO 25
26 NO NO NO 25
27 NO NO NO 25
28 NO NO NO 25
29 NO NO NO 25
30 NO NO NO 25
31 NO NO NO 25
32 NO NO NO 25
33 NO NO NO 25
34 NO NO NO 25
35 NO NO NO 25
36 NO NO NO 25
37 NO NO NO 25
38 NO NO NO 25
39 NO NO NO 25
40 NO NO NO 25
41 NO NO NO 25
42 NO NO NO 25
43 NO NO NO 25
44 NO NO End 25

Resources