create new variable from date data - r

Now my data frame is like below
dput(head(t.zoo))
structure(c(85.92, 85.85, 85.83, 85.83, 85.85, 85.87, 1300, 1300,
1299.75, 1299.75, 1299.75, 1300), .Dim = c(6L, 2L), .Dimnames = list(
NULL, c("cl", "es")), index = structure(list(sec = c(0.400000095367432,
0.900000095367432, 1.40000009536743, 1.90000009536743, 2.40000009536743,
2.90000009536743), min = c(30L, 30L, 30L, 30L, 30L, 30L), hour = c(10L,
10L, 10L, 10L, 10L, 10L), mday = c(6L, 6L, 6L, 6L, 6L, 6L), mon = c(5L,
5L, 5L, 5L, 5L, 5L), year = c(112L, 112L, 112L, 112L, 112L, 112L
), wday = c(3L, 3L, 3L, 3L, 3L, 3L), yday = c(157L, 157L, 157L,
157L, 157L, 157L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = c("", "EST", "EDT"
)), class = "zoo")
I have two questions, first is I would like to add a variable name for the first column and 2nd is i want to create a categorical variable to help me indicate 2010-06-06 (since there are 3 separate days)
What I should do for the date data?

I'm not familiar with zoo class, so the following code is not nice, but seems working.
yourdata<-as.matrix(yourdata)
justdate <- substr(rownames(yourdata), 1, 10)
justtime <- substr(rownames(yourdata), 11, 19)
row.names(yourdata) <- NULL
yourdata<-as.data.frame(yourdata)
yourdata[,"justdate"]<-justdate
yourdata[,"justtime"]<-justtime
yourdata[yourdata$justdate=="2012-06-06","newvariable"]<-1
> yourdata
cl es justdate justtime newvariable
1 85.92 1300.00 2012-06-06 10:30:00 1
2 85.85 1300.00 2012-06-06 10:30:00 1
3 85.83 1299.75 2012-06-06 10:30:01 1
4 85.83 1299.75 2012-06-06 10:30:01 1
5 85.85 1299.75 2012-06-06 10:30:02 1
6 85.87 1300.00 2012-06-06 10:30:02 1

zoo objects are a little bit different to work with from data.frames.
The "first column" (as you referred to it) is actually not a column, but the index of your object. Try index(t.zoo) and see what it returns. This index really should have unique values; in your case, there are duplicated values, which might affect your calculations.
Conversion to a data.frame can be done like the following. I've added separate "Date" and "Time" variables based on the index from t.zoo.
require(zoo) # Load the `zoo` package if you haven't already done so
t.df = data.frame(Date = format(index(t.zoo), "%Y-%m-%d"),
Time = format(index(t.zoo), "%H:%M:%S"),
data.frame(t.zoo))
t.df
# Date Time cl es
# 1 2012-06-06 10:30:00 85.92 1300.00
# 2 2012-06-06 10:30:00 85.85 1300.00
# 3 2012-06-06 10:30:01 85.83 1299.75
# 4 2012-06-06 10:30:01 85.83 1299.75
# 5 2012-06-06 10:30:02 85.85 1299.75
# 6 2012-06-06 10:30:02 85.87 1300.00
Converting back to a zoo object (keeping the new "Date" and "Time" columns, or any other columns that you have added) can be done like:
zoo(t.df, order.by=index(t.zoo))
Note, however, that this will give you a warning because you don't have unique "order.by" values.

Related

Calculate date diff using same date filed column

I want to find the total sum of running minutes of a battery per month and year. For this I have the following condition:
If Battery.voltage < 50 then "Yes, otherwise "No.
Note: For calculating the total sum of mins, we can the time stamp column which is day, month, year, hour, mins.
This is my data:
# Time.stamp Battery.voltage Condition
# 1 01/04/2016 00:00 51 No
# 2 01/04/2016 00:01 52 No
# 3 01/04/2016 00:02 45 Yes
# 4 01/04/2016 00:03 48 Yes
# 5 01/04/2016 00:04 49 Yes
# 6 01/04/2016 00:05 55 No
# 7 01/04/2016 00:06 54 No
# ...
structure(list(
Time.stamp = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 10L, 11L, 12L, 12L, 13L),
.Label = c("01/04/2016 00:00", "01/04/2016 00:01", "01/04/2016 00:02", "01/04/2016 00:03",
"01/04/2016 00:04", "01/04/2016 00:05", "01/04/2016 00:06", "01/04/2016 00:07",
"01/04/2016 00:08", "01/04/2016 00:09", "01/04/2016 00:11", "01/04/2016 00:12",
"01/04/2016 00:13"), class = "factor"),
Battery.voltage = c(51L, 52L, 45L, 48L, 49L, 55L, 54L, 52L, 51L, 49L, 48L, 47L, 45L, 50L, 51L),
Condition = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L),
.Label = c("No", "Yes"), class = "factor")),
.Names = c("Time.stamp", "Battery.voltage", "Condition"),
class = "data.frame", row.names = c(NA, -15L))
My expected output is something like this:
Month year Sum of mins running in battery
Jan 2016 350min
Feb 2016 450min
etc.
Unfortunately, your sample data is not very representative of your problem statement, as it only includes data for one day. It would have been beneficial to provide some code that generates random data for sufficient entries (i.e. dates).
That aside, you could adapt the following solution (here I assume your timestamp format is "DD/MM/YYYY"):
df %>%
mutate(
Time.stamp = as.POSIXct(Time.stamp, format = "%d/%m/%Y %H:%M"),
byday = format(Time.stamp, "%d/%m/%Y"),
bymonth = format(Time.stamp, "%d/%m"),
byyear = format(Time.stamp, "%Y")) %>%
group_by(byday) %>%
summarise(sum.running.in.mins = sum(Condition == "Yes"))
## A tibble: 1 x 2
# byday sum.running.in.mins
# <chr> <int>
#1 01/04/2016 7
Here we create columns byday, bymonth and byyear according to which you can group entries and calculate the sum of total running time per group. In above example, I calculate the total running time by day; to get the total running time per month, you would replace group_by(byday) with group_by(bymonth).

Set the start point for time intervals in R

I have different sets of data with the following format
Time Value1 Value2 ....
11/04/2015 15:12:22 1 2 ....
11/04/2015 15:13:46 1 2 ....
And I want to group them in intervals of 15 minutes. I can do this with the following code
data$time = cut(data$time, breaks = "15 min")
data.grouped <- aggregate(data[,c(-1)], by = list(time = datos$time), median)
The problem is that the time field in the output has the following values
12/04/2015 16:12
12/04/2015 16:27
12/04/2015 16:42
12/04/2015 16:57
And I want the times to be :00 :15 :30 or :45. Is there any way of forcing the intervals to be like this or a different approach to merge the data that allows it?
A sample data from dput:
structure(list(time = structure(list(sec = c(49, 5, 21, 37, 54,
10, 38), min = c(12L, 13L, 13L, 13L, 13L, 14L, 22L), hour = c(15L,
15L, 15L, 15L, 15L, 15L, 16L), mday = c(11L, 11L, 11L, 11L, 11L,
11L, 12L), mon = c(3L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(116L,
116L, 116L, 116L, 116L, 116L, 116L), wday = c(1L, 1L, 1L, 1L,
1L, 1L, 2L), yday = c(101L, 101L, 101L, 101L, 101L, 101L, 102L
), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST", "CEST",
"CEST", "CEST", "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_)), .Names = c("sec", "min", "hour", "mday", "mon",
"year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt",
"POSIXt")), value1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("time",
"value1"), row.names = c(NA, -7L), class = "data.frame")
Starting with your dput, calling it df, first we'll convert your factor to a POSIXct class, then we will floor it to closest 15 minutes below. (use round instead of floor if you want the closest 15 minutes in general):
df$time = as.POSIXct(df$time)
df$time15 = lubridate::floor_date(df$time, unit = "15 min")
df
# time value1 time15
# 1 2016-04-11 15:12:49 0 2016-04-11 15:00:00
# 2 2016-04-11 15:13:05 0 2016-04-11 15:00:00
# 3 2016-04-11 15:13:21 0 2016-04-11 15:00:00
# 4 2016-04-11 15:13:37 0 2016-04-11 15:00:00
# 5 2016-04-11 15:13:54 0 2016-04-11 15:00:00
# 6 2016-04-11 15:14:10 0 2016-04-11 15:00:00
# 7 2016-04-12 16:22:38 0 2016-04-12 16:15:00
You can then aggregate using the time15 column as the grouper.
I provide an example you can replicate with your data frame. First, I create a dummy time series (ts) as.POSIXct by 5 min intervals and then group them by 15 min intervals using dplyr.
ts <- seq.POSIXt(as.POSIXct("2017-01-01", tz = "UTC"),
as.POSIXct("2017-02-01", tz = "UTC"),
by = "5 min")
ts <- as.data.frame(ts)
library(dplyr)
ts %>%
group_by(interval = cut(ts, breaks = "15 min")) %>%
summarise(count= n())
Output
# A tibble: 2,977 x 2
interval sumvalue
<fct> <int>
1 2017-01-01 00:00:00 3
2 2017-01-01 00:15:00 3
3 2017-01-01 00:30:00 3
4 2017-01-01 00:45:00 3
5 2017-01-01 01:00:00 3
6 2017-01-01 01:15:00 3
7 2017-01-01 01:30:00 3
8 2017-01-01 01:45:00 3
9 2017-01-01 02:00:00 3
10 2017-01-01 02:15:00 3
# ... with 2,967 more rows

Time aggregation in R

I have dataset with data of gamesessions(id,count of session,averege seconds of session and date of session for each id)
here sample of mydat:
mydat=read.csv("C:/Users/Admin/desktop/rty.csv", sep=";",dec=",")
structure(list(udid = c(74385162L, 79599601L, 79599601L, 91475825L,
91475825L, 91492531L, 92137561L, 96308016L, 96308016L, 96308016L,
96308016L, 96308016L, 96495076L, 97135620L, 97135620L, 97135620L,
97135620L, 97135620L, 97135620L, 97135620L, 97135620L, 97135620L,
97135620L, 97165942L), count = c(1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), avg_duration = c(39L, 1216L, 568L, 5L, 6L, 79L, 9L, 426L,
78L, 884L, 785L, 785L, 22L, 302L, 738L, 280L, 2782L, 5L, 2284L,
144L, 234L, 231L, 539L, 450L), date = structure(c(13L, 3L, 3L,
1L, 1L, 14L, 2L, 11L, 11L, 11L, 12L, 12L, 9L, 7L, 4L, 4L, 5L,
6L, 8L, 8L, 8L, 8L, 8L, 10L), .Label = c("11.10.16", "12.12.16",
"15.11.16", "15.12.16", "16.12.16", "17.12.16", "18.10.16", "18.12.16",
"21.10.16", "26.10.16", "28.11.16", "29.11.16", "31.10.16", "8.10.16"
), class = "factor")), .Names = c("udid", "count", "avg_duration",
"date"), class = "data.frame", row.names = c(NA, -24L))
I need calculating the time difference between the first date of the player's appearance and the last date when he was seen.
For example uid 97135620 the first time when he started play was 18.10.2016 and last time he was seen at 18.12.2016, it is mean that the difference between first and last day = 60,9 days,
meanwhile uid74385162 started at 31.10.2016 and after he didn't play(i.e he played one time), it is mean the difference between first data and last data = 0.
id79599601 has two count of session in 1 day(i.e for one day I played 2 times), so the the difference =1
In output i expect this format only with last date and the value of the difference between the last day and the first day.
udid count avg_duration date datediff
74385162 1 39 31.10.2016 0
79599601 1 568 15.11.2016 1
91475825 1 5 11.10.2016 1
91492531 1 79 08.10.2016 0
92137561 1 9 12.12.2016 0
96308016 1 785 29.11.2016 1
96495076 1 22 21.10.2016 0
97135620 1 539 18.12.2016 61
97165942 1 450 26.10.2016 0
How do that?
This function calculates the difference between first and last session, and only returns the date of the last session:
get_datediff <- function (x) {
dates <- as.Date(as.character(x$date), "%d.%m.%y")
x <- x[order(dates), ]
if (length(x$date)==1) {
x$datediff <- 0
} else {
x$datediff <- max(1, diff(range(dates)))
}
x[nrow(x), ]
}
This can then be applied to data for each user, making use of dplyr and magrittr packages:
group_by(mydat, udid) %>% do(get_datediff(.))
# A tibble: 9 x 5
# Groups: udid [9]
udid count avg_duration date datediff
<int> <int> <int> <fctr> <dbl>
1 74385162 1 39 31.10.16 0
2 79599601 1 568 15.11.16 1
3 91475825 3 6 11.10.16 1
4 91492531 1 79 8.10.16 0
5 92137561 1 9 12.12.16 0
6 96308016 1 785 29.11.16 1
7 96495076 1 22 21.10.16 0
8 97135620 1 539 18.12.16 61
9 97165942 1 450 26.10.16 0
The way you describe how your metrics are calculated are confusing, but following what you wrote as closely as possible, I ended up with the following:
dplyr solution:
timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
group_by(udid)%>%
arrange(udid,dateFormat)%>%
summarise(dateBetween = difftime(last(dateFormat), first(dateFormat), units = "days"), mean(avg_duration))%>%
left_join((timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
select(udid, count,dateFormat)%>%
group_by(udid)%>%
slice(which.min(dateFormat))))
Result:
# A tibble: 9 x 5
udid dateBetween `mean(avg_duration)` count dateFormat
<int> <time> <dbl> <int> <date>
1 74385162 0 days 39.0 1 2016-10-31
2 79599601 0 days 892.0 1 2016-11-15
3 91475825 0 days 5.5 1 2016-10-11
4 91492531 0 days 79.0 1 2016-10-08
5 92137561 0 days 9.0 1 2016-12-12
6 96308016 1 days 591.6 1 2016-11-29
7 96495076 0 days 22.0 1 2016-10-21
8 97135620 61 days 753.9 1 2016-12-18
9 97165942 0 days 450.0 1 2016-10-26

r How to check if values exist in a previous period (rolling)

Here is my dataset:
structure(list(Date = structure(c(14609, 14609, 14609, 14609, 14699, 14699, 14699, 14699, 14790, 14790, 14790, 14790), class = "Date"),
ID = structure(c(5L, 4L, 6L, 10L, 9L, 3L, 10L, 8L, 7L, 1L,
10L, 2L), .Label = c("B00NYQ2", "B03J9L7", "B05DZD1", "B06HC42",
"B09V3X7", "B09YCC8", "X6114659", "X6478816", "X6556701",
"X6812555"), class = "factor"), Name = structure(c(10L, 4L,
9L, 8L, 7L, 3L, 8L, 6L, 2L, 5L, 8L, 1L), .Label = c("AIRA",
"BOUS", "CSCS", "EVF", "GTB", "JER", "MGB", "MPR", "NVB",
"TTNP"), class = "factor"), Score = c(55.075, 54.5, 53.325,
52.175, 70.275, 69.825, 60.15, 60.025, 56.175, 52.65, 52.175,
52.125), Score.rank = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L)), .Names = c("Date", "ID", "Name", "Score", "Score.rank"), row.names = c(1L, 2L, 3L, 4L, 71L, 72L, 73L, 74L, 156L, 157L, 158L, 159L), class = "data.frame")
I'm trying to find which IDs come in and out when we go into a new period.
What i mean by that is..i want to compare if the ID was present in the previous period, denoted by "Date".
If it existed in the previous period (date), It should not return anything.
If it did not exist in the previous period, it should return "IN".
I also want to show that if does not exist in the next period, it should return an "OUT".
ie the this period's OUTs should be equal to next periods INs
my expected dataframe is supposed to look like this
Date ID Name Score Score.rank THIS PERIOD NEXT PERIOD
31/12/2009 B09V3X7 TTNP 55.075 1 OUT
31/12/2009 B06HC42 EVF 54.5 2 OUT
31/12/2009 B09YCC8 NVB 53.325 3 OUT
31/12/2009 X6812555 MPR 52.175 4
31/3/2010 X6556701 MGB 70.275 1 IN
31/3/2010 B05DZD1 CSCS 69.825 2 IN OUT
31/3/2010 X6812555 MPR 60.15 3
31/3/2010 X6478816 JER 60.025 4 IN OUT
30/6/2010 X6114659 BOUS 56.175 1 IN
30/6/2010 B00NYQ2 GTB 52.65 2 IN
30/6/2010 X6812555 MPR 52.175 3
30/6/2010 B03J9L7 AIRA 52.125 4 IN
Can somebody point me in the right direction as to how to do this?
Thanks in advance
Your description and example doesn't match, unfortunately.
Considering your description, it seems you want to tag entry and exit conditions for the IDs.
Which can be achieved as:
dft %>%
group_by(ID) %>%
dplyr::mutate( This_period = if_else(Date == min(Date), "IN", NULL) ) %>%
dplyr::mutate( Next_period = if_else(Date == max(Date), "OUT", NULL))
and returns:
#Source: local data frame [12 x 7]
#Groups: ID [10]
#
# Date ID Name Score Score.rank This_period Next_period
# <date> <fctr> <fctr> <dbl> <int> <chr> <chr>
#1 2009-12-31 B09V3X7 TTNP 55.075 1 IN OUT
#2 2009-12-31 B06HC42 EVF 54.500 2 IN OUT
#3 2009-12-31 B09YCC8 NVB 53.325 3 IN OUT
#4 2009-12-31 X6812555 MPR 52.175 4 IN <NA>
#5 2010-03-31 X6556701 MGB 70.275 1 IN OUT
#6 2010-03-31 B05DZD1 CSCS 69.825 2 IN OUT
#7 2010-03-31 X6812555 MPR 60.150 3 <NA> <NA>
#8 2010-03-31 X6478816 JER 60.025 4 IN OUT
#9 2010-06-30 X6114659 BOUS 56.175 1 IN OUT
#10 2010-06-30 B00NYQ2 GTB 52.650 2 IN OUT
#11 2010-06-30 X6812555 MPR 52.175 3 <NA> OUT
#12 2010-06-30 B03J9L7 AIRA 52.125 4 IN OUT
However, your example suggests you want to exclude the min(Date) from this_period check and the max(Date) from the Next_period check. Is it so? if yes, is score.rank somehow related to Date?
please clarify.

Record variable value when condition true with dynamic name

I have 9x2 dataframe DATS with prices and POSIXct datetimestamps sampled every 15 minutes. and a list of dates FOMCDATES with the dates of recent FOMC events. I then split the POSIXct datetimestamps into separate Date and Time columns. I then add column FOMCBinary to DATS containing a 1 whenever the date in DATS is contained in FOMCDATES AND time is 14:30 (EDIT: FOMC is 14:00, used 14:30 by mistake - example still valid).
I would like to record the Close before the event takes place in a separate variable. The name of the variable should be based on the date of the event. In the case at hand, the result should be: PreEvent-2016-01-27 = 1122.7. Please take into account this would actually be run in a large sample with dozens of dates and the time can be other than 14:30 (e.g. if looking at NFP rather than FOMC).
DATS <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0,0, 0, 0, 0, 0), min = c(30L, 15L, 0L, 45L, 30L, 15L, 0L, 45L,30L), hour = c(15L, 15L, 15L, 14L, 14L, 14L, 14L, 13L, 13L),mday = c(27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L), mon = c(0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(116L, 116L, 116L,116L, 116L, 116L, 116L, 116L, 116L), wday = c(3L, 3L, 3L,3L, 3L, 3L, 3L, 3L, 3L), yday = c(26L, 26L, 26L, 26L, 26L,26L, 26L, 26L, 26L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), zone = c("EST", "EST", "EST", "EST", "EST", "EST","EST", "EST", "EST"), gmtoff = c(NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour","mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Close = c(1127.2, 1127.5,1126.9, 1128.3, 1125.4, 1122.7, 1122.8, 1117.3, 1116)), .Names = c("DateTime","Close"), row.names = 2131:2139, class = "data.frame")
FOMCDATES <- structure(c(16785, 16827, 16876), class = "Date")
DATS$Time <- strftime(DATS$DateTime, format="%H:%M:%S")
DATS$Date <- as.Date(DATS$DateTime)
DATS$FOMCBinary <- ifelse( DATS$Time == "14:30:00" & DATS$Date %in% FOMCDATES, 1, 0)
#Output for FOMCDATES:
[1] 2015-12-16 2016-01-27 2016-03-16
#Output for DATS after calculations performed:
DateTime Close Time Date FOMCBinary
2131 2016-01-27 15:30:00 1127.2 15:30:00 2016-01-27 0
2132 2016-01-27 15:15:00 1127.5 15:15:00 2016-01-27 0
2133 2016-01-27 15:00:00 1126.9 15:00:00 2016-01-27 0
2134 2016-01-27 14:45:00 1128.3 14:45:00 2016-01-27 0
2135 2016-01-27 14:30:00 1125.4 14:30:00 2016-01-27 1
2136 2016-01-27 14:15:00 1122.7 14:15:00 2016-01-27 0
2137 2016-01-27 14:00:00 1122.8 14:00:00 2016-01-27 0
2138 2016-01-27 13:45:00 1117.3 13:45:00 2016-01-27 0
2139 2016-01-27 13:30:00 1116.0 13:30:00 2016-01-27 0
My attempt results in a vector rather than a single value, and the variable name is not dynamic.
#My failed attempt
#Define rowShift function
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r]) }
PreEventLevel <- ifelse(DATS$FOMCBinary > 0, rowShift(DATS$Close, +1), 0)
How could this be achieved?
Thank you very much!
Creating variables in the global environment with dynamic names is not a good practice... I would rather use a list as container for your values e.g. :
# get the indexes where FOMCBinary > 0
oneIdxs <- which(DATS$FOMCBinary > 0)
# get the close values using indexes on the shifted vector and put the values in a list
PreEventLevel <- as.list(rowShift(DATS$Close,1)[oneIdxs])
# set the dates as names of the element in the list
names(PreEventLevel) <- DATS$Date[oneIdxs]
> PreEventLevel
$`2016-01-27`
[1] 1122.7
# now you can access to values using:
# PreEventLevel[["2016-01-27"]]
# or
# PreEventLevel$`2016-01-27`
Note that you can also simply create a vector with names instead of a list (just remove as.list), and PreEventLevel will be:
> PreEventLevel
2016-01-27
1122.7
# you can access to values using PreEventLevel["2016-01-27"]

Resources