Calculate date diff using same date filed column - r

I want to find the total sum of running minutes of a battery per month and year. For this I have the following condition:
If Battery.voltage < 50 then "Yes, otherwise "No.
Note: For calculating the total sum of mins, we can the time stamp column which is day, month, year, hour, mins.
This is my data:
# Time.stamp Battery.voltage Condition
# 1 01/04/2016 00:00 51 No
# 2 01/04/2016 00:01 52 No
# 3 01/04/2016 00:02 45 Yes
# 4 01/04/2016 00:03 48 Yes
# 5 01/04/2016 00:04 49 Yes
# 6 01/04/2016 00:05 55 No
# 7 01/04/2016 00:06 54 No
# ...
structure(list(
Time.stamp = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 10L, 11L, 12L, 12L, 13L),
.Label = c("01/04/2016 00:00", "01/04/2016 00:01", "01/04/2016 00:02", "01/04/2016 00:03",
"01/04/2016 00:04", "01/04/2016 00:05", "01/04/2016 00:06", "01/04/2016 00:07",
"01/04/2016 00:08", "01/04/2016 00:09", "01/04/2016 00:11", "01/04/2016 00:12",
"01/04/2016 00:13"), class = "factor"),
Battery.voltage = c(51L, 52L, 45L, 48L, 49L, 55L, 54L, 52L, 51L, 49L, 48L, 47L, 45L, 50L, 51L),
Condition = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L),
.Label = c("No", "Yes"), class = "factor")),
.Names = c("Time.stamp", "Battery.voltage", "Condition"),
class = "data.frame", row.names = c(NA, -15L))
My expected output is something like this:
Month year Sum of mins running in battery
Jan 2016 350min
Feb 2016 450min
etc.

Unfortunately, your sample data is not very representative of your problem statement, as it only includes data for one day. It would have been beneficial to provide some code that generates random data for sufficient entries (i.e. dates).
That aside, you could adapt the following solution (here I assume your timestamp format is "DD/MM/YYYY"):
df %>%
mutate(
Time.stamp = as.POSIXct(Time.stamp, format = "%d/%m/%Y %H:%M"),
byday = format(Time.stamp, "%d/%m/%Y"),
bymonth = format(Time.stamp, "%d/%m"),
byyear = format(Time.stamp, "%Y")) %>%
group_by(byday) %>%
summarise(sum.running.in.mins = sum(Condition == "Yes"))
## A tibble: 1 x 2
# byday sum.running.in.mins
# <chr> <int>
#1 01/04/2016 7
Here we create columns byday, bymonth and byyear according to which you can group entries and calculate the sum of total running time per group. In above example, I calculate the total running time by day; to get the total running time per month, you would replace group_by(byday) with group_by(bymonth).

Related

R: using a for loop to create a new data table containing min and max variables given multiple column combinations

I am currently working with a data set in R that contains four variables for a large set of individuals: pid, month, window, and agedays. I'm trying to create a loop that will output the min and max agedays of each group of combinations between month and window into a new data table that I can export as a csv.
Here's an example of the data:
pid agedays month window
1 22 2 1
2 35 3 2
3 33 3 2
4 55 3 2
1 66 2 1
2 55 4 2
3 80 4 2
4 90 4 2
I'd like for the new data table to contain the min and max agedays of each group within each combination of window and month as well as the count of each group within each combination. The range for month is 2-24 and the range for window is 0-2.
The data table should look something like this:
month window min max N
2 1 22 66 1
3 2 33 55 3
etc....
where N is the number of unique individuals (pids) within each group
After grouping by 'month', 'window', get the min, max of 'agedays' and the number of distinct (n_distinct) elements of 'pid'
library(dplyr)
df1 %>%
group_by(month, window) %>%
summarise(min = min(agedays), max = max(agedays), N = n_distinct(pid))
# A tibble: 3 x 5
# Groups: month [3]
# month window min max N
# <int> <int> <int> <int> <int>
#1 2 1 22 66 1
#2 3 2 33 55 3
#3 4 2 55 90 3
We can also do this with data.table
library(data.table)
setDT(df1)[, .(min = min(agedays), max = max(agedays),
N = uniqueN(pid)), by = .(month, window)]
Or using split from base R
do.call(rbind, lapply(split(df1, df1[c('month', 'window')], drop = TRUE),
function(x) cbind(month = x$month[1], window = x$window[1], min = min(x$agedays), max = max(x$agedays),
N = length(unique(x$pid)))))
data
df1 <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Using data.table, we can calculate min, max of agedays along with number of rows for each combination of month and window.
library(data.table)
setDT(df) #Convert to data.table if it is not already
df[, .(min_age = min(agedays, na.rm = TRUE),
max_age = max(agedays, na.rm = TRUE), N = .N), .(month, window)]
# month window min_age max_age N
#1: 2 1 22 66 2
#2: 3 2 33 55 3
#3: 4 2 55 90 3
data
df <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -8L))

Grouping column events to create start and end dates of events

i'm new to programming and R. i'm a bit stuck. i have the following data table.
Date |ONIstatus
01/10/1993 |Average
01/11/1993 |Average
01/12/1993 |Average
01/01/1994 |Average
01/02/1994 |High
01/03/1994 |High
01/04/1994 |High
01/05/1994 |High
01/06/1994 |Low
01/07/1994 |Low
01/08/1994 |Average
01/09/1994 |Average
01/10/1994 |Average
01/11/1994 |Average
01/12/1994 |High
01/01/1995 |High
01/02/1995 |Low
01/03/1995 |Low
01/04/1995 |Low
01/05/1995 |Low
I want to extract start and end dates based on sequences of events in the 'ONIstatus' column. So, start date would be at the first set of 'ONIstatus entries' and end date would be when the next sequence starts - So, for example the first few sets of results desired output would be
Start Date | End Date | ONIstatus
01/10/1993 | 01/02/1994 | Average
01/02/1994 | 01/06/1994 | High
01/06/1994 | 01/08/1994 | Low
01/08/1994 | 01/12/1994 | Average
01/12/1994 | 01/02/1995 | High
and so on... I want to loop over the entire data set which has several 100 entries.
I've been trying to do this with Dplyr and rle, but not having much luck
s <- rle(as.character(df$ONIstatus))
df_final <- data.frame(ONIstatus = s$values, length = s$lengths)
#end index
df_final$end <- cumsum(df_final$length)
df_final$desired_end <- df_final$end +1
#start index
df_final$start <- df_final$end - df_final$length + 1
#start_date & end_date calculation based on start & end index
df_final$start_date <- df$Date[df_final$start]
df_final$end_date <- df$Date[df_final$desired_end]
#final output
df_final <- na.omit(df_final[,c('ONIstatus','start_date','end_date')])
df_final
Output is:
ONIstatus start_date end_date
1 Average 01/10/1993 01/02/1994
2 High 01/02/1994 01/06/1994
3 Low 01/06/1994 01/08/1994
4 Average 01/08/1994 01/12/1994
5 High 01/12/1994 01/02/1995
#sample data
> dput(df)
structure(list(Date = structure(c(15L, 17L, 19L, 1L, 3L, 5L,
7L, 9L, 11L, 12L, 13L, 14L, 16L, 18L, 20L, 2L, 4L, 6L, 8L, 10L
), .Label = c("01/01/1994", "01/01/1995", "01/02/1994", "01/02/1995",
"01/03/1994", "01/03/1995", "01/04/1994", "01/04/1995", "01/05/1994",
"01/05/1995", "01/06/1994", "01/07/1994", "01/08/1994", "01/09/1994",
"01/10/1993", "01/10/1994", "01/11/1993", "01/11/1994", "01/12/1993",
"01/12/1994"), class = "factor"), ONIstatus = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("Average", "High", "Low"), class = "factor")), .Names = c("Date",
"ONIstatus"), class = "data.frame", row.names = c(NA, -20L))
We can use tidyverse
library(dplyr)
library(lubridate)
df1 %>%
mutate(Date = dmy(Date)) %>%
group_by(ONIstatus) %>%
summarise(StartDate = min(Date), EndDate = max(Date)) %>%
mutate(EndDate = lead(StartDate)) %>%
na.omit() %>%
mutate_at(2:3, funs(format(., "%d/%m/%Y"))) %>%
select(StartDate, EndDate, ONIstatus)
# A tibble: 2 x 3
# StartDate EndDate ONIstatus
# <chr> <chr> <chr>
#1 01/10/1993 01/02/1994 Average
#2 01/02/1994 01/06/1994 High

Delete rows if column values are equal

I want delete the rows if the columns (YEAR, POL, CTY, ID, AMOUNT) are equal in the values across all rows. Please see the output table below.
Table:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 36 3500 RANGE1 L0015N20W23
2017 30408 11 36 3500 RANGE1 L00210N20W24
2017 30408 11 36 3500 RANGE1 L00310N20W25
2017 30409 11 36 3500 RANGE1 L0015N20W23
2017 30409 11 35 3500 RANGE2 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
Output:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 35 3500 RANGE1 L0015N20W23
You can try this:
no_duplicate_cols <- c("YEAR", "POL", "CTY", "ID", "AMOUNT")
new_df <- df[!duplicated(df[, no_duplicate_cols]), ]
The data frame new_df will hold the rows from df that are not duplicated.
If I understood the question correctly then I think you can try this
library(dplyr)
df %>%
group_by(YEAR, POL, CTY, ID, AMOUNT) %>%
filter(n() == 1)
Output (but it seems that the output provided in the original question has bit of typo!):
# A tibble: 1 x 7
# Groups: YEAR, POL, CTY, ID, AMOUNT [1]
YEAR POL CTY ID AMOUNT RAN LEGAL
1 2017 30409 11 36 3500 RANGE1 L0015N20W23
#sample data
> dput(df)
structure(list(YEAR = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L), POL = c(30408L, 30408L, 30408L, 30409L, 30409L, 30409L,
30409L), CTY = c(11L, 11L, 11L, 11L, 11L, 11L, 11L), ID = c(36L,
36L, 36L, 36L, 35L, 35L, 35L), AMOUNT = c(3500L, 3500L, 3500L,
3500L, 3500L, 3500L, 3500L), RAN = structure(c(1L, 1L, 1L, 1L,
2L, 3L, 3L), .Label = c("RANGE1", "RANGE2", "RANGE3"), class = "factor"),
LEGAL = structure(c(1L, 2L, 3L, 1L, 4L, 4L, 4L), .Label = c("L0015N20W23",
"L00210N20W24", "L00310N20W25", "NANANA"), class = "factor")), .Names = c("YEAR",
"POL", "CTY", "ID", "AMOUNT", "RAN", "LEGAL"), class = "data.frame", row.names = c(NA,
-7L))

r How to check if values exist in a previous period (rolling)

Here is my dataset:
structure(list(Date = structure(c(14609, 14609, 14609, 14609, 14699, 14699, 14699, 14699, 14790, 14790, 14790, 14790), class = "Date"),
ID = structure(c(5L, 4L, 6L, 10L, 9L, 3L, 10L, 8L, 7L, 1L,
10L, 2L), .Label = c("B00NYQ2", "B03J9L7", "B05DZD1", "B06HC42",
"B09V3X7", "B09YCC8", "X6114659", "X6478816", "X6556701",
"X6812555"), class = "factor"), Name = structure(c(10L, 4L,
9L, 8L, 7L, 3L, 8L, 6L, 2L, 5L, 8L, 1L), .Label = c("AIRA",
"BOUS", "CSCS", "EVF", "GTB", "JER", "MGB", "MPR", "NVB",
"TTNP"), class = "factor"), Score = c(55.075, 54.5, 53.325,
52.175, 70.275, 69.825, 60.15, 60.025, 56.175, 52.65, 52.175,
52.125), Score.rank = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L)), .Names = c("Date", "ID", "Name", "Score", "Score.rank"), row.names = c(1L, 2L, 3L, 4L, 71L, 72L, 73L, 74L, 156L, 157L, 158L, 159L), class = "data.frame")
I'm trying to find which IDs come in and out when we go into a new period.
What i mean by that is..i want to compare if the ID was present in the previous period, denoted by "Date".
If it existed in the previous period (date), It should not return anything.
If it did not exist in the previous period, it should return "IN".
I also want to show that if does not exist in the next period, it should return an "OUT".
ie the this period's OUTs should be equal to next periods INs
my expected dataframe is supposed to look like this
Date ID Name Score Score.rank THIS PERIOD NEXT PERIOD
31/12/2009 B09V3X7 TTNP 55.075 1 OUT
31/12/2009 B06HC42 EVF 54.5 2 OUT
31/12/2009 B09YCC8 NVB 53.325 3 OUT
31/12/2009 X6812555 MPR 52.175 4
31/3/2010 X6556701 MGB 70.275 1 IN
31/3/2010 B05DZD1 CSCS 69.825 2 IN OUT
31/3/2010 X6812555 MPR 60.15 3
31/3/2010 X6478816 JER 60.025 4 IN OUT
30/6/2010 X6114659 BOUS 56.175 1 IN
30/6/2010 B00NYQ2 GTB 52.65 2 IN
30/6/2010 X6812555 MPR 52.175 3
30/6/2010 B03J9L7 AIRA 52.125 4 IN
Can somebody point me in the right direction as to how to do this?
Thanks in advance
Your description and example doesn't match, unfortunately.
Considering your description, it seems you want to tag entry and exit conditions for the IDs.
Which can be achieved as:
dft %>%
group_by(ID) %>%
dplyr::mutate( This_period = if_else(Date == min(Date), "IN", NULL) ) %>%
dplyr::mutate( Next_period = if_else(Date == max(Date), "OUT", NULL))
and returns:
#Source: local data frame [12 x 7]
#Groups: ID [10]
#
# Date ID Name Score Score.rank This_period Next_period
# <date> <fctr> <fctr> <dbl> <int> <chr> <chr>
#1 2009-12-31 B09V3X7 TTNP 55.075 1 IN OUT
#2 2009-12-31 B06HC42 EVF 54.500 2 IN OUT
#3 2009-12-31 B09YCC8 NVB 53.325 3 IN OUT
#4 2009-12-31 X6812555 MPR 52.175 4 IN <NA>
#5 2010-03-31 X6556701 MGB 70.275 1 IN OUT
#6 2010-03-31 B05DZD1 CSCS 69.825 2 IN OUT
#7 2010-03-31 X6812555 MPR 60.150 3 <NA> <NA>
#8 2010-03-31 X6478816 JER 60.025 4 IN OUT
#9 2010-06-30 X6114659 BOUS 56.175 1 IN OUT
#10 2010-06-30 B00NYQ2 GTB 52.650 2 IN OUT
#11 2010-06-30 X6812555 MPR 52.175 3 <NA> OUT
#12 2010-06-30 B03J9L7 AIRA 52.125 4 IN OUT
However, your example suggests you want to exclude the min(Date) from this_period check and the max(Date) from the Next_period check. Is it so? if yes, is score.rank somehow related to Date?
please clarify.

create new variable from date data

Now my data frame is like below
dput(head(t.zoo))
structure(c(85.92, 85.85, 85.83, 85.83, 85.85, 85.87, 1300, 1300,
1299.75, 1299.75, 1299.75, 1300), .Dim = c(6L, 2L), .Dimnames = list(
NULL, c("cl", "es")), index = structure(list(sec = c(0.400000095367432,
0.900000095367432, 1.40000009536743, 1.90000009536743, 2.40000009536743,
2.90000009536743), min = c(30L, 30L, 30L, 30L, 30L, 30L), hour = c(10L,
10L, 10L, 10L, 10L, 10L), mday = c(6L, 6L, 6L, 6L, 6L, 6L), mon = c(5L,
5L, 5L, 5L, 5L, 5L), year = c(112L, 112L, 112L, 112L, 112L, 112L
), wday = c(3L, 3L, 3L, 3L, 3L, 3L), yday = c(157L, 157L, 157L,
157L, 157L, 157L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = c("", "EST", "EDT"
)), class = "zoo")
I have two questions, first is I would like to add a variable name for the first column and 2nd is i want to create a categorical variable to help me indicate 2010-06-06 (since there are 3 separate days)
What I should do for the date data?
I'm not familiar with zoo class, so the following code is not nice, but seems working.
yourdata<-as.matrix(yourdata)
justdate <- substr(rownames(yourdata), 1, 10)
justtime <- substr(rownames(yourdata), 11, 19)
row.names(yourdata) <- NULL
yourdata<-as.data.frame(yourdata)
yourdata[,"justdate"]<-justdate
yourdata[,"justtime"]<-justtime
yourdata[yourdata$justdate=="2012-06-06","newvariable"]<-1
> yourdata
cl es justdate justtime newvariable
1 85.92 1300.00 2012-06-06 10:30:00 1
2 85.85 1300.00 2012-06-06 10:30:00 1
3 85.83 1299.75 2012-06-06 10:30:01 1
4 85.83 1299.75 2012-06-06 10:30:01 1
5 85.85 1299.75 2012-06-06 10:30:02 1
6 85.87 1300.00 2012-06-06 10:30:02 1
zoo objects are a little bit different to work with from data.frames.
The "first column" (as you referred to it) is actually not a column, but the index of your object. Try index(t.zoo) and see what it returns. This index really should have unique values; in your case, there are duplicated values, which might affect your calculations.
Conversion to a data.frame can be done like the following. I've added separate "Date" and "Time" variables based on the index from t.zoo.
require(zoo) # Load the `zoo` package if you haven't already done so
t.df = data.frame(Date = format(index(t.zoo), "%Y-%m-%d"),
Time = format(index(t.zoo), "%H:%M:%S"),
data.frame(t.zoo))
t.df
# Date Time cl es
# 1 2012-06-06 10:30:00 85.92 1300.00
# 2 2012-06-06 10:30:00 85.85 1300.00
# 3 2012-06-06 10:30:01 85.83 1299.75
# 4 2012-06-06 10:30:01 85.83 1299.75
# 5 2012-06-06 10:30:02 85.85 1299.75
# 6 2012-06-06 10:30:02 85.87 1300.00
Converting back to a zoo object (keeping the new "Date" and "Time" columns, or any other columns that you have added) can be done like:
zoo(t.df, order.by=index(t.zoo))
Note, however, that this will give you a warning because you don't have unique "order.by" values.

Resources