Grouping column events to create start and end dates of events - r

i'm new to programming and R. i'm a bit stuck. i have the following data table.
Date |ONIstatus
01/10/1993 |Average
01/11/1993 |Average
01/12/1993 |Average
01/01/1994 |Average
01/02/1994 |High
01/03/1994 |High
01/04/1994 |High
01/05/1994 |High
01/06/1994 |Low
01/07/1994 |Low
01/08/1994 |Average
01/09/1994 |Average
01/10/1994 |Average
01/11/1994 |Average
01/12/1994 |High
01/01/1995 |High
01/02/1995 |Low
01/03/1995 |Low
01/04/1995 |Low
01/05/1995 |Low
I want to extract start and end dates based on sequences of events in the 'ONIstatus' column. So, start date would be at the first set of 'ONIstatus entries' and end date would be when the next sequence starts - So, for example the first few sets of results desired output would be
Start Date | End Date | ONIstatus
01/10/1993 | 01/02/1994 | Average
01/02/1994 | 01/06/1994 | High
01/06/1994 | 01/08/1994 | Low
01/08/1994 | 01/12/1994 | Average
01/12/1994 | 01/02/1995 | High
and so on... I want to loop over the entire data set which has several 100 entries.
I've been trying to do this with Dplyr and rle, but not having much luck

s <- rle(as.character(df$ONIstatus))
df_final <- data.frame(ONIstatus = s$values, length = s$lengths)
#end index
df_final$end <- cumsum(df_final$length)
df_final$desired_end <- df_final$end +1
#start index
df_final$start <- df_final$end - df_final$length + 1
#start_date & end_date calculation based on start & end index
df_final$start_date <- df$Date[df_final$start]
df_final$end_date <- df$Date[df_final$desired_end]
#final output
df_final <- na.omit(df_final[,c('ONIstatus','start_date','end_date')])
df_final
Output is:
ONIstatus start_date end_date
1 Average 01/10/1993 01/02/1994
2 High 01/02/1994 01/06/1994
3 Low 01/06/1994 01/08/1994
4 Average 01/08/1994 01/12/1994
5 High 01/12/1994 01/02/1995
#sample data
> dput(df)
structure(list(Date = structure(c(15L, 17L, 19L, 1L, 3L, 5L,
7L, 9L, 11L, 12L, 13L, 14L, 16L, 18L, 20L, 2L, 4L, 6L, 8L, 10L
), .Label = c("01/01/1994", "01/01/1995", "01/02/1994", "01/02/1995",
"01/03/1994", "01/03/1995", "01/04/1994", "01/04/1995", "01/05/1994",
"01/05/1995", "01/06/1994", "01/07/1994", "01/08/1994", "01/09/1994",
"01/10/1993", "01/10/1994", "01/11/1993", "01/11/1994", "01/12/1993",
"01/12/1994"), class = "factor"), ONIstatus = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("Average", "High", "Low"), class = "factor")), .Names = c("Date",
"ONIstatus"), class = "data.frame", row.names = c(NA, -20L))

We can use tidyverse
library(dplyr)
library(lubridate)
df1 %>%
mutate(Date = dmy(Date)) %>%
group_by(ONIstatus) %>%
summarise(StartDate = min(Date), EndDate = max(Date)) %>%
mutate(EndDate = lead(StartDate)) %>%
na.omit() %>%
mutate_at(2:3, funs(format(., "%d/%m/%Y"))) %>%
select(StartDate, EndDate, ONIstatus)
# A tibble: 2 x 3
# StartDate EndDate ONIstatus
# <chr> <chr> <chr>
#1 01/10/1993 01/02/1994 Average
#2 01/02/1994 01/06/1994 High

Related

R: using a for loop to create a new data table containing min and max variables given multiple column combinations

I am currently working with a data set in R that contains four variables for a large set of individuals: pid, month, window, and agedays. I'm trying to create a loop that will output the min and max agedays of each group of combinations between month and window into a new data table that I can export as a csv.
Here's an example of the data:
pid agedays month window
1 22 2 1
2 35 3 2
3 33 3 2
4 55 3 2
1 66 2 1
2 55 4 2
3 80 4 2
4 90 4 2
I'd like for the new data table to contain the min and max agedays of each group within each combination of window and month as well as the count of each group within each combination. The range for month is 2-24 and the range for window is 0-2.
The data table should look something like this:
month window min max N
2 1 22 66 1
3 2 33 55 3
etc....
where N is the number of unique individuals (pids) within each group
After grouping by 'month', 'window', get the min, max of 'agedays' and the number of distinct (n_distinct) elements of 'pid'
library(dplyr)
df1 %>%
group_by(month, window) %>%
summarise(min = min(agedays), max = max(agedays), N = n_distinct(pid))
# A tibble: 3 x 5
# Groups: month [3]
# month window min max N
# <int> <int> <int> <int> <int>
#1 2 1 22 66 1
#2 3 2 33 55 3
#3 4 2 55 90 3
We can also do this with data.table
library(data.table)
setDT(df1)[, .(min = min(agedays), max = max(agedays),
N = uniqueN(pid)), by = .(month, window)]
Or using split from base R
do.call(rbind, lapply(split(df1, df1[c('month', 'window')], drop = TRUE),
function(x) cbind(month = x$month[1], window = x$window[1], min = min(x$agedays), max = max(x$agedays),
N = length(unique(x$pid)))))
data
df1 <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Using data.table, we can calculate min, max of agedays along with number of rows for each combination of month and window.
library(data.table)
setDT(df) #Convert to data.table if it is not already
df[, .(min_age = min(agedays, na.rm = TRUE),
max_age = max(agedays, na.rm = TRUE), N = .N), .(month, window)]
# month window min_age max_age N
#1: 2 1 22 66 2
#2: 3 2 33 55 3
#3: 4 2 55 90 3
data
df <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -8L))

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)

Calculate date diff using same date filed column

I want to find the total sum of running minutes of a battery per month and year. For this I have the following condition:
If Battery.voltage < 50 then "Yes, otherwise "No.
Note: For calculating the total sum of mins, we can the time stamp column which is day, month, year, hour, mins.
This is my data:
# Time.stamp Battery.voltage Condition
# 1 01/04/2016 00:00 51 No
# 2 01/04/2016 00:01 52 No
# 3 01/04/2016 00:02 45 Yes
# 4 01/04/2016 00:03 48 Yes
# 5 01/04/2016 00:04 49 Yes
# 6 01/04/2016 00:05 55 No
# 7 01/04/2016 00:06 54 No
# ...
structure(list(
Time.stamp = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 10L, 11L, 12L, 12L, 13L),
.Label = c("01/04/2016 00:00", "01/04/2016 00:01", "01/04/2016 00:02", "01/04/2016 00:03",
"01/04/2016 00:04", "01/04/2016 00:05", "01/04/2016 00:06", "01/04/2016 00:07",
"01/04/2016 00:08", "01/04/2016 00:09", "01/04/2016 00:11", "01/04/2016 00:12",
"01/04/2016 00:13"), class = "factor"),
Battery.voltage = c(51L, 52L, 45L, 48L, 49L, 55L, 54L, 52L, 51L, 49L, 48L, 47L, 45L, 50L, 51L),
Condition = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L),
.Label = c("No", "Yes"), class = "factor")),
.Names = c("Time.stamp", "Battery.voltage", "Condition"),
class = "data.frame", row.names = c(NA, -15L))
My expected output is something like this:
Month year Sum of mins running in battery
Jan 2016 350min
Feb 2016 450min
etc.
Unfortunately, your sample data is not very representative of your problem statement, as it only includes data for one day. It would have been beneficial to provide some code that generates random data for sufficient entries (i.e. dates).
That aside, you could adapt the following solution (here I assume your timestamp format is "DD/MM/YYYY"):
df %>%
mutate(
Time.stamp = as.POSIXct(Time.stamp, format = "%d/%m/%Y %H:%M"),
byday = format(Time.stamp, "%d/%m/%Y"),
bymonth = format(Time.stamp, "%d/%m"),
byyear = format(Time.stamp, "%Y")) %>%
group_by(byday) %>%
summarise(sum.running.in.mins = sum(Condition == "Yes"))
## A tibble: 1 x 2
# byday sum.running.in.mins
# <chr> <int>
#1 01/04/2016 7
Here we create columns byday, bymonth and byyear according to which you can group entries and calculate the sum of total running time per group. In above example, I calculate the total running time by day; to get the total running time per month, you would replace group_by(byday) with group_by(bymonth).

How to calculate percentage of mising data in a time series in R dplyr

In the following sample data and script,
How can I calculate the % of missing data between start date strtdt and end date enddt for each ID. What I want to get is: add the missing days with NA between strtdt and enddt separately for each IDs than calculated the % of NA.
I tried following using dplyr but for no luck. Any suggestion will be highly appreciated.
Note: I can achieve same by calculating individually for each ID however that is not possible because I have more than 10000 IDs.
Ultimate goal is to get % of NA between start date and end date for each ID; If the dates are missing completely than i have to add missing date with NA values.
library(dplyr
df<-structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L
), .Label = c("xx", "xyz", "yy", "zz"), class = "factor"), Date = structure(c(8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 19L, 20L, 21L, 22L, 23L), .Label = c("1989-09-12",
"1989-09-13", "1989-09-14", "1989-09-19", "1989-09-23", "1990-01-12",
"1990-01-13", "1996-09-12", "1996-09-13", "1996-09-16", "1996-09-17",
"1996-09-18", "1996-09-19", "2000-09-12", "2000-09-13", "2000-11-10",
"2000-11-11", "2000-11-12", "2001-09-07", "2001-09-08", "2001-09-09",
"2001-09-10", "2001-09-11"), class = "factor"), val = c(3, 5,
9, 3, 5, 6, 8, 7, 9, 5, 3, 2, 8, 8, 5, 3, 2, 1, 5, 7, NA, NA,
NA, NA)), .Names = c("ID", "Date", "val"), row.names = c(NA,
-24L), class = "data.frame")
df$Date<-as.Date(df$Date,format="%Y-%m-%d")
df
df_mis<-df %>%
group_by(ID)%>%
dplyr::mutate(strtdt=min(Date),
enddt=max(Date))
df_mis
df_mis2<-df_mis %>%
group_by(ID) %>%
dplyr::do( data.frame(., Date1= seq(.$strtdt,.$enddt, by = '1 day')))
df_mis2
I assume from the sequence generation in the question's code, that the expected observations are one per day between the first observed date and last observed date per ID. Here's a clunky piece by piece calculation to count the % missing data.
1. Make a data frame of all expected dates for each ID
library(dplyr)
# df as in the question, but coerce Date column
df$Date <- as.Date(df$Date)
# Data frame with date ranges per id
ranges_df <- df %>%
group_by(ID) %>%
summarize(min=min(Date), max=max(Date))
# Data frame with IDs and date for every day expected.
alldays <- ranges_df %>%
group_by(ID) %>%
do(., data.frame(
Date = seq(.$dmin,.$dmax, by = '1 day')
)
)
2. JOIN the expected dates table with the observed dates table.
imputed_df <- left_join(alldays, df)
3. Count NAs
imputed_df %>%
group_by(ID) %>%
summarize(total=n(),
missing=sum(is.na(val)),
percent_missing=missing/total*100
)
result:
# A tibble: 4 x 4
ID total missing percent_missing
<fctr> <int> <int> <dbl>
1 xx 8 2 25.00000
2 xyz 4 4 100.00000
3 yy 62 57 91.93548
4 zz 4380 4371 99.794
Assuming that NAs in the original data should be counted as missing data, this will do so.
Calculate the number of days between the min and max of dates as an intermediate variable.
Then, calculate the number of missing days as number of days - number of observations. Then, calculate percentages.
df %>%
group_by(ID) %>%
mutate(numdays = as.numeric(max(Date) - min(Date)) + 1,
pctmissing = (numdays - n()) / numdays)

Add rows when values in columns are equal in df

For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.
One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()

Resources