Episode extraction - r

I totally lost on this. I am using time-use data I would like to extract episodes based on time steps. So basically, individuals are asked to take 3 measurements at the same time denoted by 3 variables a_1, b_1 and c_1. In some cases, they reported the same measurement at a_2, b_2 and c_2. I would like to extract the length (the minimum length is 2) of the same measurements based on time-steps the frequency of the occurrence as well the start and end times.
For example, below the highlighted red and blue cells denote same measurement based on time-steps and not per id.
Possible output:
Sample data:
structure(list(a_1 = c(100, 100, NA), a_2 = c(101, 101, NA),
a_3 = c(100, 100, NA), a_4 = c(1234, 1234, NA), b_1 = c(4567,
100, NA), b_2 = c(101, 101, NA), b_3 = c(100, 100, NA), b_4 = c(1234,
1234, NA), c_1 = c(3456, 100, NA), c_2 = c(101, 101, NA),
c_3 = c(100, 100, NA), c_4 = c(1234, 1234, NA)), spec = structure(list(
cols = list(a_1 = structure(list(), class = c("collector_double",
"collector")), a_2 = structure(list(), class = c("collector_double",
"collector")), a_3 = structure(list(), class = c("collector_double",
"collector")), a_4 = structure(list(), class = c("collector_double",
"collector")), b_1 = structure(list(), class = c("collector_double",
"collector")), b_2 = structure(list(), class = c("collector_double",
"collector")), b_3 = structure(list(), class = c("collector_double",
"collector")), b_4 = structure(list(), class = c("collector_double",
"collector")), c_1 = structure(list(), class = c("collector_double",
"collector")), c_2 = structure(list(), class = c("collector_double",
"collector")), c_3 = structure(list(), class = c("collector_double",
"collector")), c_4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))

Using dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>%
gather(key, value, -id) %>%
separate(key, into = c("Msrm", "TS"), sep = "_") %>%
group_by(id, TS) %>%
filter(n_distinct(value)==1 & !is.na(value)) %>%
select(-Msrm) %>%
distinct() %>%
arrange(id) %>%
pivot_wider(id, TS) %>%
ungroup() %>%
select(sort(current_vars()), -id) %>%
mutate(Length = rowSums(!is.na(.)),
Interval = apply(., 1, function(x) paste(which(!is.na(x)), collapse = ", "))) %>%
unite("Measurement", 1:4, sep = ",", na.rm = T) %>%
group_by(across(everything())) %>%
summarize(Occurance = n())
which results to:
Measurement Length Interval Occurance
<chr> <dbl> <chr> <int>
1 101,100,1234 3 1, 2, 3 1
2 101,100,1234,100 4 1, 2, 3, 4 1
Edit:
First things first we will need an id. Since we are going to mutate our data a lot you will lose sight of your ordering.
Next, we gather. This is the same as pivot_longer and melt (from reshape2 or data.table). It creates a tidy table with only 1 column of values and all of the old column names together in a new id column.
Now that we have the columnnames in a column we can easily manipulate them, so we separate them into Msrm (a, b, c) and TS (1, 2, 3, 4).
Since we want to see if for each id if all measurements are the same across a single TS we will group by id and TS. With these groups we can filter for n_distinct==1 to only keep timesteps where all values are the exact same, ignoring NA.
At this point there is no need anymore for the Msrm column so we remove it and also we remove the duplicates with the distinct function. Lastly we re-arrange the table by id.
Next we do pivot_wider to create a value column for each unique TS (1, 2, 3, 4), a bit like our original table. We also remove the id column since there is no use for it anymore at this point and sort the TS to make sure they are in the right order (1, 2, 3, 4).
At this point the table looks like this:
# A tibble: 2 x 4
`1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl>
1 NA 101 100 1234
2 100 101 100 1234
From here on we calculate the results. First we do the rowSum of a boolean (!is.na()), this essentially counts the number of columns in a row that are not equal to NA.
Then we use an apply to find the column names (which) attached to these !is.na() values. We wrap it in paste(..., collapse = ", ") to print all the column names into a single cell.
After that we can use unite to concatenate all the value columns into one while simultaneously removing the original columns. Right now it assumed that you have 4 timesteps (1:4), you could pretty easily make it dynamic if you so desire.
Lastly we group by all columns and use n() to see if any row is identical to another and count the occurrences in case they are.

Related

Loop in tidyverse

I am learning tidyverse() and I am using a time-series dataset, and I selected columns that start with sec. What I would like basically to identify those values from columns that equal 123, keep these and have the rest replace with 0. But I don't know how to loop from sec1:sec4. Also how can I sum() per columns?
df1<-df %>%
select(starts_with("sec")) %>%
select(ifelse("sec1:sec4"==123, 1, 0))
Sample data:
structure(list(sec1 = c(1, 123, 1), sec2 = c(123, 1, 1), sec3 = c(123,
0, 0), sec4 = c(1, 123, 1)), spec = structure(list(cols = list(
sec1 = structure(list(), class = c("collector_double", "collector"
)), sec2 = structure(list(), class = c("collector_double",
"collector")), sec3 = structure(list(), class = c("collector_double",
"collector")), sec4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
I think you would have to use mutate and across to accomplish this. below you will mutate across each column starting with sec and then keep all values that are 123 and replace all others with 0.
df1<-df %>%
select(starts_with("sec")) %>%
mutate(across(starts_with("sec"),.fns = function(x){ifelse(x == 123,x,0)}))

Standard deviation based on group id from a data frame

This relates to one of y previous question. My end goal is to rank items based on the serial variable, which is derived from a standard deviation value for the start and end of the day. To simply summarise, I would like to calculate both of them (start and end day)  and then I would like to mark it with a 1 if the standard deviations are less than 0.5. What is the best way to do this in R?
Rule that i would like to implement in R:
=IF(AND(STDEV.S(D2,D3,D4)<0.5,STDEV.P(E2, E3, E4)<0.5),1,0)
Sample data structure:
Sample output:
Sample data
df<-structure(list(serial = c(11011209, 11011209, 11011209, 11011209,
11011209, 11011210, 11011210, 11011210, 11011210), pnum = c(1,
1, 1, 2, 2, 2, 2, 2, 2), Day = c("Tue", "Wed", "Thur", "Wed",
"Thur", "Mo", "Tue", "Wed", "Thur"), Start = c(7, 7, 7, 8, 8,
9.75, 6.5, 6.5, 6.5), End = c(14.5, 14.5, 14.5, 15.75, 15.75,
17.75, 14.75, 14.75, 8.75)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(serial = structure(list(), class = c("collector_double",
"collector")), pnum = structure(list(), class = c("collector_double",
"collector")), Day = structure(list(), class = c("collector_character",
"collector")), Start = structure(list(), class = c("collector_double",
"collector")), End = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
After grouping by 'serial', 'num', create the 'Pattern', by checking the sd of 'Start', 'End' columns are less than 0.5 and connect the multiple expressions to a single one with &
library(dplyr)
df %>%
group_by(serial, pnum) %>%
mutate(Pattern = +(sd(Start) < 0.5 & sd(End) < 0.5)) %>%
ungroup
Or instead of specifying each column separately, use if_all
df %>%
group_by(serial, pnum) %>%
mutate(Pattern = +(if_all(c(Start, End), ~ sd(.) < 0.5))) %>%
ungroup

Return two or more max equal values

It is my intent to return the time of day of the column where the most measures were taken. As an example, I found the highest numbers in this to be 16:10 and 16:20. I am using the which.max(col) function but it returns only one occurrence, not all. What is the best way to return two (or more) equal maximum values?
Update: If I sum up based on column colSums how can I return as well the max values? (eg. 16:10 and 16:20)
Data structure:
Desired output: 16:10 16:20
Sample data:
structure(list(Duration = c(10, 20, 30, 40, 50, 60), `16:00` = c(1,
0, 0, 0, 1, 0), `16:10` = c(0, 0, 0, 2, 1, 0), `16:20` = c(0,
0, 0, 2, 1, 0)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L), spec = structure(list(cols = list(
Duration = structure(list(), class = c("collector_double",
"collector")), `16:00` = structure(list(), class = c("collector_double",
"collector")), `16:10` = structure(list(), class = c("collector_double",
"collector")), `16:20` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Compare the max value with the values in the dataframe and return column names which has at least 1 occurrence of the max value.
tmp <- df[-1]
names(tmp)[colSums(tmp == max(tmp)) > 0]
#[1] "16:10" "16:20"
For the update this should work -
tmp <- colSums(df[-1])
names(tmp)[tmp == max(tmp)]
#[1] "16:10" "16:20"
Does this work:
colnames(df)[apply(df,2,function(x) any(x == max(df[-1])))]
[1] "16:10" "16:20"
This also can be done through the combination of base R and tidyverse:
library(dplyr)
df[-1] %>%
select(where(~ sum(.x) == max(colSums(df[-1])))) %>%
names()
[1] "16:10" "16:20"
We can use base R
v1 <- sapply(df1[-1], max)
names(which(v1 == max(v1)))
#[1] "16:10" "16:20"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
summarise(across(-Duration, max)) %>%
pivot_longer(everything()) %>%
filter(value == max(value))%>%
pull(name)
[1] "16:10" "16:20"

Construct a loop based on multiple conditions in a column R

I have a df attached and I would like to create a loop that would apply a specific sequence based on conditions in column "x9". I would like to be able to set the sequence myself so I can try different sequences for this data frame, I will explain more below.
I have a df of losses and wins for an algorithm. On the first instance of a win I want to take the value in "x9" and divide it by the sequence value. I want to keep iterating through the sequence values until a loss is achieved. Once a loss is achieved the sequence will restart.
Risk control is the column I am attempting to create, it takes values from "x9" and divides them by the sequence value. I want to have the ability to alter the sequence values.
In short I need assistance in:
Constructing a sequence to apply to my df, would like to be able to alter this to try different sequences;
Take values in "x9" and create a new column that would apply the sequence values set. The sequence is taking the value in "x9" and dividing it by the sequence number;
Construct a loop to iterate through the entire df to apply this over all of the values.
I would appreciate any help / insight anyone can provide.
structure(list(x1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), x2 = c("2016.01.04 01:05",
"2016.01.04 01:12", "2016.01.04 01:13", "2016.01.04 01:17", "2016.01.04 01:20",
"2016.01.04 01:23", "2016.01.04 01:25", "2016.01.04 01:30", "2016.01.04 01:31",
"2016.01.04 01:59"), x3 = c("buy", "close", "buy", "close", "buy",
"close", "buy", "t/p", "buy", "close"), x4 = c(1, 1, 2, 2, 3,
3, 4, 4, 5, 5), x5 = c(8.46, 8.46, 8.6, 8.6, 8.69, 8.69, 8.83,
8.83, 9, 9), x6 = c(1.58873, 1.58955, 1.5887, 1.58924, 1.58862,
1.58946, 1.58802, 1.58902, 1.58822, 1.58899), x7 = c(1.57873,
1.57873, 1.5787, 1.5787, 1.57862, 1.57862, 1.57802, 1.57802,
1.57822, 1.57822), x8 = c(1.58973, 1.58973, 1.5897, 1.5897, 1.58962,
1.58962, 1.58902, 1.58902, 1.58922, 1.58922), x9 = c("$0.00",
"$478.69", "$0.00", "$320.45", "$0.00", "$503.70", "$0.00", "$609.30",
"$0.00", "$478.19"), x10 = c("$30,000.00", "$30,478.69", "$30,478.69",
"$30,799.14", "$30,799.14", "$31,302.84", "$31,302.84", "$31,912.14",
"$31,912.14", "$32,390.33"), `Risk Control` = c(NA, "$478.69",
NA, "$320.45", NA, "$251.85", NA, "$304.65", NA, "$159.40"),
Sequence = c(NA, 1, NA, 1, NA, 2, NA, 2, NA, 3)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(
cols = list(x1 = structure(list(), class = c("collector_double",
"collector")), x2 = structure(list(), class = c("collector_character",
"collector")), x3 = structure(list(), class = c("collector_character",
"collector")), x4 = structure(list(), class = c("collector_double",
"collector")), x5 = structure(list(), class = c("collector_double",
"collector")), x6 = structure(list(), class = c("collector_double",
"collector")), x7 = structure(list(), class = c("collector_double",
"collector")), x8 = structure(list(), class = c("collector_double",
"collector")), x9 = structure(list(), class = c("collector_character",
"collector")), x10 = structure(list(), class = c("collector_character",
"collector")), `Risk Control` = structure(list(), class = c("collector_character",
"collector")), ...12 = structure(list(), class = c("collector_logical",
"collector")), Sequence = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"))
Maybe there are better ways but I believe the following function does what the question asks for. It takes two arguments, a vector x to be processed and a sequence Seq. The return value is the risk control described in the question.
constructRisk <- function(x, Seq){
stopifnot(length(x) > 0)
stopifnot(length(Seq) > 0)
n <- length(x)
m <- length(Seq)
y <- numeric(n)
iSeq <- 1L
for(i in seq_len(n)){
y[i] <- x[i]/Seq[iSeq]
if(!is.na(y[i])){
if(y[i] < 0) iSeq <- 0L
}
iSeq <- iSeq + 1L
if(iSeq > m) iSeq <- 1L
}
y
}
Note that since the posted data has column x9 with dollar signs and is, therefore, of class "character", the test below is on a numeric version of it, X9. And the same goes for the risk control column, as posted.
X9 <- as.numeric(sub("\\$", "", df1$x9))
RskCntr <- as.numeric(sub("\\$", "", df1$`Risk Control`))
RC <- constructRisk(X9, df1$Sequence)
all.equal(RskCntr, RC)
#[1] "Mean relative difference: 2.091175e-05"
all.equal(RskCntr, round(RC, 2))
#[1] TRUE

Sum months from January to latest available month for current year, and previous year

I am trying to find a way to do a year to date for the current and previous year.
For example, for the current year I would sum from January to latest available, and then the same months from the previous year, January to July in this case.
If the latest available month is July of 2020, then I want January to July of 2019 summed and January to July of 2020 summed.
I have cut the script back to the barebones and added the dput():
library(tidyverse)
#Get the data for table 1
data1 <- read_csv("test-table.csv")
data1
dput(data1)
data1 <- data1 %>%
select(DATE, TC, VALUE)
dput(data1)
structure(list(DATE = structure(c(18444, 18444, 18444), class = "Date"),
TC = c("Canada", "Canada", "Canada"), VALUE = c(141772, 113414,
100351)), row.names = c(NA, -3L), spec = structure(list(cols = list(
REF_DATE = structure(list(), class = c("collector_character",
"collector")), GEO = structure(list(), class = c("collector_character",
"collector")), DGUID = structure(list(), class = c("collector_character",
"collector")), `Traveller characteristics` = structure(list(), class = c("collector_character",
"collector")), UOM = structure(list(), class = c("collector_character",
"collector")), UOM_ID = structure(list(), class = c("collector_double",
"collector")), SCALAR_FACTOR = structure(list(), class = c("collector_character",
"collector")), SCALAR_ID = structure(list(), class = c("collector_double",
"collector")), VECTOR = structure(list(), class = c("collector_character",
"collector")), COORDINATE = structure(list(), class = c("collector_double",
"collector")), VALUE = structure(list(), class = c("collector_double",
"collector")), STATUS = structure(list(), class = c("collector_logical",
"collector")), SYMBOL = structure(list(), class = c("collector_logical",
"collector")), TERMINATED = structure(list(), class = c("collector_logical",
"collector")), DECIMALS = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Once the data is tidy, it's a simple filter to remove months you're not interested in, grouping by year and tc, and summarize to get the summed total of the values for each tc.
To get a dynamic month list, use the unique month values from the months_df which is filtered to only contain the latest year (max). We can then use that as our filter for our final summary df.
library(tidyverse)
# tidy up the data
tidy_data <- data1 %>%
separate(`DATE TC VALUE`, into = c("date", 'tc', 'value'), sep = '\t') %>%
separate(date, into = c('year', 'month'), sep = '-') %>%
mutate_at(.vars = c('year','month','value'), .funs = as.integer)
# filter for latest year
months_df <- tidy_data %>%
filter(year == max(tidy_data$year))
# use months_df to feed the filter
summary <- tidy_data %>%
filter(month %in% unique(months_df$month)) %>%
group_by(year, tc) %>%
summarize(total = sum(value, na.rm = TRUE))
year tc total
<int> <chr> <int>
1 2019 TC-1 18271577
2 2019 TC-2 14094089
3 2019 TC-3 9415440
4 2020 TC-1 4340588
5 2020 TC-2 3431912
6 2020 TC-3 2551697
Edit: Updated to have a dynamic month list based on the latest year. I'm sure there is a more elegant way, but this should work.
d <- IS YOUR DATA
d %>%
group_by(date, tc) %>%
dplyr::summarise(value = sum(value)) %>%
mutate(month = month(date, label = T),
year = year(date)) %>%
ungroup %>%
select(month, year, tc, value) %>%
split(.$tc) %>%
map( ~ spread(.x, year, value, fill = 0) %>%
mutate(across(where(is.numeric), cumsum))) %>%
plyr::ldply(rbind) %>%
select(month, tc, `2018`, `2019`, `2020`) %>%
filter(month == month(Sys.Date(), label = T))
month tc 2018 2019 2020
1 Oct TC-1 NA 28451012 4340588
2 Oct TC-2 7518039 21795738 3431912
3 Oct TC-3 4930193 14477761 2551697

Resources