Standard deviation based on group id from a data frame - r

This relates to one of y previous question. My end goal is to rank items based on the serial variable, which is derived from a standard deviation value for the start and end of the day. To simply summarise, I would like to calculate both of them (start and end day)  and then I would like to mark it with a 1 if the standard deviations are less than 0.5. What is the best way to do this in R?
Rule that i would like to implement in R:
=IF(AND(STDEV.S(D2,D3,D4)<0.5,STDEV.P(E2, E3, E4)<0.5),1,0)
Sample data structure:
Sample output:
Sample data
df<-structure(list(serial = c(11011209, 11011209, 11011209, 11011209,
11011209, 11011210, 11011210, 11011210, 11011210), pnum = c(1,
1, 1, 2, 2, 2, 2, 2, 2), Day = c("Tue", "Wed", "Thur", "Wed",
"Thur", "Mo", "Tue", "Wed", "Thur"), Start = c(7, 7, 7, 8, 8,
9.75, 6.5, 6.5, 6.5), End = c(14.5, 14.5, 14.5, 15.75, 15.75,
17.75, 14.75, 14.75, 8.75)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(serial = structure(list(), class = c("collector_double",
"collector")), pnum = structure(list(), class = c("collector_double",
"collector")), Day = structure(list(), class = c("collector_character",
"collector")), Start = structure(list(), class = c("collector_double",
"collector")), End = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))

After grouping by 'serial', 'num', create the 'Pattern', by checking the sd of 'Start', 'End' columns are less than 0.5 and connect the multiple expressions to a single one with &
library(dplyr)
df %>%
group_by(serial, pnum) %>%
mutate(Pattern = +(sd(Start) < 0.5 & sd(End) < 0.5)) %>%
ungroup
Or instead of specifying each column separately, use if_all
df %>%
group_by(serial, pnum) %>%
mutate(Pattern = +(if_all(c(Start, End), ~ sd(.) < 0.5))) %>%
ungroup

Related

Loop in tidyverse

I am learning tidyverse() and I am using a time-series dataset, and I selected columns that start with sec. What I would like basically to identify those values from columns that equal 123, keep these and have the rest replace with 0. But I don't know how to loop from sec1:sec4. Also how can I sum() per columns?
df1<-df %>%
select(starts_with("sec")) %>%
select(ifelse("sec1:sec4"==123, 1, 0))
Sample data:
structure(list(sec1 = c(1, 123, 1), sec2 = c(123, 1, 1), sec3 = c(123,
0, 0), sec4 = c(1, 123, 1)), spec = structure(list(cols = list(
sec1 = structure(list(), class = c("collector_double", "collector"
)), sec2 = structure(list(), class = c("collector_double",
"collector")), sec3 = structure(list(), class = c("collector_double",
"collector")), sec4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
I think you would have to use mutate and across to accomplish this. below you will mutate across each column starting with sec and then keep all values that are 123 and replace all others with 0.
df1<-df %>%
select(starts_with("sec")) %>%
mutate(across(starts_with("sec"),.fns = function(x){ifelse(x == 123,x,0)}))

Episode extraction

I totally lost on this. I am using time-use data I would like to extract episodes based on time steps. So basically, individuals are asked to take 3 measurements at the same time denoted by 3 variables a_1, b_1 and c_1. In some cases, they reported the same measurement at a_2, b_2 and c_2. I would like to extract the length (the minimum length is 2) of the same measurements based on time-steps the frequency of the occurrence as well the start and end times.
For example, below the highlighted red and blue cells denote same measurement based on time-steps and not per id.
Possible output:
Sample data:
structure(list(a_1 = c(100, 100, NA), a_2 = c(101, 101, NA),
a_3 = c(100, 100, NA), a_4 = c(1234, 1234, NA), b_1 = c(4567,
100, NA), b_2 = c(101, 101, NA), b_3 = c(100, 100, NA), b_4 = c(1234,
1234, NA), c_1 = c(3456, 100, NA), c_2 = c(101, 101, NA),
c_3 = c(100, 100, NA), c_4 = c(1234, 1234, NA)), spec = structure(list(
cols = list(a_1 = structure(list(), class = c("collector_double",
"collector")), a_2 = structure(list(), class = c("collector_double",
"collector")), a_3 = structure(list(), class = c("collector_double",
"collector")), a_4 = structure(list(), class = c("collector_double",
"collector")), b_1 = structure(list(), class = c("collector_double",
"collector")), b_2 = structure(list(), class = c("collector_double",
"collector")), b_3 = structure(list(), class = c("collector_double",
"collector")), b_4 = structure(list(), class = c("collector_double",
"collector")), c_1 = structure(list(), class = c("collector_double",
"collector")), c_2 = structure(list(), class = c("collector_double",
"collector")), c_3 = structure(list(), class = c("collector_double",
"collector")), c_4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
Using dplyr and tidyr:
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>%
gather(key, value, -id) %>%
separate(key, into = c("Msrm", "TS"), sep = "_") %>%
group_by(id, TS) %>%
filter(n_distinct(value)==1 & !is.na(value)) %>%
select(-Msrm) %>%
distinct() %>%
arrange(id) %>%
pivot_wider(id, TS) %>%
ungroup() %>%
select(sort(current_vars()), -id) %>%
mutate(Length = rowSums(!is.na(.)),
Interval = apply(., 1, function(x) paste(which(!is.na(x)), collapse = ", "))) %>%
unite("Measurement", 1:4, sep = ",", na.rm = T) %>%
group_by(across(everything())) %>%
summarize(Occurance = n())
which results to:
Measurement Length Interval Occurance
<chr> <dbl> <chr> <int>
1 101,100,1234 3 1, 2, 3 1
2 101,100,1234,100 4 1, 2, 3, 4 1
Edit:
First things first we will need an id. Since we are going to mutate our data a lot you will lose sight of your ordering.
Next, we gather. This is the same as pivot_longer and melt (from reshape2 or data.table). It creates a tidy table with only 1 column of values and all of the old column names together in a new id column.
Now that we have the columnnames in a column we can easily manipulate them, so we separate them into Msrm (a, b, c) and TS (1, 2, 3, 4).
Since we want to see if for each id if all measurements are the same across a single TS we will group by id and TS. With these groups we can filter for n_distinct==1 to only keep timesteps where all values are the exact same, ignoring NA.
At this point there is no need anymore for the Msrm column so we remove it and also we remove the duplicates with the distinct function. Lastly we re-arrange the table by id.
Next we do pivot_wider to create a value column for each unique TS (1, 2, 3, 4), a bit like our original table. We also remove the id column since there is no use for it anymore at this point and sort the TS to make sure they are in the right order (1, 2, 3, 4).
At this point the table looks like this:
# A tibble: 2 x 4
`1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl>
1 NA 101 100 1234
2 100 101 100 1234
From here on we calculate the results. First we do the rowSum of a boolean (!is.na()), this essentially counts the number of columns in a row that are not equal to NA.
Then we use an apply to find the column names (which) attached to these !is.na() values. We wrap it in paste(..., collapse = ", ") to print all the column names into a single cell.
After that we can use unite to concatenate all the value columns into one while simultaneously removing the original columns. Right now it assumed that you have 4 timesteps (1:4), you could pretty easily make it dynamic if you so desire.
Lastly we group by all columns and use n() to see if any row is identical to another and count the occurrences in case they are.

Making a boxplot based on continuous values

Using 2 classifications, I want to create a boxplot to illustrate the variation in starting and ending times. How I could do this with ggplot?
Data structure:
Desired output:
Sample data:
structure(list(day = c("Mo", "Tue", "Wed", "Thur", "Fri", "Mo",
"Tue", "Wed", "Thur", "Fri"), start_time1 = c(9.75, 6.5, 6.5,
6.5, 6.5, 8.5, 8.5, 8.5, 8.5, 8.75), end_time1 = c(14.75, 14.75,
8.75, 8.75, 14.75, 17.75, 17.25, 17.25, 16.5, 17.5), Pattern = c(0,
0, 0, 0, 0, 1, 1, 1, 1, 1)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(day = structure(list(), class = c("collector_character",
"collector")), start_time1 = structure(list(), class = c("collector_double",
"collector")), end_time1 = structure(list(), class = c("collector_double",
"collector")), Pattern = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Is not very clear what you want to do, so I'm trying to guess:
Personally, I would put the day on the x-axis, the times on the y-axis, using the colors to differentiate Patterns and facet to differentiate the starting or ending times
library(ggplot2)
m <- melt(dff,id.vars=c("day","Pattern"))
m$Pattern <- as.factor(m$Pattern)
ggplot(m,aes(x=day,y=value,fill=Pattern))+
geom_boxplot()+
facet_wrap(~variable)+
labs(y="times")
the output depends on the length and variability of the data. With the sample that you provided the output is not so informative. Trying to inject some randomness in the data the plot becomes more useful:

Regression in R on panel data

I would like to run a linear regression on panel data. Below my code so far but, I don't understand why is not returning the fit and rsq. Any suggestion?
Sample code:
for(i in names(df))
{
if(is.numeric(df[3,i])) ##if row 3 is numeric, the entire column is
{
fit <- lm(df[3,i] ~ Gender, data=df) #does a regression for each column in my csv file against my independent variable 'etch'
rsq <- summary(fit)$r.squared
}
}
Data structure
Sample data:
df<-structure(list(id = c(1, 1, 2, 2, 2), id1 = c(1, 2, 1, 2, 3),
a1 = c(5, 8, 7, 6, 3), a2 = c(1, 4, 3, 10, 5), a3 = c(2,
34, 3, 12, 6), a4 = c(9, 2, 3, 12, 7), a5 = c(0, 0, 0, 7,
8), a6 = c(7, 7, 0, 0, 9), a7 = c(5, 8, 7, 6, 0), a8 = c(1,
4, 3, 10, 3), a9 = c(2, 34, 3, 12, 3), a10 = c(9, 2, 3, 12,
3), Gender = c(1, 2, 1, 1, 2)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(id = structure(list(), class = c("collector_double",
"collector")), id1 = structure(list(), class = c("collector_double",
"collector")), a1 = structure(list(), class = c("collector_double",
"collector")), a2 = structure(list(), class = c("collector_double",
"collector")), a3 = structure(list(), class = c("collector_double",
"collector")), a4 = structure(list(), class = c("collector_double",
"collector")), a5 = structure(list(), class = c("collector_double",
"collector")), a6 = structure(list(), class = c("collector_double",
"collector")), a7 = structure(list(), class = c("collector_double",
"collector")), a8 = structure(list(), class = c("collector_double",
"collector")), a9 = structure(list(), class = c("collector_double",
"collector")), a10 = structure(list(), class = c("collector_double",
"collector")), Gender = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
To fit a linear regression on each column you can use lapply. We use reformulate to create a formula object from the column name and use it in lapply. R-squared value can be extracted from summary of each model.
cols <- grep('a\\d+', names(df), value = TRUE)
cols
#[1] "a1" "a2" "a3" "a4" "a5" "a6" "a7" "a8" "a9" "a10"
lapply(cols, function(x) {
lm(reformulate('Gender', x), df)
}) -> fit
r.squared <- sapply(fit, function(x) summary(x)$r.squared)

Construct a loop based on multiple conditions in a column R

I have a df attached and I would like to create a loop that would apply a specific sequence based on conditions in column "x9". I would like to be able to set the sequence myself so I can try different sequences for this data frame, I will explain more below.
I have a df of losses and wins for an algorithm. On the first instance of a win I want to take the value in "x9" and divide it by the sequence value. I want to keep iterating through the sequence values until a loss is achieved. Once a loss is achieved the sequence will restart.
Risk control is the column I am attempting to create, it takes values from "x9" and divides them by the sequence value. I want to have the ability to alter the sequence values.
In short I need assistance in:
Constructing a sequence to apply to my df, would like to be able to alter this to try different sequences;
Take values in "x9" and create a new column that would apply the sequence values set. The sequence is taking the value in "x9" and dividing it by the sequence number;
Construct a loop to iterate through the entire df to apply this over all of the values.
I would appreciate any help / insight anyone can provide.
structure(list(x1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), x2 = c("2016.01.04 01:05",
"2016.01.04 01:12", "2016.01.04 01:13", "2016.01.04 01:17", "2016.01.04 01:20",
"2016.01.04 01:23", "2016.01.04 01:25", "2016.01.04 01:30", "2016.01.04 01:31",
"2016.01.04 01:59"), x3 = c("buy", "close", "buy", "close", "buy",
"close", "buy", "t/p", "buy", "close"), x4 = c(1, 1, 2, 2, 3,
3, 4, 4, 5, 5), x5 = c(8.46, 8.46, 8.6, 8.6, 8.69, 8.69, 8.83,
8.83, 9, 9), x6 = c(1.58873, 1.58955, 1.5887, 1.58924, 1.58862,
1.58946, 1.58802, 1.58902, 1.58822, 1.58899), x7 = c(1.57873,
1.57873, 1.5787, 1.5787, 1.57862, 1.57862, 1.57802, 1.57802,
1.57822, 1.57822), x8 = c(1.58973, 1.58973, 1.5897, 1.5897, 1.58962,
1.58962, 1.58902, 1.58902, 1.58922, 1.58922), x9 = c("$0.00",
"$478.69", "$0.00", "$320.45", "$0.00", "$503.70", "$0.00", "$609.30",
"$0.00", "$478.19"), x10 = c("$30,000.00", "$30,478.69", "$30,478.69",
"$30,799.14", "$30,799.14", "$31,302.84", "$31,302.84", "$31,912.14",
"$31,912.14", "$32,390.33"), `Risk Control` = c(NA, "$478.69",
NA, "$320.45", NA, "$251.85", NA, "$304.65", NA, "$159.40"),
Sequence = c(NA, 1, NA, 1, NA, 2, NA, 2, NA, 3)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(
cols = list(x1 = structure(list(), class = c("collector_double",
"collector")), x2 = structure(list(), class = c("collector_character",
"collector")), x3 = structure(list(), class = c("collector_character",
"collector")), x4 = structure(list(), class = c("collector_double",
"collector")), x5 = structure(list(), class = c("collector_double",
"collector")), x6 = structure(list(), class = c("collector_double",
"collector")), x7 = structure(list(), class = c("collector_double",
"collector")), x8 = structure(list(), class = c("collector_double",
"collector")), x9 = structure(list(), class = c("collector_character",
"collector")), x10 = structure(list(), class = c("collector_character",
"collector")), `Risk Control` = structure(list(), class = c("collector_character",
"collector")), ...12 = structure(list(), class = c("collector_logical",
"collector")), Sequence = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"))
Maybe there are better ways but I believe the following function does what the question asks for. It takes two arguments, a vector x to be processed and a sequence Seq. The return value is the risk control described in the question.
constructRisk <- function(x, Seq){
stopifnot(length(x) > 0)
stopifnot(length(Seq) > 0)
n <- length(x)
m <- length(Seq)
y <- numeric(n)
iSeq <- 1L
for(i in seq_len(n)){
y[i] <- x[i]/Seq[iSeq]
if(!is.na(y[i])){
if(y[i] < 0) iSeq <- 0L
}
iSeq <- iSeq + 1L
if(iSeq > m) iSeq <- 1L
}
y
}
Note that since the posted data has column x9 with dollar signs and is, therefore, of class "character", the test below is on a numeric version of it, X9. And the same goes for the risk control column, as posted.
X9 <- as.numeric(sub("\\$", "", df1$x9))
RskCntr <- as.numeric(sub("\\$", "", df1$`Risk Control`))
RC <- constructRisk(X9, df1$Sequence)
all.equal(RskCntr, RC)
#[1] "Mean relative difference: 2.091175e-05"
all.equal(RskCntr, round(RC, 2))
#[1] TRUE

Resources