Calculate difference between two values in grouped sequences - r

This is a follow-up question for this post:
Loop through dataframe in R and measure time difference between two values
I already got excellent help with the following code to calculate the time difference in minutes between a certain Stimuli and the next Response:
df$Date <- as.POSIXct(strptime(df$Date,"%d.%m.%Y %H:%M"))
df %>%
arrange(User,Date)%>%
mutate(difftime= difftime(lead(Date),Date, units = "mins") ) %>%
group_by(User)%>%
filter((StimuliA==1 | StimuliB==1) & lead(Responses)==1)`
Dataset:
structure(list(User = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L), Date = c("24.11.2015 20:39",
"25.11.2015 11:42", "11.01.2016 22:46", "26.11.2015 22:42", "04.03.2016 05:45",
"24.11.2015 13:13", "25.11.2015 13:59", "27.11.2015 12:18", "28.05.2016 06:49",
"06.07.2016 09:46", "03.12.2015 09:32", "07.12.2015 08:18", "08.12.2015 19:40",
"08.12.2015 19:40", "22.12.2015 08:50", "22.12.2015 08:52", "22.12.2015 08:52",
"22.12.2015 20:46"), StimuliA = c(1L, 0L, 0L, 1L, 1L, 1L, 0L,
1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), StimuliB = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L), Responses = c(0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, 1L, 1L, 1L, 1L)), .Names = c("User", "Date", "StimuliA",
"StimuliB", "Responses"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -18L), spec = structure(list(cols = structure(list(
User = structure(list(), class = c("collector_integer", "collector"
)), Date = structure(list(), class = c("collector_character",
"collector")), StimuliA = structure(list(), class = c("collector_integer",
"collector")), StimuliB = structure(list(), class = c("collector_integer",
"collector")), Responses = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("User", "Date", "StimuliA", "StimuliB",
"Responses")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Goal/Question The lead arugment helped to determine the time difference between a Stimuli == 1 (A or B) and the next response [sorted by date/time] (Response == 1). How would i change that code to find the time difference between the Stimuli A or B and the LAST Response in this sequence. (until the next Stimuli occurs)
Desired output:
User Date StimuliA StimuliB Responses time diff Seq_ID
1 24.11.2015 20:39 1 0 0 1_1_0
1 25.11.2015 11:42 0 0 1 1_1_0
1 11.01.2016 22:46 0 0 1 69247 1_1_0
2 26.11.2015 22:42 1 0 0 2_1_0
2 04.03.2016 05:45 0 1 0 2_1_1
3 24.11.2015 13:13 1 0 0 3_1_0
3 25.11.2015 13:59 0 0 1 1486 3_1_0
3 27.11.2015 12:18 1 0 0 3_2_0
3 28.05.2016 06:49 0 0 1 3_2_0
3 06.07.2016 09:46 0 0 1 319528 3_2_0
4 03.12.2015 09:32 1 0 0 4_1_0
4 07.12.2015 08:18 1 0 0 4_2_0
4 08.12.2015 19:40 0 0 1 2122 4_1_0
4 08.12.2015 19:40 0 1 0 4_2_1
4 22.12.2015 08:50 0 0 1 19510 4_2_1
5 22.12.2015 08:52 0 0 1 5_0_0
5 22.12.2015 08:52 0 0 1 5_0_0
5 22.12.2015 20:46 0 0 1 5_0_0
For Stimuli A this meant the values c(69247, 319528, 2122) and B c(1486, 19510).

Try this.
# df$Date <- as.POSIXct(strptime(df$Date,"%d.%m.%Y %H:%M"))
df %>%
arrange(User, Date) %>%
group_by(User) %>%
mutate(
last.date = Date[which(StimuliA == 1L)[c(1,1:sum(StimuliA == 1L))][cumsum(StimuliA == 1L)+ 1]]
) %>%
mutate(
timesince = ifelse(Responses == 1L, Date - last.date, NA)
)
This works by first creating a column that records the data of last stimuli, and then using ifelse and lag to get the difference between the current date and the last stimuli date. You can filter to extract only the LAST response.
There is a cleaner way to do the "last.date" operation with zoo.na.locf, but I didn't want to assume you were ok with another package dependency.
EDIT To identify the sequence (if I correctly understand what you mean by "sequence"), continue the chain with
%>% mutate(sequence = cumsum(StimuliA))
to identify sequences defined as observations following a positive Stimuli. To filter out the last response of a sequence, continue the chain with
%>% group_by(User, sequence) %>%
filter(timesince == max(timesince, na.rm = TRUE))
to group by sequence (and user) and then extract the maximum time difference associated with each sequence (which will correspond to the last positive response of a sequence).

Related

Filtering columns where all rows match a specific value in R

I have a dataframe like so:
Apple Orange Strawberry
0 1 1
0 1 1
0 1 0
0 1 0
0 1 0
0 1 1
0 1 1
I want to filter the dataframe such that I get the column names where all the rows are a specific value say 0.
In this case I would get Apple
I tried doing
df[rowSums(df<1)==0, ]
but I'm just getting an empty dataframe with all the column names. Is there something else I can try?
In base R you could do:
names(df)[colSums(df) == 0]
or even
names(Filter(all, data.frame(df == 0)))
[1] "Apple"
Here is another option:
library(tidyverse)
df %>%
select(where( ~ sum(.) == 0)) %>%
names()
Output
[1] "Apple"
Data
df <- structure(list(Apple = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), Orange = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), Strawberry = c(1L, 1L, 0L, 0L, 0L, 1L,
1L)), class = "data.frame", row.names = c(NA, -7L))

How to do multiple arithmetic operations according to conditions between two datasets in R

I have several datasets.
The first one
lid=structure(list(x1 = 619490L, x2 = 10L, x3 = 0L, x4 = 6089230L,
x5 = 0L, x6 = -10L), class = "data.frame", row.names = c(NA,
-1L))
second dataset
lidar=structure(list(A = c(638238.76, 638238.76, 638239.29, 638235.39,
638233.86, 638233.86, 638235.55, 638231.97, 638231.91, 638228.41
), B = c(6078001.09, 6078001.09, 6078001.15, 6078001.15, 6078001.07,
6078001.07, 6078001.02, 6078001.08, 6078001.09, 6078001.01),
C = c(186.64, 186.59, 199.28, 189.37, 186.67, 186.67, 198.04,
200.03, 199.73, 192.14), gpstime = c(319805734.664265, 319805734.664265,
319805734.67875, 319805734.678768, 319805734.678777, 319805734.678777,
319805734.687338, 319805734.701928, 319805734.701928, 319805734.701945
), Intensity = c(13L, 99L, 5L, 2L, 20L, 189L, 2L, 11L, 90L,
1L), ReturnNumber = c(2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
3L), NumberOfReturns = c(2L, 1L, 3L, 2L, 1L, 1L, 3L, 1L,
1L, 4L), ScanDirectionFlag = c(1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L), EdgeOfFlightline = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Classification = c(1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
How to subtract the value for each row of the lidar dataset from lid dataset using the formula
(lidar$A-lid$x1)/lid$x3
then
(lidar$B-lid$x4)/lid$x6
So for first row will be result
(lidar$A-lid$x1)/lid$x3=1874,876(but everything after the comma is discarded)=1874(without,876)
(lidar$B-lid$x4)/lid$x6=1122
also in lidar dataset for column lidar$C
subtract the smallest value from the largest value. In this case lidar$c11-lidar$c1=5,5
so desired output for this will be
A B C Intensity ReturnNumber NumberOfReturns row col subs(lidar$Cmax-lidar$Cmin)
638238.76 6078001.09 186.64 13 2 2 1874 1122 5,5
638238.76 6078001.09 186.59 99 1 1 1874 1122 5,5
638239.29 6078001.15 199.28 5 1 3 1874 1122 5,5
638235.39 6078001.15 189.37 2 2 2 1874 1122 5,5
the result of subtraction (lidar$Cmax-lidar$Cmin) for all rows is always the same.
row and col this the result of this arithmetic
(lidar$A-lid$x1)/lid$x3 (row)
then
(lidar$B-lid$x4)/lid$x6 (col)
with the value after the comma, these values(row and col) are different, but we must remove the part after the comma, so they seem to be the same.
How can i get desired output according to such arithmetic operations.
Any of your help is valuable.Thank you
If I understand your purpose correctly, the main question is how to remove the part after comma, which is a decimal separator in your examples.
If that's true, one way of doing that is to split the number into two parts, one which comes before the comma and another one which comes after it, and then extract only the first part. In R you can do this by strsplit(). However, this function requires the input to be characters, not numerics. So, you need to coerce the numbers into characters, do the splitting, coerce the result back to numbers, and then extract its first element.
Here is an example of a function to implement the steps:
remove_after_comma <- function(num_with_comma){
myfun <- function(num_with_comma) {
num_with_comma|>
as.character() |>
strsplit("[,|.]") |>
unlist() |>
as.numeric() |>
getElement(1)
}
vapply(num_with_comma, myfun, FUN.VALUE = numeric(1))
}
Notes:
[,|.] is used to anticipate other systems that use . instead of , as the decimal separator.
vapply is used to make it possible to apply this function to a numeric vectors, such as a numeric column.
Check:
remove_after_comma(c(a = '1,5', b = '12,74'))
# a b
# 1 12
(4:10)/3
#[1] 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
remove_after_comma ((4:10)/3)
#[1] 1 1 2 2 2 3 3
Assuming that lid$x3 = 10L in your example:
(lidar$A-lid$x1)/lid$x3
#[1] 1874.876 1874.876 1874.929 1874.539 1874.386 1874.386 1874.555 1874.197 #1874.191 1873.841
remove_after_comma((lidar$A-lid$x1)/lid$x3)
#[1] 1874 1874 1874 1874 1874 1874 1874 1874 1874 1873
I'm not sure if this is what you mean
`
lidar$row <- round((lidar$A-lid$x1)/lid$x3, 0)
lidar$col <- (lidar$B-lid$x4)/lid$x6
lidar$cdif <- max(lidar$C)-min(lidar$C)
`

How to change the reference level for Risk Ratio in logistic regression in R?

Survey.ID Quit Boss Subord Subord2 Subord3 Subord4
1 1 0 0 0 0 1 0
2 2 1 0 0 1 0 0
3 3 0 0 0 0 0 0
4 4 0 0 0 0 1 0
5 5 0 0 0 0 1 0
6 6 1 0 0 0 1 0
I have a df above. Each of the variables is a binary variable that categorizes if someone is a boss, or a certain level of subordinate. I am trying to see what is most predictive of someone quitting the past month. I am using the logistic regression
model <- glm(Quit ~ Subord, family=binomial, data = df)
summary(model)
exp(cbind(RR = coef(model), confint.default(model)))
I would like to find the Relative Risk (RR) for each group of employees: Boss, Subord, Subord2, Subord3, Subord4. However, I would like to reference group to be Subord4. I believe right now, the reference is set to boss? How do I fix this?
I think this might help
Libraries
library(tidyverse)
Sample data
df <-
structure(list(id = 1:6, Quit = c(0L, 1L, 0L, 0L, 0L, 1L), Boss = c(0L,
0L, 0L, 0L, 0L, 0L), Subord = c(0L, 0L, 0L, 0L, 0L, 0L), Subord2 = c(0L,
1L, 0L, 0L, 0L, 0L), Subord3 = c(1L, 0L, 0L, 1L, 1L, 1L), Subord4 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Code
df %>%
#Create a single column of variables: Boss Subord Subord2 Subord3 Subord4
pivot_longer(cols = -c(id,Quit)) %>%
#Keeping only those with value = 1
filter(value == 1) %>%
#Making "Subord4" the baseline, as the first level of the factor
mutate(name = fct_relevel(as.factor(name),"Subord4")) %>%
glm(data = .,formula = Quit~name, family=binomial)

Identify eventlogs in a vector before and after a certain date in r

I have the following data set which i would like to group and summarize for each sequence. Each sequence should be split into all the events, which occured in the first 7 days after the first date and combine the later events into a seperate group. Basically my biggest challenge is to find the first date in the sequence, add 7 days and mark all the dates in this sequence which fall into this category.
structure(list(`Sequence ID` = c("1_0_0", "1_0_0", "1_0_0", "1_0_0",
"1_0_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_2_0",
"1_2_1", "1_2_1", "1_2_1", "1_2_1", "1_2_2"), Date = c("02.12.2015 20:16",
"03.12.2015 20:17", "02.12.2015 20:44", "03.12.2015 09:32", "03.12.2015 09:33",
"07.12.2015 08:18", "08.12.2015 19:40", "08.12.2015 19:43", "22.12.2015 18:22",
"22.12.2015 18:23", "23.12.2015 14:18", "05.01.2016 11:35", "05.01.2016 13:21",
"05.01.2016 13:22", "05.01.2016 13:22", "04.08.2016 08:25"),
StimuliA = c(0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L), StimuliB = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L), Response = c(1L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L
)), .Names = c("Sequence ID", "Date", "StimuliA", "StimuliB",
"Response"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-16L), spec = structure(list(cols = structure(list(`Sequence ID` = structure(list(), class = c("collector_character",
"collector")), Date = structure(list(), class = c("collector_character",
"collector")), StimuliA = structure(list(), class = c("collector_integer",
"collector")), StimuliB = structure(list(), class = c("collector_integer",
"collector")), Response = structure(list(), class = c("collector_integer",
"collector")), X6 = structure(list(), class = c("collector_skip",
"collector")), X7 = structure(list(), class = c("collector_skip",
"collector")), X8 = structure(list(), class = c("collector_skip",
"collector")), X9 = structure(list(), class = c("collector_skip",
"collector")), X10 = structure(list(), class = c("collector_skip",
"collector"))), .Names = c("Sequence ID", "Date", "StimuliA",
"StimuliB", "Response", "X6", "X7", "X8", "X9", "X10")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
This could be a possible output, where with Group the 0 summarized all the values in the first 7 days and 1 the values which occured later.
Sequence ID Group Date StimuliA StimuliB Response
1_0_0 0 02.12.2015 20:16 0 0 5
1_0_0 1 09.12.2015 20:16 0 0 0
1_1_0 0 07.12.2015 08:18 1 0 2
1_1_0 1 14.12.2015 08:18 0 0 2
1_2_0 0 23.12.2015 14:18 1 0 0
1_2_0 1 30.12.2015 14:18 0 0 0
1_2_1 0 05.01.2016 11:35 0 1 3
1_2_1 1 12.01.2016 11:35 0 0 0
1_2_2 0 04.08.2016 08:25 0 1 0
1_2_2 1 11.08.2016 08:25 0 0 0
I would try to achieve this with the following code, but need some inputs how to identify the values before and after 7 days.
#change the date into posixct format
df$Date <- as.POSIXct(strptime(master$Date,"%d.%m.%Y %H:%M"))
#arrange the dataframe according to User and Date
df <- arrange(df, Sequence ID,Date)
#identify the values before and after 7 days
#aggregate all the eventlog rows according to the stimuli IDs
df <- aggregate(. ~ Sequence ID + Group, data=df, sum)
The following data.table code returns aggregated values which are grouped by sequence and the period of the first seven days within each sequence (or beyond) as requested:
library(data.table)
# copy and coerce to data.table
data.table(DF)[
# make syntactically valid column names
, setnames(.SD, make.names(names(.SD)))][
# transform character date-time to date
, Date := as.Date(lubridate::dmy_hm(Date))][
# create Group variable for the first 7 days and beyond within each sequence
, Initial.Period := Date %between% (min(Date) + c(0L, 6L)), by = Sequence.ID][
# aggregate by sequence and date range
, .(Min.Date = min(Date), Response = sum(Response)), by = .(Sequence.ID, Initial.Period)]
Sequence.ID Initial.Period Min.Date Response
1: 1_0_0 TRUE 2015-12-02 5
2: 1_1_0 TRUE 2015-12-07 2
3: 1_1_0 FALSE 2015-12-22 2
4: 1_2_0 TRUE 2015-12-23 0
5: 1_2_1 TRUE 2016-01-05 3
6: 1_2_2 TRUE 2016-08-04 0
Note that the result differs from the possible output shown in the question due to ambiguities or to inconsistencies in the provided sample data:
The sample data contain date-times but the OP is using the terms date and days throughout in his specification. Therefore, the code is using Date rather than POSIXct.
I've deliberately choosen to use Initial.Period as a more speaking column name to indicate the first 7 days and to avoid the generic and ambiguous name Group.
The columns StimuliA and StimuliB have been omitted from the aggregation because they aren't consistent with the sequences and the OP hasn't specified how to handle this case.
Min.Date refers to the minimum date in the data for each sequence and period, not to the calculated beginn of the periods.
The result shows only aggregated values where data is available in the data set. The possible output has more rows because it includes all possible combinations of sequences and periods where missing values have been filled up with zeros.

Forecasting for multiple products

I want to forecast values for multiple products (in this case product_id 1 and 2 but I actually have a few thousand products) at the same time.
product_id Date Revenue Value
1 1 1/10/12 in 0
2 1 1/13/12 in 1
3 1 2/14/16 in 0
4 1 3/5/16 out 0
5 1 1/5/17 out 0
6 1 3/15/17 out 0
7 2 11/1/11 in 1
8 2 3/14/15 in 2
9 2 1/15/16 in 3
10 2 3/15/17 out 0
11 2 4/11/17 out 0
12 2 5/16/17 out 0
If this were only one product, I would fill in the missing dates with:
allDates <- seq.Date(
min(Dat$Date),
max(Dat$Date),
"day")
allValues <- merge(
x=data.frame(Date=allDates),
y=Value,
all.x=TRUE)
Make the data time series:
time <- ts(dat$Value, start= c(2011,11), frequency=52)
Forecast using hybrid model:
hm1 <- hybridModel(y = time, weights = "insample.errors")
plot(forecast(hm1))
Is there a way that I can do this for both product ids? Or is there a cleaner method without filling in the blank dates?
dat <-structure(list(product_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
Date = c("1/10/12", "1/13/12", "2/14/16", "3/5/16", "1/5/17", "3/15/17", "1/1/11", "3/14/15", "1/15/16", "3/15/17", "4/11/17", "5/16/17"),
Revenue = c("in", "in", "in", "out", "out", "out", "in", "in", "in", "out", "out", "out"),
Value = c(0L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 0L, 0L, 0L)
),
.Names = c("product_id", "Date", "Revenue", "Value"),
class = "data.frame", row.names = c(NA, -12L))

Resources