Sum of lag functions - r

Within one person's data from a behavioral task, I am trying to sum the clock time at which a target appears (data$onset) and the reaction time of their response (data$Latency) to find the clock time at which they entered their response at. For future data processing reasons, these calculated values will have to be placed in the data$onset column two values down from when the target appeared on the screen. In the example below:
Item
onset
Latency
Prime
9.97
0
Target
10.70
0.45
Mask
11.02
0
Response
NA
0
Onset is how many seconds into the task the stimuli appeared, and latency is reaction time to the target. latency for non-targets will always be 0, as subjects don't respond to them. in the "NA" under onset, I need that value to be the sum of the onset of the target+reaction time to the target (10.70+0.45). Here is the code I have tried:
data$onset=if_else(is.na(data$onset), sum(lag(data$onset, n = 2)+lag(data$Latency, n = 2)), data$onset)
If any clarification is needed please let me know.

since you used if_else I'm adding a dplyr solution;
library(dplyr)
data %>%
mutate(onset=ifelse(is.na(onset),lag(onset,n =2)+lag(Latency,n = 2),onset))
output;
Item onset Latency
<fct> <dbl> <dbl>
1 Prime 9.97 0
2 Target 10.7 0.45
3 Mask 11.0 0
4 Response 11.1 0
Also note that, if you want to stick to your own syntax;
data$onset=if_else(is.na(data$onset), lag(data$onset, n = 2)+lag(data$Latency, n = 2), data$onset)

Related

Need formula on how to calculate the work completion days on average

Client give 100 task to employee.
Employee complete 50 task in 1 day
20 task in 2 days
15 task in 3 days
4 task in 4 days
5 taak in 6 days
6 task in 10 days.
Now I want to know on a average how many days employee will take to complete for 1 task
Need formula for this query..
Assuming tasks are not completed in parallel (i.e. days are mutually exclusive with respect to completing/working on tasks), average days per task = 0.26:
=SUM(B2:B7)/SUM(A2:A7)
This is where the solution should terminate - however, I provide a number of checks/alternative approaches which serve to demonstrate (unequivocally) the veracity of the above function...
checks
check 1
The same can be derived using the 'weighted average calculation:
=SUM((B2:B7/A2:A7)*A2:A7)/SUM(A2:A7)
check 2
Intuitively, if each task takes ~0.26 days to complete, and there are 100 tasks, then the total duration (days) ~= 26: summing column B gives just that:
check 3
If still unconvinced, you can calculate the average days per task for each category/type (i.e. for those that take 1,2,3,.., 10 days to complete):
=B2:B7/A2:A7
Then expand these out using sequence / other method:
=SEQUENCE(1,A2,G2,0)
Again, this yields 0.26 and which should confirm (unequivocally) the veracity of the simple/direct ratio...
Ta

Sampling not completely at random, with boundary conditions

I have summary level data that tells me how often a group of patients actually went to the doctor until a certain cut-off date. I do not have individual data, I only know that some e.g. went 5 times, and some only once.
I also know that some were already patients at the beginning of the observation interval, and would be expected to come more often, whereas some were new patients that entered later. If they only joined a month before the cutoff data, they would be expected to come less often than someone who was in the group from the beginning.
Of course, the patients are not well behaved, so they sometimes miss a visit, or they come more often than expected. I am setting some boundary conditions to define the expectation about minimum and maximum number of doctor visits relative to the month they started appearing at the doctor.
Now, I want to distribute the actual summary level data to individuals, i.e. create a data frame that tells me during which month each individual started appearing at the doctor, and how many times they came for check-up until the cut-off date.
I am assuming this can be done with some type of random sampling, but the result needs to fit both the summary level information I have about the actual subjects as well as the boundary conditions telling how often a subject would be expected to come to the doctor relative to their joining time.
Here is some code that generates the target data frame that contains the month when the observation period starts, the respective number of doctor's visits that is expected (including boundary for minimum and maximum visits), and the associated percentage of subjects who start coming to the doctor during this month:
library(tidyverse)
months <- c("Nov", "Dec", "Jan", "Feb", "Mar", "Apr")
target.visits <- c(6,5,4,3,2,1)
percent <- c(0.8, 0.1, 0.05, 0.03, 0.01, 0.01)
df.target <- data.frame(month = months, target.visits = target.visits,
percent = percent) %>%
mutate(max.visits = c(7,6,5,4,3,2),
min.visits = c(5,4,3,2,1,1))
This is the data frame:
month target.visits percent max.visits min.visits
Nov 6 0.80 7 5
Dec 5 0.10 6 4
Jan 4 0.05 5 3
Feb 3 0.03 4 2
Mar 2 0.01 3 1
Apr 1 0.01 2 1
In addition, I can create the data frame that shows the actual subject n with the actual number of visits:
subj.n <- 1000
actual.visits = c(7,6,5,4,3,2,1)
actual.subject.perc = c(0.05,0.6,0.2,0.06,0.035, 0.035,0.02)
df.observed <- data.frame(actual.visits = actual.visits,
actual.subj.perc = actual.subject.perc, actual.subj.n = subj.n * actual.subject.perc)
Here is the data frame with the actual observations:
actual.visits actual.subj.perc actual.subj.n
7 0.050 50
6 0.600 600
5 0.200 200
4 0.060 60
3 0.035 35
2 0.035 35
1 0.020 20
Unfortunately I do not have any idea how to bring these together. I just know that if I have e.g. 60 subjects that come to the doctor 4 times during their observation period, I would like to randomly assign a starting month to each of them. However, based on the boudary conditions min.visits and max.visits, I know that it can only be a month from Dec - Feb.
Any thoughts are much appreciated.

K means cluster analysis result using R

I tried a k means cluster analysis on a data set. The data set for customers includes the order number (the number of time that a customer has placed an order with the company;can be any number) ,order day (the day of the week the most recent order was placed; 0 to 6) and order hour (the hour of the day the most recent order was placed; 0 to 23) for loyal customers. I scaled the values and used.
# K-Means Cluster Analysis
fit <- kmeans(mydata, 3) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
However, I am getting a few negative values as well. On the internet they say that this means the differences within group are greater than with that for other groups. However, I cannot understand how to interpret the output.
Can you please give an example of how to interpret?
Group.1 order_number order_dow order_hour_of_day
1 1 -0.4434400796 0.80263819338 -0.04766613741
2 2 1.6759259419 0.09051366962 0.07815242904
3 3 -0.3936748015 -1.00553744774 0.01377787416

Aggregating data by two variables using R and dplyr

I am using NFL play-by-play data from the 2013 season and I am looking to measure catch success rate by Wide Receivers. Essentially, I have four variables of interest: Targeted Receiver, Pass Distance, Target and Reception. I would like to obtain a data set broken down by Targeted Receiver and Pass Distance, with Targets and Receptions summarized (just a simple count) for each of the two Targeted Receiver and Pass Distance combinations (i.e. Receiver 1 Short, Receiver 1 Long).
Thank you for your help,
CLR
First, take the table df and keep only the columns that are relevant (Targeted Receiver, Pass Distance, Target, and Reception).
df <- select(df, `Targeted Receiver`, `Pass Distance`, `Target`, `Reception`)
Then, remove the rows where there is no receiver (e.g. a running play).
df <- df[!is.na(df$`Targeted Receiver`), ]
After that, use group_by from dplyr so that your data are grouped at the Target Receiver and Pass Distance level.
grouped <- group_by(df, `Targeted Receiver`, `Pass Distance`)
Finally, use the summarise function to create the count of Target and the sum of Reception.
per_rec <- summarise(grouped, Target = n(), Reception = sum(Reception))
The data will look like this:
Targeted Receiver Pass Distance Target Reception
(chr) (chr) (int) (dbl)
1 A.J. Green Deep 50 21
2 A.J. Green Short 128 77
3 A.J. Jenkins Deep 6 2
4 A.J. Jenkins Short 11 6
5 Aaron Dobson Deep 23 6
6 Aaron Dobson Short 49 31

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Resources