Subset dataframe based of non-sequential dates - r

I have data that looks like this
df<-data.frame(datecol=as.Date(c("2010-04-03","2010-04-04","2010-04-05","2010-04-06","2010-04-07",
"2010-04-03","2010-04-04","2010-04-05","2010-04-06","2010-04-07",
"2010-05-06","2010-05-07","2010-05-09","2010-06-06","2010-06-07")),x=c(1,1,1,0,1,1,1,0,0,0,1,0,0,0,1),type=c(rep("A",5),rep("B",5),rep("C",5)))
> df
datecol x type
1 2010-04-03 1 A
2 2010-04-04 1 A
3 2010-04-05 1 A
4 2010-04-06 0 A
5 2010-04-07 1 A
6 2010-04-03 1 B
7 2010-04-04 1 B
8 2010-04-05 0 B
9 2010-04-06 0 B
10 2010-04-07 0 B
11 2010-05-06 1 C
12 2010-05-07 0 C
13 2010-05-09 0 C
14 2010-06-06 0 C
15 2010-06-07 1 C
I need to subset this dataframe by type, where I only keep the "types" which have 2 or more different dates and those dates are at least 1 day apart. In the above example type A has 4 different dates, and type C has 2 different dates which are more than 1 day apart, so I want to save these two as a new dataframe. Type B has 2 different dates, but they are not 1 day apart, so I don't want to keep it.
I was thinking to do it in a loop count how many unique date are within each type, leave everything which has more than 2 different dates. Then I would look at the ones which have only 2 different dates and calculate the distance between them and leave only the ones where distance is more than 1. But it seems that there should be a more efficient way. Any Ideas?

One solution with data.table:
#make sure datecol is Date
df$datecol <- as.Date(df$datecol)
library(data.table)
#x needs to be 1 and the date difference more than a day per type
#then in the second [] we select the TRUEs
setDT(df)[x == 1, diff(datecol) > 1, by = type][V1 == TRUE, type]
#[1] A C
#Levels: A B C

Related

Filling (NA values) in the column based on its previous records and another column (with interval) in R

I want to fill action column based on its records and time column. NA in action column should be filled based on previous action record and time interval. lets say we set time interval to 10, which means that if action is A and time is 1, all NA in action should be A till time==11 (1+10).
Please note that if action or ID change, this process should be reset. For example (in row 3) we have B with time==11, I want to fill the next NAs with B until time==21, but we have C in time==16, so we continue filling NA with C until time==26.
df<-read.table(text="
id action time
1 A 1
1 NA 4
1 NA 9
1 B 11
1 NA 12
1 C 16
1 NA 19
1 NA 30
1 A 31
1 NA 32
2 NA 1
2 A 2
2 NA 6",header=T,stringsAsFactors = F)
Desired Result:
id action time
1 A 1
1 A 4
1 A 9
1 B 11
1 B 12
1 C 16
1 C 19
1 NA 30
1 A 31
1 A 32
2 NA 1
2 A 2
2 A 6
We can extract the non-NA rows to use as a reference for filling in values, then iterate through the data set and conditionally replace values based on if they meet the requirements of id and the time interval.
# Use row numbers as an index (unique Id)
df$idx <- 1:nrow(df)
# Find the non-NA rows to use a reference for imputation
idx <- df %>%
group_by(id) %>%
na.omit(action)
The temporary data set idx is used as the reference and the column idx is our unique identifier. Let's first look at the logic for finding and filling in the missing values without worrying about the time interval, so that it's easier to read and understand:
# Ignoring the 'interval' limitation, we'd fill them in like this:
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
If you're running this example code make sure you re-create df and idx before proceeding, since it would be modified by that last example code block.
The time interval requires us to do a logical test on the value of time and also another test to avoid trying to conduct the time comparison on NA values:
# Accounting for the max interval:
interval <- 10
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
if(!is.na(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]))){
if(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]) + interval >= df$time[r])
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
}
df
This gives us:
id action time idx
1 1 A 1 1
2 1 A 4 2
3 1 A 9 3
4 1 B 11 4
5 1 B 12 5
6 1 C 16 6
7 1 C 19 7
8 1 <NA> 30 8
9 1 A 31 9
10 1 A 32 10
11 2 <NA> 1 11
12 2 A 2 12
13 2 A 6 13
which matches your desired output.

How can I test if a value in one row is unique comparing with all previous rows by group, and count number of different value

I am trying to test if value in one row is unique comparing with all previous rows by group.
For example, for ID=1, I want to compare the drug of the current row to all previous rows (or to day, compare to those DATE earlier than the current row) under ID=1, eg. In row 2, drug A is same as in row 1 , thus EXIST_BEFORE codes as 1 ; for row 4, C is unique comparing with previous rows (A, B , C) thus codes as 0.
add another question: how can I count the number of different drug prior the current date ? for example, for ID=1 , prev_drug for row 4 is 2 , because it has two drugs ( A ,B) different from drug C prior the the DATE of row 4.
ID DATE DRUG EXIST_BEFORE prev_drug
1 2001-01-01 A NA 0
1 2001-02-01 A 1 0
1 2001-03-15 B 0 1
1 2001-04-20 C 0 2
1 2001-05-29 A 1 2
1 2001-05-02 B 1 2
2 2001-03-02 A NA 0
2 2001-03-23 C 0 1
2 2001-04-04 D 0 2
2 2001-05-05 B 0 3
I only know how to compare with one row above by lag(), but have no idea on comparing to date before for each ID.
For this, try using dplyr. Basically we can just group on ID and DRUG. For that grouped combination, find the first DATE occurrence using min(). Then, if the date is after that first occurrence, than it is a repeat.
library(dplyr)
mydata %>%
group_by(ID, DRUG) %>%
mutate(FIRST_OCCURANCE = min(DATE),
EXIST_BEFORE = DATE > FIRST_OCCURANCE)
ID DATE DRUG EXIST_BEFORE FIRST_OCCURANCE
<int> <date> <chr> <lgl> <date>
1 1 2001-01-01 A FALSE 2001-01-01
2 1 2001-02-01 A TRUE 2001-01-01
3 1 2001-03-15 B FALSE 2001-03-15
4 1 2001-04-20 C FALSE 2001-04-20
5 1 2001-05-29 A TRUE 2001-01-01
6 1 2001-05-02 B TRUE 2001-03-15
7 2 2001-03-02 A FALSE 2001-03-02
8 2 2001-03-23 C FALSE 2001-03-23
9 2 2001-04-04 D FALSE 2001-04-04
10 2 2001-05-05 B FALSE 2001-05-05
I broke it into two variables to show what is going on, but you can also reduce the mutate() line simply to:
mutate(EXIST_BEFORE = DATE > min(DATE))
Alternatively, the rowid() function from the data.table package can be used:
library(data.table)
setDT(DT)[order(DATE), EXIST_BEFORE := pmin(1L, rowid(ID, DRUG) - 1L)]
DT
ID DATE DRUG EXIST_BEFORE
1: 1 2001-01-01 A 0
2: 1 2001-02-01 A 1
3: 1 2001-03-15 B 0
4: 1 2001-04-20 C 0
5: 1 2001-05-29 A 1
6: 1 2001-05-02 B 1
7: 2 2001-03-02 A 0
8: 2 2001-03-23 C 0
9: 2 2001-04-04 D 0
10: 2 2001-05-05 B 0
rowid(ID, DRUG) - 1L counts the number of occurrences of ID and DRUG (kind of implied grouping) starting at 0. pmin() is used to cut off values greater 1. order(DATE) ensures that rows are sorted appropriately.
Or, as suggested in Sotos' comment:
setDT(DT)[order(DATE), EXIST_BEFORE := as.integer(duplicated(DRUG)), by = ID][]

How do I identifying the first zero in a group of ordered columns?

I'm trying to format a dataset for use in some survival analysis models. Each row is a school, and the time-varying columns are the total number of students enrolled in the school that year. Say the data frame looks like this (there are time invariate columns as well).
Name total.89 total.90 total.91 total.92
a 8 6 4 0
b 1 2 4 9
c 7 9 0 0
d 2 0 0 0
I'd like to create a new column indicating when the school "died," i.e., the first column in which a zero appears. Ultimately I'd like to have this column be "years since 1989" and can re-name columns accordingly.
A more general version of the question, for a series of time ordered columns, how do I identify the first column in which a given value occurs?
Here's a base R approach to get a column with the first zero (x = 0) or NA if there isn't one:
data$died <- apply(data[, -1], 1, match, x = 0)
data
# Name total.89 total.90 total.91 total.92 died
# 1 a 8 6 4 0 4
# 2 b 1 2 4 9 NA
# 3 c 7 9 0 0 3
# 4 d 2 0 0 0 2
Here is an option using max.col with rowSums
df1$died <- max.col(!df1[-1], "first") * NA^!rowSums(!df1[-1])
df1$died
#[1] 4 NA 3 2

How to conditionally compute difference in column values between rows in R?

I have the following data set
(this is just sample, actual data set runs into rows)
Image of the data set in also attached in the snapshot
Dataset snapshot
User Time Flag TimeDifference Expected o/p (Seconds)
A 11:39:30 1
A 11:37:53 1
A 20:44:19 1
A 22:58:42 2 Calculate time difference? 8063
A 23:01:54 1 Calculate time difference? 192
B 23:03:00 1
B 23:03:33 1
B 23:03:53 1
B 15:00:42 3 Calculate time difference 28991
B 19:35:31 2 Calculate time difference 16489
B 19:35:34 1 Calculate time difference 3
C 10:19:06 1
C 10:59:50 1
C 10:59:50 1
C 12:16:36 1
C 12:16:36 1
I need to calculate for each user
the time difference (in seconds) between rows whenever there is a 'Flag change' and store it in a new column called 'Time difference'
i.e. whenever flag changes from 1 to 2 , or 2 to 3 , or 2 to 1, or 3 to 1 , I need to compute time difference the time column between the current row and preceding row when flag change in encountered.
I have time in hh:mm:ss format.
Is there any for-loop function I can apply here?
Help appreciated.
One way to do that is to turn your time variable into POSIXlt time object, calculate the time difference (for all rows) against a shifted time variable. Then use your flag variable to NA the ones you dont want. The important part is you need to diff the flag variable so you know when your flag has changed
I'm laying out all the steps here, but theres probably a quicker way to do it:
# Create the data
flag <- c(1,1,1,2,1,1,1,1,3,2,1,1,1,1,1,1)
time <- c('11:39:30','11:37:53','20:44:19','22:58:42','23:01:54',
'23:03:00','23:03:33','23:03:53','15:00:42','19:35:31',
'19:35:34','10:19:06','10:59:50','10:59:50','12:16:36',
'12:16:36')
# Shift the time
time_shift <- c(NA,time[1:length(time)-1])
# Turn into POSIXlt objects
time <- strptime(time, format='%H:%M:%S')
time_shift <- strptime(time_shift, format='%H:%M:%S')
data <- data.frame(time, time_shift, flag)
# Calculate diffs
data$time_diff <- as.numeric(abs(difftime(data$time, data$time_shift, units=('secs'))))
data$flag_diff <- c(NA,abs(diff(data$flag)))
# Set non 'flag change' diffs to NA
data$time_diff[data$flag_diff == 0] <- NA
You'll probably want to remove the useless columns and convert time back into your original representation, which you can do with:
data$time <- format(data$time, "%H:%M:%S")
data <- data[c('time', 'flag', 'time_diff')]
That will result in a dataframe that looks like this:
time flag time_diff
1 11:39:30 1 NA
2 11:37:53 1 NA
3 20:44:19 1 NA
4 22:58:42 2 8063
5 23:01:54 1 192
6 23:03:00 1 NA
7 23:03:33 1 NA
8 23:03:53 1 NA
9 15:00:42 3 28991
10 19:35:31 2 16489
11 19:35:34 1 3
12 10:19:06 1 NA
13 10:59:50 1 NA
14 10:59:50 1 NA
15 12:16:36 1 NA
16 12:16:36 1 NA
Some preprocessing may be required earlier:
df$Time<-strptime(x = df$Time,format = "%H:%M:%S")
df$Time<-strftime(x = df$Time,format = "%H:%M:%S")
df$Time<-as.POSIXct(df$Time)
sol<-function(d){
Time_difference<-numeric(nrow(d))
ind<-which(diff(d$Flag)!=0)+1
#calculate differences in time where change in Flag was detected
Time_difference[ind]<-abs(difftime(time1 = d$Time[ind],time2 =
d$Time[(ind-1)], units = "secs"))
d$Time_Difference<-Time_difference
return(d)
}
Now using the plyr package and ddply function, which follow the split-apply-combine principle. It will take a data frame(d) and split it by a variable("User" in this case), apply a function(sol in this case) to that subset of data.frame and then recombine it to the original data.frame(d).
ddply(.data = df,.variables = "User",.fun = sol)
# User Time Flag Time_Difference
#1 A 11:39:30 1 0
#2 A 11:37:53 1 0
#3 A 20:44:19 1 0
#4 A 22:58:42 2 8063
#5 A 23:01:54 1 192
#6 B 23:03:00 1 0
#7 B 23:03:33 1 0
#8 B 23:03:53 1 0
#9 B 15:00:42 3 28991
#10 B 19:35:31 2 16489
#11 B 19:35:34 1 3
#12 C 10:19:06 1 0
#13 C 10:59:50 1 0
#14 C 10:59:50 1 0
#15 C 12:16:36 1 0
#16 C 12:16:36 1 0

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Resources