Filling (NA values) in the column based on its previous records and another column (with interval) in R - r

I want to fill action column based on its records and time column. NA in action column should be filled based on previous action record and time interval. lets say we set time interval to 10, which means that if action is A and time is 1, all NA in action should be A till time==11 (1+10).
Please note that if action or ID change, this process should be reset. For example (in row 3) we have B with time==11, I want to fill the next NAs with B until time==21, but we have C in time==16, so we continue filling NA with C until time==26.
df<-read.table(text="
id action time
1 A 1
1 NA 4
1 NA 9
1 B 11
1 NA 12
1 C 16
1 NA 19
1 NA 30
1 A 31
1 NA 32
2 NA 1
2 A 2
2 NA 6",header=T,stringsAsFactors = F)
Desired Result:
id action time
1 A 1
1 A 4
1 A 9
1 B 11
1 B 12
1 C 16
1 C 19
1 NA 30
1 A 31
1 A 32
2 NA 1
2 A 2
2 A 6

We can extract the non-NA rows to use as a reference for filling in values, then iterate through the data set and conditionally replace values based on if they meet the requirements of id and the time interval.
# Use row numbers as an index (unique Id)
df$idx <- 1:nrow(df)
# Find the non-NA rows to use a reference for imputation
idx <- df %>%
group_by(id) %>%
na.omit(action)
The temporary data set idx is used as the reference and the column idx is our unique identifier. Let's first look at the logic for finding and filling in the missing values without worrying about the time interval, so that it's easier to read and understand:
# Ignoring the 'interval' limitation, we'd fill them in like this:
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
If you're running this example code make sure you re-create df and idx before proceeding, since it would be modified by that last example code block.
The time interval requires us to do a logical test on the value of time and also another test to avoid trying to conduct the time comparison on NA values:
# Accounting for the max interval:
interval <- 10
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
if(!is.na(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]))){
if(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]) + interval >= df$time[r])
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
}
df
This gives us:
id action time idx
1 1 A 1 1
2 1 A 4 2
3 1 A 9 3
4 1 B 11 4
5 1 B 12 5
6 1 C 16 6
7 1 C 19 7
8 1 <NA> 30 8
9 1 A 31 9
10 1 A 32 10
11 2 <NA> 1 11
12 2 A 2 12
13 2 A 6 13
which matches your desired output.

Related

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

How to conditionally compute difference in column values between rows in R?

I have the following data set
(this is just sample, actual data set runs into rows)
Image of the data set in also attached in the snapshot
Dataset snapshot
User Time Flag TimeDifference Expected o/p (Seconds)
A 11:39:30 1
A 11:37:53 1
A 20:44:19 1
A 22:58:42 2 Calculate time difference? 8063
A 23:01:54 1 Calculate time difference? 192
B 23:03:00 1
B 23:03:33 1
B 23:03:53 1
B 15:00:42 3 Calculate time difference 28991
B 19:35:31 2 Calculate time difference 16489
B 19:35:34 1 Calculate time difference 3
C 10:19:06 1
C 10:59:50 1
C 10:59:50 1
C 12:16:36 1
C 12:16:36 1
I need to calculate for each user
the time difference (in seconds) between rows whenever there is a 'Flag change' and store it in a new column called 'Time difference'
i.e. whenever flag changes from 1 to 2 , or 2 to 3 , or 2 to 1, or 3 to 1 , I need to compute time difference the time column between the current row and preceding row when flag change in encountered.
I have time in hh:mm:ss format.
Is there any for-loop function I can apply here?
Help appreciated.
One way to do that is to turn your time variable into POSIXlt time object, calculate the time difference (for all rows) against a shifted time variable. Then use your flag variable to NA the ones you dont want. The important part is you need to diff the flag variable so you know when your flag has changed
I'm laying out all the steps here, but theres probably a quicker way to do it:
# Create the data
flag <- c(1,1,1,2,1,1,1,1,3,2,1,1,1,1,1,1)
time <- c('11:39:30','11:37:53','20:44:19','22:58:42','23:01:54',
'23:03:00','23:03:33','23:03:53','15:00:42','19:35:31',
'19:35:34','10:19:06','10:59:50','10:59:50','12:16:36',
'12:16:36')
# Shift the time
time_shift <- c(NA,time[1:length(time)-1])
# Turn into POSIXlt objects
time <- strptime(time, format='%H:%M:%S')
time_shift <- strptime(time_shift, format='%H:%M:%S')
data <- data.frame(time, time_shift, flag)
# Calculate diffs
data$time_diff <- as.numeric(abs(difftime(data$time, data$time_shift, units=('secs'))))
data$flag_diff <- c(NA,abs(diff(data$flag)))
# Set non 'flag change' diffs to NA
data$time_diff[data$flag_diff == 0] <- NA
You'll probably want to remove the useless columns and convert time back into your original representation, which you can do with:
data$time <- format(data$time, "%H:%M:%S")
data <- data[c('time', 'flag', 'time_diff')]
That will result in a dataframe that looks like this:
time flag time_diff
1 11:39:30 1 NA
2 11:37:53 1 NA
3 20:44:19 1 NA
4 22:58:42 2 8063
5 23:01:54 1 192
6 23:03:00 1 NA
7 23:03:33 1 NA
8 23:03:53 1 NA
9 15:00:42 3 28991
10 19:35:31 2 16489
11 19:35:34 1 3
12 10:19:06 1 NA
13 10:59:50 1 NA
14 10:59:50 1 NA
15 12:16:36 1 NA
16 12:16:36 1 NA
Some preprocessing may be required earlier:
df$Time<-strptime(x = df$Time,format = "%H:%M:%S")
df$Time<-strftime(x = df$Time,format = "%H:%M:%S")
df$Time<-as.POSIXct(df$Time)
sol<-function(d){
Time_difference<-numeric(nrow(d))
ind<-which(diff(d$Flag)!=0)+1
#calculate differences in time where change in Flag was detected
Time_difference[ind]<-abs(difftime(time1 = d$Time[ind],time2 =
d$Time[(ind-1)], units = "secs"))
d$Time_Difference<-Time_difference
return(d)
}
Now using the plyr package and ddply function, which follow the split-apply-combine principle. It will take a data frame(d) and split it by a variable("User" in this case), apply a function(sol in this case) to that subset of data.frame and then recombine it to the original data.frame(d).
ddply(.data = df,.variables = "User",.fun = sol)
# User Time Flag Time_Difference
#1 A 11:39:30 1 0
#2 A 11:37:53 1 0
#3 A 20:44:19 1 0
#4 A 22:58:42 2 8063
#5 A 23:01:54 1 192
#6 B 23:03:00 1 0
#7 B 23:03:33 1 0
#8 B 23:03:53 1 0
#9 B 15:00:42 3 28991
#10 B 19:35:31 2 16489
#11 B 19:35:34 1 3
#12 C 10:19:06 1 0
#13 C 10:59:50 1 0
#14 C 10:59:50 1 0
#15 C 12:16:36 1 0
#16 C 12:16:36 1 0

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Delete following observations when goal has been reached

Given the dataframe:
df = data.frame(
ID = c(1,1,1,1,2,3,3),
Start = c(0,8,150,200,6,7,60),
Stop = c(5,60,170,210,NA,45,80))
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
3 1 150 170 1
4 1 200 210 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
For each ID, I would like to keep all rows until Start[i+1] - Stop[i] >= 28, and then delete the following observations of that ID
In this example, the output should be
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
I ended up having to set NA's to a value easy to identify later and the following code
df$Stop[is.na(df$Stop)] = 10000
df$diff <- df$Start-c(0,df$Stop[1:length(df$Stop)-1])
space <- with(df, unique(ID[diff<28]))
df2 <- subset(df, (ID %in% space & diff < 28) | !ID %in% space)
Using data.table...
library(data.table)
setDT(df)
df[,{
w = which( shift(Start,type="lead") - Stop >= 28 )
if (length(w)) .SD[seq(w[1])] else .SD
}, by=ID]
# ID Start Stop
# 1: 1 0 5
# 2: 1 8 60
# 3: 2 6 NA
# 4: 3 7 45
# 5: 3 60 80
.SD is the Subset of Data associated with each by=ID group.
Create a diff column.
df$diff<-df$Start-c(0,df$Stop[1:length(df$Stop)-1])
Subset on the basis of this column
df[df$diff<28,]
PS: I have converted 'NA' to 0. You would have to handle that anyway.
p <- which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28)
df <- df[p,]
Assuming you want to keep entries where next entry start if higher than giben entry stop by 28 or more
The result is:
>p 2 3
> df[p,]
ID Start Stop
2 1 8 60
3 1 150 170
start in row 2 ( i + 1 = 2) is higher than stop in row 1 (i=1) by 90.
Or, if by until you mean the reverse condition, then
df <- df[which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] < 28),]
Inclusion of NA in your data frame got me thinking. You have to be very careful how you word your condition. If you want to keep all the cases where difference between next start and stop is less than 28, then the above statement will do.
However, if you want to keep all cases EXCEPT when difference is 28 or more, then you should
p <- which((df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28))
rp <- which((!is.element(1:nrow(df),p)))
df <- df[rp,]
As it will include the unknown difference.

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

Resources