How to find mean of n consecutive days in each group r - r

I have a dataframe that contains id(contains duplicate),date(contains duplicate),value. the values are recorded for different consecutive days. now what i want is to group the dataframe with id and date(as n consecutive days) and find mean of values. and return NA if the last group does not contain n days.
id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2
. . .
. . .
. . .
20 2012-2-6 10
desired output with n-consecutive days as 3
id date value group_n_consecutive_days mean_n_consecutive_days
1 2016-10-5 2 1 2
1 2016-10-6 3 1 2
1 2016-10-7 1 1 2
1 2016-10-8 2 2 NA
1 2016-10-9 5 2 NA
2 2013-10-6 2 1 4
.
.
.
.
20 2012-2-6 10 6 25

The data in the question is sorted and consecutive within id so we assume that that is the case. Also when the question refers to duplicate dates we assume that that means that different id values can have the same date but within id the dates are unique and consecutive. Now, using the data shown reproducibly in Note 2 at the end group by id and compute the group numbers using gl. Then grouping by id and group_no take the mean of each group of 3 or NA for smaller groups.
library(dplyr)
DF %>%
group_by(id) %>%
mutate(group_no = c(gl(n(), 3, n()))) %>%
group_by(group_no, add = TRUE) %>%
mutate(mean = if (n() == 3) mean(value) else NA) %>%
ungroup
giving:
# A tibble: 6 x 5
id date value group_no mean
<int> <date> <int> <int> <dbl>
1 1 2016-10-05 2 1 2
2 1 2016-10-06 3 1 2
3 1 2016-10-07 1 1 2
4 1 2016-10-08 2 2 NA
5 1 2016-10-09 5 2 NA
6 2 2013-10-06 2 1 NA
Note 1
An alternative to gl(...) could be cumsum(rep(1:3, length = n()) == 1) and an alternative to if (n() = 3) mean(value) else NA could be mean(head(c(value, NA, NA), 3)) .
Note 2
The input data in reproducible form was assumed to be:
Lines <- "id date value
1 2016-10-5 2
1 2016-10-6 3
1 2016-10-7 1
1 2016-10-8 2
1 2016-10-9 5
2 2013-10-6 2"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)

Related

How to use a for loop to changed consecutive values in R?

How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1

Insert missing rows in time series data

I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1

delete a row where the order is wrong within a group

I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9

How to subset by time range?

I would like to create a subset from the following example of data frame. The condition is to select those rows where time column values belong to time range from the minimum time for the certain id till the next lets say one hour.
id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693
And the output should be like this:
id time
1 1468696537
1 1468696637
2 1471902849
2 1471902850
3 1471446673
3 1471446693
Please, help me how to do that?
We can do the following:
library(magrittr);
library(dplyr);
df %>%
group_by(id) %>%
filter(time <= min(time) + 3600)
# id time
# <int> <int>
#1 1 1468696537
#2 1 1468696637
#3 2 1471902849
#4 2 1471902850
#5 3 1471446673
#6 3 1471446693
Explanation: Group entries by id, then filter entries that are within min(time) + 1 hour.
Sample data
df <- read.table(text =
" id time
1 1468696537
1 1468696637
1 1482007490
2 1471902849
2 1471902850
2 1483361074
3 1474207754
3 1474207744
3 1471446673
3 1471446693 ", header = T)

Finding individual values from cumulative mean in R Dataframe

I am having troubles finding how to find individual values from the running mean in an R dataframe.
I have an R dataframe:
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
Where the mean is the mean for the x measurements for the specific ID in the dataframe.
To find the individual values at each x value rather than the mean, I was thinking that I needed to apply a recursive function on the dataframe and group by the ID. How could I do this in a dataframe while grouping by one of the values when any apply function wouldn't have access to the previous entry in the dataframe?
When completed and appended to the dataframe, I am hoping it to look like this:
x ID Mean IndivValues
1 1 1 1
1 2 5 5
2 1 3 5
2 2 6 7
It's much easier to calculate this from totals -> to individual observation, as below:
Example data.frame:
df <- read.table(text='
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
', header=T)
Solution:
library(dplyr); library(magrittr)
df %>%
group_by(id) %>%
mutate(
total = mean * x,
ind_value = total - lag(total, default=0) )
## A tibble: 4 x 5
## Groups: ID [2]
# x ID Mean total ind_value
# <int> <int> <int> <int> <int>
#1 1 1 1 1 1
#2 1 2 5 5 5
#3 2 1 3 6 5
#4 2 2 6 12 7

Resources