Rolling lowest mean and stdev of n consecutive minutes of each time period - r

New R user here looking for guidance. I am working with a 15-minute data set and looking to parse out the following by one variable (buildings in my case) for each day of the year:
(1) lowest mean of "value" for n consecutive rows (preferably 2 or 3 hours worth)
(2) standard deviation of the same period
Sample df:
variable year month day hr min date value
building_a 2018 6 2 0 0 6/2/2018 19
building_a 2018 6 2 0 15 6/2/2018 19
building_a 2018 6 2 0 30 6/2/2018 19
building_a 2018 6 2 0 45 6/2/2018 17
building_a 2018 6 2 1 0 6/2/2018 17
building_a 2018 6 2 1 15 6/2/2018 15
building_a 2018 6 2 1 30 6/2/2018 15
building_a 2018 6 2 1 45 6/2/2018 14
building_a 2018 6 2 2 0 6/2/2018 14
building_a 2018 6 2 2 15 6/2/2018 13
building_a 2018 6 2 2 30 6/2/2018 13
building_a 2018 6 2 2 45 6/2/2018 13
building_a 2018 6 2 3 0 6/2/2018 12
building_a 2018 6 2 3 15 6/2/2018 14
building_a 2018 6 2 3 30 6/2/2018 13
building_a 2018 6 2 3 45 6/2/2018 13
building_b 2018 6 2 0 0 6/2/2018 37
building_b 2018 6 2 0 15 6/2/2018 41
building_b 2018 6 2 0 30 6/2/2018 38
building_b 2018 6 2 0 45 6/2/2018 39
building_b 2018 6 2 1 0 6/2/2018 37
building_b 2018 6 2 1 15 6/2/2018 36
building_b 2018 6 2 1 30 6/2/2018 34
building_b 2018 6 2 1 45 6/2/2018 34
building_b 2018 6 2 2 0 6/2/2018 35
building_b 2018 6 2 2 15 6/2/2018 35
building_b 2018 6 2 2 30 6/2/2018 29
building_b 2018 6 2 2 45 6/2/2018 32
building_b 2018 6 2 3 0 6/2/2018 30
building_b 2018 6 2 3 15 6/2/2018 33
building_b 2018 6 2 3 30 6/2/2018 30
building_b 2018 6 2 3 45 6/2/2018 32
I've been able to perform this for one-hour segments using the following approach, but cannot figure out how to adapt this to a larger window (e.g., lowest 135 minute mean instead of 60 min).
tmp <- aggregate(value~variable+date+hour, df,
function(x)
c(mean = mean(x), sd = sd(x)))
tmp2 <- do.call("data.frame",tmp)
tmp2$value.mean <- as.numeric(tmp2$value.mean)
tmp2$value.sd <- as.numeric(tmp2$value.sd)
tmp2_flat <- tmp2 %>%
group_by(variable, date) %>%
filter(value.mean == min(value.mean)) %>%
arrange(variable, date, value.sd) %>%
slice(1)
Thank you in advance for any advice

I played a little around and this is what I came up with:
UPDATE: The last answer wasn't very practicable. There was no feedback but I'm changing it nevertheless.
library(zoo)
library(dplyr)
df %>%
group_by(variable, date) %>%
mutate(minimum = rollapply(value, width = 4, FUN = mean, fill = NA, align = "right"),
sd = rollapply(value, width = 4, FUN = sd, fill = NA, align = "right")) %>%
slice(which.min(minimum))
# A tibble: 2 x 10
# Groups: variable, date [2]
variable year month day hr min date value minimum sd
<fct> <int> <int> <int> <int> <int> <fct> <int> <dbl> <dbl>
1 building_a 2018 6 2 3 0 6/2/2018 12 12.8 0.5
2 building_b 2018 6 2 2 30 6/2/2018 29 33.2 2.87
The idea remains the same however. In the rollapply() function one can specifiy the n of consecutive rows via as width= argument. 4 means in this case 4 * 15 minutes = 1 hour, but can be any number of quarter hours.
And it calculates a "moving average" of valueat each row by looking back width rows.
That should do it I hope.

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

How to sort data in descending order based on every second value in R?

I am using dplyr for most of my data wrangling in R. Yet, I am having a hard time achieving this particular effect. Can't also seem to find the answer by googling either.
Assume I have data like this and what I want to achieve is to sort person-grouped data based on cash value from the year 2021. Below I show the outcome I wish to achieve. I am just missing my imagination on this one I guess. If I only had 2021 value I could simply use ... %>% arrange(desc(cash)) but I am not sure how to follow from here.
year person cash
0 2020 personone 29
1 2021 personone 40
2 2020 persontwo 17
3 2021 persontwo 13
4 2020 personthree 62
5 2021 personthree 55
And what I want to achieve is to sort this data in descending order based on values from the year 2021. So that the data should look like:
year person cash
0 2020 personthree 62
1 2021 personthree 55
2 2020 personone 29
3 2021 personone 40
4 2020 persontwo 17
5 2021 persontwo 13
One approach using a join:
df %>%
filter(year == 2021) %>%
# group_by(person) %>% slice(2) %>% ungroup() %>% #each person's yr2
arrange(-cash) %>%
select(-cash, -year) %>%
left_join(df)
Output:
person year cash
1 personthree 2020 62
2 personthree 2021 55
3 personone 2020 29
4 personone 2021 40
5 persontwo 2020 17
6 persontwo 2021 13
Another option:
library(dplyr)
dat %>%
group_by(person) %>%
mutate(maxcash = max(cash)) %>%
arrange(desc(maxcash)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash maxcash
# <int> <chr> <int> <int>
# 1 2020 personthree 62 62
# 2 2021 personthree 55 62
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 17
# 6 2021 persontwo 13 17
Or a one-liner, using base R as a helper:
dat %>%
arrange(-ave(cash, person, FUN = max))
# year person cash
# 4 2020 personthree 62
# 5 2021 personthree 55
# 0 2020 personone 29
# 1 2021 personone 40
# 2 2020 persontwo 17
# 3 2021 persontwo 13
Edit:
If instead of max you mean "always 2021's data", then:
dat %>%
group_by(person) %>%
mutate(cash2021 = cash[year == 2021]) %>%
arrange(desc(cash2021)) %>%
ungroup()
# # A tibble: 6 x 4
# year person cash cash2021
# <int> <chr> <int> <int>
# 1 2020 personthree 62 55
# 2 2021 personthree 55 55
# 3 2020 personone 29 40
# 4 2021 personone 40 40
# 5 2020 persontwo 17 13
# 6 2021 persontwo 13 13

A running sum for daily data that resets when month turns

I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

Increasing the value of elements in a vector by a given sequence/statement

I want to create a sample dataset for an analysis:
set.seed(123)
d <- data.frame(ID=rep(1:10, each = 8),
AGE=rep(sample(20:40, size=10),each=8),
YEAR=rep(2011:2014, 10, each = 2),
HAND=rep(c("LEFT","RIGHT"), 40),
OUTCOME=(rnorm(80)) )
> d[1:8,]
ID AGE YEAR HAND OUTCOME
1 1 26 2011 LEFT 1.7150650
2 1 26 2011 RIGHT 0.4609162
3 1 26 2012 LEFT -1.2650612
4 1 26 2012 RIGHT -0.6868529
5 1 26 2013 LEFT -0.4456620
6 1 26 2013 RIGHT 1.2240818
7 1 26 2014 LEFT 0.3598138
8 1 26 2014 RIGHT 0.4007715
ID would be subject, AGE is the age of the subject, YEAR is the year the measurement was taken, HAND is left or right hand, OUTCOME is some measure of outcome.
Now I realized that the AGE for each subject should ideally increase by one year for each YEAR the subject has been measured, i.e.:
26,26,27,27,28,28,29,29
I came up with this solution:
age <- unique(d$AGE)
AGE2=c()
for(i in 1:10){
a <- rep(age[i]+0:3, each=2)
AGE2 <- c(AGE2,a)
}
d$AGE2 <- AGE2
d[1:8,]
> d[1:8,]
ID AGE YEAR HAND OUTCOME AGE2
1 1 26 2011 LEFT 1.7150650 26
2 1 26 2011 RIGHT 0.4609162 26
3 1 26 2012 LEFT -1.2650612 27
4 1 26 2012 RIGHT -0.6868529 27
5 1 26 2013 LEFT -0.4456620 28
6 1 26 2013 RIGHT 1.2240818 28
7 1 26 2014 LEFT 0.3598138 29
8 1 26 2014 RIGHT 0.4007715 29
Question: I was wondering if there is a more efficient way to do this? For example would it be possible to add the "corrected" age right away in the data.frame() function above?
We can do this with duplicated
library(dplyr)
res <- d %>%
group_by(ID) %>%
mutate(AGE2 = AGE + cumsum(!duplicated(YEAR))-1)
head(res)
# ID AGE YEAR HAND OUTCOME AGE2
# <int> <int> <int> <fctr> <dbl> <dbl>
#1 1 26 2011 LEFT 1.7150650 26
#2 1 26 2011 RIGHT 0.4609162 26
#3 1 26 2012 LEFT -1.2650612 27
#4 1 26 2012 RIGHT -0.6868529 27
#5 1 26 2013 LEFT -0.4456620 28
#6 1 26 2013 RIGHT 1.2240818 28
Using dplyr you can simply group by ID and HAND:
d %>% group_by(ID, HAND) %>% mutate(AGE2 = AGE + (0:(length(AGE)-1)))
Source: local data frame [80 x 6]
Groups: ID, HAND [20]
ID AGE YEAR HAND OUTCOME AGE2
<int> <int> <int> <fctr> <dbl> <int>
1 1 26 2011 LEFT 1.7150650 26
2 1 26 2011 RIGHT 0.4609162 26
3 1 26 2012 LEFT -1.2650612 27
4 1 26 2012 RIGHT -0.6868529 27
5 1 26 2013 LEFT -0.4456620 28
6 1 26 2013 RIGHT 1.2240818 28
7 1 26 2014 LEFT 0.3598138 29
8 1 26 2014 RIGHT 0.4007715 29
9 2 35 2011 LEFT 0.1106827 35
10 2 35 2011 RIGHT -0.5558411 35
# ... with 70 more rows
With data.table, you could use rep and a little algebra.
library(data.table)
setDT(d)
d[, AGE2 := AGE + rep(0L:((.N-1)/2L), each=2), by=ID][]
ID AGE YEAR HAND OUTCOME AGE2
1: 1 26 2011 LEFT 1.715064987 26
2: 1 26 2011 RIGHT 0.460916206 26
3: 1 26 2012 LEFT -1.265061235 27
4: 1 26 2012 RIGHT -0.686852852 27
5: 1 26 2013 LEFT -0.445661970 28
6: 1 26 2013 RIGHT 1.224081797 28
7: 1 26 2014 LEFT 0.359813827 29
8: 1 26 2014 RIGHT 0.400771451 29
9: 2 35 2011 LEFT 0.110682716 35
10: 2 35 2011 RIGHT -0.555841135 35
11: 2 35 2012 LEFT 1.786913137 36
12: 2 35 2012 RIGHT 0.497850478 36
...
Here, AGE2 is constructed by adding AGE to rep(0L:((.N-1)/2L), each=2) which counts 0 through the number of observations, minus 1, divided by 2. The by statement repeats this for each ID.

Resources