I want to calculate the difference between two days by month, for instance:
attach(airquality)
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
My output is like:
Month Day temp_diff
5 1 5
5 2 2
5 3 -12
The calculation will stop at the last day for each month, e.g. for May31, it won't calculate the temp_diff by substracting temp of June 1 and temp of May 31.
Before calculating I need to order the data by month and day so they can be calculated correctly.
I was thinking to use: by(airquality[,1:4],Month,function) but cannot figure out how to write this function, help?
Assuming that the dataset is ordered by Month and day
Using dplyr
library(dplyr)
airquality %>%
group_by(Month) %>%
arrange(Month, Day) %>% #if not ordered
mutate(temp_diff=c(diff(Temp),NA)) %>%
select(Month, Day, temp_diff)%>%
head()
# Month Day temp_diff
#1 5 1 5
#2 5 2 2
#3 5 3 -12
#4 5 4 -6
#5 5 5 10
#6 5 6 -1
Or using base R
airquality$temp_diff <- with(airquality, ave(Temp, Month,
FUN=function(x) c(diff(x), NA)))
Or using data.table
library(data.table)
DT <- setDT(airquality)[, temp_diff:=c(diff(Temp),NA), by=Month]
Related
I am monitoring an animal population. I have their individual IDs as numbers, the date they were encountered on, and the number of individuals encountered on that day. I want to sum up the total number of different individuals encountered as the days go by, so I need it to recognize same IDs and only add new individuals to the total encountered.
This is my dataset, the last column being my desired outcome:
Month Day ID N. individuals that day Total encountered
5 13 44 3 3
5 13 58 3 3
5 13 57 3 3
5 14 58 1 3
5 15 44 2 4
5 15 06 2 4
Edit - updated to working, but inelegant, solution. The process here was to use padr to create a row for every ID in every date, with a 1 once it appears. Then we can count how many IDs have appeared as of each date, and add that to the original with a join.
library(tidyverse); library(lubridate)
# First, make a date column for easier sorting etc.
df1 <- df %>%
mutate(date = ymd(paste(2019, Month, Day))) %>%
select(date, ID) %>%
mutate(appearance = 1) # For counting later; if missing = NA in padded version
df2 <- df1 %>%
padr::pad(group = "ID", start_val = min(df1$date), end_val = max(df1$dat)) %>%
fill(appearance) %>%
count(date, Month = month(date), Day = day(date),
wt = appearance, name = "Total_encountered_calc")
df %>%
left_join(df2)
Output
Month Day ID N_individuals_that_day Total_encountered date Total_encountered_calc
1 5 13 44 3 3 2019-05-13 3
2 5 13 58 3 3 2019-05-13 3
3 5 13 57 3 3 2019-05-13 3
4 5 14 58 1 3 2019-05-14 3
5 5 15 44 2 4 2019-05-15 4
6 5 15 6 2 4 2019-05-15 4
An option
library(tidyverse)
df %>%
add_count(Month, Day) %>%
mutate(n1 = duplicated(ID)) %>%
group_by(Month, Day) %>%
mutate(n1 = c(min(n - n1), rep(0, n()-1))) %>%
ungroup %>%
mutate(n1 = cumsum(n1))
# A tibble: 6 x 5
# Month Day ID n n1
# <int> <int> <int> <int> <dbl>
#1 5 13 44 3 3
#2 5 13 58 3 3
#3 5 13 57 3 3
#4 5 14 58 1 3
#5 5 15 44 2 4
#6 5 15 6 2 4
Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.
For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45
You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))