I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
Related
I want to keep only the 2 largest values in a column of a df according to the unique pair of values in two other columns. e.g., I have this df:
df <- data.frame('ID' = c(1,1,1,2,2,3,4,4,4,5),
'YEAR' = c(2002,2002,2003,2002,2003,2005,2010,2011,2012,2008),
'WAGES' = c(100,98,60,120,80,300,50,40,30,500));
And I want to drop the 3rd and 9th rows, equivalently, keep the first two largest values in WAGES column. The df has roughly 300,000 rows.
You can use dplyr's top_n:
library(dplyr)
df %>%
group_by(ID) %>%
top_n(n = 2, wt = WAGES)
## A tibble: 8 x 3
## Groups: ID [5]
# ID YEAR WAGES
# <dbl> <dbl> <dbl>
#1 1 2001 100
#2 1 2002 98
#3 2 2002 120
#4 2 2003 80
#5 3 2005 300
#6 4 2010 50
#7 4 2011 40
#8 5 2008 500
If I understood your question correctly, using base R:
for (i in 1:2) {
max_row <- which.max(df$WAGES)
df <- df[-c(max_row), ]
}
df
# ID YEAR WAGES
# 1 1 2001 100
# 2 1 2002 98
# 3 1 2003 60
# 4 2 2002 120
# 5 2 2003 80
# 7 4 2010 50
# 8 4 2011 40
# 9 4 2012 30
Note - and , in df <- df[-c(max_row), ].
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
So lets take the following data
set.seed(123)
A <- 1:10
age<- sample(20:50,10)
height <- sample(100:210,10)
df1 <- data.frame(A, age, height)
B <- c(1,1,1,2,2,3,3,5,5,5,5,8,8,9,10,10)
injury <- sample(letters[1:5],16, replace=T)
df2 <- data.frame(B, injury)
Now, we can merge the data using the following code:
df3 <- merge(df1, df2, by.x = "A", by.y = "B", all=T)
head(df3)
# A age height injury
# 1 1 28 206 e
# 2 1 28 206 d
# 3 1 28 206 d
# 4 2 43 149 e
# 5 2 43 149 d
# 6 3 31 173 d
But what i want in the new data frame is the length of injury's as a level variable.
So the desired output should look like this:
So in this simple example we know that the max length of injury's is 4 per unique df2$B . So we need 4 new columns.
Must my data has an unknown number, so a code is needed to generate the correct, so something like
length(unique(df2$injury[df2$B]))
but that is also not correct syntax, as the output should equal 4
I don't know where the letters are coming from in your sample output, because there are none in the variables in your sample input, but you can try something like:
library(splitstackshape)
dcast.data.table(getanID(df3, c("A", "age")), A + age + height ~
.id, value.var = "injury")
## A age height 1 2 3 4
## 1: 1 28 206 4 3 3 NA
## 2: 2 43 149 4 3 NA NA
## 3: 3 31 173 3 3 NA NA
## 4: 4 44 161 NA NA NA NA
## 5: 5 45 111 3 2 1 4
## 6: 6 21 195 NA NA NA NA
## 7: 7 33 125 NA NA NA NA
## 8: 8 41 104 4 3 NA NA
## 9: 9 32 133 4 NA NA NA
## 10: 10 30 197 1 2 NA NA
This adds a secondary ID based on the first two columns and then spreads it to a wide format.
If you want to accomplish this using the tidyr package, I found it necessary to create an index variable:
df3 %>%
group_by(A) %>%
mutate(ind = row_number()) %>%
spread(ind, injury)
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
I'd like to go from this:
years
-------
1994
2001
.
.
To this:
int dates
------
8793 # 1994-01-28
8824 # 1994-02-28
8852 # 1994-03-28
8883 # 1994-04-28
8913 # 1994-05-28
8944 # 1994-06-28
8974 # 1994-07-28
9005 # 1994-08-28
9036 # 1994-09-28
9066 # 1994-10-28
9097 # 1994-11-28
9127 # 1994-12-28
11350 # 2001-01-28
11381 # 2001-02-28
11409 # 2001-03-28
11440 # 2001-04-28
11470 # 2001-05-28
11501 # 2001-06-28
11531 # 2001-07-28
11562 # 2001-08-28
11593 # 2001-09-28
11623 # 2001-10-28
11654 # 2001-11-28
11684 # 2001-12-28
.
.
i.e. periodizing each year into 12 dates (the 28th each month of that year) as 1970-base integers.
What is the most efficient way of doing this?
My attempt is painfully slow!
require(data.table)
# Sample data
dt <- data.table(year=c(1994,2001)) # edit
# Create results table
data <- data.table(dates=integer())
for (i in 1:12) {
temp <- dt
temp$dates <- as.integer(as.Date(paste(temp$year, "-", sprintf( "%02d",i),"-28", sep="")))
temp <- subset(temp, select=dates)
data <- rbind(temp,data)
}
# Sort
data <- data[with(data, order(dates)),]
Here's a one-liner:
as.integer(as.Date(apply(expand.grid(1:12,c(1994,2001)), 1,
function(x)paste(x[2], x[1], 28,sep="-"))))
[1] 8793 8824 8852 8883 8913 8944 8974 9005 9036 9066 9097 9127 11350 11381 11409 11440 11470 11501
[19] 11531 11562 11593 11623 11654 11684
And here the step by step explanation:
expand.grid(1:12, c(1994,2001))
Var1 Var2
1 1 1994
2 2 1994
3 3 1994
4 4 1994
5 5 1994
6 6 1994
7 7 1994
8 8 1994
9 9 1994
10 10 1994
11 11 1994
12 12 1994
13 1 2001
14 2 2001
15 3 2001
16 4 2001
17 5 2001
18 6 2001
19 7 2001
20 8 2001
21 9 2001
22 10 2001
23 11 2001
24 12 2001
To that you apply on every row function paste(). Then convert to a Date object that you then convert to an integer (by default 1970-base).
Try this. The inputs and output are all data tables:
# input data
Y <- data.table(year = c(1994, 2001))
M <- data.table(month = 1:12)
as.data.table( merge.data.frame( M, Y ))[,
list(`int dates` = as.numeric(as.Date(ISOdate(year, month, 28))))
]
If you are going to move this data back to excel, then add "25569" to the numbers in excel and you get your dates. This is an issue with R and I use that number to bring the dates back to excel correct format.