I'd like to go from this:
years
-------
1994
2001
.
.
To this:
int dates
------
8793 # 1994-01-28
8824 # 1994-02-28
8852 # 1994-03-28
8883 # 1994-04-28
8913 # 1994-05-28
8944 # 1994-06-28
8974 # 1994-07-28
9005 # 1994-08-28
9036 # 1994-09-28
9066 # 1994-10-28
9097 # 1994-11-28
9127 # 1994-12-28
11350 # 2001-01-28
11381 # 2001-02-28
11409 # 2001-03-28
11440 # 2001-04-28
11470 # 2001-05-28
11501 # 2001-06-28
11531 # 2001-07-28
11562 # 2001-08-28
11593 # 2001-09-28
11623 # 2001-10-28
11654 # 2001-11-28
11684 # 2001-12-28
.
.
i.e. periodizing each year into 12 dates (the 28th each month of that year) as 1970-base integers.
What is the most efficient way of doing this?
My attempt is painfully slow!
require(data.table)
# Sample data
dt <- data.table(year=c(1994,2001)) # edit
# Create results table
data <- data.table(dates=integer())
for (i in 1:12) {
temp <- dt
temp$dates <- as.integer(as.Date(paste(temp$year, "-", sprintf( "%02d",i),"-28", sep="")))
temp <- subset(temp, select=dates)
data <- rbind(temp,data)
}
# Sort
data <- data[with(data, order(dates)),]
Here's a one-liner:
as.integer(as.Date(apply(expand.grid(1:12,c(1994,2001)), 1,
function(x)paste(x[2], x[1], 28,sep="-"))))
[1] 8793 8824 8852 8883 8913 8944 8974 9005 9036 9066 9097 9127 11350 11381 11409 11440 11470 11501
[19] 11531 11562 11593 11623 11654 11684
And here the step by step explanation:
expand.grid(1:12, c(1994,2001))
Var1 Var2
1 1 1994
2 2 1994
3 3 1994
4 4 1994
5 5 1994
6 6 1994
7 7 1994
8 8 1994
9 9 1994
10 10 1994
11 11 1994
12 12 1994
13 1 2001
14 2 2001
15 3 2001
16 4 2001
17 5 2001
18 6 2001
19 7 2001
20 8 2001
21 9 2001
22 10 2001
23 11 2001
24 12 2001
To that you apply on every row function paste(). Then convert to a Date object that you then convert to an integer (by default 1970-base).
Try this. The inputs and output are all data tables:
# input data
Y <- data.table(year = c(1994, 2001))
M <- data.table(month = 1:12)
as.data.table( merge.data.frame( M, Y ))[,
list(`int dates` = as.numeric(as.Date(ISOdate(year, month, 28))))
]
If you are going to move this data back to excel, then add "25569" to the numbers in excel and you get your dates. This is an issue with R and I use that number to bring the dates back to excel correct format.
Related
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
I am working with a large dataset of patent data. Each row is an individual patent, and columns contain information including application year and number of citations in the patent.
> head(p)
allcites appyear asscode assgnum cat cat_ocl cclass country ddate gday gmonth
1 6 1974 2 1 6 6 2/161.4 US 6 1
2 0 1974 2 1 6 6 5/11 US 6 1
3 20 1975 2 1 6 6 5/430 US 6 1
4 4 1974 1 NA 5 <NA> 114/354 6 1
5 1 1975 1 NA 6 6 12/142S 6 1
6 3 1972 2 1 6 6 15/53.4 US 6 1
gyear hjtwt icl icl_class icl_maingroup iclnum nclaims nclass nclass_ocl
1 1976 1 A41D 1900 A41D 19 1 4 2 2
2 1976 1 A47D 701 A47D 7 1 3 5 5
3 1976 1 A47D 702 A47D 7 1 24 5 5
4 1976 1 B63B 708 B63B 7 1 7 114 9
5 1976 1 A43D 900 A43D 9 1 9 12 12
6 1976 1 B60S 304 B60S 3 1 12 15 15
patent pdpass state status subcat subcat_ocl subclass subclass1 subclass1_ocl
1 3930271 10030271 IL 63 63 161.4 161.4 161
2 3930272 10156902 PA 65 65 11.0 11 11
3 3930273 10112031 MO 65 65 430.0 430 331
4 3930274 NA CA 55 NA 354.0 354 2
5 3930275 NA NJ 63 63 NA 142S 142
6 3930276 10030276 IL 69 69 53.4 53.4 53
subclass_ocl term_extension uspto_assignee gdate
1 161 0 251415 1976-01-06
2 11 0 246000 1976-01-06
3 331 0 10490 1976-01-06
4 2 0 0 1976-01-06
5 142 0 0 1976-01-06
6 53 0 243840 1976-01-06
I am attempting to create a new data frame which contains the mean number of citations (allcites) per application year (appyear), separated by category (cat), for patents from 1970 to 2006 (the data goes all the way back to 1901). I did this successfully, but I feel like my solution is somewhat ad hoc and does not take advantage of the specific capabilities of R. Here is my solution
#citations by category
citescat <- data.frame("chem"=integer(37),
"comp"=integer(37),
"drugs"=integer(37),
"ee"=integer(37),
"mech"=integer(37),
"other"=integer(37),
"year"=1970:2006
)
for (i in 1:37) {
for (j in 1:6) {
citescat[i,j] <- mean(p$allcites[p$appyear==(i+1969) & p$cat==j], na.rm=TRUE)
}
}
I am wondering if there is a simple way to do this without using the nested for loops which would make it easy to make small tweaks to it. It is hard for me to pin down exactly what I am looking for other than this, but my code just looks ugly to me and I suspect that there are better ways to do this in R.
Joran is right - here's a plyr solution. Without your dataset in a usable form it's hard to show you exactly, but here it is in a simplified dataset:
p <- data.frame(allcites = sample(1:20, 20), appyear = 1974:1975, pcat = rep(1:4, each = 5))
#First calculate the means of each group
cites <- ddply(p, .(appyear, pcat), summarise, meancites = mean(allcites, na.rm = T))
#This gives us the data in long form
# appyear pcat meancites
# 1 1974 1 14.666667
# 2 1974 2 9.500000
# 3 1974 3 10.000000
# 4 1974 4 10.500000
# 5 1975 1 16.000000
# 6 1975 2 4.000000
# 7 1975 3 12.000000
# 8 1975 4 9.333333
#Now use dcast to get it in wide form (which I think your for loop was doing):
citescat <- dcast(cites, appyear ~ pcat)
# appyear 1 2 3 4
# 1 1974 14.66667 9.5 10 10.500000
# 2 1975 16.00000 4.0 12 9.333333
Hopefully you can see how to adapt that to your specific data.
I have a dataframe with counts of geese at several different sites. The aim was to make monthly counts of geese in
all 8 months between September-April at each site in consecutive winter periods. A winter period is defined as the 8 months between
September-April.
If the method had been carried out as planned, this is what the data would look like:
df <- data.frame(site=c(rep('site 1', 16), rep('site 2', 16), rep('site 3', 16)),
date=dmy(rep(c('01/09/2007', '02/10/2007', '02/11/2007',
'02/12/2007', '02/01/2008', '02/02/2008', '02/03/2008',
'02/04/2008', '01/09/2008', '02/10/2008', '02/11/2008',
'02/12/2008', '02/01/2009', '02/02/2009', '02/03/2009',
'02/04/2009'),3)),
count=sample(1:100, 48))
Its ended up with a situation where some sites have all 8 counts in some September-April periods, but not in other September-April periods. In addition, some sites, never achieved 8 counts in a September-April period. These toy data look like my actual data:
df <- df[-c(11:16, 36:48),]
I need to remove rows from the dataframe which do not form part of 8 consecutive counts in a September-April period. Using the toy data, this is the dataframe I need:
df <- df[-c(9:10, 27:29), ]
I've tried various commands using ddply() from plyr package but without success. Is there a solution to this problem?
One way I could think of is to subtract four months from your date so that, then you could group by year. To get the corresponding date by subtracting by 4 months, I suggest you use mondate package. See here for an excellent answer as to what problem you'd face when you subtract month and how you can overcome it.
require(mondate)
df$grp <- mondate(df$date) - 4
df$year <- year(df$grp)
df$month <- month(df$date)
ddply(df, .(site, year), function(x) {
if (all(c(1:4, 9:12) %in% x$month)) {
return(x)
} else {
return(NULL)
}
})
# site date count grp year month
# 1 site 1 2007-09-01 87 2007-05-02 2007 9
# 2 site 1 2007-10-02 44 2007-06-02 2007 10
# 3 site 1 2007-11-02 50 2007-07-03 2007 11
# 4 site 1 2007-12-02 65 2007-08-02 2007 12
# 5 site 1 2008-01-02 12 2007-09-02 2007 1
# 6 site 1 2008-02-02 2 2007-10-03 2007 2
# 7 site 1 2008-03-02 100 2007-11-02 2007 3
# 8 site 1 2008-04-02 29 2007-12-03 2007 4
# 9 site 2 2007-09-01 3 2007-05-02 2007 9
# 10 site 2 2007-10-02 22 2007-06-02 2007 10
# 11 site 2 2007-11-02 56 2007-07-03 2007 11
# 12 site 2 2007-12-02 5 2007-08-02 2007 12
# 13 site 2 2008-01-02 40 2007-09-02 2007 1
# 14 site 2 2008-02-02 15 2007-10-03 2007 2
# 15 site 2 2008-03-02 10 2007-11-02 2007 3
# 16 site 2 2008-04-02 20 2007-12-03 2007 4
# 17 site 2 2008-09-01 93 2008-05-02 2008 9
# 18 site 2 2008-10-02 13 2008-06-02 2008 10
# 19 site 2 2008-11-02 58 2008-07-03 2008 11
# 20 site 2 2008-12-02 64 2008-08-02 2008 12
# 21 site 2 2009-01-02 92 2008-09-02 2008 1
# 22 site 2 2009-02-02 69 2008-10-03 2008 2
# 23 site 2 2009-03-02 89 2008-11-02 2008 3
# 24 site 2 2009-04-02 27 2008-12-03 2008 4
An alternative solution using data.table:
require(data.table)
require(mondate)
dt <- data.table(df)
dt[, `:=`(year=year(mondate(date)-4), month=month(date))]
dt.out <- dt[, .SD[rep(all(c(1:4,9:12) %in% month), .N)],
by=list(site,year)][, c("year", "month") := NULL]