efficient date comparison in data table - r

I have a data frame (actually a data table) that looks like
id hire.date survey.year
1 15-04-2003 2003
2 16-07-2001 2001
3 06-06-1980 2002
4 17-08-1981 2001
I need to check if hire.date is less than say 31st March of survey.year. So I would end up with something like
id hire.date survey.year emp31mar
1 15-04-2003 2003 FALSE
2 16-07-2001 2001 FALSE
3 06-06-1980 2002 TRUE
4 17-08-1981 2001 TRUE
I could always create an object holding March 31st of survey.year and then make the appropriate comparison like so
mar31 = as.Date(paste0("31-03-", as.character(myData$survey.year)), "%d-%m-%Y")
myData$emp31 = myData$hiredate < mar31
but creating the object mar31 is consuming too much time because myData is large-ish (think tens of millions of rows).
I wonder if there is a more efficient way of doing this -- a way that doesn't involve creating an object such as mar31?

You could try the data.table methods for creating the column.
library(data.table)
setDT(df1)[, emp31mar:= as.Date(hire.date, '%d-%m-%Y') <
paste(survey.year, '03-31', sep="-")][]
# id hire.date survey.year emp31mar
#1: 1 15-04-2003 2003 FALSE
#2: 2 16-07-2001 2001 FALSE
#3: 3 06-06-1980 2002 TRUE
#4: 4 17-08-1981 2001 TRUE

Related

Change name of column after uniqueN function

I am already happy with the results, but want to further tidy up my data by giving the right name to the respective column.
The problem to solve is to give the number of different authors which are included for each years publication between 2000 and 2010. Here is my code and my result:
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000, uniqueN(Book_Author), by = "Year_Of_Publication"][order(Year_Of_Publication)]
Year_Of_Publication V1
1: 2000 12057
2: 2001 11818
3: 2002 11942
4: 2003 9913
5: 2004 4536
6: 2005 38
7: 2006 3
8: 2008 1
9: 2010 2
The numbers in the result are right, but I want to change the column name V1 to something like "Num_Of_Dif_Auth". I tried the setnames function, but as I don`t want to change the underlying dataset it didn´t help.
You can use :
library(data.table)
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000,
.(Num_Of_Dif_Auth = uniqueN(Book_Author)),
by = Year_Of_Publication][order(Year_Of_Publication)]

New column from non-standard date factor in R

I have a dataframe with an oddly formatted dates column. I'd like to create a column just showing the year from the original date column and I am having trouble coming up with a way to do this because the current date column is being treated as a factor. Any advice on how to do this efficiently would be appreciated.
Example
starting with:
org <- c("a","b","c","d")
country <- c("1","2","3","4")
date <- c("01-09-14","01-10-07","11-31-99","10-31-12")
toy <- data.frame(cbind(org,country,date))
toy
org country date
1 a 1 01-09-14
2 b 2 01-10-07
3 c 3 11-31-99
4 d 4 10-31-12
str(toy$date)
Factor w/ 4 levels "01-09-14","01-10-07",..: 1 2 4 3
Desired result:
org country Year
1 a 1 2014
2 b 2 2007
3 c 3 1999
4 d 4 2012
This should work:
transform(toy,Year=format(strptime(date,"%m-%d-%y"),"%Y"))
This produces
## org country date Year
## 1 a 1 01-09-14 2014
## 2 b 2 01-10-07 2007
## 3 c 3 11-31-99 <NA>
## 4 d 4 10-31-12 2012
I initially thought that the NA value was because the %y format indicator wasn't smart enough to handle previous-century dates, but ?strptime says:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
implying that it should be able to handle it.
The problem is actually that 31 November doesn't exist ...
(You can drop the date column at your leisure ...)

Aggregate using a certain value

I'm trying to use the aggregate function in R to get the mean EMISSIONS, organized by YEAR, but only for rows where FIPS is equal to 24510. The following code gives me the right result, but in addition it also adds the overall EMISSIONS, summed across all FIPS values. What am I missing here?
This is the function I'm using:
sum <- aggregate(NEI$Emissions, list(Year = NEI$year, NEI$fips == 24510), sum);
This is the output:
Year Group.2 x
1 1999 FALSE 7329692.557
2 2002 FALSE 5633326.582
3 2005 FALSE 5451611.723
4 2008 FALSE 3462343.556
5 1999 TRUE 3274.180
6 2002 TRUE 2453.916
7 2005 TRUE 3091.354
8 2008 TRUE 1862.282
This is the output that I would like:
Year x
1 1999 3274.180
2 2002 2453.916
3 2005 3091.354
4 2008 1862.282
Should I be using subset separately or can this be done with aggregate alone?
Using this sample
set.seed(15)
NEI <- data.frame(year=2000:2004, fips=rep(c(24510,57399), each=5), Emissions=rnorm(10))
you could use the command
mysum <- aggregate(Emissions~year, subset(NEI, fips == 24510), sum);
to get
year Emissions
1 2000 0.2588229
2 2001 1.8311207
3 2002 -0.3396186
4 2003 0.8971982
5 2004 0.4880163
(also, don't save a value to a variable named sum -- that will conflict with the base function sum())

Subsetting a data.table using another data.table

I have the dt and dt1 data.tables.
dt<-data.table(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
dt1<-data.table(id=rep(2, 5), year=c(2005:2009), performance=(1000:1004))
dt
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
4: 4 2005 0
5: 4 2006 1
dt1
id year performance
1: 2 2005 1000
2: 2 2006 1001
3: 2 2007 1002
4: 2 2008 1003
5: 2 2009 1004
I would like to subset the former using the combination of its first and second column that also appear in dt1. As a result of this, I would like to create a new object without overwriting dt. This is what I'd like to obtain.
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
I tried to do this using the following code:
dt.sub<-dt[dt[,c(1:2)] %in% dt1[,c(1:2)],]
but it didn't work. As a result, I got back a data table identical to dt. I think there are at least two mistakes in my code. The first is that I am probably subsetting the data.table by column using a wrong method. The second, and pretty evident, is that %in% applies to vectors and not to multiple-column objects. Nevertherless, I am unable to find a more efficient way to do it...
Thank you in advance for your help!
setkeyv(dt,c('id','year'))
setkeyv(dt1,c('id','year'))
dt[dt1,nomatch=0]
Output -
> dt[dt1,nomatch=0]
id year event performance
1: 2 2005 1 1000
2: 2 2006 0 1001
3: 2 2007 0 1002
Use merge:
merge(dt,dt1, by=c("year","id"))
year id event performance
1: 2005 2 1 1000
2: 2006 2 0 1001
3: 2007 2 0 1002

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

Resources