data.table count of rows indexing by category - r

I'm stuck on an easy one, but didn't find a solution in either the data.table manual or around here.
dt<-data.table(account=c("treu65","treu65","treg23","treg23","treg23"),year=c("2012","2013","2013","2013","2012"))
I need to add a column with a count of rows by account and year. The problem is that I need to create two separate columns. One will contain the count for 2012, the other for 2013.
Like so:
account year count2012 count2013
1: treu65 2012 1 1
2: treu65 2013 1 1
3: treg23 2013 1 2
4: treg23 2013 1 2
5: treg23 2012 1 2
Normally I would aggregate, but in this case I need the same structure as above.
I got as far as:
dt[year==2012,count2012:=.N,.(account)]
dt[year==2013,count2013:=.N,.(account)]
But I have NAs now:
account year count2012 count2013
1: treu65 2012 1 NA
2: treu65 2013 NA 1
3: treg23 2013 NA 2
4: treg23 2013 NA 2
5: treg23 2012 1 NA
And I should get:
account year count2012 count2013
1: treu65 2012 1 1
2: treu65 2013 1 1
3: treg23 2013 1 2
4: treg23 2013 1 2
5: treg23 2012 1 2
Thank you.

You can move the filter from i position (by which you will only be able to modify specific rows) to j position and use sum to count the rows:
dt[, `:=`(count2012 = sum(year == 2012), count2013 = sum(year == 2013)), .(account)][]
# account year count2012 count2013
#1: treu65 2012 1 1
#2: treu65 2013 1 1
#3: treg23 2013 1 2
#4: treg23 2013 1 2
#5: treg23 2012 1 2

Related

data.table: is it possible to merge .SD and return a new 'sub data table' by group?

I have a data table organized by id and year, with a frequency (freq) value for every year where the frequency is at least 1. The start and end year may differ for every id.
Example:
> dt <- data.table(id=c('A','A','A','A','B','B','B','B'),year=c(2010,2012,2013,2015,2006,2007,2010,2011),freq=c(2,1,4,3,1,3,5,7))
> dt
id year freq
1: A 2010 2
2: A 2012 1
3: A 2013 4
4: A 2015 3
5: B 2006 1
6: B 2007 3
7: B 2010 5
8: B 2011 7
I would like to make each time series by id complete, i.e. add rows with freq=0 for any missing year. So the result for the example above should look like this:
id year freq
A 2010 2
A 2011 0
A 2012 1
A 2013 4
A 2014 0
A 2015 3
B 2006 1
B 2007 3
B 2008 0
B 2009 0
B 2010 5
B 2011 7
I'm starting with data.table and I'm interested to see if this is doable. With plyr or dplyr I would have used a merge operation with a complete column of years for every sub dataframe by id. Is there an equivalent to this solution with data.table?
We can't use CJ-based approaches because the missing rows need to be by-id. An alternative is:
library(data.table)
dt[ dt[, .(year = do.call(seq, as.list(range(year)))), by = .(id)],
on = .(id, year)
][is.na(freq), freq := 0][]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7
Another solution, maybe more explicit than #r2evans'? First make a table of complete series:
years <- dt[, list(year= seq(min(year), max(year))), by= id]
years
id year
1: A 2010
2: A 2011
3: A 2012
4: A 2013
5: A 2014
6: A 2015
7: B 2006
8: B 2007
9: B 2008
10: B 2009
11: B 2010
12: B 2011
then merge and replace NAs:
full <- merge(dt, years, all.y= TRUE)
full[, freq := ifelse(is.na(freq), 0, freq)]
full
id year freq
1: A 2010 2
2: A 2011 0
3: A 2012 1
4: A 2013 4
5: A 2014 0
6: A 2015 3
7: B 2006 1
8: B 2007 3
9: B 2008 0
10: B 2009 0
11: B 2010 5
12: B 2011 7
Here is another data.table way to solve your problem:
dt[, .SD[.(min(year):max(year)), on="year"], by=id][is.na(freq), freq:=0]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7

Condition on grouped observation in a panel dataframe

I have a dataframe that looks like this
id year changetype
1 2010 1
1 2012 2
2 2014 2
2 2014 2
3 2012 1
3 2012 2
3 2014 2
3 2014 1
I want to get something like this
id year changetype
1 2010 1
1 2012 2
2 2014 2
2 2014 2
In other words I want to remove all observations associated with id 3 because, in the same year (2012) id=3 presents both changetype=1 and changetype=2.
How can I impose a condition on variable for grouped observation by id and year?
Many thanks to everyone helping me.
You can use data.table package to achieve this-
library(data.table)
setDT(dt)
dt[,count:=lapply(.SD,function(x)length(unique(x))), by=.(id,year)]
dt[,keep:=uniqueN(count), by=id][keep==1,.(id,year,changetype)]
id year changetype
1: 1 2010 1
2: 1 2012 2
3: 2 2014 2
4: 2 2014 2

R: Creating id by ordered column in data table not working correctly

I am trying to create a unique ID column based on a sorted column in a data table. I have reproduced a simple example here but I am not getting the ID in the correct order.
t <- data.table(YEAR = c(2007, 2009, 2011, 2001, 1994, 2005))
t[, id := order(YEAR)]
It is returning the following:
YEAR id
1: 2007 5
2: 2009 4
3: 2011 6
4: 2001 1
5: 1994 2
6: 2005 3
But I was expecting:
YEAR id
1: 2007 4
2: 2009 5
3: 2011 6
4: 2001 2
5: 1994 1
6: 2005 3
I made this mistake before. You want rank
t[, id := rank(YEAR)]
# YEAR id
# 1: 2007 4
# 2: 2009 5
# 3: 2011 6
# 4: 2001 2
# 5: 1994 1
# 6: 2005 3

Drop subgroup of obs in dataframe if first observation of group is na

In R I have a dataframe df of this form:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
NA 5 2011 04 1234759
5 5 2011 05 1234759
5 5 2011 06 1234759
2 2 2001 11 1234760
NA NA 2001 11 1234760
Some of the a's and b's are NAs. I wish to subset the dataframe by id, have each subset ordered by year and month and then drop the whole subset/id if the first observation in order of time of either a or b is na.
For the example above, inteded result is:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
2 2 2001 11 1234760
NA NA 2001 11 1234760
I did it the non vectorized way, which took forever to run, as follow:
df_summary <- as.data.frame(table(df$id),stringsAsFactors=FALSE)
df <- df[order(df$id,df$year,df$month),]
remove <- ""
j <- 1
l <- 0
for(i in 1:nrow(df_summary)){
m <- df_summary$Var1[i]
if( is.na(df$a[j]) | is.na(df$b[j]) ) {
l <- l + 1
remove[l] <- df_summary$id[i]
}
j <- j + m
}
df <- df[!(df$id %in% remove),]
What is a faster, vectorized way, to achieve the same result?
What I tried, also to double-check my code:
dt <- setDT(df)
remove_vectorized <- dt[,list(remove_first_na=(is.na(a[1]) | is.na(b[1]))),by=id]
which suggests me to remove ALL observation, which is patently wrong.
Here are few data.table possible approaches
First- fixing your attempt
library(data.table)
setDT(df)[, if(!is.na(a[1L]) & !is.na(b[1L])) .SD, by = id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or we can generalize this (on expense of speed probably)
setDT(df)[, if(Reduce(`&`, !is.na(.SD[1L, .(a, b)]))) .SD, by = id]
## OR maybe `setDT(df)[, if(Reduce(`&`, !sapply(.SD[1L, .(a, b)], is.na))) .SD , by = id]`
## in order to avoid to matrix conversions)
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Another way is to combine unique and na.omit methods
indx <- na.omit(unique(setDT(df), by = "id"), by = c("a", "b"))
Then, a simple subset will do
df[id %in% indx$id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or maybe a binary join?
df[indx[, .(id)], on = "id"]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or
indx <- na.omit(unique(setDT(df, key = "id")), by = c("a", "b"))
df[.(indx$id)]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
(The last two are mainly for illustration)
For more info regarding data.table, please visit Getting Started on GH

Create a new variable to epidemiological week

I have a data frame with a column week and another year (87 weeks). I need to create a new column (weekseq) with a number that identify the week sequentially from first to last. I dont know how to do. Someone can help me?
Example:
id week month year yearweek weekseq
1 1 1 2014 2014/1
1 1 1 2013 2013/1
1 2 1 2014 2014/2
1 2 1 2013 2013/2
1 3 1 2014 2014/3
1 3 1 2013 2013/3
1 4 1 2014 2014/4
1 4 1 2013 2013/4
1 5 1 2014 2014/5
1 5 1 2013 2013/5
1 6 2 2014 2014/6
1 6 2 2013 2013/6
1 7 2 2014 2014/7
1 7 2 2013 2013/7
1 8 2 2014 2014/8
1 8 2 2013 2013/8
1 9 2 2014 2014/9
1 9 2 2013 2013/9
1 10 3 2014 2014/10
1 10 3 2013 2013/10
1 11 3 2014 2014/11
1 11 3 2013 2013/11
1 12 3 2014 2014/12
1 12 3 2013 2013/12
This solution requires the 'dplyr' and 'plyr' packages:
# Coerce into tbd_df
datatbl <- tbl_df(data)
# Arrange, giving more weight to year than week
datatbl <- arrange(datatbl, year, month, week)
# Create a new column that numbers the arranged rows sequentially
seqtbl <- ddply(datatbl, .(id), transform, sequence=seq_along(id))

Resources