Aggregating data.table with sum, length and grep - r

Lets make a data.table:
dt <- data.table(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
x.1 x.2 x.3 vessel Year
1: 1 1 2 a 2012
2: 2 2 3 a 2013
3: 3 3 4 a 2014
4: 4 4 5 a 2015
5: 5 5 6 b 2012
6: 6 6 7 b 2013
7: 7 7 8 b 2014
8: 8 8 9 b 2015
I can aggregate it, using the functions length and sum, to get the sum of all x's in each year and the sum of unique vessels each year like this:
dt[,
list(
x.1=sum(x.1),
x.2=sum(x.2),
x.3=sum(x.3),
vessels=length(unique(vessel))),
by=list(Year=Year)]
Year x.1 x.2 x.3 vessels
1: 2012 6 6 8 2
2: 2013 8 8 10 2
3: 2014 10 10 12 2
4: 2015 12 12 14 2
This is what i want, but in my real data I have a lot of columns, so i would like to use grep or %like%, but i can not get it to work. I was thinking something in line with this:
dt[,grep("x",colnames(dt)),with = FALSE])
But how to merge that with the aggregate?

You can use lapply to apply a function on all (.SD) or several columns (selected with .SDcols):
dt[, lapply(.SD, sum), by=Year, .SDcols=c("x.1","x.2")]
The following might also work to select all columns with an "x" in their name:
dt[, c(lapply(.SD, sum), vessel=uniqueN(vessel)),
by=Year,
.SDcols=grepl("^x", names(dt))
]

If you have many columns to aggregate, it might be worthwhile to consider reshaping your data from wide to long format using melt() and aggregating using dcast():
molten <- melt(dt, id.vars = c("Year", "vessel"))
molten
# Year vessel variable value
# 1: 2012 a x.1 1
# 2: 2013 a x.1 2
# 3: 2014 a x.1 3
# 4: 2015 a x.1 4
# 5: 2012 b x.1 5
# ...
#19: 2014 a x.3 4
#20: 2015 a x.3 5
#21: 2012 b x.3 6
#22: 2013 b x.3 7
#23: 2014 b x.3 8
#24: 2015 b x.3 9
# Year vessel variable value
dcast(molten, Year ~ variable, sum)
# Year x.1 x.2 x.3
#1: 2012 6 6 8
#2: 2013 8 8 10
#3: 2014 10 10 12
#4: 2015 12 12 14
Now, the number of vessels per year
dt[, .(vessels = uniqueN(vessel)), Year]
# Year vessels
#1: 2012 2
#2: 2013 2
#3: 2014 2
#4: 2015 2
finally needs to be appended using a join:
dcast(molten, Year ~ variable, sum)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
# Year x.1 x.2 x.3 vessels
#1: 2012 6 6 8 2
#2: 2013 8 8 10 2
#3: 2014 10 10 12 2
#4: 2015 12 12 14 2
Tips
The measure.vars parameter to melt() allows to define/select/restrict the relevant measure columns.
The subset parameter to dcast() allows to select specific measure variables or to exclude
You can use more than one aggregation function in dcast()
This allows to do fancy things like:
dcast(molten, Year ~ variable, list(mean, sum, max), subset = .(variable == "x.2")
)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
# Year value_mean_x.2 value_sum_x.2 value_max_x.2 vessels
#1: 2012 3 6 5 2
#2: 2013 4 8 6 2
#3: 2014 5 10 7 2
#4: 2015 6 12 8 2

If you really need that by to be efficient:
> dt[, .SD
][, .N, .(vessel, Year)
][, .N, .(Year)
][, copy(dt)[.SD, vessels := i.N, on='Year']
][, vessel := NULL
][, melt(.SD, id.vars=c('Year', 'vessels'))
][, .(value=sum(value)), .(Year, vessels, variable)
][, dcast(.SD, ... ~ variable, value.var='value')
][, setcolorder(.SD, c(setdiff(colnames(.SD), 'vessels'), 'vessels'))
][order(Year)
]
Year x.1 x.2 x.3 vessels
1: 2012 6 6 8 2
2: 2013 8 8 10 2
3: 2014 10 10 12 2
4: 2015 12 12 14 2
>

I don't get well your question, but what do you want to do with grep could be solved with something like this
dt <- data.frame(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
dt[unlist(lapply(colnames(dt),function(v){grepl("x",v)}))]
then on your filtered database you can do what you want

Related

data.table aggregation by one column using the maximum value of another column - R

I've got a data.table DT that I would like to aggregate by one column (year) using the maximum value of another column (month). Here's a sample of my data.table.
> DT <- data.table(month = c("2016-01", "2016-02", "2016-03", "2017-01", "2017-02", "2017-03")
, col1 = c(3,5,2,8,4,9)
, year = c(2016, 2016,2016, 2017,2017,2017))
> DT
month col1 year
1: 2016-01 3 2016
2: 2016-02 5 2016
3: 2016-03 2 2016
4: 2017-01 8 2017
5: 2017-02 4 2017
6: 2017-03 9 2017
The desired output
> ## desired output
> DT
month col1 year desired_output
1: 2016-01 3 2016 2
2: 2016-02 5 2016 2
3: 2016-03 2 2016 2
4: 2017-01 8 2017 9
5: 2017-02 4 2017 9
6: 2017-03 9 2017 9
Aggregating by the column year, the desired output should be the value of col1 for the latest month. But somehow the following code doesn't work, it gives me a warning and returns NAs. What am I doing wrong?
> ## wrong output
> DT[, output := col1[which.max(month)], by = .(year)]
Warning messages:
1: In which.max(month) : NAs introduced by coercion
2: In which.max(month) : NAs introduced by coercion
> DT
month col1 year output
1: 2016-01 3 2016 NA
2: 2016-02 5 2016 NA
3: 2016-03 2 2016 NA
4: 2017-01 8 2017 NA
5: 2017-02 4 2017 NA
6: 2017-03 9 2017 NA
We get the index of the max value in 'month by converting to yearmon class from zoo and use that to get the corresponding value from 'col1' in creating the 'desired_output' column grouped by 'year'
library(zoo)
library(data.table)
DT[, desired_output := col1[which.max(as.yearmon(month))], .(year)]
DT
# month col1 year desired_output
#1: 2016-01 3 2016 2
#2: 2016-02 5 2016 2
#3: 2016-03 2 2016 2
#4: 2017-01 8 2017 9
#5: 2017-02 4 2017 9
#6: 2017-03 9 2017 9
Or extract the 'month' and get the index of max value
DT[, desired_output := col1[which.max(month(as.IDate(paste0(month,
"-01"))))], .(year)]

How to combine winter months of two consecutive years

I have a count data of several species spanning over several years. I want to look at the abundance dynamics for each species over winter season only for each year. The problem is winter season span over two years, November, December and January of next year. Now, I want to combine the abundance of each species of winter months spanning over two consecutive years and do some analysis. For example, I want to subset Nov-Dec of 2005 and Jan of 2006 in first round and do some analysis with this then in second round want to subset Nov-Dec of 2006 and Jan of 2007 and then repeat the same analysis and so on.... How can I do it in R?
Here is an example of the data
date species year month day abundance temp
9/3/2005 A 2005 9 3 3 19
9/15/2005 B 2005 9 15 30 16
10/4/2005 A 2005 10 4 24 12
11/6/2005 A 2005 11 6 32 14
12/8/2005 A 2005 12 8 15 13
1/3/2005 A 2006 1 3 64 19
1/4/2006 B 2006 1 4 2 13
2/10/2006 A 2006 2 10 56 12
2/8/2006 A 2006 1 3 34 19
3/9/2006 A 2006 1 3 64 19
I convert your date column to a date class (possibly with lubridate) and remove the year month day columns as they are redundant.
Then make a new column with the seasonal year (defined as the year, unless the month is Jan, then it is the previous year). A further column is made with case_when that defines the row's season.
library(dplyr)
library(lubridate)
# converts to date format
df$date <- mdy(df$date)
# add in columns
df <- mutate(df,
season_year = ifelse(month(date) == 1, year(date) - 1, year(date)),
season = case_when(
month(date) %in% c(2, 3, 4) ~ "Spring",
month(date) %in% c(5, 6, 7) ~ "Summer",
month(date) %in% c(8, 9, 10) ~ "Autumn",
month(date) %in% c(11, 12, 1) ~ "Winter",
T ~ NA_character_
))
# date species abundance temp season_year season
# 1 2005-09-03 A 3 19 2005 Autumn
# 2 2005-09-15 B 30 16 2005 Autumn
# 3 2005-10-04 A 24 12 2005 Autumn
# 4 2005-11-06 A 32 14 2005 Winter
# 5 2005-12-08 A 15 13 2005 Winter
# 6 2005-01-03 A 64 19 2004 Winter
# 7 2006-01-04 B 2 13 2005 Winter
# 8 2006-02-10 A 56 12 2006 Spring
# 9 2006-02-08 A 34 19 2006 Spring
# 10 2006-03-09 A 64 19 2006 Spring
Then you can group_by() and/or filter() your data for further analysis:
df %>%
group_by(season_year) %>%
filter(season == "Winter") %>%
summarise(count = sum(abundance))
# # A tibble: 2 x 2
# season_year count
# <dbl> <int>
# 1 2004 64
# 2 2005 49
data.table solution:
first create a lookup-table with from-to dates and the season-year, then perform an overlap-join using foverlaps
library( data.table )
sample data
dt <- fread("date species year month day abundance temp
9/3/2005 A 2005 9 3 3 19
9/15/2005 B 2005 9 15 30 16
10/4/2005 A 2005 10 4 24 12
11/6/2005 A 2005 11 6 32 14
12/8/2005 A 2005 12 8 15 13
1/3/2005 A 2006 1 3 64 19
1/4/2006 B 2006 1 4 2 13
2/10/2006 A 2006 2 10 56 12
2/8/2006 A 2006 1 3 34 19
3/9/2006 A 2006 1 3 64 19", header = TRUE)
create a lookup-table
In here, you define the names, start and end of the seasons. Adjust to your own needs. Since you want to analyse the seasons individually, I advise to keep unique season-names (here: based on start-year of the season).
dt.season <- data.table( from = seq( as.Date("1999-02-01"), length.out = 100, by = "3 month"),
to = seq( as.Date("1999-05-01"), length.out = 100, by = "3 month") - 1 )
dt.season[, season := paste0( c( "spring", "summer", "autumn", "winter" ), "-", year( from ) )]
setkey( dt.season, from, to )
head(dt.season,6)
# from to season
# 1: 1999-02-01 1999-04-30 spring-1999
# 2: 1999-05-01 1999-07-31 summer-1999
# 3: 1999-08-01 1999-10-31 autumn-1999
# 4: 1999-11-01 2000-01-31 winter-1999
# 5: 2000-02-01 2000-04-30 spring-2000
# 6: 2000-05-01 2000-07-31 summer-2000
and perform join
#set dt$date as dates
dt[, date := as.Date(date, format = "%m/%d/%Y")]
#create dummy variables to join on
dt[, `:=`( from = date, to = date)]
#create an overlap join, and clean the dummies used for the join
foverlaps( dt, dt.season)[, `:=`(from = NULL, to = NULL, i.from = NULL, i.to = NULL)][]
# season date species year month day abundance temp
# 1: autumn-2005 2005-09-03 A 2005 9 3 3 19
# 2: autumn-2005 2005-09-15 B 2005 9 15 30 16
# 3: autumn-2005 2005-10-04 A 2005 10 4 24 12
# 4: winter-2005 2005-11-06 A 2005 11 6 32 14
# 5: winter-2005 2005-12-08 A 2005 12 8 15 13
# 6: winter-2004 2005-01-03 A 2006 1 3 64 19
# 7: winter-2005 2006-01-04 B 2006 1 4 2 13
# 8: spring-2006 2006-02-10 A 2006 2 10 56 12
# 9: spring-2006 2006-02-08 A 2006 1 3 34 19
# 10: spring-2006 2006-03-09 A 2006 1 3 64 19
You can now easily group/sum/analyse by season
I'd think the easiest way would be to consider that 2006 winter consists of Nov, Dec 2006 and Jan 2007, you could add a column winterid <- ifelse(data$month %in% c(11,12), data$year, ifelse(data$month == 1, data$year-1, "notwinter")).
You are now able to subset on the successive winter seasons. Adapt according to your notation.

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

How to add means to an existing column in R

I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20
Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]

Drop subgroup of obs in dataframe if first observation of group is na

In R I have a dataframe df of this form:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
NA 5 2011 04 1234759
5 5 2011 05 1234759
5 5 2011 06 1234759
2 2 2001 11 1234760
NA NA 2001 11 1234760
Some of the a's and b's are NAs. I wish to subset the dataframe by id, have each subset ordered by year and month and then drop the whole subset/id if the first observation in order of time of either a or b is na.
For the example above, inteded result is:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
2 2 2001 11 1234760
NA NA 2001 11 1234760
I did it the non vectorized way, which took forever to run, as follow:
df_summary <- as.data.frame(table(df$id),stringsAsFactors=FALSE)
df <- df[order(df$id,df$year,df$month),]
remove <- ""
j <- 1
l <- 0
for(i in 1:nrow(df_summary)){
m <- df_summary$Var1[i]
if( is.na(df$a[j]) | is.na(df$b[j]) ) {
l <- l + 1
remove[l] <- df_summary$id[i]
}
j <- j + m
}
df <- df[!(df$id %in% remove),]
What is a faster, vectorized way, to achieve the same result?
What I tried, also to double-check my code:
dt <- setDT(df)
remove_vectorized <- dt[,list(remove_first_na=(is.na(a[1]) | is.na(b[1]))),by=id]
which suggests me to remove ALL observation, which is patently wrong.
Here are few data.table possible approaches
First- fixing your attempt
library(data.table)
setDT(df)[, if(!is.na(a[1L]) & !is.na(b[1L])) .SD, by = id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or we can generalize this (on expense of speed probably)
setDT(df)[, if(Reduce(`&`, !is.na(.SD[1L, .(a, b)]))) .SD, by = id]
## OR maybe `setDT(df)[, if(Reduce(`&`, !sapply(.SD[1L, .(a, b)], is.na))) .SD , by = id]`
## in order to avoid to matrix conversions)
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Another way is to combine unique and na.omit methods
indx <- na.omit(unique(setDT(df), by = "id"), by = c("a", "b"))
Then, a simple subset will do
df[id %in% indx$id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or maybe a binary join?
df[indx[, .(id)], on = "id"]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or
indx <- na.omit(unique(setDT(df, key = "id")), by = c("a", "b"))
df[.(indx$id)]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
(The last two are mainly for illustration)
For more info regarding data.table, please visit Getting Started on GH

Resources