Counting rows in data.table, grouping by multiple columns, including "empty" groups

Counting rows in data.table, grouping by multiple columns, including "empty" groups - r

I have a data.table that looks like the following:
ID Date Team MonthFactor
1 2512 2015-04-24 Purple 2015-04
2 2512 2015-04-25 Purple 2015-04
3 2512 2015-04-26 Purple 2015-04
4 2512 2015-04-27 Purple 2015-04
I would like to get the number of rows grouped by both Team and MonthFactor, including when there are no rows from a given month, IE if purple team had no entries in the month of May but yellow did, the summarized table would look like:
Team MonthFactor N
1 Purple 2015-04 10
2 Purple 2015-05 0
3 Yellow 2015-04 5
4 Yellow 2015-05 7
Doing this would be trivial if I didn't need the "empty" groups, but I can't wrap my head around how to specify the groups that need to be evaluated when there might not be rows that contain a given monthFactor.

You can achieve that by using a cross-join:
dat[, .N, .(Team, MonthFactor)
][CJ(Team, MonthFactor, unique = TRUE), on = c(Team = "V1", MonthFactor = "V2")
][is.na(N), N := 0][]
this gives:
Team MonthFactor N
1: Purple 2015-04 2
2: Purple 2015-05 0
3: Yellow 2015-04 5
4: Yellow 2015-05 3
The advantage of this method is that it is easier to include other variables as well. Supposing that ID is just a numeric value, consider this example:
dat[, .(.N, sID = sum(ID)), .(Team, MonthFactor)
][CJ(Team, MonthFactor, unique = TRUE), on = c(Team = "V1", MonthFactor = "V2")
][is.na(N), `:=` (N = 0, sID = 0)][]
which gives:
Team MonthFactor N sID
1: Purple 2015-04 2 5024
2: Purple 2015-05 0 0
3: Yellow 2015-04 5 12560
4: Yellow 2015-05 3 7536
Used data:
dat <- structure(list(ID = c(2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L),
Date = structure(c(1L, 2L, 1L, 2L, 3L, 4L, 4L, 2L, 3L, 4L), .Label = c("2015-04-24", "2015-04-25", "2015-04-26", "2015-04-27"), class = "factor"),
Team = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Purple", "Yellow"), class = "factor"),
MonthFactor = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("2015-04", "2015-05"), class = "factor")),
.Names = c("ID", "Date", "Team", "MonthFactor"), class = c("data.table", "data.frame"), row.names = c(NA, -10L))

Perhaps this could work
data.table(table(dt$Team,dt$MonthFactor))

Related

R - how to avoid repeating filter & row bind

Because I am working on a very large dataset, I need to slice my dataset by groups in order to pursue my computations.
I have a person-period (melt) dataset that looks like this
group id var time
1 A 1 a 1
2 A 1 b 2
3 A 1 a 3
4 A 2 b 1
5 A 2 b 2
6 A 2 b 3
7 B 1 a 1
8 B 1 a 2
9 B 1 a 3
10 B 2 c 1
11 B 2 c 2
12 B 2 c 3
I need to do this simple transformation
library(reshape2)
library(dplyr)
dt %>% dcast(group + id ~ time, value.var = 'var')
In order to get
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c
So far, so good.
However, because my database is too big, I need to do this separately for each different groups, such as
a = dt %>% filter(group == 'A') %>% dcast(group + id ~ time, value.var ='var')
b = dt %>% filter(group == 'B') %>% dcast(group + id ~ time, value.var = 'var')
bind_rows(a,b)
My problem is that I would like to avoid doing it by hand. I mean, having to store separately each groups, a = ..., b = ..., c = ..., and so on
Any idea how I could have a single pipe stream that would separate each group, compute the transformation and put it back together in a dataframe ?
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), time = structure(c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor")), .Names = c("group", "id",
"var", "time"), row.names = c(NA, -12L), class = "data.frame")

Package purrr can be useful for working with lists. First split the dataset by group and then use map_df to dcast each list but return everything in a single data.frame.
library(purrr)
dt %>%
split(.$group) %>%
map_df(~dcast(.x, group + id ~ time, value.var = "var"))
group id 1 2 3
1 A 1 a b a
2 A 2 b b b
3 B 1 a a a
4 B 2 c c c

lapply is your friend here:
do.call(rbind, lapply(unique(dt$Group), function(grp, dt){
dt %>% filter(Group == grp) %>% dcast(group + id ~ time, value.var = "var")
}, dt = dt))

Linear combinations of rows with matching row attributes in data.table

I would like to subtract corresponding rows by months in a datatable
Here is the example table
monthly_date sector_order Retail Sales Trend Sales
1: 2014-12-01 1 42123.87 42279.64
2: 2015-11-01 1 44181.69 43620.22
3: 2015-12-01 1 43207.97 43605.21
4: 2014-12-01 30 14972.60 15025.74
5: 2015-11-01 30 15969.98 15685.36
6: 2015-12-01 30 15478.42 15675.09
Is there an elegant way to give me a 3 row table with the rows with
sector_order==30 subtracted from the rows with sector_order==1
I can obviously brute force it with two data frames. Is there a more general data.table way?

Here is an option
library(data.table)
data[, .(RetailSales = RetailSales[1L] - RetailSales[.N],
TrendSales = TrendSales[1L] - TrendSales[.N]), by = monthly_date]
# monthly_date RetailSales TrendSales
#1: 2014-12-01 27151.27 27253.90
#2: 2015-11-01 28211.71 27934.86
#3: 2015-12-01 27729.55 27930.12
or as #MichaelChirico suggested a more elegant solution
data[order(-sector_order),.(RetailSales = diff(RetailSales),
TrendSales = diff(TrendSales)), by = monthly_date]
Or as #Frank suggested
data[order(-sector_order),
.SD[2]-.SD[1]
# lapply(.SD, diff) # also works here
, by=monthly_date, .SDcols=c("RetailSales","TrendSales")]
data
data = setDT(structure(list(monthly_date = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("2014-12-01", "2015-11-01", "2015-12-01"), class = "factor"),
sector_order = c(1L, 1L, 1L, 30L, 30L, 30L), RetailSales = c(42123.87,
44181.69, 43207.97, 14972.6, 15969.98, 15478.42), TrendSales = c(42279.64,
43620.22, 43605.21, 15025.74, 15685.36, 15675.09), grp = c(1L,
2L, 3L, 1L, 2L, 3L)), .Names = c("monthly_date", "sector_order",
"RetailSales", "TrendSales", "grp"), class = "data.frame", row.names = c(NA,
-6L)))

Add rows when values in columns are equal in df

For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.

One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()

'Complex' aggregation function in dcast from reshape2

I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!

I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790

An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790

Calculating percent of row total with plyr

I am currently using cast on a melted table to calculate the total of each value at the combination of ID variables ID1 (row names) and ID2 (column headers), along with grand totals for each row using margins="grand_col".
c <- cast(m, ID1 ~ ID2, sum, margins="grand_col")
ID1 ID2a ID2b ID2c ID2d ID2e (all)
1 ID1a 6459695 885473 648019 453613 1777308 10224108
2 ID1b 7263529 1411355 587785 612730 2458672 12334071
3 ID1c 7740364 1253524 682977 886897 3559283 14123045
So far, so R-like.
Then I divide each cell by its row total to get a percentage of the total.
c[,2:6]<-c[,2:6] / c[,7]
This looks kludgy. Is there something I should be doing in cast or maybe in plyr to handle the percent of margin calculation in the first command?
Thanks,
Matt

Assuming your source table looks something like this:
dfm <- structure(list(ID1 = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ID1a", "ID1b", "ID1c"
), class = "factor"), ID2 = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("ID2a",
"ID2b", "ID2c", "ID2d", "ID2e"), class = "factor"), value = c(6459695L,
7263529L, 7740364L, 885473L, 1411355L, 1253524L, 648019L, 587785L,
682977L, 453613L, 612730L, 886897L, 1777308L, 2458672L, 3559283L
)), .Names = c("ID1", "ID2", "value"), row.names = c(NA,
-15L), class = "data.frame")
> head(dfm)
ID1 ID2 value
1 ID1a ID2a 6459695
2 ID1b ID2a 7263529
3 ID1c ID2a 7740364
4 ID1a ID2b 885473
5 ID1b ID2b 1411355
6 ID1c ID2b 1253524
Using ddply first to calculate the percentages, and cast to present the data in the required format
library(reshape)
library(plyr)
df1 <- ddply(dfm, .(ID1), summarise, ID2 = ID2, pct = value / sum(value))
dfc <- cast(df1, ID1 ~ ID2)
dfc
ID1 ID2a ID2b ID2c ID2d ID2e
1 ID1a 0.6318101 0.08660638 0.06338147 0.04436700 0.1738350
2 ID1b 0.5888996 0.11442735 0.04765539 0.04967784 0.1993399
3 ID1c 0.5480662 0.08875735 0.04835905 0.06279786 0.2520195
Compared to your example, this is missing the row totals, these need to be added separately.
Not sure though, whether this solution is more elegant than the one you currently have.

Here is a one-liner using tapply and prop.table. It does not rely on any auxilliary packages:
prop.table(tapply(dfm$value, dfm[1:2], sum), 1)
giving:
ID2
ID1 ID2a ID2b ID2c ID2d ID2e
ID1a 0.6318101 0.08660638 0.06338147 0.04436700 0.1738350
ID1b 0.5888996 0.11442735 0.04765539 0.04967784 0.1993399
ID1c 0.5480662 0.08875735 0.04835905 0.06279786 0.2520195
or this which is even shorter:
prop.table( xtabs(value ~., dfm), 1 )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Counting rows in data.table, grouping by multiple columns, including "empty" groups - r

Perhaps this could work data.table(table(dt$Team,dt$MonthFactor))

Related

R - how to avoid repeating filter & row bind

Linear combinations of rows with matching row attributes in data.table

Add rows when values in columns are equal in df

'Complex' aggregation function in dcast from reshape2

Calculating percent of row total with plyr

Categories

Resources