Linear combinations of rows with matching row attributes in data.table - r

I would like to subtract corresponding rows by months in a datatable
Here is the example table
monthly_date sector_order Retail Sales Trend Sales
1: 2014-12-01 1 42123.87 42279.64
2: 2015-11-01 1 44181.69 43620.22
3: 2015-12-01 1 43207.97 43605.21
4: 2014-12-01 30 14972.60 15025.74
5: 2015-11-01 30 15969.98 15685.36
6: 2015-12-01 30 15478.42 15675.09
Is there an elegant way to give me a 3 row table with the rows with
sector_order==30 subtracted from the rows with sector_order==1
I can obviously brute force it with two data frames. Is there a more general data.table way?

Here is an option
library(data.table)
data[, .(RetailSales = RetailSales[1L] - RetailSales[.N],
TrendSales = TrendSales[1L] - TrendSales[.N]), by = monthly_date]
# monthly_date RetailSales TrendSales
#1: 2014-12-01 27151.27 27253.90
#2: 2015-11-01 28211.71 27934.86
#3: 2015-12-01 27729.55 27930.12
or as #MichaelChirico suggested a more elegant solution
data[order(-sector_order),.(RetailSales = diff(RetailSales),
TrendSales = diff(TrendSales)), by = monthly_date]
Or as #Frank suggested
data[order(-sector_order),
.SD[2]-.SD[1]
# lapply(.SD, diff) # also works here
, by=monthly_date, .SDcols=c("RetailSales","TrendSales")]
data
data = setDT(structure(list(monthly_date = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("2014-12-01", "2015-11-01", "2015-12-01"), class = "factor"),
sector_order = c(1L, 1L, 1L, 30L, 30L, 30L), RetailSales = c(42123.87,
44181.69, 43207.97, 14972.6, 15969.98, 15478.42), TrendSales = c(42279.64,
43620.22, 43605.21, 15025.74, 15685.36, 15675.09), grp = c(1L,
2L, 3L, 1L, 2L, 3L)), .Names = c("monthly_date", "sector_order",
"RetailSales", "TrendSales", "grp"), class = "data.frame", row.names = c(NA,
-6L)))

Related

R: using a for loop to create a new data table containing min and max variables given multiple column combinations

I am currently working with a data set in R that contains four variables for a large set of individuals: pid, month, window, and agedays. I'm trying to create a loop that will output the min and max agedays of each group of combinations between month and window into a new data table that I can export as a csv.
Here's an example of the data:
pid agedays month window
1 22 2 1
2 35 3 2
3 33 3 2
4 55 3 2
1 66 2 1
2 55 4 2
3 80 4 2
4 90 4 2
I'd like for the new data table to contain the min and max agedays of each group within each combination of window and month as well as the count of each group within each combination. The range for month is 2-24 and the range for window is 0-2.
The data table should look something like this:
month window min max N
2 1 22 66 1
3 2 33 55 3
etc....
where N is the number of unique individuals (pids) within each group
After grouping by 'month', 'window', get the min, max of 'agedays' and the number of distinct (n_distinct) elements of 'pid'
library(dplyr)
df1 %>%
group_by(month, window) %>%
summarise(min = min(agedays), max = max(agedays), N = n_distinct(pid))
# A tibble: 3 x 5
# Groups: month [3]
# month window min max N
# <int> <int> <int> <int> <int>
#1 2 1 22 66 1
#2 3 2 33 55 3
#3 4 2 55 90 3
We can also do this with data.table
library(data.table)
setDT(df1)[, .(min = min(agedays), max = max(agedays),
N = uniqueN(pid)), by = .(month, window)]
Or using split from base R
do.call(rbind, lapply(split(df1, df1[c('month', 'window')], drop = TRUE),
function(x) cbind(month = x$month[1], window = x$window[1], min = min(x$agedays), max = max(x$agedays),
N = length(unique(x$pid)))))
data
df1 <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Using data.table, we can calculate min, max of agedays along with number of rows for each combination of month and window.
library(data.table)
setDT(df) #Convert to data.table if it is not already
df[, .(min_age = min(agedays, na.rm = TRUE),
max_age = max(agedays, na.rm = TRUE), N = .N), .(month, window)]
# month window min_age max_age N
#1: 2 1 22 66 2
#2: 3 2 33 55 3
#3: 4 2 55 90 3
data
df <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -8L))

Conditional data manipulation using data.table in R

I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37

Counting rows in data.table, grouping by multiple columns, including "empty" groups

I have a data.table that looks like the following:
ID Date Team MonthFactor
1 2512 2015-04-24 Purple 2015-04
2 2512 2015-04-25 Purple 2015-04
3 2512 2015-04-26 Purple 2015-04
4 2512 2015-04-27 Purple 2015-04
I would like to get the number of rows grouped by both Team and MonthFactor, including when there are no rows from a given month, IE if purple team had no entries in the month of May but yellow did, the summarized table would look like:
Team MonthFactor N
1 Purple 2015-04 10
2 Purple 2015-05 0
3 Yellow 2015-04 5
4 Yellow 2015-05 7
Doing this would be trivial if I didn't need the "empty" groups, but I can't wrap my head around how to specify the groups that need to be evaluated when there might not be rows that contain a given monthFactor.
You can achieve that by using a cross-join:
dat[, .N, .(Team, MonthFactor)
][CJ(Team, MonthFactor, unique = TRUE), on = c(Team = "V1", MonthFactor = "V2")
][is.na(N), N := 0][]
this gives:
Team MonthFactor N
1: Purple 2015-04 2
2: Purple 2015-05 0
3: Yellow 2015-04 5
4: Yellow 2015-05 3
The advantage of this method is that it is easier to include other variables as well. Supposing that ID is just a numeric value, consider this example:
dat[, .(.N, sID = sum(ID)), .(Team, MonthFactor)
][CJ(Team, MonthFactor, unique = TRUE), on = c(Team = "V1", MonthFactor = "V2")
][is.na(N), `:=` (N = 0, sID = 0)][]
which gives:
Team MonthFactor N sID
1: Purple 2015-04 2 5024
2: Purple 2015-05 0 0
3: Yellow 2015-04 5 12560
4: Yellow 2015-05 3 7536
Used data:
dat <- structure(list(ID = c(2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L, 2512L),
Date = structure(c(1L, 2L, 1L, 2L, 3L, 4L, 4L, 2L, 3L, 4L), .Label = c("2015-04-24", "2015-04-25", "2015-04-26", "2015-04-27"), class = "factor"),
Team = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Purple", "Yellow"), class = "factor"),
MonthFactor = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("2015-04", "2015-05"), class = "factor")),
.Names = c("ID", "Date", "Team", "MonthFactor"), class = c("data.table", "data.frame"), row.names = c(NA, -10L))
Perhaps this could work
data.table(table(dt$Team,dt$MonthFactor))

Find out the item first time shows in a data set

I have a data set ProductTable, I want to return the date of all the ProductsFamily has been ordered first time and the very last time. Examples:
ProductTable
OrderPostingYear OrderPostingMonth OrderPostingDate ProductsFamily Sales QTY
2008 1 20 R1 5234 1
2008 1 12 R2 223 2
2009 1 30 R3 34 1
2008 2 1 R1 1634 3
2010 4 23 R3 224 1
2009 3 20 R1 5234 1
2010 7 12 R2 223 2
Result as followings
OrderTime
ProductsFamily OrderStart OrderEnd SumSales
R1 2008/1/20 2009/3/20 12102
R2 2008/1/12 2010/7/12 446
R3 2009/1/30 2010/4/23 258
I have no idea how to do it. Any suggestions?
ProductTable <- structure(list(OrderPostingYear = c(2008L, 2008L, 2009L, 2008L,
2010L, 2009L, 2010L), OrderPostingMonth = c(1L, 1L, 1L, 2L, 4L,
3L, 7L), OrderPostingDate = c(20L, 12L, 30L, 1L, 23L, 20L, 12L
), ProductsFamily = structure(c(1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("R1",
"R2", "R3"), class = "factor"), Sales = c(5234L, 223L, 34L, 1634L,
224L, 5234L, 223L), QTY = c(1L, 2L, 1L, 3L, 1L, 1L, 2L)), .Names = c("OrderPostingYear",
"OrderPostingMonth", "OrderPostingDate", "ProductsFamily", "Sales",
"QTY"), class = "data.frame", row.names = c(NA, -7L))
We can also use dplyr/tidyr to do this. We arrange the columns, concatenate the 'Year:Date' columns with unite, group by 'ProductsFamily', get the first, last of 'Date' column and sum of 'Sales' within summarise.
library(dplyr)
library(tidyr)
ProductTable %>%
arrange(ProductsFamily, OrderPostingYear, OrderPostingMonth, OrderPostingDate) %>%
unite(Date,OrderPostingYear:OrderPostingDate, sep='/') %>%
group_by(ProductsFamily) %>%
summarise(OrderStart=first(Date), OrderEnd=last(Date), SumSales=sum(Sales))
# Source: local data frame [3 x 4]
# ProductsFamily OrderStart OrderEnd SumSales
# (fctr) (chr) (chr) (int)
# 1 R1 2008/1/20 2009/3/20 12102
# 2 R2 2008/1/12 2010/7/12 446
# 3 R3 2009/1/30 2010/4/23 258
You can first set up the date in a new column, and then aggregate your data using data.table package (you take the first and last date by ID, as well as the sum of sales):
library(data.table)
# First build up the date
ProductTable$date = with(ProductTable,
as.Date(paste(OrderPostingYear,
OrderPostingMonth,
OrderPostingDate, sep = "." ),
format = "%Y.%m.%d"))
# In a second step, aggregate your data
setDT(ProductTable)[,list(OrderStart = sort(date)[1],
OrderEnd = sort(date)[.N],
SumSales = sum(Sales))
,ProductsFamily]
# ProductsFamily OrderStart OrderEnd SumSales
#1: R1 2008-01-20 2009-03-20 12102
#2: R2 2008-01-12 2010-07-12 446
#3: R3 2009-01-30 2010-04-23 258

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000
As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

Resources