R: pivoting & subtotals in data.table? - r

Pivoting and subtotals are common auxiliary steps in spreadsheets and SQL.
Assume a data.table with the fields date, myCategory, revenue. Assume that you want to know the proportion of day revenue of all revenue and the proportion of day revenue within different subgroup such that
b[,{
#First auxiliary variable of all revenue
totalRev = sum(revenue) #SUBGROUP OF ALL REV
#Second auxiliary variable of revenue by date, syntax wrong! How to do this?
{totalRev_date=sum(revenue), by=list(date)} #DIFFERENT SUBGROUP, by DATE's rev
#Within the subgroup by date and myCategory, we will use 1st&2nd auxiliary vars
.SD[,.(Revenue_prop_of_TOT=revenue/totalRev,
,Revenue_prop_of_DAY=revenue/totalRev_date) ,by=list(myCategory,date)]
},]
where we need to compute the auxiliary sums, all revenue of specific day and all revenue of whole history.
The end result should look like this:
date myCategory Revenue_prop_of_TOT Revenue_prop_of_DAY
2019-01-01 Cat1 0.002 0.2
...
where you see that the auxiliary variables are only help functions.
How can you pivot and compute subtotals within R data.table?

Another option using data.table::cube:
cb <- cube(DT, sum(value), by=c("date","category"), id=TRUE)
cb[grouping==0L, .(date, category,
PropByDate = V1 / cb[grouping==1L][.SD, on="date", x.V1],
PropByCategory = V1 / cb[grouping==2L][.SD, on="category", x.V1],
PropByTotal = V1 / cb[grouping==3L, V1]
)]
output:
date category PropByDate PropByCategory PropByTotal
1: 1 1 0.3333333 0.2500000 0.1
2: 1 2 0.6666667 0.3333333 0.2
3: 2 1 0.4285714 0.7500000 0.3
4: 2 2 0.5714286 0.6666667 0.4
data:
DT <- data.table(date=c(1, 1, 2, 2), category=c(1, 2, 1, 2), value=1:4)
# date category value
#1: 1 1 1
#2: 1 2 2
#3: 2 1 3
#4: 2 2 4

Hopefully I'm understanding correctly what you intend but please let me know in the comments if you need a different output.
b = data.table(date = rep(seq.Date(Sys.Date()-99, Sys.Date(), "days"), each=2),
myCategory = c("a", "b"),
revenue = rnorm(100, 200))
# global total, just create a constant
totalRev = b[, sum(revenue)]
# Total revenue at myCategory and date level / total Revenue
b[, Revenue_prop_of_TOT:=sum(revenue)/totalRev, by=.(myCategory, date)]
# you can calculate totalRev_date independently
b[, totalRev_date:=sum(revenue), by=date]
# If these are all the columns you have you don't need the sum(revenue) and by calls
b[, Revenue_prop_of_DAY:=sum(revenue)/totalRev_date, by=.(myCategory, date)]
Finally I would wrap it in a function.
revenue_total <- function(b){
totalRev = b[, sum(revenue)]
b[, Revenue_prop_of_TOT:=sum(revenue)/totalRev, by=.(myCategory, date)]
b[, totalRev_date:=sum(revenue), by=date]
b[, Revenue_prop_of_DAY:=sum(revenue)/totalRev_date, by=.(myCategory, date)]
b
}
b = revenue_total(b)

Options for pivoting and subtotals in R
cube answered here
groupingsets commented by marbel here

Related

Removing negative values and one positive value from R dataframe

I have a dataframe where one column is the amount spent. In the amount spent column there are the values for amount spent and also negative values for any returns. For example.
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
I want to remove the negative value then one of its positive counter parts - the idea is to only keep fully completed spend amounts so I can look at total spend.
Right now I am thinking something like this - where I have the data frame sorted by spend
if spend < 0 {
take absolute value of spend
if diff between abs(spend) and spend+1 = 0 then both are NA}
I would like to have something like
df[df$spend < 0] <- NA
where I can also set one positive counterpart to NA as well. Any suggestions?
There should be a simpler solution to this but here is one way. Also created my own example since the one shared did not have sufficient data points to test
#Original vector
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
#Count the frequency of negative numbers, keeping all the unique numbers
vals <- table(factor(abs(x[x < 0]), levels = unique(abs(x))))
#Count the frequency of absolute value of original vector
vals1 <- table(abs(x))
#Subtract the frequencies between two vectors
new_val <- vals1 - (vals * 2 )
#Recreate the new vector
as.integer(rep(names(new_val), new_val))
#[1] 1 2 3
If you add a rowid column you can do this with data.table ant-joins.
Here's an example which takes ID into account, not deleting "positive counterparts" unless they're the same ID
First create more interesting sample data
df <- fread('
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
123 A -19.50
123 A 19.50
123 A -99.50
124 A -94.50
124 A 99.50
124 A 94.50
124 A 94.50
')
Now remove all the negative values with positive counterparts, and remove those counterparts
negs <- df[Spent < 0][, Spent := -Spent][, rid := rowid(ID, Spent)]
pos <- df[Spent > 0][, rid := rowid(ID, Spent)]
pos[!negs, on = .(ID, Spent, rid), -'rid']
# ID Store Spent rid
# 1: 123 A 18.5 2
# 2: 124 A 99.5 1
# 3: 124 A 94.5 2
And as applied to Ronak's x vector example
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
negs <- data.table(x = -x[x<0])[, rid := rowid(x)]
pos <- data.table(x = x[x>0])[, rid := rowid(x)]
pos[!negs, on = names(pos), -'rid']
# x
# 1: 2
# 2: 3
# 3: 1
I used the following code.
library(dplyr)
store <- rep(LETTERS[1:3], 3)
id <- c(1:4, 1:3, 1:2)
expense <- runif(9, -10, 10)
tibble(store, id, expense) %>%
group_by(store) %>%
summarise(net_expenditure = sum(expense))
to get this output:
# A tibble: 3 x 2
store net_expenditure
<chr> <dbl>
1 A 13.3
2 B 8.17
3 C 16.6
Alternatively, if you wanted the net expenditure per store-id pairing, then you could use this code:
tibble(store, id, expense) %>%
group_by(store, id) %>%
summarise(net_expenditure = sum(expense))
I've approached your question from a slightly different perspective. I'm not sure that my code answers your question, but it might help.

R - Dplyr - How to mutate rows

I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certain subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to the numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As I said before I want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because I haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 3 Sales over Assets 1.30
1 7 Sales over Assets Nan
1 1 Sales over Expenses A + B Inf
1 2 Sales over Expenses A + B 2.32
1 3 Sales over Expenses A + B NA
2
3
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope I have made myself clear...
Is there any way to solve this row problem with dplyr? Should I cast the df_2017?

R: Using different DFs to get third DF with specific info from first 2

I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}

Comparing Groups in data.table Columns

I have a dataset that I need to both split by one variable (Day) and then compare between groups of another variable (Group), performing per-group statistics (e.g. mean) and also tests.
Here's an example of what I devised:
require(data.table)
data = data.table(Day = rep(1:10, each = 10),
Group = rep(1:2, times = 50),
V = rnorm(100))
data[, .(g1_mean = mean(.SD[Group == 1]$V),
g2_mean = mean(.SD[Group == 2]$V),
p.value = t.test(V ~ Group, .SD, alternative = "two.sided")$p.value),
by = list(Day)]
Which produces:
Day g1_mean g2_mean p.value
1: 1 0.883406048 0.67177271 0.6674138
2: 2 0.007544956 -0.55609722 0.3948459
3: 3 0.409248637 0.28717183 0.8753213
4: 4 -0.540075365 0.23181458 0.1785854
5: 5 -0.632543900 -1.09965990 0.6457325
6: 6 -0.083221671 -0.96286343 0.2011136
7: 7 -0.044674252 -0.27666473 0.7079499
8: 8 0.260795244 -0.15159164 0.4663712
9: 9 -0.134164758 0.01136245 0.7992453
10: 10 0.496144329 0.76168408 0.1821123
I'm hoping that there's a less roundabout manner of arriving at this result.
A possible compact alternative which can also apply more functions to each group:
DTnew <- dcast(DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day],
Day + pval ~ paste0("g",Group), fun = list(mean,sd), value.var = "V")
which gives:
> DTnew
Day pval V_mean_g1 V_mean_g2 V_sd_g1 V_sd_g2
1: 1 0.4763594 -0.11630634 0.178240714 0.7462975 0.4516087
2: 2 0.5715001 -0.29689807 0.082970631 1.3614177 0.2745783
3: 3 0.2295251 -0.48792449 -0.031328749 0.3723247 0.6703694
4: 4 0.5565573 0.33982242 0.080169698 0.5635136 0.7560959
5: 5 0.5498684 -0.07554433 0.308661427 0.9343230 1.0100788
6: 6 0.4814518 0.57694034 0.885968245 0.6457926 0.6773873
7: 7 0.8053066 0.29845913 0.116217727 0.9541060 1.2782210
8: 8 0.3549573 0.14827289 -0.319017581 0.5328734 0.9036501
9: 9 0.7290625 -0.21589411 -0.005785092 0.9639758 0.8859461
10: 10 0.9899833 0.84034529 0.850429982 0.6645952 1.5809149
A decomposition of the code:
First, a pval variable is added to the dataset with DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day]
Because DT is updated in place and by reference by the previous step, the dcast function can be applied to that directly.
In the casting formula, you specify the variables that need to stay in the current form on the RHS and the variable that needs to be spread over columns on the LHS.
With the fun argument you can specify which aggregation function has to be used on the value.var (here V). If multiple aggregation functions are needed, you can specify them in a list (e.g. list(mean,sd)). This can be any type of function. So, also cumstom made functions can be used.
If you want to remove the V_ from the column names, you can do:
names(DTnew) <- gsub("V_","",names(DTnew))
NOTE: I renamed the data.table to DT as it is often not wise to name your dataset after a function (check ?data)
While not a one-liner, you might consider doing your two processes separate and then merging the results. This prevents you from having to hardcode the group-names.
First, we calculate the means:
my_means <- dcast(data[,mean(V), by = .(Day, Group)],
Day~ paste0("Mean_Group", Group),value.var="V1")
Or in the less-convoluted way #Akrun mentioned in the comments, with some added formatting.
my_means <- dcast(Day~paste0("Mean_Group", Group), data=data,
fun.agg=mean, value.var="V")
Then the t-tests:
t_tests <- data[,.(p_value=t.test(V~Group)$p.value), by = Day]
And then merge:
output <- merge(my_means, t_tests)

Summarize a data.table with unreliable data

I have a data.table of events recording, say, user ID, country of residence, and event.
E.g.,
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here).
So when I try to summarize the data:
dt[, country[.N] , by=user]
user V1
1: 3 2
2: 4 2
I get the wrong country for user 3.
Ideally, I would like to get the most common country for a user and the
percentage of time he spent there:
user country support
1: 3 1 0.8
2: 4 2 1.0
How do I do that?
The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table and not data.frame after all).
Another way:
Edited. table(.) was the culprit. Changed it to complete data.table syntax.
dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)],
max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
# user country support
# 1: 3 1 0.8
# 2: 4 2 1.0
Using plyr's count function:
dt[, count(country), by = user][order(-freq),
list(country = x[1],
support = freq[1]/sum(freq)),
by = user]
# user country support
#1: 4 2 1.0
#2: 3 1 0.8
Idea is to count the countries per user, order by max frequency and then get the data you like.
A smarter answer thanks to #mnel, that doesn't use extra functions:
dt[, list(freq = .N),
by = list(user, country)][order(-freq),
list(country = country[1],
support = freq[1]/sum(freq)),
by = user]

Resources