I have a client loan data base and I want to do a ddply summarise per LoanRefID:
LoanRefId Tran_Type TransactionAmount
103 11 LoanIssue 1000.0000
104 11 InitiationFee 171.0000
105 11 Interest 59.6729
106 11 AdministrationFee 64.9332
107 11 RaisedClientInstallment 1295.5757
108 11 ClientInstallment 1295.4700
109 11 PaidUp 0.0000
110 11 Adjustment 0.1361
111 11 PaidUp 0.0000
112 12 LoanIssue 3000.0000
113 12 InitiationFee 399.0000
114 12 Interest 94.9858
115 12 AdministrationFee 38.6975
116 12 RaisedClientInstallment 3532.6350
117 12 ClientInstallment 3532.6100
118 12 PaidUp 0.0000
119 12 Adjustment 0.0733
120 12 PaidUp 0.0000
However, I only want to only sum certain rows per loanID. specifically, I only want to sum where the Tran_Type == "ClientInstallment".
The only way I can think of (which doesn't seem to work) is:
> ddply(test, c("LoanRefId"), summarise, cash_in = sum(test[test$Tran_Type == "ClientInstallment","TransactionAmount"]))
LoanRefId cash_in
1 11 4828.08
2 12 4828.08
This is not summing per LoanRefId, it is simply summing all amounts where Tran_Type == "CLientInstallment" which is wrong.
Is there a better way to do this logical sum?
Someone may add a plyr answer but nowadays base R, dplyr, or data.table are more widely used. plyr has been updated and upgraded. It is worth taking the time to learn the newer implementations as they are more efficient and packed with features.
base R
aggregate(TransactionAmount ~ LoanRefId, df[df$Tran_Type == "ClientInstallment",], sum)
# LoanRefId TransactionAmount
#1 11 1295.47
#2 12 3532.61
dplyr
library(dplyr)
df %>%
group_by(LoanRefId) %>%
filter(Tran_Type == "ClientInstallment") %>%
summarise(TransactionAmount = sum(TransactionAmount))
#Source: local data frame [2 x 2]
#
# LoanRefId TransactionAmount
# (int) (dbl)
#1 11 1295.47
#2 12 3532.61
data.table
setDT(df)[Tran_Type == "ClientInstallment", sum(TransactionAmount), by=LoanRefId]
# LoanRefId V1
#1: 11 1295.47
#2: 12 3532.61
Notice how clean data.table syntax is :). Great tool to learn.
Another base R option is tapply
with(subset(df1, Tran_Type=='ClientInstallment'),
tapply(TransactionAmount, LoanRefId, FUN=sum))
# 11 12
#1295.47 3532.61
Or if we need plyr (going back to the past)
library(plyr)
ddply(df1, .(LoanRefId), summarise,
TransactionAmount = sum(TransactionAmount[Tran_Type=='ClientInstallment']))
# LoanRefId TransactionAmount
#1 11 1295.47
#2 12 3532.61
Here's one more possibility, just for completeness:
with(df1[df1$Tran_Type=="ClientInstallment",], by(LoanRefId, TransactionAmount, sum))
#TransactionAmount: 1295.47
#[1] 11
#------------------------------------------------------------
#TransactionAmount: 3532.61
#[1] 12
I honestly felt data.table is a life saver.
test[Tran_Type == "ClientInstallment",
sum(TransactionAmount), by=LoanRefId]
Related
What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.
In other words, I want to group on a column and then perform calculations using only some of the rows per group.
The data set I have is:
LoanRefId Tran_Type TransactionAmount
103 11 LoanIssue 1000.0000
104 11 InitiationFee 171.0000
105 11 Interest 59.6729
106 11 AdministrationFee 64.9332
107 11 RaisedClientInstallment 1295.5757
108 11 ClientInstallment 1295.4700
109 11 PaidUp 0.0000
110 11 Adjustment 0.1361
111 11 PaidUp 0.0000
112 12 LoanIssue 3000.0000
113 12 InitiationFee 399.0000
114 12 Interest 94.9858
115 12 AdministrationFee 38.6975
116 12 RaisedClientInstallment 3532.6350
117 12 ClientInstallment 3532.6100
118 12 PaidUp 0.0000
119 12 Adjustment 0.0733
120 12 PaidUp 0.0000
I would like to repeat the following calculation for each group:
ClientInstallment - LoanIssue.
So, group 1 will be for LoanRefId number 11. The calculation will take ClientInstallment of 1295.47 and subtract LoanIssue of 1000 to give me a new column, call it "Income, with value 295.47.
Is this possible using data.table or dplyr or any other clever tricks.
Alternatively I can create two data summaries, one for Clientinstallment and one for LoanIssue and then subtract them, but the truth is I need to do much more than just subtracting two numbers, so I would need a data summary for each calculation which is just plain unclever imho.
any help is appreciated
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'LoanRefId', we get the corresponding 'TransactionAmount' for 'Tran_Type's 'ClientInstallment' and 'LoanIssue' and then subtract it.
library(data.table)
setDT(df1)[,list(Income =TransactionAmount[Tran_Type=='ClientInstallment']-
TransactionAmount[Tran_Type=='LoanIssue']), by = LoanRefId]
# LoanRefId Income
#1: 11 295.47
#2: 12 532.61
We can also use dplyr with similar appraoch
df1 %>%
group_by(LoanRefId) %>%
summarise(Income = TransactionAmount[Tran_Type=='ClientInstallment']-
TransactionAmount[Tran_Type=='LoanIssue'])
Update
If we don't have a 'ClientInstallment' or 'LoanIssue' for a 'LoanRefId', we can use an if/else condition
setDT(df1)[, list(Income= if(any(Tran_Type=='ClientInstallment') &
any(Tran_Type=='LoanIssue'))
TransactionAmount[Tran_Type=='ClientInstallment'] -
TransactionAmount[Tran_Type=='LoanIssue'] else 0 ), by = LoanRefId]
Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.
I'm trying to bring in the standard deviation for each unique factor grouping in my data. I've researched techniques using the data.table package and the plyr package and haven't had any luck. Here is a basic example of what I'm trying to accomplish.
Group Hours
120 45
120 60
120 54
121 33
121 55
121 40
I'm trying to turn the above into:
Group Hours SD
120 45 7.343
120 60 7.343
120 54 7.343
121 33 9.833
121 55 9.833
121 40 9.833
Base solution (assuming your data called df)
transform(df, SD = ave(Hours, Group, FUN = sd))
data.table solution
library(data.table)
setDT(df)[, SD := sd(Hours), by = Group]
dplyr solution
library(dplyr)
df %>%
group_by(Group) %>%
mutate(SD = sd(Hours))
And here's a plyr solution (my first ever) as you asked for it
library(plyr)
ddply(df, .(Group), mutate, SD = sd(Hours))
(It is better to avoid having both plyr and dplyr loaded at the same time)
Thank you, David, for your detailed response! I've used data.table to write what I'm looking for. Here is a snippet of my final script that I wrote using David's answer.
PayrollHoursSD <- as.data.table(PayrollHours2)[, SD := sd(TOTAL.HOURS), by = COMBO]
head(PayrollHoursSD)
# COMBO PAY.END.DATE TOTAL.HOURS SD
# 1: 1-2 10-06 42561.78 4297.287
# 2: 1-2 10-13 42177.88 4297.287
# 3: 1-2 10-20 44691.23 4297.287
# 4: 1-2 10-27 42709.28 4297.287
# 5: 1-2 11-03 44876.25 4297.287
# 6: 1-2 11-10 40582.44 4297.287
I'm looking at some ecological data (diet) and trying to work out how to group by Predator. I would like to be able to extract the data so that I can look at the weights of each individual prey for each species for each predator, i.e work out the mean weight of each species eaten by e.g Predator 117. I've put a sample of my data below.
Predator PreySpecies PreyWeight
1 114 10 4.2035496
2 114 10 1.6307026
3 115 1 407.7279775
4 115 1 255.5430495
5 117 10 4.2503708
6 117 10 3.6268814
7 117 10 6.4342073
8 117 10 1.8590861
9 117 10 2.3181421
10 117 10 0.9749844
11 117 10 0.7424772
12 117 15 4.2803743
13 118 1 126.8559155
14 118 1 276.0256158
15 118 1 123.0529734
16 118 1 427.1129793
17 118 3 237.0437606
18 120 1 345.1957190
19 121 1 160.6688815
You can use the aggregate function as follows:
aggregate(formula = PreyWeight ~ Predator + PreySpecies, data = diet, FUN = mean)
# Predator PreySpecies PreyWeight
# 1 115 1 331.635514
# 2 118 1 238.261871
# 3 120 1 345.195719
# 4 121 1 160.668881
# 5 118 3 237.043761
# 6 114 10 2.917126
# 7 117 10 2.886593
# 8 117 15 4.280374
There are a few different ways of getting what you want:
The aggregate function. Probably what you are after.
aggregate(PreyWeight ~ Predator + PreySpecies, data=dd, FUN=mean)
tapply: Very useful, but only divides the variable by a single factor, hence, we need to create a need joint factor with the paste command:
tapply(dd$PreyWeight, paste(dd$Predator, dd$PreySpecies), mean)
ddply: Part of the plyr package. Very useful. Worth learning.
require(plyr)
ddply(dd, .(Predator, PreySpecies), summarise, mean(PreyWeight))
dcast: The output is in more of a table format. Part of the reshape2 package.
require(reshape2)
dcast(dd, PreyWeight ~ PreySpecies+ Predator, mean, fill=0)
mean(data$PreyWeight[data$Predator==117]);