R perform calculations on groups for subset of rows - r

In other words, I want to group on a column and then perform calculations using only some of the rows per group.
The data set I have is:
LoanRefId Tran_Type TransactionAmount
103 11 LoanIssue 1000.0000
104 11 InitiationFee 171.0000
105 11 Interest 59.6729
106 11 AdministrationFee 64.9332
107 11 RaisedClientInstallment 1295.5757
108 11 ClientInstallment 1295.4700
109 11 PaidUp 0.0000
110 11 Adjustment 0.1361
111 11 PaidUp 0.0000
112 12 LoanIssue 3000.0000
113 12 InitiationFee 399.0000
114 12 Interest 94.9858
115 12 AdministrationFee 38.6975
116 12 RaisedClientInstallment 3532.6350
117 12 ClientInstallment 3532.6100
118 12 PaidUp 0.0000
119 12 Adjustment 0.0733
120 12 PaidUp 0.0000
I would like to repeat the following calculation for each group:
ClientInstallment - LoanIssue.
So, group 1 will be for LoanRefId number 11. The calculation will take ClientInstallment of 1295.47 and subtract LoanIssue of 1000 to give me a new column, call it "Income, with value 295.47.
Is this possible using data.table or dplyr or any other clever tricks.
Alternatively I can create two data summaries, one for Clientinstallment and one for LoanIssue and then subtract them, but the truth is I need to do much more than just subtracting two numbers, so I would need a data summary for each calculation which is just plain unclever imho.
any help is appreciated

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'LoanRefId', we get the corresponding 'TransactionAmount' for 'Tran_Type's 'ClientInstallment' and 'LoanIssue' and then subtract it.
library(data.table)
setDT(df1)[,list(Income =TransactionAmount[Tran_Type=='ClientInstallment']-
TransactionAmount[Tran_Type=='LoanIssue']), by = LoanRefId]
# LoanRefId Income
#1: 11 295.47
#2: 12 532.61
We can also use dplyr with similar appraoch
df1 %>%
group_by(LoanRefId) %>%
summarise(Income = TransactionAmount[Tran_Type=='ClientInstallment']-
TransactionAmount[Tran_Type=='LoanIssue'])
Update
If we don't have a 'ClientInstallment' or 'LoanIssue' for a 'LoanRefId', we can use an if/else condition
setDT(df1)[, list(Income= if(any(Tran_Type=='ClientInstallment') &
any(Tran_Type=='LoanIssue'))
TransactionAmount[Tran_Type=='ClientInstallment'] -
TransactionAmount[Tran_Type=='LoanIssue'] else 0 ), by = LoanRefId]

Related

R ddply summarise sum only selected/specific/logical rows

I have a client loan data base and I want to do a ddply summarise per LoanRefID:
LoanRefId Tran_Type TransactionAmount
103 11 LoanIssue 1000.0000
104 11 InitiationFee 171.0000
105 11 Interest 59.6729
106 11 AdministrationFee 64.9332
107 11 RaisedClientInstallment 1295.5757
108 11 ClientInstallment 1295.4700
109 11 PaidUp 0.0000
110 11 Adjustment 0.1361
111 11 PaidUp 0.0000
112 12 LoanIssue 3000.0000
113 12 InitiationFee 399.0000
114 12 Interest 94.9858
115 12 AdministrationFee 38.6975
116 12 RaisedClientInstallment 3532.6350
117 12 ClientInstallment 3532.6100
118 12 PaidUp 0.0000
119 12 Adjustment 0.0733
120 12 PaidUp 0.0000
However, I only want to only sum certain rows per loanID. specifically, I only want to sum where the Tran_Type == "ClientInstallment".
The only way I can think of (which doesn't seem to work) is:
> ddply(test, c("LoanRefId"), summarise, cash_in = sum(test[test$Tran_Type == "ClientInstallment","TransactionAmount"]))
LoanRefId cash_in
1 11 4828.08
2 12 4828.08
This is not summing per LoanRefId, it is simply summing all amounts where Tran_Type == "CLientInstallment" which is wrong.
Is there a better way to do this logical sum?
Someone may add a plyr answer but nowadays base R, dplyr, or data.table are more widely used. plyr has been updated and upgraded. It is worth taking the time to learn the newer implementations as they are more efficient and packed with features.
base R
aggregate(TransactionAmount ~ LoanRefId, df[df$Tran_Type == "ClientInstallment",], sum)
# LoanRefId TransactionAmount
#1 11 1295.47
#2 12 3532.61
dplyr
library(dplyr)
df %>%
group_by(LoanRefId) %>%
filter(Tran_Type == "ClientInstallment") %>%
summarise(TransactionAmount = sum(TransactionAmount))
#Source: local data frame [2 x 2]
#
# LoanRefId TransactionAmount
# (int) (dbl)
#1 11 1295.47
#2 12 3532.61
data.table
setDT(df)[Tran_Type == "ClientInstallment", sum(TransactionAmount), by=LoanRefId]
# LoanRefId V1
#1: 11 1295.47
#2: 12 3532.61
Notice how clean data.table syntax is :). Great tool to learn.
Another base R option is tapply
with(subset(df1, Tran_Type=='ClientInstallment'),
tapply(TransactionAmount, LoanRefId, FUN=sum))
# 11 12
#1295.47 3532.61
Or if we need plyr (going back to the past)
library(plyr)
ddply(df1, .(LoanRefId), summarise,
TransactionAmount = sum(TransactionAmount[Tran_Type=='ClientInstallment']))
# LoanRefId TransactionAmount
#1 11 1295.47
#2 12 3532.61
Here's one more possibility, just for completeness:
with(df1[df1$Tran_Type=="ClientInstallment",], by(LoanRefId, TransactionAmount, sum))
#TransactionAmount: 1295.47
#[1] 11
#------------------------------------------------------------
#TransactionAmount: 3532.61
#[1] 12
I honestly felt data.table is a life saver.
test[Tran_Type == "ClientInstallment",
sum(TransactionAmount), by=LoanRefId]

Reordering columns of a dataframe on the basis of column mean

I have a data frame, which i want to reorder based on column mean. I want to reorder it by decreasing column mean
SNR SignalIntensity ID
1 1.0035798 6.817374 109
2 11.9438978 11.545993 110
4 3.2894878 9.780420 112
5 4.0170266 9.871984 113
6 1.6310523 9.078186 114
7 1.6405415 8.228931 116
8 1.6625413 8.043536 117
9 0.8489116 6.179346 118
10 7.5312260 10.558180 119
11 7.2832911 10.474533 120
12 0.5732577 4.157294 121
14 0.8149754 6.045174 124
I use the following code
means <- colMeans(df) ## to get mean
df <- df[,order(means)] ## to reorder
to get the mean of columns and the order, but i get the column in increasing mean, opposite of my interest. what should i do to reorder in decreasing column mean
expected output
ID SignalIntensity SNR
1 109 6.817374 1.0035798
2 110 11.545993 11.9438978
4 112 9.780420 3.2894878
5 113 9.871984 4.0170266
6 114 9.078186 1.6310523
7 116 8.228931 1.6405415
8 117 8.043536 1.6625413
9 118 6.179346 0.8489116
10 119 10.558180 7.5312260
11 120 10.474533 7.2832911
12 121 4.157294 0.5732577
14 124 6.045174 0.8149754
The default settings in order is decreasing=FALSE. We can change that to TRUE
df[order(means, decreasing=TRUE)]
Or get the order of negative values of 'means'
df[order(-means)]

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Grouping ecological data in R

I'm looking at some ecological data (diet) and trying to work out how to group by Predator. I would like to be able to extract the data so that I can look at the weights of each individual prey for each species for each predator, i.e work out the mean weight of each species eaten by e.g Predator 117. I've put a sample of my data below.
Predator PreySpecies PreyWeight
1 114 10 4.2035496
2 114 10 1.6307026
3 115 1 407.7279775
4 115 1 255.5430495
5 117 10 4.2503708
6 117 10 3.6268814
7 117 10 6.4342073
8 117 10 1.8590861
9 117 10 2.3181421
10 117 10 0.9749844
11 117 10 0.7424772
12 117 15 4.2803743
13 118 1 126.8559155
14 118 1 276.0256158
15 118 1 123.0529734
16 118 1 427.1129793
17 118 3 237.0437606
18 120 1 345.1957190
19 121 1 160.6688815
You can use the aggregate function as follows:
aggregate(formula = PreyWeight ~ Predator + PreySpecies, data = diet, FUN = mean)
# Predator PreySpecies PreyWeight
# 1 115 1 331.635514
# 2 118 1 238.261871
# 3 120 1 345.195719
# 4 121 1 160.668881
# 5 118 3 237.043761
# 6 114 10 2.917126
# 7 117 10 2.886593
# 8 117 15 4.280374
There are a few different ways of getting what you want:
The aggregate function. Probably what you are after.
aggregate(PreyWeight ~ Predator + PreySpecies, data=dd, FUN=mean)
tapply: Very useful, but only divides the variable by a single factor, hence, we need to create a need joint factor with the paste command:
tapply(dd$PreyWeight, paste(dd$Predator, dd$PreySpecies), mean)
ddply: Part of the plyr package. Very useful. Worth learning.
require(plyr)
ddply(dd, .(Predator, PreySpecies), summarise, mean(PreyWeight))
dcast: The output is in more of a table format. Part of the reshape2 package.
require(reshape2)
dcast(dd, PreyWeight ~ PreySpecies+ Predator, mean, fill=0)
mean(data$PreyWeight[data$Predator==117]);

Resources