Adding Standard Deviation for Each Unique Factor Grouping - r

I'm trying to bring in the standard deviation for each unique factor grouping in my data. I've researched techniques using the data.table package and the plyr package and haven't had any luck. Here is a basic example of what I'm trying to accomplish.
Group Hours
120 45
120 60
120 54
121 33
121 55
121 40
I'm trying to turn the above into:
Group Hours SD
120 45 7.343
120 60 7.343
120 54 7.343
121 33 9.833
121 55 9.833
121 40 9.833

Base solution (assuming your data called df)
transform(df, SD = ave(Hours, Group, FUN = sd))
data.table solution
library(data.table)
setDT(df)[, SD := sd(Hours), by = Group]
dplyr solution
library(dplyr)
df %>%
group_by(Group) %>%
mutate(SD = sd(Hours))
And here's a plyr solution (my first ever) as you asked for it
library(plyr)
ddply(df, .(Group), mutate, SD = sd(Hours))
(It is better to avoid having both plyr and dplyr loaded at the same time)

Thank you, David, for your detailed response! I've used data.table to write what I'm looking for. Here is a snippet of my final script that I wrote using David's answer.
PayrollHoursSD <- as.data.table(PayrollHours2)[, SD := sd(TOTAL.HOURS), by = COMBO]
head(PayrollHoursSD)
# COMBO PAY.END.DATE TOTAL.HOURS SD
# 1: 1-2 10-06 42561.78 4297.287
# 2: 1-2 10-13 42177.88 4297.287
# 3: 1-2 10-20 44691.23 4297.287
# 4: 1-2 10-27 42709.28 4297.287
# 5: 1-2 11-03 44876.25 4297.287
# 6: 1-2 11-10 40582.44 4297.287

Related

Substraction column to another column of next row in R

I want to calculate the subtraction of c= b-a, and d= b2-a3... How can I do that?
name a b c=b-a d=b2-a3…
peter 80 100 20 30
dancy 70 90 20 20
tiger 70 85 15 20
pop 85 101 16 29
rock 72 111 39
enter image description here
Thank you so much!
Presuming your data is in a dataframe, to recreate your column using a tidyverse approach you could do:
library(tidyverse)
yourdata <- yourdata %>%
mutate(c = b - a,
d = b - lead(a))
To do the opposite you can use lag, to increase the number of steps in either lag or lead you can use lag(column_name, n = number of steps).
Here is an option with data.table
library(data.table)
setDT(df1)[, c := b - a][, d := b - shift(a, type = 'lead')]

How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

R ddply summarise sum only selected/specific/logical rows

I have a client loan data base and I want to do a ddply summarise per LoanRefID:
LoanRefId Tran_Type TransactionAmount
103 11 LoanIssue 1000.0000
104 11 InitiationFee 171.0000
105 11 Interest 59.6729
106 11 AdministrationFee 64.9332
107 11 RaisedClientInstallment 1295.5757
108 11 ClientInstallment 1295.4700
109 11 PaidUp 0.0000
110 11 Adjustment 0.1361
111 11 PaidUp 0.0000
112 12 LoanIssue 3000.0000
113 12 InitiationFee 399.0000
114 12 Interest 94.9858
115 12 AdministrationFee 38.6975
116 12 RaisedClientInstallment 3532.6350
117 12 ClientInstallment 3532.6100
118 12 PaidUp 0.0000
119 12 Adjustment 0.0733
120 12 PaidUp 0.0000
However, I only want to only sum certain rows per loanID. specifically, I only want to sum where the Tran_Type == "ClientInstallment".
The only way I can think of (which doesn't seem to work) is:
> ddply(test, c("LoanRefId"), summarise, cash_in = sum(test[test$Tran_Type == "ClientInstallment","TransactionAmount"]))
LoanRefId cash_in
1 11 4828.08
2 12 4828.08
This is not summing per LoanRefId, it is simply summing all amounts where Tran_Type == "CLientInstallment" which is wrong.
Is there a better way to do this logical sum?
Someone may add a plyr answer but nowadays base R, dplyr, or data.table are more widely used. plyr has been updated and upgraded. It is worth taking the time to learn the newer implementations as they are more efficient and packed with features.
base R
aggregate(TransactionAmount ~ LoanRefId, df[df$Tran_Type == "ClientInstallment",], sum)
# LoanRefId TransactionAmount
#1 11 1295.47
#2 12 3532.61
dplyr
library(dplyr)
df %>%
group_by(LoanRefId) %>%
filter(Tran_Type == "ClientInstallment") %>%
summarise(TransactionAmount = sum(TransactionAmount))
#Source: local data frame [2 x 2]
#
# LoanRefId TransactionAmount
# (int) (dbl)
#1 11 1295.47
#2 12 3532.61
data.table
setDT(df)[Tran_Type == "ClientInstallment", sum(TransactionAmount), by=LoanRefId]
# LoanRefId V1
#1: 11 1295.47
#2: 12 3532.61
Notice how clean data.table syntax is :). Great tool to learn.
Another base R option is tapply
with(subset(df1, Tran_Type=='ClientInstallment'),
tapply(TransactionAmount, LoanRefId, FUN=sum))
# 11 12
#1295.47 3532.61
Or if we need plyr (going back to the past)
library(plyr)
ddply(df1, .(LoanRefId), summarise,
TransactionAmount = sum(TransactionAmount[Tran_Type=='ClientInstallment']))
# LoanRefId TransactionAmount
#1 11 1295.47
#2 12 3532.61
Here's one more possibility, just for completeness:
with(df1[df1$Tran_Type=="ClientInstallment",], by(LoanRefId, TransactionAmount, sum))
#TransactionAmount: 1295.47
#[1] 11
#------------------------------------------------------------
#TransactionAmount: 3532.61
#[1] 12
I honestly felt data.table is a life saver.
test[Tran_Type == "ClientInstallment",
sum(TransactionAmount), by=LoanRefId]

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Finding max value for a column in dataframe for each day in R [duplicate]

This question already has answers here:
Finding maximum value of one column (by group) and inserting value into another data frame in R
(3 answers)
Closed 8 years ago.
Need to find the max value of a column count in the data frame and group it by day. This is the sample data it is having:
Date count
7/28/2014 00:30:31 95
7/28/2014 01:30:57 62
7/28/2014 15:42:42 112
7/28/2014 15:42:42 150
7/31/2014 17:12:22 12
7/31/2014 04:45:47 97
8/2/2014 21:12:06 85
8/2/2014 23:05:09 96
8/2/2014 18:17:42 48
8/2/2014 19:53:02 89
8/2/2014 14:18:38 201
My requirement is to find max value of count. How can this be done in R?
sorry i just forgot to mention. the date column is of data type timestamp or is having timestamp format value.
Assuming that your data is in a data.frame called bar, you can use by():
> with(bar,by(count,Date,max))
Date: 7/28/2014
[1] 150
-------------------------
Date: 7/31/2014
[1] 97
-------------------------
Date: 8/2/2014
[1] 201
There are many ways to do this, another option with base R is to use aggregate (assuming your data is called dat):
aggregate(count ~ Date, data = dat, max)
# Date count
#1 7/28/2014 150
#2 7/31/2014 97
#3 8/2/2014 201
Using the package dplyr in case you have a large data set and need better speed:
library(dplyr)
dat %>% group_by(Date) %>% summarize(maxCount = max(count))
#Source: local data frame [3 x 2]
#
# Date maxCount
#1 7/28/2014 150
#2 7/31/2014 97
#3 8/2/2014 201
Using data.table for bigger datasets
library(data.table)
setDT(dat)[, list(maxCount=max(count)), by=Date]
# Date maxCount
#1: 7/28/2014 150
#2: 7/31/2014 97
#3: 8/2/2014 201
Benchmarks for slightly bigger datasets
set.seed(455)
dat1 <- data.frame(group=sample(1:5000, 1e7, replace=TRUE), count=sample(200, 1e7, replace=TRUE))
f1<- function() dat1 %>% group_by(group) %>% summarize(maxCount = max(count))
f2 <- function() setDT(dat1)[, list(maxCount=max(count)), by=group]
library(microbenchmark)
microbenchmark(f1(),f2(), unit="relative")
# expr min lq median uq max neval
# f1() 1.914458 2.049166 2.221317 2.256047 2.888778 100
# f2() 1.000000 1.000000 1.000000 1.000000 1.000000 100
tapply has easy syntax and gives a clear tabular output:
with(ddf, tapply(count, Date, max))
7/28/2014 7/31/2014 8/2/2014
150 97 201

Resources