Related
I work for an insurance company and I am trying to improve something that I built. I have about 150 data frames that look like this:
library(data.table)
dt_Premium<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
I take the base premium and then I multiply by factors, and then add a fixed expense at the very end. My code is currently something like:
dt_Final_Premium<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
What I hate about this:
-The 2:4 stuff (I would like to be able to use a named range)
-The typing is monstrous considering all of the tables and policies I actually have
-It is very confusing for anybody except me (the author) to understand and edit/adjust the code
-I would like to be able to have each rating step as part of a list, and then just iterate over that list (or a similar process).
-Ideally I would be able to get the values at each step. For example :
step2_answer<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4])
There just has to be a way were I can take a dataframe/datatable and then just multiply or add to the next dataframe/datatable in the series. Thanks for taking a look at this?
How about something like this using dplyr?!
Here I am using the same calculation that you have mentioned but row wise using mutate function of dplyr which makes it clear to see the step by step and for anyone to understand the calculation easily.
library(data.table)
library(dplyr)
dt_Premium <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
dt_Final_Premium <- cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
new_dt_final_premium <-
dt_Premium %>%
# Joining all tables together
left_join(dt_Discount_Factors, by = "Policy") %>%
left_join(dt_Territory_Factors, by = "Policy") %>%
left_join(dt_Fixed_Expense, by = "Policy") %>%
# Calculating required calculation
mutate(
Base_Premium_Fire =
Base_Premium_Fire * Discount_Factor_Fire * Territory_Factor_Fire + Fixed_Expense_Fire,
Base_Premium_Water =
Base_Premium_Water * Discount_Factor_Water * Territory_Factor_Water + Fixed_Expense_Water,
Base_Premium_Theft =
Base_Premium_Theft * Discount_Factor_Theft * Territory_Factor_Theft + Fixed_Expense_Theft) %>%
select(Policy, Base_Premium_Fire, Base_Premium_Water, Base_Premium_Theft)
Since your columns have a clean naming, some pivoting may do the work:
library(tidyverse) #to be run after library(data.table)
dt_Premium %>%
left_join(dt_Discount_Factors, by="Policy") %>%
left_join(dt_Territory_Factors, by="Policy") %>%
left_join(dt_Fixed_Expense, by="Policy") %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #or calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
After joining all the columns, you can pivot to a long format and turn columns to rows. There, you can separate the name into the real name (Base, Discount, Fixed...) and the object (Fire, Water, ...) and return to the wide format. The tricky part is to get a good regular expression, as your names use the underscore twice. Mine can be vastly improved but will do the work for now.
After this, you can calculate whatever you want, select only the result and pivot to wide one last time. If you want to get all the results, you may tweak this last pivot with prefixes.
Pivoting is quite a gymnastics, but it has proven to be very effective once you get used to it.
As you have a lot of tables, if you can get them as a list, you can also use purrr::reduce to join them all at once and simplify the first lines of code:
list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense) %>%
reduce(left_join, by='Policy') %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #of calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
Another option is to reorganize the data by converting into a long format, merge and then perform the calculations:
DT <- Reduce(merge, lapply(dtList, function(d) {
vn <- sub('_([^_]*)$', '', names(d)[2L]) #see reference [1]
melt(d, id.vars="Policy", value.name=vn)[,
variable := gsub("(.*)_(.*)_(.*)", "\\3", variable)]
}))
DT
DT[, disc_prem := Base_Premium * Discount_Factor][,
disc_prem_loc := disc_prem * Territory_Factor][,
Final_Premium := disc_prem_loc + Fixed_Expense]
output:
Policy variable Base_Premium Discount_Factor Territory_Factor Fixed_Expense disc_prem disc_prem_loc Final_Premium
1: Pol123 Fire 45 0.90 1.90 5 40.50 76.9500 81.9500
2: Pol123 Theft 3 1.00 1.00 9 3.00 3.0000 12.0000
3: Pol123 Water 20 0.80 1.03 7 16.00 16.4800 23.4800
4: Pol333 Fire 55 0.95 1.20 5 52.25 62.7000 67.7000
5: Pol333 Theft 5 1.00 1.50 9 5.00 7.5000 16.5000
6: Pol333 Water 21 0.85 1.30 7 17.85 23.2050 30.2050
7: Pol555 Fire 105 0.99 0.91 5 103.95 94.5945 99.5945
8: Pol555 Theft 6 1.00 1.00 9 6.00 6.0000 15.0000
9: Pol555 Water 24 0.90 1.25 7 21.60 27.0000 34.0000
10: Pol999 Fire 92 0.97 1.03 5 89.24 91.9172 96.9172
11: Pol999 Theft 7 1.00 0.50 9 7.00 3.5000 12.5000
12: Pol999 Water 29 0.96 1.01 7 27.84 28.1184 35.1184
data:
dtLs <- list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense)
Reference:
regex-return-all-before-the-second-occurrence
I am guessing reading some of rdata.table vignettes would help you tighten up syntax and make it more terse. Some of us think terse = 'more readable' in numeric programming. Others think that represents some level of insanity:
vignette(package="data.table")
Understanding Map, Reduce, mget and other functional notation in R and rdata.table may help. Here are some things I have done from a data.table mindset:
Dropping cols syntax might be more terse using 'i' to drop a vector of cols:
dt[is.na(dt)] <- 0 # replace NA with 0
drop_col_list <- c('dropcol1','dropcol2','dropcol3') # drop col list
# dt <- dt[!drop_col_list,sapply(dt,as.numeric)] # make selected dt cols numeric type
dt[!drop_col_list,SumCol := Reduce(`+`, dt)] # adds Sum col with 'functional programming' iteration
The lapply(.SD, func) format is very powerful:
fsum <- function(x) {sum(x,na.rm=TRUE)}
dt[,lapply(.SD,fsum),by=,.SDcols=c("col1","col2","col3","col4")]
# or
dt[!drop_col_list,lapply(.SD,fsum)]
This shows applying the internal data.table 'set' function (':=') and mget to create cols derived from operations with functional programming on two data.tables. The data.table(s) may need to have the same nrow():
nm1 <- names(dt1)[1:4]
nm2 <- names(dt2)[1:4]
dt[, SumCol := Reduce(`+`, Map(`*`, mget(nm1), mget(nm2)))]
The loop below isn't really rdata.table'esq' programming but outputs a data.table. Probably this isn't as fast as more data.table like syntax:
seqXpi <- function(x) {x * pi}
seqXexp <- function(x) {x * exp(1)}
l <- {};
for(x in seq(1,10,1)) l <- as.data.table(rbind(l,cbind(seq=x,seqXpi=seqXpi(x),seqXexp=seqXexp(x))))
Normally, I would handle this kind of issue by running a loop (probably not the best solution still), but I am working with an extremely large dataset (7.8 million observations) and I have been trying to program it more efficiently. Here is a very small subset of my dataset:
df = data.frame(STATE = c("PA", "PA", "MD","MD", "MO", "MO"),
DIVISION = c("Middle_Atlantic", "Middle_Atlantic","South_Atlantic","South_Atlantic","West_North_Central","West_North_Central"),
Middle_Atlantic_NSA = c(117.77, 119.43, 119.43, 120.72, 119.11, 117.77),
Middle_Atlantic_SA = c(118.45, 119.65, 119.65, 120.73, 119, 118.45),
South_Atlantic_NSA = c(134.45, 135.2, 135.2, 136.69, 134.07, 134.45),
South_Atlantic_SA = c(134.25, 134.83, 134.83, 135.97, 133.86, 134.25),
West_North_Central_NSA=c(152.24, 153.61, 153.61, 155.19, 151.08, 152.24),
West_North_Central_SA=c(152.77, 153.19, 153.19, 154.44, 151.63, 152.77),
DIV_HPI_NSA = c(117.77, 119.43, 135.2, 136.69, 151.08, 152.24),
DIV_HPI_SA = c(118.45, 119.65, 134.83, 135.97, 151.63, 152.77))
I have included my desired output for variables "DIV_HPI_NSA" and "DIV_HPI_SA". What I am attempting to accomplish is to look up the value in "DIVISION" (e.g. "Middle_Atlantic") attaching suffix "_NDA" to it and return the corresponding value of that variable (in this case "Middle_Atlantic") to the new variable "DIV_HPI_NSA". I am doing the same thing for the "DIV_HPI_SA" variable. Currently, I am trying to use either the get() function or the eval(parse(text = "text_here")) method to evaluate the strings as column names and produce the correct values, however they are not working as desired for me. Ideally I would prefer a dplyr solution as this has been processing relatively quick as opposed to loops. I am not sure why this is not working in dplyr, and would like to understand why and how I could execute it successfully. Here is a screenshot of a color coordinated desired output.
Here is my current code:
comb.df = df %>%
mutate(DIV_HPI_NSA = get(paste0(DIVISION,"_NSA")),
DIV_HPI_SA = eval(parse(text = (paste0(DIVISION,"_SA")))))
This is how I would do it through a loop - which produces the correct result but it takes a ridiculous amount of time:
for(i in 1:dim(comb.df)[1]){
comb.df$DIV_HPI_NSA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_NSA")]
comb.df$DIV_HPI_SA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_SA")]
}
My current output (i.e. DIV_HPI_NSA) keeps providing the column's output that corresponds to the first element evaluated in the "DIVISION" column. For example, the dplyr method for "DIV_HPI_NSA" returns only values from "Middle_Atlantic_NSA" column as that is the first element in "DIVISION". The eval() also has the same issue and is not generating the correct rows output.
Is there a better/faster method than dplyr, and/or how can I fix my dplyr code for it to work properly?
Please let me know if you may need additional information.
Thanks in advance!
The answer will maybe depend on the number of values DIVISION can take.
Here is a little benchmark with only "_NSA", but obviously you can do the same with "_SA" later.
#your base function in a for loop
x1 = function(db){
for(i in 1:dim(db)[1]){
db$DIV_HPI_NSA[i] = db[i, paste0(db$DIVISION[i],"_NSA")]
db$DIV_HPI_SA[i] = db[i, paste0(db$DIVISION[i],"_SA")]
}
db}
#the very same function using 'apply', which is supposed to be much faster than base loop
x2= function(db){
db %>% apply(1, function(x){
x["DIV_HPI_NSA2"] = x[paste0(x["DIVISION"],"_NSA")]
x["DIV_HPI_SA2"] = x[paste0(x["DIVISION"],"_SA")]
x
}) %>% t %>% as.data.frame
}
#if DIVISION have few values, you can use 'dplyr::case_when' this way
x3= function(db){
db %>% mutate(output2 = case_when(
DIVISION=="Middle_Atlantic" ~ Middle_Atlantic_NSA,
DIVISION=="South_Atlantic" ~ South_Atlantic_NSA,
DIVISION=="West_North_Central" ~ West_North_Central_NSA
))
}
#but if DIVISION can take a lot of values, you may have to rlang the function a bit
x4= function(db){
db = db %>% mutate(output2 = -999) #start with dummy value
xx=data.frame(A=dff$DIVISION, B=paste0(dff$DIVISION,"_NSA"), stringsAsFactors = F) %>%
unique %>%
split(seq(nrow(.))) #turns xx into a list of its rows
for(i in xx){
db = db %>% mutate(output2 = case_when(DIVISION==i$A ~ !!sym(i$B), T~output2))
}
db
}
#here are some replicates of your dataset to increase the number of lines
df60 = df[rep(seq_len(nrow(df)), 10),]
df600 = df[rep(seq_len(nrow(df)), 100),]
df6k = df[rep(seq_len(nrow(df)), 1000),]
df60k = df[rep(seq_len(nrow(df)), 10000),]
df600k = df[rep(seq_len(nrow(df)), 100000),]
#the benchmark of every function with every dataset
(mbm=microbenchmark(
base = x1(df),
base60 = df60 %>% x1,
base600 = df600 %>% x1,
base6k = df6k %>% x1,
apply = x2(df),
apply60 = df60 %>% x2,
apply600 = df600 %>% x2,
apply6k = df6k %>% x2,
dplyr = x3(df),
dplyr60 = x3(df60),
dplyr600 = x3(df600),
dplyr6k = x3(df6k),
dplyr60k = x3(df60k),
dplyr600k = x3(df600k),
dplyrcw = x4(df),
dplyrcw60 = x4(df60),
dplyrcw600 = x4(df600),
dplyrcw6k = x4(df6k),
dplyrcw60k = x4(df60k),
dplyrcw600k = x4(df600k),
times=6
))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# base 515.283 599.3395 664.6767 683.396 739.3735 795.351 3 a
# base60 5125.835 5209.1620 5515.3047 5292.489 5710.0395 6127.590 3 a
# base600 53225.746 53300.1395 66678.0210 53374.533 73404.1585 93433.784 3 b
# base6k 587666.127 618005.9505 629841.8157 648345.774 650929.6600 653513.546 3 d
# apply 1220.559 1272.8895 1342.4810 1325.220 1403.4420 1481.664 3 a
# apply60 2265.710 2384.9575 2497.3980 2504.205 2613.2420 2722.279 3 a
# apply600 10852.649 11579.6225 12047.9227 12306.596 12645.5595 12984.523 3 a
# apply6k 114463.342 125155.8980 137072.6593 135848.454 148377.3180 160906.182 3 c
# dplyr 1298.964 1352.9355 1433.0417 1406.907 1500.0805 1593.254 3 a
# dplyr60 1604.559 1647.0435 1713.2313 1689.528 1767.5675 1845.607 3 a
# dplyr600 1357.676 1456.6845 1556.4223 1555.693 1655.7955 1755.898 3 a
# dplyr6k 1954.644 1970.1425 2025.0260 1985.641 2060.2170 2134.793 3 a
# dplyr60k 6366.085 6584.1590 6809.2833 6802.233 7030.8825 7259.532 3 a
# dplyr600k 46893.576 53406.6235 58086.0983 59919.671 63682.3595 67445.048 3 b
# dplyrcw 5824.182 5834.0285 5999.5897 5843.875 6087.2935 6330.712 3 a
# dplyrcw60 5591.885 5683.0535 6032.4097 5774.222 6252.6720 6731.122 3 a
# dplyrcw600 5664.820 5811.2360 5900.6413 5957.652 6018.5520 6079.452 3 a
# dplyrcw6k 6390.883 6522.7120 9003.2733 6654.541 10309.4685 13964.396 3 a
# dplyrcw60k 14379.395 14936.6140 15179.6070 15493.833 15579.7130 15665.593 3 a
# dplyrcw600k 85238.503 86607.3005 92601.6017 87976.098 96283.1510 104590.204 3 b
Conclusion
For a 6k line dataset,
apply (137s) is 6x faster than base (630s)
vanilla dplyr is even much faster (2s)
rlanged dplyr is a bit slower than vanilla (9s)
Times seem to expand linearly with base and apply at 100ms/line, so 8M lines should takes approximately 8M seconds = 1 week.
dplyr times seem to expand exponentially though, so I cannot say if it will work on your big dataset.
My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds from ggplot2.
I know I could do it like this:
ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))
But while it is reasonable for 3 columns, it is unacceptable for 17 of them.
When researching this, I found the colwise function, but the best I came up with is this:
cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])
Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):
ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))
or:
ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))
To sum up:
I don't want to have to type all 17 columns explicitly, like the first example does with x, y, and z.
Ideally, I would like to do it with a single call to ddply, without resorting to cbind (or similar functions), as in the second example.
For reference, the result I expect is 5 rows and 5 columns:
cut x y z price
1 Fair 10057.50 9954.07 6412.26 4358.758
2 Good 28645.08 28703.75 17855.42 3928.864
3 Very Good 69359.09 69713.45 43009.52 3981.760
4 Premium 82385.88 81985.82 50297.49 4584.258
5 Ideal 118691.07 118963.24 73304.61 3457.542
I would like to suggest data.table solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.
Predifine column names
Sums <- 7:9
Means <- "price"
Run the code
library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
lapply(.SD[, Means, with = FALSE], mean))
, by = cut]
# cut x y z price
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
For your specific example, this could simplified to just
data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
# cut x y z pe
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n!)
nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1
or the more generic,
do.call(merge,
lapply(c(colwise(sum, 7:9), colwise(mean, 6)),
function(cw) ddply(diamonds, ~cut, cw)))
Just to throw in another solution:
library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)
makeList <- function(inL, mdat = diamonds, by = ~cut) {
colN <- names(mdat)
args <- unlist(llply(names(inL), function(n) {
llply(inL[[n]], function(x) {
ret <- list(call(n, as.symbol(colN[[x]])))
names(ret) <- paste(n, colN[[x]], sep = ".")
ret
})
}))
args$.data <- as.symbol(deparse(substitute(mdat)))
args$.variables <- by
args$.fun <- as.symbol("summarise")
args
}
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price
# 1 Fair 6.246894 6.182652 3.982770 7017600
# 2 Good 5.838785 5.850744 3.639507 19275009
# 3 Very Good 5.740696 5.770026 3.559801 48107623
# 4 Premium 5.973887 5.944879 3.647124 63221498
# 5 Ideal 5.507451 5.520080 3.401448 74513487
The idea is that the function makeList creates an argument list for ddply. In this way you can quite easily add terms to the list (as function.name = column.indices) and ddply will work as expected:
trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price sd.y sd.z
# 1 Fair 6.246894 6.182652 3.982770 7017600 0.9563804 0.6516384
# 2 Good 5.838785 5.850744 3.639507 19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801 48107623 1.1029236 0.7302281
# 4 Premium 5.973887 5.944879 3.647124 63221498 1.2597511 0.7311610
# 5 Ideal 5.507451 5.520080 3.401448 74513487 1.0744953 0.6576481
It uses dplyr, but I believe this will accomplish the specified aim completely in reasonably easy to read syntax:
diamonds %>%
group_by(cut) %>%
select(x:z) %>%
summarize_each(funs(sum)) %>%
merge(diamonds %>%
group_by(cut) %>%
summarize(price = mean(price))
,by = "cut")
The only "trick" is that there is a piped expression inside of the merge that handles the calculation of the mean price separately from the calculation of sums.
I benchmarked this solution against the solution provided by #David Arenburg (using data.table) and #thothal (using plyr as requested by the question) with 5000 replications. Here data.table came out slower than plyr and dplyr. dplyr was faster than plyr. One imagines that the benchmark results could change as a function of the number of columns, number of levels in the grouping factor, and particular functions applied. For example, MarkusN submitted an answer after I did my initial benchmarks that is substantially faster than the previously submitted answers for the sample data. He accomplishes this by calculating many summary statistics that aren't desired and then throwing them away... surely there must be a point at which the costs of that approach outweigh the advantages.
test replications elapsed relative user.self sys.self user.child sys.child
2 dataTable 5000 119.686 2.008 119.611 0.127 0 0
1 dplyr 5000 59.614 1.000 59.676 0.004 0 0
3 plyr 5000 68.505 1.149 68.493 0.064 0 0
? MarkusN 5000 23.172 ????? 23.926 0 0 0
Certainly speed is not the only consideration. In particular, dplyr and plyr are picky about the order in which they are loaded (plyr before dplyr) and have several functions that mask each other.
Not 100% what you are looking for but it might give you another idea on how to do it. Using data.table you can do something like this:
diamonds2[, .(c = sum(c), p = sum(p), ce = sum(ce), pe = mean(pe)), by = cut]
To shorten the code (what you tried to do with colwise), you probably have to write some functions to achieve exactly what you want.
For completeness, here's a solution based on dplyr and answers posted by Veerendra Gadekar in another question and here by MarkusN.
In this particular case, it's possible to first apply sum to some of the columns and then mean to all columns of interest:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
summarise_each('mean', 8:10, price)
This is possible, because mean won't change the calculated sums of columns 8:10 and will calculate the required mean of prices. But if we wanted to calculate standard deviation of prices instead of mean, this approach wouldn't work as columns 8:10 would all be 0.
A more general approach could be:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
mutate_each('mean', price) %>%
summarise_each('first', 8:10, price)
One may not be pleased by summarise_each repeating column specifications that were named earlier, but this seems like an elegant solution nonetheless.
It has the advantage over MarkusN's solution that it doesn't require matching newly created columns and doesn't change their names.
Solution by Veerendra Gadekar should end with select(cut, 8:10, price) %>% arrange(cut) in order to produce expected results (subset of columns, plus rows sorted by grouping key). Suggestion of Hong Ooi is similar to the first one here, but assumes there are no other columns.
Finally, it seems to be more legible and easy to understand than a data.table solution, like the one proposed by David Arenburg.
I've got a large dataset (millions of records) with this structure:
id | ident1 | ident2
1 A000001 B000001
2 A000001 B000002
................
99 A000001 B000099
.........
337 A000002 B000037
338 A000002 B000043
In other words, for each [ident1], I have a high number of entries in [ident2]. I'd like to be able to select only 20 of these entries (of all of them, if there's less than 20).
Order is not important: so if a given ident1 has 100 matching [ident2], I'd like either the first 20 entries, or 20 random ones, it doesn't matter.
Thanks in advance, p.
Try
library(dplyr)
df %>%
group_by(ident1) %>%
slice(1:20)
Or using data.table
library(data.table)
setDT(df)[, head(.SD,20), by=ident1]
If you need a sample
setDT(df)[df[, .I[sample(.N,20, replace=FALSE)], by=ident1]$V1]
If some of the groups have less than 20 rows to sample
setDT(df)[,if(.N < 20) .SD else .SD[sample(.N,20, replace=FALSE)], by=group]
As the #akrun answer, I use dplyr, but in my case you are selecting observations randomly.
library(dplyr)
df %>%
group_by(ident1) %>%
sample_n(20)
or:
library(dplyr)
df %>%
group_by(ident1) %>%
sample_frac(.2) # randomly select the 20 % from each group
Using plyr:
random selection of observations:
ddply(df, .(ident1), function(x, howmany) {
x[sample(seq_len(nrow(x)), howmany), ]
}, howmany = 20)
selecting the first 20 obs:
ddply(df, .(ident1), head, 20)
A base R option to get the first 20 rows per ident1, though not as efficient as data.table or dplyr, would be:
df[ave(seq_along(df$ident1), df$ident1, FUN = seq_along) <= 20, ]
I have a dataframe similar to the one generated below. Some individuals have more than one observation for a particular variable and each variable has an associated standard error (SE) for the estimate. I would like to create a new dataframe that contains only a single row for each individual. For individuals with more than one observation, such as Kim or Bob, I need to calculate a precision weighted average based on the standard errors of the estimates along with a variance for the newly calculated weighted mean. For example, for Bob, for var1, this means that I would want his var1 value in the new dataframe to be:
weighted.mean(c(example$var1[2], example$var1[10]),
c(1/example$SE1[2], 1/example$SE1[10]))
and for Bob's new SE1, which would be the variance of the weighted mean, to be:
1/sum(1/example$SE1[2] + 1/example$SE1[10])
I have tried using the aggregate function and am able to calculate the arithmetic mean of the values, but the simple function I wrote does not use the standard errors nor can it deal with the NAs.
aggregate(example[,1:4], by = list(example[,5]), mean)
Would appreciate any help in developing some code to work through this problem. Here is the example dataset.
set.seed(1562)
example=data.frame(rnorm(10,8,2))
colnames(example)[1]=("var1")
example$SE1=rnorm(10,2,1)
example$var2=rnorm(10,8,2)
example$SE2=rnorm(10,2,1)
example$id=
c ("Kim","Bob","Joe","Sam","Kim","Kim","Joe","Sara","Jeff","Bob")
example$SE1[5]=NA
example$var1[5]=NA
example$SE2[10]=NA
example$var2[10]=NA
example
var1 SE1 var2 SE2 id
1 9.777769 2.451406 6.363250 2.2739566 Kim
2 8.753078 2.174308 6.219770 1.4978380 Bob
3 7.977356 2.107739 6.835998 2.1647437 Joe
4 11.113048 2.713242 11.091650 1.7018666 Sam
5 NA NA 11.769884 -0.1310218 Kim
6 5.271308 1.831475 6.818854 3.0294338 Kim
7 7.770062 2.094850 6.387607 0.2272348 Joe
8 9.837612 1.956486 8.517445 3.5126378 Sara
9 4.637518 2.516896 7.173460 2.0292454 Jeff
10 9.004425 1.592312 NA NA Bob
I like the plyr package for these sorts of problems. It should be functionally equivalent to aggregate, but I think it is nice and convenient to use. There are lots of examples and a great ~20 page intro to plyr on the website. For this problem, since the data starts as a data.frame and you want another data.frame on the other end, we use ddply()
library(plyr)
#f1()
ddply(example, "id", summarize,
newMean = weighted.mean(x=var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)
)
Which returns:
id newmean newSE
1 Bob 8.8982 0.91917
2 Jeff 4.6375 2.51690
3 Joe 7.8734 1.05064
4 Kim 7.1984 1.04829
5 Sam 11.1130 2.71324
6 Sara 9.8376 1.95649
Also check out ?summarize and ?transform for some other good background. You can also pass an anonymous function to the plyr functions if necessary for more complicated tasks.
Or use data.table package which can prove faster for some tasks:
library(data.table)
dt <- data.table(example, key="id")
#f2()
dt[, list(newMean = weighted.mean(var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)),
by = "id"]
A quick benchmark:
library(rbenchmark)
#f1 = plyr, #f2 = data.table
benchmark(f1(), f2(),
replications = 1000,
order = "elapsed",
columns = c("test", "elapsed", "relative"))
test elapsed relative
2 f2() 3.580 1.0000
1 f1() 6.398 1.7872
So data.table() is ~ 1.8x faster for this dataset on my simple laptop.