Related
I work for an insurance company and I am trying to improve something that I built. I have about 150 data frames that look like this:
library(data.table)
dt_Premium<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
I take the base premium and then I multiply by factors, and then add a fixed expense at the very end. My code is currently something like:
dt_Final_Premium<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
What I hate about this:
-The 2:4 stuff (I would like to be able to use a named range)
-The typing is monstrous considering all of the tables and policies I actually have
-It is very confusing for anybody except me (the author) to understand and edit/adjust the code
-I would like to be able to have each rating step as part of a list, and then just iterate over that list (or a similar process).
-Ideally I would be able to get the values at each step. For example :
step2_answer<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4])
There just has to be a way were I can take a dataframe/datatable and then just multiply or add to the next dataframe/datatable in the series. Thanks for taking a look at this?
How about something like this using dplyr?!
Here I am using the same calculation that you have mentioned but row wise using mutate function of dplyr which makes it clear to see the step by step and for anyone to understand the calculation easily.
library(data.table)
library(dplyr)
dt_Premium <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
dt_Final_Premium <- cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
new_dt_final_premium <-
dt_Premium %>%
# Joining all tables together
left_join(dt_Discount_Factors, by = "Policy") %>%
left_join(dt_Territory_Factors, by = "Policy") %>%
left_join(dt_Fixed_Expense, by = "Policy") %>%
# Calculating required calculation
mutate(
Base_Premium_Fire =
Base_Premium_Fire * Discount_Factor_Fire * Territory_Factor_Fire + Fixed_Expense_Fire,
Base_Premium_Water =
Base_Premium_Water * Discount_Factor_Water * Territory_Factor_Water + Fixed_Expense_Water,
Base_Premium_Theft =
Base_Premium_Theft * Discount_Factor_Theft * Territory_Factor_Theft + Fixed_Expense_Theft) %>%
select(Policy, Base_Premium_Fire, Base_Premium_Water, Base_Premium_Theft)
Since your columns have a clean naming, some pivoting may do the work:
library(tidyverse) #to be run after library(data.table)
dt_Premium %>%
left_join(dt_Discount_Factors, by="Policy") %>%
left_join(dt_Territory_Factors, by="Policy") %>%
left_join(dt_Fixed_Expense, by="Policy") %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #or calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
After joining all the columns, you can pivot to a long format and turn columns to rows. There, you can separate the name into the real name (Base, Discount, Fixed...) and the object (Fire, Water, ...) and return to the wide format. The tricky part is to get a good regular expression, as your names use the underscore twice. Mine can be vastly improved but will do the work for now.
After this, you can calculate whatever you want, select only the result and pivot to wide one last time. If you want to get all the results, you may tweak this last pivot with prefixes.
Pivoting is quite a gymnastics, but it has proven to be very effective once you get used to it.
As you have a lot of tables, if you can get them as a list, you can also use purrr::reduce to join them all at once and simplify the first lines of code:
list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense) %>%
reduce(left_join, by='Policy') %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #of calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
Another option is to reorganize the data by converting into a long format, merge and then perform the calculations:
DT <- Reduce(merge, lapply(dtList, function(d) {
vn <- sub('_([^_]*)$', '', names(d)[2L]) #see reference [1]
melt(d, id.vars="Policy", value.name=vn)[,
variable := gsub("(.*)_(.*)_(.*)", "\\3", variable)]
}))
DT
DT[, disc_prem := Base_Premium * Discount_Factor][,
disc_prem_loc := disc_prem * Territory_Factor][,
Final_Premium := disc_prem_loc + Fixed_Expense]
output:
Policy variable Base_Premium Discount_Factor Territory_Factor Fixed_Expense disc_prem disc_prem_loc Final_Premium
1: Pol123 Fire 45 0.90 1.90 5 40.50 76.9500 81.9500
2: Pol123 Theft 3 1.00 1.00 9 3.00 3.0000 12.0000
3: Pol123 Water 20 0.80 1.03 7 16.00 16.4800 23.4800
4: Pol333 Fire 55 0.95 1.20 5 52.25 62.7000 67.7000
5: Pol333 Theft 5 1.00 1.50 9 5.00 7.5000 16.5000
6: Pol333 Water 21 0.85 1.30 7 17.85 23.2050 30.2050
7: Pol555 Fire 105 0.99 0.91 5 103.95 94.5945 99.5945
8: Pol555 Theft 6 1.00 1.00 9 6.00 6.0000 15.0000
9: Pol555 Water 24 0.90 1.25 7 21.60 27.0000 34.0000
10: Pol999 Fire 92 0.97 1.03 5 89.24 91.9172 96.9172
11: Pol999 Theft 7 1.00 0.50 9 7.00 3.5000 12.5000
12: Pol999 Water 29 0.96 1.01 7 27.84 28.1184 35.1184
data:
dtLs <- list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense)
Reference:
regex-return-all-before-the-second-occurrence
I am guessing reading some of rdata.table vignettes would help you tighten up syntax and make it more terse. Some of us think terse = 'more readable' in numeric programming. Others think that represents some level of insanity:
vignette(package="data.table")
Understanding Map, Reduce, mget and other functional notation in R and rdata.table may help. Here are some things I have done from a data.table mindset:
Dropping cols syntax might be more terse using 'i' to drop a vector of cols:
dt[is.na(dt)] <- 0 # replace NA with 0
drop_col_list <- c('dropcol1','dropcol2','dropcol3') # drop col list
# dt <- dt[!drop_col_list,sapply(dt,as.numeric)] # make selected dt cols numeric type
dt[!drop_col_list,SumCol := Reduce(`+`, dt)] # adds Sum col with 'functional programming' iteration
The lapply(.SD, func) format is very powerful:
fsum <- function(x) {sum(x,na.rm=TRUE)}
dt[,lapply(.SD,fsum),by=,.SDcols=c("col1","col2","col3","col4")]
# or
dt[!drop_col_list,lapply(.SD,fsum)]
This shows applying the internal data.table 'set' function (':=') and mget to create cols derived from operations with functional programming on two data.tables. The data.table(s) may need to have the same nrow():
nm1 <- names(dt1)[1:4]
nm2 <- names(dt2)[1:4]
dt[, SumCol := Reduce(`+`, Map(`*`, mget(nm1), mget(nm2)))]
The loop below isn't really rdata.table'esq' programming but outputs a data.table. Probably this isn't as fast as more data.table like syntax:
seqXpi <- function(x) {x * pi}
seqXexp <- function(x) {x * exp(1)}
l <- {};
for(x in seq(1,10,1)) l <- as.data.table(rbind(l,cbind(seq=x,seqXpi=seqXpi(x),seqXexp=seqXexp(x))))
I would like to process some GPS-Data rows, pairwise.
For now, I am doing it in a normal for-loop but I'm sure there is a better and faster way.
n = 100
testdata <- as.data.frame(cbind(runif(n,1,10), runif(n,0,360), runif(n,14,16), runif(n, 46,49)))
colnames(testdata) <- c("speed", "heading", "long", "lat")
head(testdata)
diffmatrix <- as.data.frame(matrix(ncol = 3, nrow = dim(testdata)[1] - 1))
colnames(diffmatrix) <- c("distance","heading_diff","speed_diff")
for (i in 1:(dim(testdata)[1] - 1)) {
diffmatrix[i,1] <- spDists(as.matrix(testdata[i:(i+1),c('long','lat')]),
longlat = T, segments = T)*1000
diffmatrix[i,2] <- testdata[i+1,]$heading - testdata[i,]$heading
diffmatrix[i,3] <- testdata[i+1,]$speed - testdata[i,]$speed
}
head(diffmatrix)
How would i do that with an apply-function?
Or is it even possible to do that calclulation in parallel?
Thank you very much!
I'm not sure what you want to do with the end condition but with dplyr you can do all of this without using a for loop.
library(dplyr)
testdata %>% mutate(heading_diff = c(diff(heading),0),
speed_diff = c(diff(speed),0),
longdiff = c(diff(long),0),
latdiff = c(diff(lat),0))
%>% rowwise()
%>% mutate(spdist = spDists(cbind(c(long,long + longdiff),c(lat,lat +latdiff)),longlat = T, segments = T)*1000 )
%>% select(heading_diff,speed_diff,distance = spdist)
# heading_diff speed_diff distance
# <dbl> <dbl> <dbl>
# 1 15.9 0.107 326496
# 2 -345 -4.64 55184
# 3 124 -1.16 25256
# 4 85.6 5.24 221885
# 5 53.1 -2.23 17599
# 6 -184 2.33 225746
I will explain each part below:
The pipe operator %>% is essentially a chain that sends the results from one operation into the next. So we start with your test data and send it to the mutate function.
Use mutate to create 4 new columns that are the difference measurements from one row to the next. Adding in 0 at the last row because there is no measurement following the last datapoint. (Could do something like NA instead)
Next once you have the differences you want to use rowwise so you can apply the spDists function to each row.
Last we create another column with mutate that calls the original 4 columns that we created earlier.
To get only the 3 columns that you were concerned with I used a select statement at the end. You can leave this out if you want the entire dataframe.
A selected answer to a question here:
creating a factor variable with dplyr?
Did not impress Hadley and the follow-up answer does not generalise well for some of the problems I've come across. I'm wondering if the community can do something better with a simpler example:
### DATA ###
A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])
dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")
names(dummies) <- header
### ONE WAY ###
dummies$Religion <- factor(ifelse(dummies$Christian==1, "Christian",
ifelse(dummies$Muslim==1, "Muslim",
ifelse(dummies$Athiest==1, "Athiest", NA))))
Solution mimics the result provided to the OP in the link above. Is there a simpler function to collapse the dummy variables to one factor variable, like say the egen group function in STATA?? Simple one liner would be great.
Using Akrun's solution and system time (thank you):
set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
C = c(rep(0,1e6), 1 - A[1000001:2000000])
dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")
names(dummies) <- header
attach(dummies)
#Alistaire
system.time({
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
# user system elapsed
# 56.08 0.00 56.08
system.time({
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
})
# user system elapsed
# 0.22 0.04 0.27
#Curt F.
system.time({
dummies %>%
gather(religion, is_valid) %>%
filter(is_valid == T) %>%
select(-is_valid)
})
# user system elapsed
# 0.33 0.03 0.36
#Akrun
system.time({
names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
# user system elapsed
# 0.13 0.06 0.21
system.time({
names(dummies)[max.col(dummies, "first")]
})
# user system elapsed
# 0.04 0.07 0.11
I find that Akrun's solution works out to be the fastest method and provided 2 one-liners. However, many thanks to the others for their unique approaches to the problem and generous supply of coding methods that I would like to learn more about, especially the use of %%, names(.), is_valid and the qdapTools package.
A quick way with dplyr would be
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
What Hadley's really complaining about in that answer is nested ifelse structure, though. He's built case_when to replace it:
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
We can use
dummies$Religion <- names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
Or with max.col
dummies$Religion <- names(dummies)[max.col(dummies, "first")]
If there are rows that have only 0 elements, then
dummies$Religion <- names(dummies)[max.col(dummies, "first")*NA^(!rowSums(dummies))]
NOTE: In all of the above solution, it can be wrapped with factor. But, it is better to keep it as character
NOTE2: Both the solutions are base R only one-line solutions and are very fast compared to any package solution (proof is showed in the benchmarks below)
Benchmarks
set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
C = c(rep(0,1e6), 1 - A[1000001:2000000])
dummies <- data.frame(A,B,C)
colnames(dummies) <- c("Christian", "Muslim", "Athiest")
system.time({
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
# user system elapsed
# 49.13 0.06 49.55
system.time({
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
})
#Error in mutate_impl(.data, dots) : object 'Christian' not found
#Timing stopped at: 0 0 0
system.time({
names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
# user system elapsed
# 0.11 0.01 0.13
system.time({
names(dummies)[max.col(dummies, "first")]
})
# user system elapsed
# 0.07 0.02 0.08
One way to do this is to combine tidyr and dplyr. This may not give the fastest performance (I haven't checked), but to me at least it gives the easiest-to-understand code.
Start with the dummies data frame from the OP:
A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])
dummies <- as.data.frame(cbind(A, B, C))
header <- c("Christian", "Muslim", "Atheist")
names(dummies) <- header
Then the gather() function from tidyr does the heavy lifting, and filter() and select() from dplyr do the cleanup.
require(tidyr)
require(dplyr)
dummies %>%
gather(religion, is_valid) %>%
filter(is_valid == T) %>%
select(-is_valid)
The nice thing about this version is that it doesn't make any assumptions about the one-hotness of the initial dataframe. If some row in the initial frame is both an atheist and a Christian, your output will have two rows.
If the main intent of the OP is to create the Religion column this can be done directly in one call:
Religion <- sample(c("Christian", "Muslim", "Atheist"), 200, replace = TRUE,
prob = c(60, 20, 20))
The parameter prob can be used to specify the probability weights. Just to check:
table(Religion)
#Religion
# Atheist Christian Muslim
# 37 115 48
However, if the dummies data.frame would be required for some reason, it could be created from the Religion vector with the following code:
mat <- sapply(unique(Religion), function(x) as.integer(Religion == x))
dummies <- cbind(as.data.frame(mat), Religion)
This will result in:
head(dummies)
# Muslim Christian Atheist Religion
#1 1 0 0 Muslim
#2 1 0 0 Muslim
#3 0 1 0 Christian
#4 1 0 0 Muslim
#5 0 1 0 Christian
#6 0 0 1 Atheist
Note that the result may look different for different runs of sample() as we haven't used set.seed() before calling sample().
From this answer I learned about the mtabulate() function from package qdapTools which can replace the sapply() construct by a one-liner:
dummies <- cbind(qdapTools::mtabulate(Religion), Religion)
I have a problem in R, which I can't seem to solve.
I have the following dataframe:
Image 1
I would like to:
Find the unique combinations of the columns 'Species' and 'Effects'
Report the concentration belonging to this unique combination
If this unique combination is present more than one time, calculate the mean concentration
And would like to get the following dataframe:
Image 2
I have tried next script to get the unique combinations:
UniqueCombinations <- Data[!duplicated(Data[,1:2]),]
but don't know how to proceed from there.
Thanks in advance for your answers!
Tina
Create some example data:
dat <- data.frame(Species = rep.int(LETTERS[1:4], c(4, 1, 3, 2)),
Effect = c(rep("Reproduction", 3), "Growth", "Growth",
"Reproduction", "Mortality", "Mortality",
"Growth", "Growth"),
Concentration = rnorm(10))
You can use the function aggregate:
aggregate(Concentration ~ Species + Effect, dat, mean)
Try the following (Thanks Brandon Bertelsen for nice comment):
Creating your data:
foo = data.frame(Species=c(rep("A",4),"B",rep("C",3),"D","D"),
Effect=c(rep("Reproduction",3), rep("Growth",2),
"Reproduction", rep("Mortality",2), rep("Growth",2)),
Concentration=c(1.2,1.4,1.3,1.5,1.6,1.2,1.1,1,1.3,1.4))
Using great package plyr for a bit of magic :)
library(plyr)
ddply(foo, .(Species,Effect), function(x) mean(x[,"Concentration"]))
And this is a bit more complicated, but cleaner version (Thanks again to Brandon Bertelsen):
ddply(foo, .(Species,Effect), summarize, mean=mean(Concentration))
Just for fun before I call it a night.... Assuming your data.frame is called "dat", here are two more options:
A data.table solution.
library(data.table)
datDT <- data.table(dat, key="Species,Effect")
datDT[, list(Concentration = mean(Concentration)), by = key(datDT)]
# Species Effect Concentration
# 1: A Growth 1.50
# 2: A Reproduction 1.30
# 3: B Growth 1.60
# 4: C Mortality 1.05
# 5: C Reproduction 1.20
# 6: D Growth 1.35
An sqldf solution.
library(sqldf)
sqldf("select Species, Effect,
avg(Concentration) `Concentration`
from dat
group by Species, Effect")
# Species Effect Concentration
# 1 A Growth 1.50
# 2 A Reproduction 1.30
# 3 B Growth 1.60
# 4 C Mortality 1.05
# 5 C Reproduction 1.20
# 6 D Growth 1.35
I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)