Simplest creation of factor variable from dummies - r

A selected answer to a question here:
creating a factor variable with dplyr?
Did not impress Hadley and the follow-up answer does not generalise well for some of the problems I've come across. I'm wondering if the community can do something better with a simpler example:
### DATA ###
A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])
dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")
names(dummies) <- header
### ONE WAY ###
dummies$Religion <- factor(ifelse(dummies$Christian==1, "Christian",
ifelse(dummies$Muslim==1, "Muslim",
ifelse(dummies$Athiest==1, "Athiest", NA))))
Solution mimics the result provided to the OP in the link above. Is there a simpler function to collapse the dummy variables to one factor variable, like say the egen group function in STATA?? Simple one liner would be great.
Using Akrun's solution and system time (thank you):
set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
C = c(rep(0,1e6), 1 - A[1000001:2000000])
dummies <- as.data.frame(cbind(A,B,C))
header <- c("Christian", "Muslim", "Athiest")
names(dummies) <- header
attach(dummies)
#Alistaire
system.time({
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
# user system elapsed
# 56.08 0.00 56.08
system.time({
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
})
# user system elapsed
# 0.22 0.04 0.27
#Curt F.
system.time({
dummies %>%
gather(religion, is_valid) %>%
filter(is_valid == T) %>%
select(-is_valid)
})
# user system elapsed
# 0.33 0.03 0.36
#Akrun
system.time({
names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
# user system elapsed
# 0.13 0.06 0.21
system.time({
names(dummies)[max.col(dummies, "first")]
})
# user system elapsed
# 0.04 0.07 0.11
I find that Akrun's solution works out to be the fastest method and provided 2 one-liners. However, many thanks to the others for their unique approaches to the problem and generous supply of coding methods that I would like to learn more about, especially the use of %%, names(.), is_valid and the qdapTools package.

A quick way with dplyr would be
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
What Hadley's really complaining about in that answer is nested ifelse structure, though. He's built case_when to replace it:
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))

We can use
dummies$Religion <- names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
Or with max.col
dummies$Religion <- names(dummies)[max.col(dummies, "first")]
If there are rows that have only 0 elements, then
dummies$Religion <- names(dummies)[max.col(dummies, "first")*NA^(!rowSums(dummies))]
NOTE: In all of the above solution, it can be wrapped with factor. But, it is better to keep it as character
NOTE2: Both the solutions are base R only one-line solutions and are very fast compared to any package solution (proof is showed in the benchmarks below)
Benchmarks
set.seed(24)
A = round(runif(2e6,0,1),0)
B = c(1 - A[1:1e6],rep(0,1e6))
C = c(rep(0,1e6), 1 - A[1000001:2000000])
dummies <- data.frame(A,B,C)
colnames(dummies) <- c("Christian", "Muslim", "Athiest")
system.time({
dummies %>% rowwise() %>%
transmute(religion = names(.)[as.logical(c(Christian, Muslim, Athiest))])
})
# user system elapsed
# 49.13 0.06 49.55
system.time({
dummies %>% transmute(religion = case_when(
as.logical(Christian) ~ 'Christian',
as.logical(Muslim) ~ 'Muslim',
as.logical(Athiest) ~ 'Atheist'))
})
#Error in mutate_impl(.data, dots) : object 'Christian' not found
#Timing stopped at: 0 0 0
system.time({
names(dummies)[as.matrix(dummies)%*% seq_along(dummies)]
})
# user system elapsed
# 0.11 0.01 0.13
system.time({
names(dummies)[max.col(dummies, "first")]
})
# user system elapsed
# 0.07 0.02 0.08

One way to do this is to combine tidyr and dplyr. This may not give the fastest performance (I haven't checked), but to me at least it gives the easiest-to-understand code.
Start with the dummies data frame from the OP:
A = round(runif(200,0,1),0)
B = c(1 - A[1:100],rep(0,100))
C = c(rep(0,100), 1 - A[101:200])
dummies <- as.data.frame(cbind(A, B, C))
header <- c("Christian", "Muslim", "Atheist")
names(dummies) <- header
Then the gather() function from tidyr does the heavy lifting, and filter() and select() from dplyr do the cleanup.
require(tidyr)
require(dplyr)
dummies %>%
gather(religion, is_valid) %>%
filter(is_valid == T) %>%
select(-is_valid)
The nice thing about this version is that it doesn't make any assumptions about the one-hotness of the initial dataframe. If some row in the initial frame is both an atheist and a Christian, your output will have two rows.

If the main intent of the OP is to create the Religion column this can be done directly in one call:
Religion <- sample(c("Christian", "Muslim", "Atheist"), 200, replace = TRUE,
prob = c(60, 20, 20))
The parameter prob can be used to specify the probability weights. Just to check:
table(Religion)
#Religion
# Atheist Christian Muslim
# 37 115 48
However, if the dummies data.frame would be required for some reason, it could be created from the Religion vector with the following code:
mat <- sapply(unique(Religion), function(x) as.integer(Religion == x))
dummies <- cbind(as.data.frame(mat), Religion)
This will result in:
head(dummies)
# Muslim Christian Atheist Religion
#1 1 0 0 Muslim
#2 1 0 0 Muslim
#3 0 1 0 Christian
#4 1 0 0 Muslim
#5 0 1 0 Christian
#6 0 0 1 Atheist
Note that the result may look different for different runs of sample() as we haven't used set.seed() before calling sample().
From this answer I learned about the mtabulate() function from package qdapTools which can replace the sapply() construct by a one-liner:
dummies <- cbind(qdapTools::mtabulate(Religion), Religion)

Related

How can I iterate a function over specific columns of a series of dataframes where I can set the order?

I work for an insurance company and I am trying to improve something that I built. I have about 150 data frames that look like this:
library(data.table)
dt_Premium<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
I take the base premium and then I multiply by factors, and then add a fixed expense at the very end. My code is currently something like:
dt_Final_Premium<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
What I hate about this:
-The 2:4 stuff (I would like to be able to use a named range)
-The typing is monstrous considering all of the tables and policies I actually have
-It is very confusing for anybody except me (the author) to understand and edit/adjust the code
-I would like to be able to have each rating step as part of a list, and then just iterate over that list (or a similar process).
-Ideally I would be able to get the values at each step. For example :
step2_answer<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4])
There just has to be a way were I can take a dataframe/datatable and then just multiply or add to the next dataframe/datatable in the series. Thanks for taking a look at this?
How about something like this using dplyr?!
Here I am using the same calculation that you have mentioned but row wise using mutate function of dplyr which makes it clear to see the step by step and for anyone to understand the calculation easily.
library(data.table)
library(dplyr)
dt_Premium <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
dt_Final_Premium <- cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
new_dt_final_premium <-
dt_Premium %>%
# Joining all tables together
left_join(dt_Discount_Factors, by = "Policy") %>%
left_join(dt_Territory_Factors, by = "Policy") %>%
left_join(dt_Fixed_Expense, by = "Policy") %>%
# Calculating required calculation
mutate(
Base_Premium_Fire =
Base_Premium_Fire * Discount_Factor_Fire * Territory_Factor_Fire + Fixed_Expense_Fire,
Base_Premium_Water =
Base_Premium_Water * Discount_Factor_Water * Territory_Factor_Water + Fixed_Expense_Water,
Base_Premium_Theft =
Base_Premium_Theft * Discount_Factor_Theft * Territory_Factor_Theft + Fixed_Expense_Theft) %>%
select(Policy, Base_Premium_Fire, Base_Premium_Water, Base_Premium_Theft)
Since your columns have a clean naming, some pivoting may do the work:
library(tidyverse) #to be run after library(data.table)
dt_Premium %>%
left_join(dt_Discount_Factors, by="Policy") %>%
left_join(dt_Territory_Factors, by="Policy") %>%
left_join(dt_Fixed_Expense, by="Policy") %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #or calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
After joining all the columns, you can pivot to a long format and turn columns to rows. There, you can separate the name into the real name (Base, Discount, Fixed...) and the object (Fire, Water, ...) and return to the wide format. The tricky part is to get a good regular expression, as your names use the underscore twice. Mine can be vastly improved but will do the work for now.
After this, you can calculate whatever you want, select only the result and pivot to wide one last time. If you want to get all the results, you may tweak this last pivot with prefixes.
Pivoting is quite a gymnastics, but it has proven to be very effective once you get used to it.
As you have a lot of tables, if you can get them as a list, you can also use purrr::reduce to join them all at once and simplify the first lines of code:
list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense) %>%
reduce(left_join, by='Policy') %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #of calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
Another option is to reorganize the data by converting into a long format, merge and then perform the calculations:
DT <- Reduce(merge, lapply(dtList, function(d) {
vn <- sub('_([^_]*)$', '', names(d)[2L]) #see reference [1]
melt(d, id.vars="Policy", value.name=vn)[,
variable := gsub("(.*)_(.*)_(.*)", "\\3", variable)]
}))
DT
DT[, disc_prem := Base_Premium * Discount_Factor][,
disc_prem_loc := disc_prem * Territory_Factor][,
Final_Premium := disc_prem_loc + Fixed_Expense]
output:
Policy variable Base_Premium Discount_Factor Territory_Factor Fixed_Expense disc_prem disc_prem_loc Final_Premium
1: Pol123 Fire 45 0.90 1.90 5 40.50 76.9500 81.9500
2: Pol123 Theft 3 1.00 1.00 9 3.00 3.0000 12.0000
3: Pol123 Water 20 0.80 1.03 7 16.00 16.4800 23.4800
4: Pol333 Fire 55 0.95 1.20 5 52.25 62.7000 67.7000
5: Pol333 Theft 5 1.00 1.50 9 5.00 7.5000 16.5000
6: Pol333 Water 21 0.85 1.30 7 17.85 23.2050 30.2050
7: Pol555 Fire 105 0.99 0.91 5 103.95 94.5945 99.5945
8: Pol555 Theft 6 1.00 1.00 9 6.00 6.0000 15.0000
9: Pol555 Water 24 0.90 1.25 7 21.60 27.0000 34.0000
10: Pol999 Fire 92 0.97 1.03 5 89.24 91.9172 96.9172
11: Pol999 Theft 7 1.00 0.50 9 7.00 3.5000 12.5000
12: Pol999 Water 29 0.96 1.01 7 27.84 28.1184 35.1184
data:
dtLs <- list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense)
Reference:
regex-return-all-before-the-second-occurrence
I am guessing reading some of rdata.table vignettes would help you tighten up syntax and make it more terse. Some of us think terse = 'more readable' in numeric programming. Others think that represents some level of insanity:
vignette(package="data.table")
Understanding Map, Reduce, mget and other functional notation in R and rdata.table may help. Here are some things I have done from a data.table mindset:
Dropping cols syntax might be more terse using 'i' to drop a vector of cols:
dt[is.na(dt)] <- 0 # replace NA with 0
drop_col_list <- c('dropcol1','dropcol2','dropcol3') # drop col list
# dt <- dt[!drop_col_list,sapply(dt,as.numeric)] # make selected dt cols numeric type
dt[!drop_col_list,SumCol := Reduce(`+`, dt)] # adds Sum col with 'functional programming' iteration
The lapply(.SD, func) format is very powerful:
fsum <- function(x) {sum(x,na.rm=TRUE)}
dt[,lapply(.SD,fsum),by=,.SDcols=c("col1","col2","col3","col4")]
# or
dt[!drop_col_list,lapply(.SD,fsum)]
This shows applying the internal data.table 'set' function (':=') and mget to create cols derived from operations with functional programming on two data.tables. The data.table(s) may need to have the same nrow():
nm1 <- names(dt1)[1:4]
nm2 <- names(dt2)[1:4]
dt[, SumCol := Reduce(`+`, Map(`*`, mget(nm1), mget(nm2)))]
The loop below isn't really rdata.table'esq' programming but outputs a data.table. Probably this isn't as fast as more data.table like syntax:
seqXpi <- function(x) {x * pi}
seqXexp <- function(x) {x * exp(1)}
l <- {};
for(x in seq(1,10,1)) l <- as.data.table(rbind(l,cbind(seq=x,seqXpi=seqXpi(x),seqXexp=seqXexp(x))))

Taking the mean of 10000 replications of random sampling for each row

I did a replication 10000 times where I took a random sample from a list of ID's and then paired them with another list of IDs. After that I added a colomn that gives the relatedness of pair to each other. Then I took thee mean of the relatedness for each set of random sampling. So I end up with 10000 values which represent the mean of the relatedness for each set of random sampling. However, I want to instead take the mean of the relatedness of first row for all the 10000 sets of random sampling.
An example of what I want:
Lets say I have 10000 sets of 3 random pairings.
Set 1
female_ID male_ID relatedness
0 12-34 23-65 0.034
1 44-62 56-24 0.56
2 76-11 34-22 0.044
Set 2
female_ID male_ID relatedness
0 98-54 53-12 0.022
1 22-43 13-99 0.065
2 09-22 65-22 0.12
etc...
I want the mean of the rows for relatedness of each set, so I want a list of 3 values: 0.028 (mean of 0.034 and 0.022), 0.3125 (mean of 0.56 and 0.065), 0.082 (mean of 0.044 and 0.12), except it would be the mean across 10000 sets, and not just 2.
Here's my code so far:
mean_rel <- replicate(10000, {
random_mal <- sample(list_of_males, 78, replace=TRUE)
random_pair <- cbind(list_of_females, random_mal)
random_pair <- data.frame(random_pair)
random_pair$pair <- with(random_pair, paste(list_of_females, random_mal, sep = " "))
typeA <- genome$rel[match(random_pair$pair, genome_year$pair1)]
typeB <- genome$rel[match(random_pair$pair, genome_year$pair2)]
random_pair$relatedness <- ifelse(is.na(typeA), typeB, typeA)
random_pair <- na.omit(random_pair)
mean_random_pair_relatedness <- mean(random_pair$relatedness)
mean_random_pair_relatedness
})
If you add simplify = FALSE to your replicate after the between the closing } and ), then mean_rel will be output as a list.
mean_rel <- replicate(10000, {
random_mal <- sample(list_of_males, 78, replace=TRUE)
random_pair <- cbind(list_of_females, random_mal)
random_pair <- data.frame(random_pair)
random_pair$pair <- with(random_pair, paste(list_of_females, random_mal, sep = " "))
typeA <- genome$rel[match(random_pair$pair, genome_year$pair1)]
typeB <- genome$rel[match(random_pair$pair, genome_year$pair2)]
random_pair$relatedness <- ifelse(is.na(typeA), typeB, typeA)
random_pair <- na.omit(random_pair)
mean_random_pair_relatedness <- mean(random_pair$relatedness)
mean_random_pair_relatedness
}, simplify = FALSE)
From there, you can use purrr to add two classification columns and then can use dplyr for the rest. Here is how I did it:
library(tidyverse)
mean_rel <- purrr::map2(.x = mean_rel, .y = seq_along(mean_rel),
function(x, y){
x %>%
mutate(set = paste0("set_", y)) %>%
# do this so the same row of each set can be
# compared
rownames_to_column(var = "row_number")
})
mean_rel_comb <- mean_rel %>%
do.call(rbind, .) %>%
as.tibble() %>%
mutate(relatedness = as.numeric(as.character(relatedness))) %>%
group_by(row_number) %>%
summarize(mean = mean(relatedness))
Using your two datasets combined as a list gave me this:
# A tibble: 3 x 2
row_number mean
<chr> <dbl>
1 1 0.0280
2 2 0.3125
3 3 0.0820

Aggregate sum and mean in R with ddply

My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds from ggplot2.
I know I could do it like this:
ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))
But while it is reasonable for 3 columns, it is unacceptable for 17 of them.
When researching this, I found the colwise function, but the best I came up with is this:
cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])
Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):
ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))
or:
ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))
To sum up:
I don't want to have to type all 17 columns explicitly, like the first example does with x, y, and z.
Ideally, I would like to do it with a single call to ddply, without resorting to cbind (or similar functions), as in the second example.
For reference, the result I expect is 5 rows and 5 columns:
cut x y z price
1 Fair 10057.50 9954.07 6412.26 4358.758
2 Good 28645.08 28703.75 17855.42 3928.864
3 Very Good 69359.09 69713.45 43009.52 3981.760
4 Premium 82385.88 81985.82 50297.49 4584.258
5 Ideal 118691.07 118963.24 73304.61 3457.542
I would like to suggest data.table solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.
Predifine column names
Sums <- 7:9
Means <- "price"
Run the code
library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
lapply(.SD[, Means, with = FALSE], mean))
, by = cut]
# cut x y z price
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
For your specific example, this could simplified to just
data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
# cut x y z pe
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n!)
nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1
or the more generic,
do.call(merge,
lapply(c(colwise(sum, 7:9), colwise(mean, 6)),
function(cw) ddply(diamonds, ~cut, cw)))
Just to throw in another solution:
library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)
makeList <- function(inL, mdat = diamonds, by = ~cut) {
colN <- names(mdat)
args <- unlist(llply(names(inL), function(n) {
llply(inL[[n]], function(x) {
ret <- list(call(n, as.symbol(colN[[x]])))
names(ret) <- paste(n, colN[[x]], sep = ".")
ret
})
}))
args$.data <- as.symbol(deparse(substitute(mdat)))
args$.variables <- by
args$.fun <- as.symbol("summarise")
args
}
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price
# 1 Fair 6.246894 6.182652 3.982770 7017600
# 2 Good 5.838785 5.850744 3.639507 19275009
# 3 Very Good 5.740696 5.770026 3.559801 48107623
# 4 Premium 5.973887 5.944879 3.647124 63221498
# 5 Ideal 5.507451 5.520080 3.401448 74513487
The idea is that the function makeList creates an argument list for ddply. In this way you can quite easily add terms to the list (as function.name = column.indices) and ddply will work as expected:
trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price sd.y sd.z
# 1 Fair 6.246894 6.182652 3.982770 7017600 0.9563804 0.6516384
# 2 Good 5.838785 5.850744 3.639507 19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801 48107623 1.1029236 0.7302281
# 4 Premium 5.973887 5.944879 3.647124 63221498 1.2597511 0.7311610
# 5 Ideal 5.507451 5.520080 3.401448 74513487 1.0744953 0.6576481
It uses dplyr, but I believe this will accomplish the specified aim completely in reasonably easy to read syntax:
diamonds %>%
group_by(cut) %>%
select(x:z) %>%
summarize_each(funs(sum)) %>%
merge(diamonds %>%
group_by(cut) %>%
summarize(price = mean(price))
,by = "cut")
The only "trick" is that there is a piped expression inside of the merge that handles the calculation of the mean price separately from the calculation of sums.
I benchmarked this solution against the solution provided by #David Arenburg (using data.table) and #thothal (using plyr as requested by the question) with 5000 replications. Here data.table came out slower than plyr and dplyr. dplyr was faster than plyr. One imagines that the benchmark results could change as a function of the number of columns, number of levels in the grouping factor, and particular functions applied. For example, MarkusN submitted an answer after I did my initial benchmarks that is substantially faster than the previously submitted answers for the sample data. He accomplishes this by calculating many summary statistics that aren't desired and then throwing them away... surely there must be a point at which the costs of that approach outweigh the advantages.
test replications elapsed relative user.self sys.self user.child sys.child
2 dataTable 5000 119.686 2.008 119.611 0.127 0 0
1 dplyr 5000 59.614 1.000 59.676 0.004 0 0
3 plyr 5000 68.505 1.149 68.493 0.064 0 0
? MarkusN 5000 23.172 ????? 23.926 0 0 0
Certainly speed is not the only consideration. In particular, dplyr and plyr are picky about the order in which they are loaded (plyr before dplyr) and have several functions that mask each other.
Not 100% what you are looking for but it might give you another idea on how to do it. Using data.table you can do something like this:
diamonds2[, .(c = sum(c), p = sum(p), ce = sum(ce), pe = mean(pe)), by = cut]
To shorten the code (what you tried to do with colwise), you probably have to write some functions to achieve exactly what you want.
For completeness, here's a solution based on dplyr and answers posted by Veerendra Gadekar in another question and here by MarkusN.
In this particular case, it's possible to first apply sum to some of the columns and then mean to all columns of interest:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
summarise_each('mean', 8:10, price)
This is possible, because mean won't change the calculated sums of columns 8:10 and will calculate the required mean of prices. But if we wanted to calculate standard deviation of prices instead of mean, this approach wouldn't work as columns 8:10 would all be 0.
A more general approach could be:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
mutate_each('mean', price) %>%
summarise_each('first', 8:10, price)
One may not be pleased by summarise_each repeating column specifications that were named earlier, but this seems like an elegant solution nonetheless.
It has the advantage over MarkusN's solution that it doesn't require matching newly created columns and doesn't change their names.
Solution by Veerendra Gadekar should end with select(cut, 8:10, price) %>% arrange(cut) in order to produce expected results (subset of columns, plus rows sorted by grouping key). Suggestion of Hong Ooi is similar to the first one here, but assumes there are no other columns.
Finally, it seems to be more legible and easy to understand than a data.table solution, like the one proposed by David Arenburg.

Align dates in R date.table for linear regression

I am having a data.table with returns on n dates for m securities. I would like to do a multiple linear regression in the form of lm(ReturnSec1 ~ ReturnSec2 + ReturnSec3 + ... + ReturnSecM). The problem that I am having is that there might be dates missing for some of the securities and the regression should be on aligned dates. Here is what I have come up with so far:
#The data set
set.seed(1)
dtData <- data.table(SecId = rep(c(1,2,3), each= 4), Date = as.Date(c(1,2,3,5,1,2,4,5,1,2,4,5)), Return = round(rnorm(12),2))
#My solution so far
dtDataAligned <- merge(dtData[SecId == 1,list(Date, Return)], dtData[SecId == 2, list(Date, Return)], by='Date', all=TRUE)
dtDataAligned <- merge(dtDataAligned, dtData[SecId == 3,list(Date, Return)], by='Date', all=TRUE)
setnames(dtDataAligned, c('Date', 'Sec1', 'Sec2', 'Sec3'))
dtDataAligned[is.na(dtDataAligned)] <- 0
#This is what i want to do
fit <- lm(dtDataAligned[, Sec1] ~ dtDataAligned[, Sec2] + dtDataAligned[, Sec3])
Is there a better (more elegant, possibly faster) way of doing this without having to loop and merge the data.table to perform a regression on the values with aligned dates?
Here is a data.table solution using dcast.data.table, which takes data in the long format (your input) and converts it to the wide format required for the lm call.
lm(`1` ~ ., dcast.data.table(dtData, Date ~ SecId, fill=0))
Here is the output of the dcast call:
Date 1 2 3
1: 2014-01-02 -0.63 0.33 0.58
2: 2014-01-03 0.18 -0.82 -0.31
3: 2014-01-04 -0.84 0.00 0.00
4: 2014-01-05 0.00 0.49 1.51
5: 2014-01-06 1.60 0.74 0.39
I stole the lm piece from #G.Grothendieck. Note that if you have more than three columns in your real data you will need to specify the value.var parameter for dcast.data.table.
If the question is how to reproduce the output from the code shown in the question in a more compact fashion then try this:
library(zoo)
z <- read.zoo(dtData, split = 1, index = 2)
z0 <- na.fill(z, fill = 0)
lm(`1` ~., z0)
ADDED
Regarding the comment about elegance we could create a magrittr package pipeline out of the above like this:
library(magrittr)
dtData %>%
read.zoo(split = 1, index = 2) %>%
na.fill(fill = 0) %>%
lm(formula = `1` ~.)

Elegant way to report missing values in a data.frame

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.
Just use sapply
> sapply(airquality, function(x) sum(is.na(x)))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You could also use apply or colSums on the matrix created by is.na()
> apply(is.na(airquality),2,sum)
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
> colSums(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:
library(naniar)
library(UpSetR)
riskfactors %>%
as_shadow_upset() %>%
upset()
It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
Or for categorical variables:
gg_miss_fct(x = riskfactors, fct = marital)
These examples are from package vignette that lists other interesting visualizations.
We can use map_df with purrr.
library(mice)
library(purrr)
# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R Wind Temp Month Day
# <int> <int> <int> <int> <int> <int>
# 1 37 7 0 0 0 0
summary(airquality)
already gives you this information
The VIM packages also offers some nice missing data plot for data.frame
library("VIM")
aggr(airquality)
Another graphical alternative - plot_missing function from excellent DataExplorer package:
Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).
More succinct-: sum(is.na(x[1]))
That is
x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0
Another function that would help you look at missing data would be df_status from funModeling library
library(funModeling)
iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.
df_status(iris.2)
This will give you the number and percentage of NAs in each column.
For one more graphical solution, visdat package offers vis_miss.
library(visdat)
vis_miss(airquality)
Very similar to Amelia output with a small difference of giving %s on missings out of the box.
I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.
install.packages("Amelia")
library(Amelia)
missmap(airquality)
You can also run the following code will return the logic values of na
row.has.na <- apply(training, 1, function(x){any(is.na(x))})
Another graphical and interactive way is to use is.na10 function from heatmaply library:
library(heatmaply)
heatmaply(is.na10(airquality), grid_gap = 1,
showticklabels = c(T,F),
k_col =3, k_row = 3,
margins = c(55, 30),
colors = c("grey80", "grey20"))
Probably won't work well with large datasets..
A dplyr solution to get the count could be:
summarise_all(df, ~sum(is.na(.)))
Or to get a percentage:
summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:
is_missing <- function(x){
missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}
# sample ugly data
df <- data.frame(a = c(NA, '1', ' ', 'missing'),
b = c(0, 2, NaN, 4),
c = c('NA', 'b', '-9', 'null'),
d = 1:4,
e = c(1, Inf, -Inf, 0))
# counts:
> summarise_all(df, ~sum(is_missing(.)))
a b c d e
1 3 1 3 0 2
# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
a b c d e
1 0.75 0.25 0.75 0 0.5
If you want to do it for particular column, then you can also use this
length(which(is.na(airquality[1])==T))
ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:
For piping you could write:
# Counts
df %>% is.na() %>% colSums()
# % of missing rounded to 2 decimals
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2))

Resources