Deciles by Grouped Variable in R - r

I want to find out deciles for each grouped variable. I am specifically looking for methods using dplyr and lapply. I'd appreciate if you can help me out.
Here's my what I tried. I don't know how to pull deciles directly other than calling dplyr::ntile() (which didn't work for me)
Attempt 1
Here's what I tried using describe() from Hmisc package:
set.seed(10)
IData <- data.frame(let = sample( x = LETTERS, size = 10000, replace=TRUE), numbers = sample(x = c(1:20000),size = 10000))
Output<-IData %>% data.table::as.data.table(.) %>% split(.,by=c("let"),drop = TRUE,sorted = TRUE) %>% purrr::map(~describe(.$numbers))
This certainly helps but there are two problems with above code:
a) The output (even the list format) is not something I am looking for.
b) I don't really know how to extract 5%, 10%...from the list above.
The bottomline is that I am stuck
Attempt 2
I tried replacing describe by ntile, but the following code gave me an output which didn't make sense to me because the number of columns aren't 10. Upon running Output[[1]], I see a vector of ~400 numbers instead of 10.
Output<-IData %>% data.table::as.data.table(.) %>% split(.,by=c("let"),drop = TRUE,sorted = TRUE) %>% purrr::map(~dplyr::ntile(.$numbers,10))
Attempt 3 = Expected Output
Finally, I tried going the old school (i.e. copy-paste) to get the expected output:
Output<-IData %>%
dplyr::group_by(let) %>%
dplyr::summarise( QQuantile1 = quantile(`numbers`, c(.10)),
QQuantile1 = quantile(`numbers`, c(.10)),
QQuantile2 = quantile(`numbers`, c(.20)),
QQuantile3 = quantile(`numbers`, c(.30)),
QQuantile4 = quantile(`numbers`, c(.40)),
QQuantile5 = quantile(`numbers`, c(.50)),
QQuantile6 = quantile(`numbers`, c(.60)),
QQuantile7 = quantile(`numbers`, c(.70)),
QQuantile8 = quantile(`numbers`, c(.80)),
QQuantile9 = quantile(`numbers`, c(.90)),
QQuantile10 = quantile(`numbers`, c(.100)))
Question: Can someone please help me to generate above output by using these three (not one, but preferably all the methods for learning)
1) lapply
2) dplyr
3) data.table
I looked at several threads on SO, but they all talk about a specific quantile and not all of them. E.g. Find top deciles from dataframe by group thread.

To assemble my comments into an answer, base is shockingly simple:
aggregate(numbers ~ let, IData, quantile, seq(0.1, 1, 0.1))
## let numbers.10% numbers.20% numbers.30% numbers.40% numbers.50% numbers.60% numbers.70% numbers.80% ...
## 1 A 1749.8 3847.8 5562.6 7475.2 9926.0 11758.6 13230.6 15788.8
## 2 B 2393.5 4483.6 6359.1 7708.0 9773.0 11842.8 13468.9 16266.4
## 3 C 2041.5 3682.0 5677.5 7504.0 9226.0 11470.0 13628.5 15379.0
## 4 D 1890.7 4086.8 5661.9 7526.6 9714.0 11438.8 13969.2 15967.2
## 5 E 2083.6 4107.0 6179.8 7910.8 10095.0 11692.6 13668.0 15570.2
## 6 F 1936.6 4220.2 6197.0 8791.8 10382.0 12266.4 14589.2 16407.0
## 7 G 3059.4 4884.2 6519.6 8530.0 10481.0 12469.0 14401.6 16127.8
## 8 H 2186.5 4081.0 5801.5 7206.0 9256.5 11453.0 13692.0 15471.0
## 9 I 1534.1 3793.2 5822.2 7621.4 9417.5 11737.0 14191.2 15722.4
## 10 J 1967.2 4286.6 5829.6 7664.6 10606.0 12217.4 14422.2 16628.0
## ...
with the caveat that numbers is actually a nested column that may need to be unpacked for further usage.
dplyr works if you use list columns or do and reshape:
library(tidyverse)
IData %>% group_by(let) %>%
summarise(quant_prob = list(paste0('quant', seq(.1, 1, .1))),
quant_value = list(quantile(numbers, seq(.1, 1, .1)))) %>%
unnest() %>%
spread(quant_prob, quant_value)
## # A tibble: 26 × 11
## let quant0.1 quant0.2 quant0.3 quant0.4 quant0.5 quant0.6 quant0.7 quant0.8 quant0.9 quant1
## * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 1749.8 3847.8 5562.6 7475.2 9926.0 11758.6 13230.6 15788.8 17763.0 19958
## 2 B 2393.5 4483.6 6359.1 7708.0 9773.0 11842.8 13468.9 16266.4 17877.4 19929
## 3 C 2041.5 3682.0 5677.5 7504.0 9226.0 11470.0 13628.5 15379.0 17265.0 19876
## 4 D 1890.7 4086.8 5661.9 7526.6 9714.0 11438.8 13969.2 15967.2 17961.0 19989
## 5 E 2083.6 4107.0 6179.8 7910.8 10095.0 11692.6 13668.0 15570.2 18011.4 19887
## 6 F 1936.6 4220.2 6197.0 8791.8 10382.0 12266.4 14589.2 16407.0 18345.0 19997
## 7 G 3059.4 4884.2 6519.6 8530.0 10481.0 12469.0 14401.6 16127.8 18219.2 19922
## 8 H 2186.5 4081.0 5801.5 7206.0 9256.5 11453.0 13692.0 15471.0 17331.0 19996
## 9 I 1534.1 3793.2 5822.2 7621.4 9417.5 11737.0 14191.2 15722.4 17706.6 19965
## 10 J 1967.2 4286.6 5829.6 7664.6 10606.0 12217.4 14422.2 16628.0 18091.2 19901
## # ... with 16 more rows
Another interesting option is purrrlyr::by_slice, which lets you collect the results to columns:
IData %>% group_by(let) %>%
by_slice(~quantile(.x$numbers, seq(0.1, 1, 0.1)), .collate = "cols")
## # A tibble: 26 × 11
## let .out1 .out2 .out3 .out4 .out5 .out6 .out7 .out8 .out9 .out10
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 1749.8 3847.8 5562.6 7475.2 9926.0 11758.6 13230.6 15788.8 17763.0 19958
## 2 B 2393.5 4483.6 6359.1 7708.0 9773.0 11842.8 13468.9 16266.4 17877.4 19929
## 3 C 2041.5 3682.0 5677.5 7504.0 9226.0 11470.0 13628.5 15379.0 17265.0 19876
## 4 D 1890.7 4086.8 5661.9 7526.6 9714.0 11438.8 13969.2 15967.2 17961.0 19989
## 5 E 2083.6 4107.0 6179.8 7910.8 10095.0 11692.6 13668.0 15570.2 18011.4 19887
## 6 F 1936.6 4220.2 6197.0 8791.8 10382.0 12266.4 14589.2 16407.0 18345.0 19997
## 7 G 3059.4 4884.2 6519.6 8530.0 10481.0 12469.0 14401.6 16127.8 18219.2 19922
## 8 H 2186.5 4081.0 5801.5 7206.0 9256.5 11453.0 13692.0 15471.0 17331.0 19996
## 9 I 1534.1 3793.2 5822.2 7621.4 9417.5 11737.0 14191.2 15722.4 17706.6 19965
## 10 J 1967.2 4286.6 5829.6 7664.6 10606.0 12217.4 14422.2 16628.0 18091.2 19901
## # ... with 16 more rows
though the column names are a little lousy.

We can do this in a compact way with data.table. Convert the 'data.frame' to 'data.table' (setDT(IData)), grouped by 'let', get the quantile of 'numbers' and convert it to list (as.list)
library(data.table)
setDT(IData)[, as.list(quantile(numbers, seq(.1, 1, .1))), by = let]

Related

Calculate sum of the total phylogenetic branch length from species birth-death table

BACKGROUND
This question is a bit complex so I will first introduce the background. To generate an example of species birth-death table (the L table) I would suggest to use dd_sim() function from DDD package.
library(DDD)
library(tidyverse)
library(picante)
result <- dd_sim(c(0.2, 0.1, 20), 10)
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L
L
[,1] [,2] [,3] [,4]
[1,] 10.0000000 0 -1 -1.0000000
[2,] 10.0000000 -1 2 2.7058965
[3,] 8.5908774 2 3 6.6301616
[4,] 8.4786474 3 4 3.3866813
[5,] 8.4455262 -1 -5 -1.0000000
[6,] 8.3431071 4 6 3.5624756
[7,] 5.3784683 2 7 0.6975934
[8,] 3.8950593 6 8 -1.0000000
[9,] 1.5032100 -5 -9 -1.0000000
[10,] 0.8393589 7 10 -1.0000000
[11,] 0.6118985 -5 -11 -1.0000000
The L table has 4 columns:
the first column is the time at which a species is born in Mya
the second column is the label of the parent of the species; positive and negative values indicate whether the species belongs to the left or right crown lineage
the third column is the label of the daughter species itself; positive and negative values indicate whether the species belongs to the left or right crown lineage
the fourth column is the time of extinction of the species; if the
fourth element equals -1, then the species is still extant.
WHAT I DID
With this L table, I now have a extant community. I want to calculate its phylogenetic diversity (also called Faith's index)
Using DDD and picante functions I can do it:
# convert L table into community data matrix
comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
Faith = picante::pd(comm,phy)
PROBLEM
Although I achieved my goal, the procedure seems to be redundant and time-consuming. I have to convert my original L table back and forth because I have to use existing functions.
By definition Faith's index is basically the sum of the total phylogenetic branch length of the community, so my question is:
Is it possible to calculate Faith's index directly from the L table?
Thank you advance!
You can simply use the phy$edge.length component of the phylo object generated by DDD::L2phylo:
## Measuring the sum of the branch lengths from `phy`
sum_br_length <- sum(phy$edge.length)
sum_br_length == Faith$PD
# [1] TRUE
## Measuring the sum of the branch length from `L`
sum_br_length <- sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length)
sum_br_length == Faith$PD
# [1] TRUE
And some micro-benchmarking for fun:
library(microbenchmark)
## Function 1
fun1 <- function(L) {
comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
Faith = picante::pd(comm,phy)
return(Faith$PD)
}
## Function 2
fun2 <- function(L) {
phy <- DDD::L2phylo(L, dropextinct = T)
return(sum(phy$edge.length))
}
## Function 3
fun3 <- function(L) {
return(sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length))
}
## Do all of them give the same results
fun1(L) == Faith$PD
# [1] TRUE
fun2(L) == Faith$PD
# [1] TRUE
fun3(L) == Faith$PD
# [1] TRUE
## Which function fastest?
microbenchmark(fun1(L), fun2(L), fun3(L))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(L) 6.486462 6.900641 8.273386 7.445334 8.667535 16.888429 100
# fun2(L) 1.627854 1.683204 2.215531 1.771219 2.229408 9.522366 100
# fun3(L) 1.630635 1.663181 2.229206 1.859733 2.448196 7.573001 100
I examined pd::sample2matrix to see what it does internally. The tapply call and the following line look to be the only necessary pieces.
library(DDD)
library(tidyverse)
library(picante)
#> Loading required package: ape
#> Loading required package: vegan
#> Loading required package: permute
#> Loading required package: lattice
#> This is vegan 2.5-6
#> Loading required package: nlme
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
set.seed(100)
result <- dd_sim(c(0.2, 0.1, 20), 10)
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L
# convert L table into community data matrix
comm_original = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# Instead of using dplyr, we'll do some base R operations
# on L. The code doesn't look as nice, but it should be
# significantly faster.
pa <- ifelse(L[, 4] == -1, 1, 0)
plot <- rep(0, length(pa))
id <- paste0("t", abs(L[,3]))
comm_new <- tapply(pa, list(plot, id), sum)
comm_new[is.na(comm_new)] <- 0
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
picante::pd(comm_original,phy)
#> PD SR
#> 0 29.82483 6
picante::pd(comm_new, phy)
#> PD SR
#> 0 29.82483 6
Created on 2019-11-17 by the reprex package (v0.3.0)
Edit: original() is the way you originally constructed the comm, new() is the way given above. It looks like you can expect a 2x speedup if you swap this in. I know that's not a huge gain depending on the size of the workload, but better than nothing.
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 original() 9.76ms 10.24ms 96.1 552KB 2.04 47 1 489ms <df[,1125] [1 x 1,1~ <df[,3] [107 x~ <bch:t~ <tibble [48 x~
2 new() 4.57ms 4.84ms 201. 464KB 2.07 97 1 483ms <dbl[,1125] [1 x 1,~ <df[,3] [63 x ~ <bch:t~ <tibble [98 x~

Using custom function to apply across multiple groups and subsets

I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)

Creating new columns with mutate

i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...
OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.
You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]
Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

Dplyr: how to loop over specific columns whose names are in a list?

I have a dataframe that looks like this
set.seed(10)
sample <- data_frame(group = c('A','B','C','C',NA,'D'),
var_hello = rnorm(6),
var_how = rnorm(6),
var_are = rnorm(6),
var_you = rnorm(6),
var_buddy = rnorm(6))
# A tibble: 6 × 6
group var_hello var_how var_are var_you var_buddy
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.01874617 -1.2080762 -0.23823356 0.9255213 -1.2651980
2 B -0.18425254 -0.3636760 0.98744470 0.4829785 -0.3736616
3 C -1.37133055 -1.6266727 0.74139013 -0.5963106 -0.6875554
4 C -0.59916772 -0.2564784 0.08934727 -2.1852868 -0.8721588
5 <NA> 0.29454513 1.1017795 -0.95494386 -0.6748659 -0.1017610
6 D 0.38979430 0.7557815 -0.19515038 -2.1190612 -0.2537805
In my original dataset, there are many, many var_something variables.
I would like to group_by('group') and compute the mean of a subset of these var_something variables, but even this subset can be large. So I dont want to resort to typing manually each mutate for every variable.
In the example, I am interested in variables in the following list ['var_hello', 'var_are']
I dont know how to code that up efficiently in dplyr. In Pandas, one could simply write
for var in ['var_hello', 'var_are']:
sample[computation +'_' + var] = sample.groupby('group')[var].agg('mean')
Note how I can automatically create the new column names (of the form computation_var_hello) . What is the best way to achieve that in dplyr?
Many thanks!
You can do this simply by using group_by and summarize_each. You then specify which variables you want to summarize, then replace the prefix in the names using setNames.
sample %>%
group_by(group) %>%
summarize_each(funs(mean), var_hello, var_are) %>%
setNames(gsub("var_","computation_var_",colnames(.)))

Resources