Cocor package won't read my variable as numeric - r

I am performing correlations with the following data:
datacor
A tibble: 213 x 3
Prop_coord Prop_assoc PPT
<dbl> <dbl> <dbl>
1 0.474 0.211 92
2 0.343 0.343 85
3 0.385 0.308 83
4 0.714 0 92
5 0.432 0.273 73
6 0.481 0.148 92
7 0.455 0.273 96
8 0.605 0.184 88
9 0.412 0.235 98
10 0.5 0.318 94
# … with 203 more rows
The cor.test works well, but when I try to compare correlations it shows this error:
> cocor(~ Prop_coord+PPT | Prop_assoc+PPT, datacor)
Error in cocor(~Prop_coord + PPT | Prop_assoc + PPT, datacor) :
The variable 'PPT' must be numeric
What should I do?

Just to keep the record here that someone elsewhere helped me with this, the problem was that cocor seems not to work with tibbles. So when I read my data with data.frame, it worked perfectly.

Related

Abline command is not showing a regression line?

I'm new to R programming and I'm trying to plot a regression line for this data set, but it doesn't seem to be working.
I followed exactly what my professor was using, however it doesn't seem to be working. I've also interchanged the abline command with abline(lm(batters$EMH~batters$TB)) with similar results.
Here is my code for it:
batters<-read.table(header=TRUE, text="
X AVG EBH TB OPS K.to.BB.Ratio
1 LeMahieu 0.327 61 312 0.893 1.95
2 Urshela 0.314 55 236 0.889 3.48
3 Torres 0.278 64 292 0.871 2.64
4 Judge 0.272 46 204 0.921 2.21
5 Sanchez 0.232 47 208 0.841 3.13
6 Wong 0.285 40 202 0.784 1.76
7 Molina 0.270 34 167 0.711 2.52
8 Goldschmidt 0.260 60 284 0.821 2.13
9 Ozuna 0.243 53 230 0.804 1.84
10 DeJong 0.233 62 259 0.762 2.39
11 Altuve 0.298 61 275 0.903 1.98
12 Bregman 0.296 80 328 1.015 0.69
13 Springer 0.292 62 283 0.974 1.69
14 Reddick 0.275 36 205 0.728 1.83
15 Chirinos 0.238 40 162 0.791 2.45
16 Bellinger 0.305 84 351 1.035 1.14
17 Turner 0.290 51 244 0.881 1.72
18 Seager 0.272 64 236 0.817 2.23
19 Taylor 0.262 45 169 0.794 3.11
20 Muncy 0.251 58 251 0.889 1.65
21 Meadows 0.291 69 296 0.922 2.43
22 Garcia 0.282 47 227 0.796 4.03
23 Pham 0.273 56 255 0.818 1.52
24 Choi 0.261 41 188 0.822 1.69
25 Adames 0.254 46 222 0.735 3.32
26 Yelich 0.329 76 328 1.101 1.48
27 Braun 0.285 55 232 0.849 3.09
28 Moustakas 0.254 66 270 0.845 1.85
29 Grandal 0.246 56 240 0.848 1.28
30 Arcia 0.223 32 173 0.633 2.53")
plot(batters$EBH,batters$TB,main="Attribute Pairing 5",xlab="EBH",ylab="TB")
lm(formula = batters$EBH~batters$TB)
#Call:
#lm(formula = batters$EBH ~ batters$TB)
#Coefficients:
#(Intercept) batters$TB
# -4.1275 0.2416
lin_model_1<-lm(formula = batters$EBH~batters$TB)
summary(lin_model_1)
abline(-4.12752, 0.24162)
I apologize for the messy coding, this is for a class.
Your formula is backwards in the lm() function call. The dependent variable is on the left side of the "~".
In your plot the y-axis (dependent variable) is TB, but in the linear regression model, it is defined as the independent variable. So for the linear regression model to work, one needs to swap EBH & TB.
plot(batters$EBH,batters$TB,main="Attribute Pairing 5",xlab="EBH",ylab="TB")
model <-lm(formula = batters$TB ~batters$EBH)
model
Call: lm(formula = batters$TB ~ batters$EBH)
Coefficients: (Intercept) batters$EBH
46.510 3.603
abline(model)
#or
abline (46.51, 3.60)
Also if you pass the "model" to abline you can avoid the need to specify the slope and intercept with abline

"Error: Must subset rows with a valid subscript vector" in preProcess() when using knnImpute

I'm using kaggle's pokemon data to practice KNN imputation via preProcess(), but when I did I encountered this following message after the predict() step. I am wondering if I use the incorrect data format or if some columns have inappropriate "class." Below is my code.
library(dplyr)
library(ggplot2)
library(tidyr)
library(reshape2)
library(caret)
library(skimr)
library(psych)
library(e1071)
library(data.table)
pokemon <- read.csv("https://www.dropbox.com/s/znbta9u9tub2ox9/pokemon.csv?dl=1")
pokemon = tbl_df(pokemon)
# select relevant features
df <- select(pokemon, hp, weight_kg, height_m, sp_attack, sp_defense, capture_rate)
pre_process_missing_data <- preProcess(df, method="knnImpute")
classify_legendary <- predict(pre_process_missing_data, newdata = df)
and I received this error message
Error: Must subset rows with a valid subscript vector.
x Subscript `nn$nn.idx` must be a simple vector, not a matrix.
Run `rlang::last_error()` to see where the error occurred.
The input for preProcess needs to be a data.frame. This works:
pre_process_missing_data <- preProcess(as.data.frame(df), method="knnImpute")
classify_legendary <- predict(pre_process_missing_data, newdata = df)
classify_legendary
> classify_legendary
# A tibble: 801 x 6
hp weight_kg height_m sp_attack sp_defense capture_rate
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 -0.902 -0.498 -0.429 -0.195 -0.212 45
2 -0.337 -0.442 -0.152 0.269 0.325 45
3 0.415 0.353 0.774 1.57 1.76 45
4 -1.13 -0.484 -0.522 -0.349 -0.748 45
5 -0.412 -0.388 -0.0591 0.269 -0.212 45
6 0.340 0.266 0.496 2.71 1.58 45
7 -0.939 -0.479 -0.615 -0.659 -0.247 45
8 -0.375 -0.356 -0.152 -0.195 0.325 45
9 0.378 0.221 0.404 1.97 1.58 45
10 -0.902 -0.535 -0.800 -1.59 -1.82 255
# ... with 791 more rows

Calculating the average of multiple entries with identical names

Not sure if the title makes sense, but I am new to "R" and to say the least I am confused. As you can see in the code below I have multiple entries that have the same name. For example, time ON and sample 1 appears 3 times. I want to figure out how to calculate the average of the OD at time ON and sample 1. How do I go about doing this? I want to do this for all repeats in the data frame.
Thanks in advance! Hope my question makes sense.
> freednaod
time sample OD
1 ON 1 0.248
2 ON 1 0.245
3 ON 1 0.224
4 ON 2 0.262
5 ON 2 0.260
6 ON 2 0.255
7 ON 3 0.245
8 ON 3 0.249
9 ON 3 0.244
10 0 1 0.010
11 0 1 0.013
12 0 1 0.012
13 0 2 0.014
14 0 2 0.013
15 0 2 0.015
16 0 3 0.013
17 0 3 0.013
18 0 3 0.014
19 30 1 0.018
20 30 1 0.020
21 30 1 0.019
22 30 2 0.017
23 30 2 0.019
24 30 2 0.021
25 30 3 0.021
26 30 3 0.020
27 30 3 0.024
28 60 1 0.023
29 60 1 0.024
30 60 1 0.023
31 60 2 0.031
32 60 2 0.031
33 60 2 0.033
34 60 3 0.025
35 60 3 0.028
36 60 3 0.024
37 90 1 0.052
38 90 1 0.048
39 90 1 0.049
40 90 2 0.076
41 90 2 0.078
42 90 2 0.081
43 90 3 0.073
44 90 3 0.068
45 90 3 0.067
46 120 1 0.124
47 120 1 0.128
48 120 1 0.134
49 120 2 0.202
50 120 2 0.202
51 120 2 0.186
52 120 3 0.192
53 120 3 0.182
54 120 3 0.183
55 150 1 0.229
56 150 1 0.215
57 150 1 0.220
58 150 2 0.197
59 150 2 0.216
60 150 2 0.200
61 150 3 0.207
62 150 3 0.211
63 150 3 0.209
By converting the 'time' column to a factor with levels specified by the unique level, the output would be ordered in the same order as in the initial dataset
aggregate(OD~ sample + time, transform(freednaod,
time = factor(time, levels = unique(time))), mean)[c(2, 1, 3)]
Or using dplyr
library(dplyr)
freednaod %>%
group_by(time = factor(time, levels = unique(time)), sample) %>%
summarise(OD = mean(OD))

How to truncate Kaplan Meier curves when number at risk is < 10

For a publication in a peer-reviewed scientific journal (http://www.redjournal.org), we would like to prepare Kaplan-Meier plots. The journal has the following specific guidelines for these plots:
"If your figures include curves generated from analyses using the Kaplan-Meier method or the cumulative incidence method, the following are now requirements for the presentation of these curves:
That the number of patients at risk is indicated;
That censoring marks are included;
That curves be truncated when there are fewer than 10 patients at risk; and
An estimate of the confidence interval should be included either in the figure itself or the text.”
Here, I illustrate my problem with the veteran dataset (https://github.com/tidyverse/reprex is great!).
We can adress 1, 2 and 4 easily with the survminer package:
library(survival)
library(survminer)
#> Warning: package 'survminer' was built under R version 3.4.3
#> Loading required package: ggplot2
#> Loading required package: ggpubr
#> Warning: package 'ggpubr' was built under R version 3.4.3
#> Loading required package: magrittr
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
ggsurvplot(fit.obj,
conf.int = T,
risk.table ="absolute",
tables.theme = theme_cleantable())
I have, however, a problem with requirement 3 (truncate curves when there are fewer than 10 patients at risk). I see that all the required information is available in the survfit object:
library(survival)
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
summary(fit.obj)
#> Call: survfit(formula = Surv(time, status) ~ celltype, data = veteran)
#>
#> celltype=squamous
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 35 2 0.943 0.0392 0.8690 1.000
#> 8 33 1 0.914 0.0473 0.8261 1.000
#> 10 32 1 0.886 0.0538 0.7863 0.998
#> 11 31 1 0.857 0.0591 0.7487 0.981
#> 15 30 1 0.829 0.0637 0.7127 0.963
#> 25 29 1 0.800 0.0676 0.6779 0.944
#> 30 27 1 0.770 0.0713 0.6426 0.924
#> 33 26 1 0.741 0.0745 0.6083 0.902
#> 42 25 1 0.711 0.0772 0.5749 0.880
#> 44 24 1 0.681 0.0794 0.5423 0.856
#> 72 23 1 0.652 0.0813 0.5105 0.832
#> 82 22 1 0.622 0.0828 0.4793 0.808
#> 110 19 1 0.589 0.0847 0.4448 0.781
#> 111 18 1 0.557 0.0861 0.4112 0.754
#> 112 17 1 0.524 0.0870 0.3784 0.726
#> 118 16 1 0.491 0.0875 0.3464 0.697
#> 126 15 1 0.458 0.0876 0.3152 0.667
#> 144 14 1 0.426 0.0873 0.2849 0.636
#> 201 13 1 0.393 0.0865 0.2553 0.605
#> 228 12 1 0.360 0.0852 0.2265 0.573
#> 242 10 1 0.324 0.0840 0.1951 0.539
#> 283 9 1 0.288 0.0820 0.1650 0.503
#> 314 8 1 0.252 0.0793 0.1362 0.467
#> 357 7 1 0.216 0.0757 0.1088 0.429
#> 389 6 1 0.180 0.0711 0.0831 0.391
#> 411 5 1 0.144 0.0654 0.0592 0.351
#> 467 4 1 0.108 0.0581 0.0377 0.310
#> 587 3 1 0.072 0.0487 0.0192 0.271
#> 991 2 1 0.036 0.0352 0.0053 0.245
#> 999 1 1 0.000 NaN NA NA
#>
#> celltype=smallcell
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 2 48 1 0.9792 0.0206 0.93958 1.000
#> 4 47 1 0.9583 0.0288 0.90344 1.000
#> 7 46 2 0.9167 0.0399 0.84172 0.998
#> 8 44 1 0.8958 0.0441 0.81345 0.987
#> 10 43 1 0.8750 0.0477 0.78627 0.974
#> 13 42 2 0.8333 0.0538 0.73430 0.946
#> 16 40 1 0.8125 0.0563 0.70926 0.931
#> 18 39 2 0.7708 0.0607 0.66065 0.899
#> 20 37 2 0.7292 0.0641 0.61369 0.866
#> 21 35 2 0.6875 0.0669 0.56812 0.832
#> 22 33 1 0.6667 0.0680 0.54580 0.814
#> 24 32 1 0.6458 0.0690 0.52377 0.796
#> 25 31 2 0.6042 0.0706 0.48052 0.760
#> 27 29 1 0.5833 0.0712 0.45928 0.741
#> 29 28 1 0.5625 0.0716 0.43830 0.722
#> 30 27 1 0.5417 0.0719 0.41756 0.703
#> 31 26 1 0.5208 0.0721 0.39706 0.683
#> 51 25 2 0.4792 0.0721 0.35678 0.644
#> 52 23 1 0.4583 0.0719 0.33699 0.623
#> 54 22 2 0.4167 0.0712 0.29814 0.582
#> 56 20 1 0.3958 0.0706 0.27908 0.561
#> 59 19 1 0.3750 0.0699 0.26027 0.540
#> 61 18 1 0.3542 0.0690 0.24171 0.519
#> 63 17 1 0.3333 0.0680 0.22342 0.497
#> 80 16 1 0.3125 0.0669 0.20541 0.475
#> 87 15 1 0.2917 0.0656 0.18768 0.453
#> 95 14 1 0.2708 0.0641 0.17026 0.431
#> 99 12 2 0.2257 0.0609 0.13302 0.383
#> 117 9 1 0.2006 0.0591 0.11267 0.357
#> 122 8 1 0.1755 0.0567 0.09316 0.331
#> 139 6 1 0.1463 0.0543 0.07066 0.303
#> 151 5 1 0.1170 0.0507 0.05005 0.274
#> 153 4 1 0.0878 0.0457 0.03163 0.244
#> 287 3 1 0.0585 0.0387 0.01600 0.214
#> 384 2 1 0.0293 0.0283 0.00438 0.195
#> 392 1 1 0.0000 NaN NA NA
#>
#> celltype=adeno
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 3 27 1 0.9630 0.0363 0.89430 1.000
#> 7 26 1 0.9259 0.0504 0.83223 1.000
#> 8 25 2 0.8519 0.0684 0.72786 0.997
#> 12 23 1 0.8148 0.0748 0.68071 0.975
#> 18 22 1 0.7778 0.0800 0.63576 0.952
#> 19 21 1 0.7407 0.0843 0.59259 0.926
#> 24 20 1 0.7037 0.0879 0.55093 0.899
#> 31 19 1 0.6667 0.0907 0.51059 0.870
#> 35 18 1 0.6296 0.0929 0.47146 0.841
#> 36 17 1 0.5926 0.0946 0.43344 0.810
#> 45 16 1 0.5556 0.0956 0.39647 0.778
#> 48 15 1 0.5185 0.0962 0.36050 0.746
#> 51 14 1 0.4815 0.0962 0.32552 0.712
#> 52 13 1 0.4444 0.0956 0.29152 0.678
#> 73 12 1 0.4074 0.0946 0.25850 0.642
#> 80 11 1 0.3704 0.0929 0.22649 0.606
#> 84 9 1 0.3292 0.0913 0.19121 0.567
#> 90 8 1 0.2881 0.0887 0.15759 0.527
#> 92 7 1 0.2469 0.0850 0.12575 0.485
#> 95 6 1 0.2058 0.0802 0.09587 0.442
#> 117 5 1 0.1646 0.0740 0.06824 0.397
#> 132 4 1 0.1235 0.0659 0.04335 0.352
#> 140 3 1 0.0823 0.0553 0.02204 0.307
#> 162 2 1 0.0412 0.0401 0.00608 0.279
#> 186 1 1 0.0000 NaN NA NA
#>
#> celltype=large
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 12 27 1 0.9630 0.0363 0.89430 1.000
#> 15 26 1 0.9259 0.0504 0.83223 1.000
#> 19 25 1 0.8889 0.0605 0.77791 1.000
#> 43 24 1 0.8519 0.0684 0.72786 0.997
#> 49 23 1 0.8148 0.0748 0.68071 0.975
#> 52 22 1 0.7778 0.0800 0.63576 0.952
#> 53 21 1 0.7407 0.0843 0.59259 0.926
#> 100 20 1 0.7037 0.0879 0.55093 0.899
#> 103 19 1 0.6667 0.0907 0.51059 0.870
#> 105 18 1 0.6296 0.0929 0.47146 0.841
#> 111 17 1 0.5926 0.0946 0.43344 0.810
#> 133 16 1 0.5556 0.0956 0.39647 0.778
#> 143 15 1 0.5185 0.0962 0.36050 0.746
#> 156 14 1 0.4815 0.0962 0.32552 0.712
#> 162 13 1 0.4444 0.0956 0.29152 0.678
#> 164 12 1 0.4074 0.0946 0.25850 0.642
#> 177 11 1 0.3704 0.0929 0.22649 0.606
#> 200 9 1 0.3292 0.0913 0.19121 0.567
#> 216 8 1 0.2881 0.0887 0.15759 0.527
#> 231 7 1 0.2469 0.0850 0.12575 0.485
#> 250 6 1 0.2058 0.0802 0.09587 0.442
#> 260 5 1 0.1646 0.0740 0.06824 0.397
#> 278 4 1 0.1235 0.0659 0.04335 0.352
#> 340 3 1 0.0823 0.0553 0.02204 0.307
#> 378 2 1 0.0412 0.0401 0.00608 0.279
#> 553 1 1 0.0000 NaN NA NA
But I have no idea how I can manipulate this list. I would very much appreciate any advice on how to filter out all lines with n.risk < 10 from fit.obj.
I can't quite seem to get this all the way there. But I see that you can pass a data.frame rather than a fit object to the plotting function. You can do this and clip the values. For example
ss <- subset(surv_summary(fit.obj), n.risk>=10)
ggsurvplot(ss,
conf.int = T)
But it seems in this mode it does not automatically print the table. There is a function to draw just the table with
ggrisktable(fit.obj, tables.theme = theme_cleantable())
So I guess you could just combine them. Maybe i'm missing an easier way to draw the table when using a data.frame in the same plot.
As a slight variation on the above answers, if you want to truncate each group individually when less than 10 patients are at risk in that group, I found this to work and not require plotting the figure and table separately:
library(survival)
library(survminer)
# truncate each line when fewer than 10 at risk
atrisk <- 10
# KM fit
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
# subset each stratum separately
maxcutofftime = 0 # for plotting
strata <- rep(names(fit.obj$strata), fit.obj$strata)
for (i in names(fit.obj$strata)){
cutofftime <- min(fit.obj$time[fit.obj$n.risk < atrisk & strata == i])
maxcutofftime = max(maxcutofftime, cutofftime)
cutoffs <- which(fit.obj$n.risk < atrisk & strata == i)
fit.obj$lower[cutoffs] <- NA
fit.obj$upper[cutoffs] <- NA
fit.obj$surv[cutoffs] <- NA
}
# plot
ggsurvplot(fit.obj, data = veteran, risk.table = TRUE, conf.int = T, pval = F,
tables.theme = theme_cleantable(), xlim = c(0,maxcutofftime), break.x.by = 90)
edited to add: note that if we had used pval = T above, that would give the p-value for the truncated data, not the full data. It doesn't make much of a difference in this example as both are p<0.0001, but be careful :)
I'm following up on MrFlick's great answer.
I'd interpret 3) to mean that there should be at least 10 at risk total - i.e., not per group. So we have to create an ungrouped Kaplan-Meier fit first and determine the time cutoff from there.
Subset the surv_summary object w/r/t this cutoff.
Plot KM-curve and risk table separately. Crucially, function survminer::ggrisktable() (a minimal front end for ggsurvtable()) accepts options xlim and break.time.by. However, the function can currently only extend the upper time limit, not reduce it. I assume this is a bug. I created function ggsurvtable_mod() to change this.
Turn ggplot objects into grobs and use ggExtra::grid.arrange() to put both plots together. There is probably a more elegant way to do this based on options widths and heights.
Admittedly, this is a bit of a hack and needs tweaking to get the correct alignment between survival plot and risk table.
library(survival)
library(survminer)
# ungrouped KM estimate to determine cutoff
fit1_ss <- surv_summary(survfit(Surv(time, status) ~ 1, data=veteran))
# time cutoff with fewer than 10 at risk
cutoff <- min(fit1_ss$time[fit1_ss$n.risk < 10])
# KM fit and subset to cutoff
fit.obj <- survfit(Surv(time, status) ~ celltype, data = veteran)
fit_ss <- subset(surv_summary(fit.obj), time < cutoff)
# KM survival plot and risk table as separate plots
p1 <- ggsurvplot(fit_ss, conf.int=TRUE)
# note options xlim and break.time.by
p2 <- ggsurvtable_mod(fit.obj,
survtable="risk.table",
tables.theme=theme_cleantable(),
xlim=c(0, cutoff),
break.time.by=100)
# turn ggplot objects into grobs and arrange them (needs tweaking)
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
lom <- rbind(c(NA, rep(1, 14)),
c(NA, rep(1, 14)),
c(rep(2, 15)))
gridExtra::grid.arrange(grobs=list(g1, g2), layout_matrix=lom)

"summarise_at" and "mutate_if" for descriptive statistics for character variables

I would like to use summarise_at and mutate_at on multiple character variables at the same time. I have looked at many examples that use integer variables, but I just can't figure it out for character variables. Directly below is the code I use to produce descriptive statistics for a character (or factor) variable.
library(tidyverse)
# First block of code
starwars %>%
group_by(gender) %>%
summarise (n = n()) %>%
mutate(totalN = (cumsum(n))) %>%
mutate(percent = round((n / sum(n)), 3)) %>%
mutate(cumpercent = round(cumsum(freq = n / sum(n)),3))
This produces:
A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
I would like to produce this same thing, but for multiple character (or factor) variables at once. In this case, let's use the variables gender and eye_color This is what I have tried:
starwars %>%
summarise_at(vars(gender, eyecolor) (n = n()) %>%
mutate_at(vars(gender, eyecolor) (totalN = (cumsum(n))) %>%
mutate_at(vars(gender", "eyecolor) (percent = round((n / sum(n)), 3)) %>%
mutate_at(vars(gender, eyecolor) (cumpercent = round(cumsum(freq = n / sum(n)),3))))))
I get the following error:
Error in eval(expr, envir, enclos) : attempt to apply non-function
I understand that there are built-in functions called using funs, but I don't want to use them. I have tried playing with the code in many different ways to get it to work, but have come up short.
What I would like to produce, is something like this:
A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000
Perhaps a loop would be better? Right now I have many lines of code to generate the descriptive statistics for each character variable because I have to run the first block of code (noted above) for each variable. It would be great if I could just list the variables I would like to use and run each through the first block of code.
Based on your expected output, mutate_at is not what you want, since it mutates on the columns selected. What you wanted to do is to group_by gender and eye_color separately. This is a good place to write your summary code into a function:
library(tidyverse)
library(rlang)
summary_func = function(group_by_var){
group_by_quo = enquo(group_by_var)
starwars %>%
group_by(!!group_by_quo) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)),3))
}
Result:
> summary_func(gender)
# A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
> summary_func(eye_color)
# A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000
The idea is to make your summary code into a function so that you can apply the same code over different group_by variables. enquo from rlang takes the code supplied to group_by_var and bundles it with the environment where it was called into a quosure. You can then use !! to unquote the group_by_quo in the group_by step. This enables non-standard evaluation (i.e. typing summary_func(gender) instead of summary_func("gender").
If you don't want to call summary_func for every variable you want to group_by, you can wrap your dplyr code in map from purrr, and unquote each argument of group_by_quo supplied as ... arguments. Notice the change from enquo to quos to convert each argument of ... to a list of quosures:
summary_func = function(...){
group_by_quo = quos(...)
map(group_by_quo, ~{
starwars %>%
group_by(!!.x) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)),3))
})
}
You can now do this:
summary_func(gender, eye_color)
or with a vector of character variable names to group_by:
group_vars = c("gender", "eye_color")
summary_func(!!!syms(group_vars))
Result:
[[1]]
# A tibble: 5 x 5
gender n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 female 19 19 0.218 0.218
2 hermaphrodite 1 20 0.011 0.230
3 male 62 82 0.713 0.943
4 none 2 84 0.023 0.966
5 <NA> 3 87 0.034 1.000
[[2]]
# A tibble: 15 x 5
eye_color n totalN percent cumpercent
<chr> <int> <int> <dbl> <dbl>
1 black 10 10 0.115 0.115
2 blue 19 29 0.218 0.333
3 blue-gray 1 30 0.011 0.345
4 brown 21 51 0.241 0.586
5 dark 1 52 0.011 0.598
6 gold 1 53 0.011 0.609
7 green, yellow 1 54 0.011 0.621
8 hazel 3 57 0.034 0.655
9 orange 8 65 0.092 0.747
10 pink 1 66 0.011 0.759
11 red 5 71 0.057 0.816
12 red, blue 1 72 0.011 0.828
13 unknown 3 75 0.034 0.862
14 white 1 76 0.011 0.874
15 yellow 11 87 0.126 1.000

Resources