I'm an R newbie so my apologizes if this is a simple question.
I use a lot excel to create "dual entries" tables. It's likely the name 'dual table' is not the most accurate but I wouldn't know how to describe it otherwise.
I basically start from big tables and then create a new one where I average the data grouping by two columns and then I display it as a matrix.
I will share with you a perfectly functional R example I coded myself.
My question is: is there an easier / better way to do it?
This is my working code:
require(dplyr)
df <- mtcars
output_var <- 'disp'
rows_var <- 'cyl'
col_var <- 'am'
output_name <- paste0("Avg. ",output_var)
one_way_table <- df %>%
group_by(eval(parse(text=rows_var)), eval(parse(text=col_var)) ) %>%
summarise(output=mean( eval(parse(text=output_var)) ))
one_way_table <- data.frame(one_way_table, check.rows = F, check.names = F, stringsAsFactors = F)
colnames(one_way_table) <- c(rows_var, col_var, output_name)
unique_row_items <- unique(one_way_table[,rows_var])
unique_col_items <- unique(one_way_table[,col_var])
x_rows <- rep(unique_row_items, length(unique_col_items))
y_cols <- rep(unique_col_items, length(unique_row_items))
new_df <- data.frame(x = x_rows, y = y_cols, check.rows = F, check.names = F, stringsAsFactors = F)
colnames(new_df) <- c(rows_var, col_var)
new_df <- base::merge(new_df, one_way_table, by = c(rows_var, col_var), all.x=T)
m <- matrix(new_df[, output_name], ncol= length(unique(new_df[,col_var])) )
df_matrix <- data.frame(m, check.rows = F, check.names = F, stringsAsFactors = F)
Perhaps there's a way more efficient way to do it.
Notice how, since this will be coded inside a function, I had to use variable names do define what columns I want to use for the analysis.
Thanks
A possible solution for your issue can come from tidyverse. Here an example reshaping your data and aggregating with mean:
library(tidyverse)
#Data
df <- mtcars
#Code
df %>% pivot_longer(cols = -c(cyl,am)) %>% filter(name=='disp') %>%
group_by(cyl,am) %>% summarise(Mean=mean(value)) %>%
pivot_wider(names_from = am,values_from=Mean)
Output:
# A tibble: 3 x 3
# Groups: cyl [3]
cyl `0` `1`
<dbl> <dbl> <dbl>
1 4 136. 93.6
2 6 205. 155
3 8 358. 326
Which is close to df_matrix the final output of your code.
If we need to pivot, this can be done in a more simple way. We select the columns of interest and use pivot_wider with values_fn specifying as mean to be applied on the columns selected on values_from
library(dplyr)
library(tidyr)
mtcars %>%
select(cyl, am, disp) %>%
pivot_wider(names_from = am, values_from = disp, values_fn = mean)
# A tibble: 3 x 3
# cyl `1` `0`
# <dbl> <dbl> <dbl>
#1 6 155 205.
#2 4 93.6 136.
#3 8 326 358.
Related
This is the edited version of the question.
I need help to convert my wide data to long format data using the pivot_longer() function in R. The main problem is wanting to create long data with a variable nested in another variable.
For example, if I have wide data like this, where
variable fu1 and fu2 are variables for the follow-up (in days). There are two follow-up events (fu1 and fu2)
variables cpass and is are the results of two tests at each follow up
IDno <- c(1,2)
Sex <- c("M","F")
fu1 <- c(13,15)
fu2 <- c(20,18)
cpass1 <- c(27, 85)
cpass2 <- c(33, 90)
is1 <- c(201, 400)
is2 <- c(220, 430)
mydata <- data.frame(IDno, Sex,
fu1, cpass1, is1,
fu2, cpass2, is2)
mydata
which looks like this
And now, I want to convert it to long format data, and it should look like this:
I have tried the codes below, but they do not produce the data frame in the format that I want:
#renaming variables
mydata_wide <- mydata %>%
rename(fu1_day = fu1,
cp_one = cpass1,
is_one = is1,
fu2_day = fu2,
cp_two = cpass2,
is_two = is2)
#pivoting
mydata_wide %>%
pivot_longer(
cols = c(fu1_day, fu2_day),
names_to = c("fu", ".value"),
values_to = "day",
names_sep = "_") %>%
pivot_longer(
cols = c("cp_one", "is_one", "cp_two", "is_two"),
names_to = c("test", ".value"),
values_to = "value",
names_sep = "_")
The data frame, unfortunately, looks like this:
I have looked at some tutorials but have not found the best solution for this problem. Any help is very much appreciated.
library(tidyverse)
mydata %>% # the "nested" pivoting must be done within two calls
pivot_longer(cols=c(fu1,fu2),names_to = 'fu', values_to = 'day') %>%
pivot_longer(cols=c(starts_with('cpass'), starts_with('is')),
names_to = 'test', values_to = 'value') %>%
# with this filter check not mixing the tests and the follow-ups
filter(str_extract(fu,"\\d") == str_extract(test,"\\d")) %>%
mutate(test = gsub("\\d","",test)) # remove numbers in strings
Output:
# A tibble: 8 × 6
IDno Sex fu day test value
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 M fu1 13 cpass 27
2 1 M fu1 13 is 201
3 1 M fu2 20 cpass 33
4 1 M fu2 20 is 220
5 2 F fu1 15 cpass 85
6 2 F fu1 15 is 400
7 2 F fu2 18 cpass 90
8 2 F fu2 18 is 430
I'm not sure if your example is your real expected output, the first dataset and the output example that you describe do not show the same information.
I took inspiration from almost similar post from How to reshape Panel / Longitudinal survey data from wide to long format using pivot_longer and from the solution provided by RobertoT and put together these codes:
STEP 1: Generate wide data for simulation
IDno <- c(1,2)
Sex <- c("M","F")
fu1_day <- c(13,15)
fu2_day <- c(20,18)
fu1_cpass <- c(27, 85)
fu2_cpass <- c(33, 90)
fu1_is <- c(201, 400)
fu2_is <- c(220, 430)
mydata_wide <- data.frame(IDno, Sex,
fu1_day, fu1_cpass, fu1_is,
fu2_day, fu2_cpass, fu2_is)
mydata_wide
STEP 1: CONVERT TO LONG DATA (out1)
out1 <- mydata_wide %>%
select(IDno, contains("day")) %>%
pivot_longer(cols = c(fu1_day, fu2_day),
names_to = c('fu', '.value'),
names_sep="_")
out1
STEP 2: CREATE ANOTHER LONG DATA AND JOIN WITH out1
mydata_wide %>%
select(-contains('day')) %>%
pivot_longer(cols = -c(IDno, Sex),
names_to = c('fu', 'test'),
names_sep="_") %>%
left_join(out1)
The result looks like this
Following on the renaming request #67453183 I want to do the same for formats using the dictionary, because it won't bring together columns of distinct types.
I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this. > Suppose this data and dictionary (actual one is much longer, thus I want to automate):
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("factor", "numeric")
)
I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.... I want to automate 'coalesce all columns with duplicate names'.
And to bring these together, the types need to be the same too. I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.
#ronakShah in the previous query proposed
pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
flatten_chr() -> val
mtcars_all <- list(mtcarsA,mtcarsB) %>%
map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))
Which works great in the previous example but not if the formats vary. Here it throws error:
Error: Can't combine ..1$cyl_true<double> and..2$cyl_true <factor<51fac>>.
This response to #56773354 offers a related solution if one has a complete list of types, but not for a type list by column name, as I have.
Desired output:
mtcars_all
# A tibble: 4 x 3
mpg_true cyl_true disp
<factor> <numeric> <dbl>
1 21 6 160
2 21 6 160
3 22.8 4 108
4 21.4 6 258
Something simpler:
library(magrittr) # %<>% is cool
library(dplyr)
# The renaming is easy:
renameA <- dic$nameA
renameB <- dic$nameB
names(renameA) <- dic$true_name
names(renameB) <- dic$true_name
mtcarsA %<>% rename(all_of(renameA))
mtcarsB %<>% rename(all_of(renameB))
# Formatting is a little harder:
formats <- dic$true_format
names(formats) <- dic$true_name
lapply(names(formats), function (x) {
# there's no nice programmatic way to do this, I think
coercer <- switch(formats[[x]],
factor = as.factor,
numeric = as.numeric,
warning("Unrecognized format")
)
mtcarsA[[x]] <<- coercer(mtcarsA[[x]])
mtcarsB[[x]] <<- coercer(mtcarsB[[x]])
})
mtcars_all <- bind_rows(mtcarsA, mtcarsB)
In the background you should be aware of how base R treated concatenating factors before 4.1.0, and how this'll change. Here it probably doesn't matter because bind_rows will use the vctrs package.
I took another approach than Ronak's to read the dictionary. It is more verbose but I find it a bit more readable. A benchmark would be interesting to see which one is faster ;-)
Unfortunately, it seems that you cannot blindly cast a variable to a factor so I switched to character instead. In practice, it should behave exactly like a factor and you can call as_factor() on the end object if this is very important to you. Another possibility would be to store a casting function name (such as as_factor()) in the dictionary, retrieve it using get() and use it instead of as().
library(tidyverse)
mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl"),
true_format = c("numeric", "character") #instead of factor
)
dic2 = dic %>%
pivot_longer(-c(true_name, true_format), names_to=NULL)
read_dic = function(key, dict=dic2){
x = dict[dict$value==key,][["true_name"]]
if(length(x)!=1) x=key
x
}
rename_from_dic = function(df, dict=dic2){
rename_with(df, ~{
map_chr(.x, ~read_dic(.x, dict))
})
}
cast_from_dic = function(df, dict=dic){
mutate(df, across(everything(), ~{
cl=dict[dict$true_name==cur_column(),][["true_format"]]
if(length(cl)!=1) cl=class(.x)
as(.x, cl, strict=FALSE)
}))
}
list(mtcarsA,mtcarsB) %>%
map(rename_from_dic) %>%
map_df(cast_from_dic)
#> # A tibble: 4 x 3
#> mpg_true cyl_true disp
#> <dbl> <chr> <dbl>
#> 1 21 6 160
#> 2 21 6 160
#> 3 22.8 4 108
#> 4 21.4 6 258
Created on 2021-05-09 by the reprex package (v2.0.0)
I'm trying to create a simple code that I can reuse over and over (with minimal adjustments) to be able to print a table of summary statistics.
A reproducible example creates a table with M and SD for the variable V1 broken down by group:
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
library(dplyr)
descriptives <- data %>% group_by(Group) %>%
summarize(
Mean = mean(V2)
, SD = sd(V2)
)
descriptives
I'd like to modify this function so that it will compute M and SD for all variables in my dataset.
I'd like to be able to replace the call to V1 with something like vars which is just a list of all the variables in my dataset; in this example, V1 and V2. But usually I have like 100 variables.
The reason I'd like it to work this way is so that I can do something very easy like:
vars <- names(data[3:4])
and very quickly select the columns for which I want summary statistics.
A few things for my wishlist:
M and SD for a given variable should be next to eachother and I'd like to add a column above each pair with the variable name.
I'd like the end product to look something like
I'd like to use dplyr, but I'm open to other options.
I'd also like to learn how I could switch the rows and columns of the table so that the variables are on separate rows and each group has a column (or two columns, one for M and one for SD). Like this:
Close, but no cigar:
The newish summarise(across()) kind of helps:
dplyr::group_by(df, Group) %>%
dplyr::summarise(dplyr::across(.cols = c(V1, V2), .fns = c(mean, sd)))
But I don't know how to scale it without making multiple table and using rbind() to stack them.
I really like the format of table1() (vignette), but from what I can tell I can only stratify the column M/SDs by another variable. I really wish I could just add additional grouping variables on.
There is a limitation in the ordering, but if we use select, then can reorder on the substring on the column names
library(dplyr)
library(stringr)
data %>%
group_by(Group) %>%
summarise_at(vars(vars), list(Mean = mean, SD = sd)) %>%
select(Group, order(str_remove(names(.)[-1], "_.*")) + 1)
# A tibble: 2 x 5
# Group V1_Mean V1_SD V2_Mean V2_SD
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 0.165 0.915 0.146 1.16
#2 2 0.308 1.31 -0.00711 0.854
I had a similar question here, and got some really useful and simple answers using tidyverse. In the end a really robust approach was made, which I wrapped in a function and use regularly.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
}
It takes three arguments, data, variables and the grouping_var - all of which are rather self explanatory.
Here is a test using mtcars with a 2 level and 3 level grouping var.
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "am"
)
# A tibble: 2 x 3
variable `0` `1`
<chr> <chr> <chr>
1 mpg 17.15(±3.83) 24.39(±6.17)
2 hp 160.26(±53.91) 126.85(±84.06)
baseline_table(
data = mtcars,
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4` `6` `8`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
It works out of the box, and are applicable to all data, below I used iris,
baseline_table(
data = iris,
variables = c("Sepal.Length", "Sepal.Width"),
grouping_var = "Species"
)
# A tibble: 2 x 4
variable setosa versicolor virginica
<chr> <chr> <chr> <chr>
1 Sepal.Length 5.01(±0.35) 5.94(±0.52) 6.59(±0.64)
2 Sepal.Width 3.43(±0.38) 2.77(±0.31) 2.97(±0.32)
Of course; some grouping variables are not directly suited for this. Namely cyl but it does serve as a good example though. but you can recode your grouping variables accordingly,
baseline_table(
data = mtcars %>% mutate(cyl = paste(cyl, "Cylinders", sep = " ")),
variables = c("mpg", "hp"),
grouping_var = "cyl"
)
# A tibble: 2 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 mpg 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
2 hp 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
You can also modify the function to include descriptive strings, about the values,
baseline_table <- function(data, variables, grouping_var) {
# Generate the table;
tmpTable <- data %>%
group_by(!!sym(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% pivot_longer(
cols = -grouping_var,
names_to = "variable"
) %>% pivot_wider(
names_from = grouping_var
)
# Generate Descriptives dynamically
tmpDesc <- tmpTable[1,] %>% mutate(
across(.fns = ~ paste("Mean (±SD)"))
) %>% mutate(
variable = ""
)
bind_rows(
tmpDesc,
tmpTable
)
}
Granted, this extension is a bit awkward - but it is nonetheless still robust. The output is,
# A tibble: 3 x 4
variable `4 Cylinders` `6 Cylinders` `8 Cylinders`
<chr> <chr> <chr> <chr>
1 "" Mean (±SD) Mean (±SD) Mean (±SD)
2 "mpg" 26.66(±4.51) 19.74(±1.45) 15.1(±2.56)
3 "hp" 82.64(±20.93) 122.29(±24.26) 209.21(±50.98)
Update: Ive rewritten the function for added flexibility as noted in the comments.
library(tidyverse)
baseline_table <- function(data, variables, grouping_var) {
data %>%
group_by(!!!syms(grouping_var)) %>%
summarise(
across(
all_of(variables),
~ paste0(mean(.) %>% round(2), "(±", sd(.) %>% round(2), ")")
)
) %>% unite(
"grouping",
all_of(grouping_var)
) %>% pivot_longer(
cols = -"grouping",
names_to = "variables"
) %>% pivot_wider(
names_from = "grouping"
)
}
It works in the same way, and outputs the same, unless there is more than one grouping_var,
baseline_table(
mtcars,
variables = c("hp", "mpg"),
grouping_var = c("am", "cyl")
)
# A tibble: 2 x 7
variables `0_4` `0_6` `0_8` `1_4` `1_6` `1_8`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 hp 84.67(±19.66) 115.25(±9.18) 194.17(±33.36) 81.88(±22.66) 131.67(±37.53) 299.5(±50.2)
2 mpg 22.9(±1.45) 19.12(±1.63) 15.05(±2.77) 28.08(±4.48) 20.57(±0.75) 15.4(±0.57)
In the updated function I used unite with a default seperator. Clearly, you can modify this to suit your needs such that the colnames says, for example, 4 Cylinder (Automatic) 6 Cylinder (Automatic) etc.
Slight variation of your original code, you could use across() more simply/flexibly if you specify you don't want the ID (or the already-grouped Group) column, but rather everything else:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}"))
# A tibble: 2 x 5
Group V1_Mean V1_SD V2_Mean V2_SD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0167 0.979 0.145 1.02
2 2 0.119 1.11 -0.277 1.05
EDIT:
If you want to create your (first) goal exactly, you can use the gt package to make an html table with column spanners:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = list(Mean = mean, SD = sd), .names = "{.col}_{.fn}")) %>%
gt::gt() %>%
gt::tab_spanner_delim("_") %>%
gt::fmt_number(-Group, decimals = 2)
As to your other question, you could alternately do something like this to get the combined & transposed variation:
data %>%
group_by(Group) %>%
summarize(across(-ID, .fns = ~paste0(
sprintf("%.2f", mean(.x)),
sprintf(" (%.2f)", sd(.x))))) %>%
t() %>%
as.data.frame()
V1 V2
Group 1 2
V1 -0.02 (0.98) 0.12 (1.11)
V2 0.15 (1.02) -0.28 (1.05)
Outside dplyr, you could use the tables package which allows to create summary statistics out of a table formula:
library(tables)
vars <- c("V1","V2")
vars <- paste(vars, collapse="+")
table <- as.formula(paste("(group = factor(Group)) ~ (", vars ,")*(mean+sd)"))
table
# (group = factor(Group)) ~ (V1 + V2) * (mean + sd)
tables::tabular(table, data = data)
# V1 V2
# group mean sd mean sd
# 1 -0.15759 0.9771 0.1405 1.0697
# 2 0.05084 0.9039 -0.1470 0.9949
One way to make a nice summary table is to use a package called gtsummary (note I am a co-author on this package just as an FYI). Below I am just formatting the data a little bit in data2 and dropping the ID variable. Then it is a two line call to gtsummary to summarize your data. The by statement is what stratifies the table, and in the statistics input I am simply telling to show the mean and sd, by default gtsummary will show median q1-q3. This table can be rendered in all markdown options (word, pdf, html).
library(dplyr)
library(gtsummary)
data <- as.data.frame(cbind(1:100, sample(1:2), rnorm(100), rnorm(100)))
names(data) <- c("ID", "Group", "V1", "V2")
data2 <- data %>%
mutate(Group = ifelse(Group == 1, "Group Var1","Group Var2")) %>%
select(-ID)
tbl_summary(data2, by = Group,
statistic = all_continuous()~ "{mean} ({sd})")
If you want more than one strata but do not want to use tbl_strata you can combine two variables into one column and use that in the by statement. You can unite() as many variables as you want (although maybe not reccomended)
trial %>%
tidyr::unite(col = "trt_grade", trt, grade, sep = ", ") %>%
select(age, marker,stage,trt_grade) %>%
tbl_summary(by = c(trt_grade))
A data.table option
dcast(
setDT(data)[,
c(
.(Meas = c("M", "Sd")),
lapply(.SD, function(x) c(mean(x), sd(x)))
),
Group,
.SDcols = patterns("V\\d")
], Group ~ Meas,
value.var = c("V1", "V2")
)
gives
Group V1_M V1_Sd V2_M V2_Sd
1: 1 -0.2392583 1.097343 -0.08048455 0.7851212
2: 2 0.1059716 1.011769 -0.23356373 0.9927975
You can also use base R:
# using do.call to make the result a data.frame
do.call(
data.frame
# here you aggregate for all the functions you need
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x))))
)
This leads to something like this:
Group V1.mn V1.sd V2.mn V2.sd
1 1 0.1239868 1.008214 0.07215481 1.026059
2 2 -0.2324611 1.048230 0.11348897 1.071467
If you want a fancier table, kableExtra could really help. Note, the %>% should be imported also in kableExtra, but in case, from R 4.1 you can use |> instead of it:
library(kableExtra)
# data manipulation as above, note the [,-1] to remove the Group column
do.call(
data.frame
,(aggregate(. ~ Group, data = data[,-1], FUN = function(x) c(mn = mean(x), sd = sd(x)))))[,-1] %>%
# here you define as a kable, and give the names you want to columns
kbl(col.names = rep(c('mean','sd'),2) ) %>%
# some formatting
kable_paper() %>%
# adding the first header
add_header_above(c( "Group 1" = 2, "Group 2" = 2)) %>%
# another header if you need it
add_header_above(c( "Big group" = 4))
And you can find much more to make great tables.
In case, you can also try something like this:
do.call(data.frame,
aggregate(. ~ Group, data = data[,-1], FUN = function(x) paste0(round(mean(x),2),' (', round(sd(x),2),')'))
) %>%
kbl() %>%
kable_paper()
That leads to:
The table contain My Name, My Age , My Height. I would like to store in a list to get the summary of the height instead of using test = My Height
list<- c("`My Name`" , "`My Age`" , "`My Height`" )
table%>%
group_by(`My Name` ,`My Age`,`My Height` ) %>%
summarize(test = mean(list[3], na.rm = TRUE))
Try summarize_at :
library(dplyr)
lst<- c("`My Name`" , "`My Age`" , "`My Height`" )
table%>%
group_by(`My Name` ,`My Age`,`My Height` ) %>%
summarize_at(vars(lst[3]), mean, na.rm = TRUE)
Or using non-standard evaluation :
table %>%
group_by(`My Name` ,`My Age`,`My Height` ) %>%
summarise(test = mean(!!sym(lst[2]), na.rm = TRUE))
Using reproducble example form mtcars
lst <- c('cyl', 'hp', 'am')
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(list[2]), mean, na.rm =TRUE)
# A tibble: 3 x 2
# cyl test
# <dbl> <dbl>
#1 4 82.6
#2 6 122.
#3 8 209.
Recently I stumbled uppon a strange behaviour of dplyr and I would be happy if somebody would provide some insights.
Assuming I have a data of which com columns contain some numerical values. In an easy scenario I would like to compute rowSums. Although there are many ways to do it, here are two examples:
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
# works
dplyr::select(df, - ids) %>% {rowSums(.)}
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(df, - ids) %>% {rowSums(.)})
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = dplyr::select(., - ids) %>% {rowSums(.)})
# workaround:
tmp <- dplyr::select(df, - ids) %>% {rowSums(.)}
df %>%
dplyr::mutate(blubb = tmp)
# works
rowSums(dplyr::select(df, - ids))
# does not work
# Error: invalid argument to unary operator
df %>%
dplyr::mutate(blubb = rowSums(dplyr::select(df, - ids)))
# workaround
tmp <- rowSums(dplyr::select(df, - ids))
df %>%
dplyr::mutate(blubb = tmp)
First, I don't really understand what is causing the error and second I would like to know how to actually achieve a tidy computation of some (viable) columns in a tidy way.
edit
The question mutate and rowSums exclude columns , although related, focuses on using rowSums for computation. Here I'm eager to understand why the upper examples do not work. It is not so much about how to solve (see the workarounds) but to understand what happens when the naive approach is applied.
The examples do not work because you are nesting select in mutate and using bare variable names. In this case, select is trying to do something like
> -df$ids
Error in -df$ids : invalid argument to unary operator
which fails because you can't negate a character string (i.e. -"i1" or -"i2" makes no sense). Either of the formulations below works:
df %>% mutate(blubb = rowSums(select_(., "X1", "X2")))
df %>% mutate(blubb = rowSums(select(., -3)))
or
df %>% mutate(blubb = rowSums(select_(., "-ids")))
as suggested by #Haboryme.
select_ is deprecated. You can use:
library(dplyr)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>%
mutate(blubb = rowSums(select(., .dots = c("X1", "X2"))))
# Or more generally:
desired_columns <- c("X1", "X2")
df %>%
mutate(blubb = rowSums(select(., .dots = all_of(desired_columns))))
select can now accept bare column names so no need to use .dots or select_ which has been deprecated.
Here are few of the approaches that can work now.
library(dplyr)
#sum all the columns except `id`.
df %>% mutate(blubb = rowSums(select(., -ids), na.rm = TRUE))
#sum X1 and X2 columns
df %>% mutate(blubb = rowSums(select(., X1, X2), na.rm = TRUE))
#sum all the columns that start with 'X'
df %>% mutate(blubb = rowSums(select(., starts_with('X')), na.rm = TRUE))
#sum all the numeric columns
df %>% mutate(blubb = rowSums(select(., where(is.numeric))))
Adding to this old thread because I searched on this question then realized I was asking the wrong question. Also, I detect some yearning in this and related questions for the proper pipe steps way to do this.
The answers here are somewhat non-intuitive because they are trying to use the dplyr vernacular with non-"tidy" data. IF you want to do it the dplyr way, make the data tidy first, using gather(), and then use summarise()
library(tidyverse)
df <- data.frame(matrix(rnorm(20), 10, 2),
ids = paste("i", 1:20, sep = ""),
stringsAsFactors = FALSE)
df %>% gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
#> # A tibble: 20 x 2
#> ids rowsum
#> <chr> <dbl>
#> 1 i1 0.942
#> 2 i10 -0.330
#> 3 i11 0.942
#> 4 i12 -0.721
#> 5 i13 2.50
#> 6 i14 -0.611
#> 7 i15 -0.799
#> 8 i16 1.84
#> 9 i17 -0.629
#> 10 i18 -1.39
#> 11 i19 1.44
#> 12 i2 -0.721
#> 13 i20 -0.330
#> 14 i3 2.50
#> 15 i4 -0.611
#> 16 i5 -0.799
#> 17 i6 1.84
#> 18 i7 -0.629
#> 19 i8 -1.39
#> 20 i9 1.44
If you care about the order of the ids when they are not sortable using arrange(), make that column a factor first.
df %>%
mutate(ids=as_factor(ids)) %>%
gather(key=Xn,value="value",-ids) %>%
group_by(ids) %>%
summarise(rowsum=sum(value))
Why do you want to use the pipe operator? Just write an expression such as:
rowSums(df[,sapply(df, is.numeric)])
i.e. calculate the rowsums on all the numeric columns, with the advantage of not needing to specify ids.
If you want to save your results as a column within data, you can use data.table syntax like this:
dt <- as.data.table(df)
dt[, x3 := rowSums(.SD, na.rm=T), .SDcols = which(sapply(dt, is.numeric))]