How to use dplyr::mutate_all for rounding selected columns - r

I'm using the following package version
# devtools::install_github("hadley/dplyr")
> packageVersion("dplyr")
[1] ‘0.5.0.9001’
With the following tibble:
library(dplyr)
df <- structure(list(gene_symbol = structure(1:6, .Label = c("0610005C13Rik",
"0610007P14Rik", "0610009B22Rik", "0610009L18Rik", "0610009O20Rik",
"0610010B08Rik"), class = "factor"), fold_change = c(1.54037,
1.10976, 0.785, 0.79852, 0.91615, 0.87931), pvalue = c(0.5312,
0.00033, 0, 0.00011, 0.00387, 0.01455), ctr.mean_exp = c(0.00583,
59.67286, 83.2847, 6.88321, 14.67696, 1.10363), tre.mean_exp = c(0.00899,
66.22232, 65.37819, 5.49638, 13.4463, 0.97043), ctr.cv = c(5.49291,
0.20263, 0.17445, 0.46288, 0.2543, 0.39564), tre.cv = c(6.06505,
0.28827, 0.33958, 0.53295, 0.26679, 0.52364)), .Names = c("gene_symbol",
"fold_change", "pvalue", "ctr.mean_exp", "tre.mean_exp", "ctr.cv",
"tre.cv"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
That looks like this:
> df
# A tibble: 6 × 7
gene_symbol fold_change pvalue ctr.mean_exp tre.mean_exp ctr.cv tre.cv
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0610005C13Rik 1.54037 0.53120 0.00583 0.00899 5.49291 6.06505
2 0610007P14Rik 1.10976 0.00033 59.67286 66.22232 0.20263 0.28827
3 0610009B22Rik 0.78500 0.00000 83.28470 65.37819 0.17445 0.33958
4 0610009L18Rik 0.79852 0.00011 6.88321 5.49638 0.46288 0.53295
5 0610009O20Rik 0.91615 0.00387 14.67696 13.44630 0.25430 0.26679
6 0610010B08Rik 0.87931 0.01455 1.10363 0.97043 0.39564 0.52364
I'd like to round the floats (2nd columns onward) to 3 digits. What's the way to do it with dplyr::mutate_all()
I tried this:
cols <- names(df)[2:7]
# df <- df %>% mutate_each_(funs(round(.,3)), cols)
# Warning message:
#'mutate_each_' is deprecated.
# Use 'mutate_all' instead.
# See help("Deprecated")
df <- df %>% mutate_all(funs(round(.,3)), cols)
But get the following error:
Error in mutate_impl(.data, dots) :
3 arguments passed to 'round'which requires 1 or 2 arguments

While the new across() function is slightly more verbose than the previous mutate_if variant, the dplyr 1.0.0 updates make the tidyverse language and code more consistent and versatile.
This is how to round specified columns:
df %>% mutate(across(2:7, round, 3)) # columns 2-7 by position
df %>% mutate(across(cols, round, 3)) # columns specified by variable cols
This is how to round all numeric columns to 3 decimal places:
df %>% mutate(across(where(is.numeric), round, 3))
This is how to round all columns, but it won't work in this case because gene_symbol is not numeric:
df %>% mutate(across(everything(), round, 3))
Where we put where(is.numeric) in across's arguments, you could put in other column specifications such as -1 or -gene_symbol to exclude column 1. See help(tidyselect) for even more options.
Update for dplyr 1.0.0
The across() function replaces the _if/_all/_at/_each variants of dplyr verbs. https://dplyr.tidyverse.org/dev/articles/colwise.html#how-do-you-convert-existing-code
Since some columns are not numeric, you could use mutate_if with the added benefit of rounding columns iff (if and only if) it is numeric:
df %>% mutate_if(is.numeric, round, 3)

packageVersion("dplyr")
[1] '0.7.6'
Try
df %>% mutate_at(2:7, funs(round(., 3)))
It works!!

Related

Assigning new field values based on ifelse logic with lag/lead function in R

Have seen several posts on this, but can't seem to get it to work for my specific use case.
I'm trying to assign a new field value based on ifelse logic. My input dataset looks like:
If the value for X is missing, I am trying to replace it with the previous value of X, only when the value of unique_id is the same as the previous value of unique_id. I would like the output dataset to look like this:
The code I've written (I'm a total beginner) doesn't throw an error, but the data doesn't change:
within(data3, data3$Output <- ifelse(data3$unique_id == lag(data3$unique_id) & is.na(data3$Output), data3$Output == lag(data3$Output), data3$Output == data3$Output))
I do change missing data values ("-") in the input dataset to official NA missing values in a previous step... hopefully allowing me to use the is.na function.
data.table option where you replace the NA with the non-NA value per group:
df <- data.frame(unique_id = c("m", "m"),
X = c(73500, NA),
MoM = c("4%", "0%"))
library(data.table)
setDT(df)
df[, X := X[!is.na(X)][1L], by = unique_id]
df
#> unique_id X MoM
#> 1: m 73500 4%
#> 2: m 73500 0%
Created on 2022-07-09 by the reprex package (v2.0.1)
In addition to the provided solutions: One of these:
fill()
suggest by #jared_marot in the comments
library(dplyr)
library(tidyr)
df %>%
fill(X)
first()
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(X = first(X))
lag()
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(X = lag(X, default = X[1]))
base R
df[2,2] <- df[1,2]
You could group the IDs, then use fill to copy down the values replacing NAs by group. See the reproducible example below.
(If you have NAs which could appear before or after the value, then you could add , .direction = "downup" to the fill.
library(tidyverse)
# Sample data
df <- tribble(
~unique_id, ~x, ~mom,
"m", 73500, 4,
"m", NA, 0,
"z", 4000, 5,
"z", NA, 0,
)
df2 <- df |>
group_by(unique_id) |>
fill(x, .direction = "downup") |>
ungroup()
#> # A tibble: 4 × 3
#> unique_id x mom
#> <chr> <dbl> <dbl>
#> 1 m 73500 4
#> 2 m 73500 0
#> 3 z 4000 5
#> 4 z 4000 0
Created on 2022-07-09 by the reprex package (v2.0.1)

Find the average of 3 minimum prices of numeric column

How can I find the average of the 3 minimum prices of a numeric column (Country_1) ?Imagine that I have thousands of values?
d<-structure(list(Subarea = c("SA_1", "SA_2", "SA_3", "SA_4", "SA_5",
"SA_6", "SA_7", "SA_8", "SA_10", "SA_9"), Country_1 = c(101.37519256645,
105.268942332558, 100.49933368058, 104.531597221684, NA, 83.4404308144341,
86.2833044714836, 81.808967345926, 79.6786979951661, 77.6863475527052
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))
Sort your vector by ascending value, take the 3 first values, and compute the mean.
mean(head(sort(d$Country_1), 3))
# [1] 79.72467
Use sapply or dplyr::across if you want to do that to multiple columns:
sapply(df[, your_columns], \(x) mean(head(sort(x), 3)))
# or
library(dplyr)
d %>%
mutate(across(your_columns, ~ mean(head(sort(.x), 3)))
If you only care the minimal 3 values and the amount of data is large, using sort() with partial = 1:3 is more efficient.
mean(sort(sample(d$Country_1), partial = 1:3)[1:3])
An option with slice_min
library(dplyr)
d %>%
slice_min(n = 3, order_by = Country_1) %>%
summarise(Mean = mean(Country_1))
# A tibble: 1 × 1
Mean
<dbl>
1 79.7

pivot_wider() arguments imply differing number of rows Eroor

I have the following data.table object called x:
month.option som.month
all.year 56.6%
diff -0.9%
and when I perform the following operation :
x %>% pivot_wider(names_from = month.option, values_from = som.month) %>%
select(diff, everything()) %>%
set_names(c("Dif vs MA", "SOM YTD", "SOM AA"))
I get the following error: Error in data.frame(row = row_id, col = col_id) : arguments imply differing number of rows: 0, 2. However I don't understand the reason since x is a 2x2 data.table. If anyone knows a possible issue that I am not seeing I will appreciate the correction.
As a side note, all the columns are of type character, if that is any useful info
If we want to use pivot_wider, we could do this without creating a new column by specifying the values_fn as I
library(dplyr)
library(tidyr)
x %>%
pivot_wider(names_from = month.option, values_from = som.month, values_fn = I)
# A tibble: 1 x 2
# all.year diff
# <I<chr>> <I<chr>>
#1 56.6% -0.9%
Or it can be also a function to get the first element
x %>%
pivot_wider(names_from = month.option,
values_from = som.month, values_fn = first)
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
However, these kind of problems can be easily tackled with transpose from data.table
data.table::transpose(x, make.names = 'month.option')
# all.year diff
#1 56.6% -0.9%
Or use deframe with as_tibble_row which would be more direct
library(tibble)
deframe(x) %>%
as_tibble_row
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
Or another option is to convert the first column to rownames, do the transpose with t and convert to tibble (or data.frame)
x %>%
column_to_rownames('month.option') %>%
t %>%
as_tibble
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
data
x <- structure(list(month.option = c("all.year", "diff"), som.month = c("56.6%",
"-0.9%")), class = "data.frame", row.names = c(NA, -2L))
Try this tidyverse solution with same pivot_wider(). You are having issues because the function can not identify the rows properly. Creating an id is the solution:
#Code
df %>% mutate(id=1) %>%
pivot_wider(names_from = month.option,values_from=som.month) %>%
select(-1)
Output:
# A tibble: 1 x 2
all.year diff
<chr> <chr>
1 56.6% -0.9%
Some data used:
#Data
df <- structure(list(month.option = c("all.year", "diff"), som.month = c("56.6%",
"-0.9%")), class = "data.frame", row.names = c(NA, -2L))
If you have data.table we can also use dcast :
library(data.table)
dcast(x, rowid(month.option)~month.option, value.var = 'som.month')
# month.option all.year diff
#1: 1 56.6% -0.9%

Merge 3 columns based on unique values?

I am trying to do a merge on 3 columns to a single one. The column values are separated by ";" and the new column need to unzip all the 3 column values and put the unique values. I know how to perform the merge column. But I am struggling to do unzipping the row value in 3 columns and finding unique value and putting in another column.
Here is the dummy data
n = c(2, 3, 5,10)
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn")
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA")
t = c("aa;bb;cc", "bb;dd", "kk","NA")
df = data.frame(n, s, b,t)
> df
n s b t
1 2 aa;bb;cc aa;bb;cc aa;bb;cc
2 3 bb;dd;aa bb;dd;cc bb;dd
3 5 NA zz;bb;yy kk
4 10 xx;nn NA NA
The expected output is
> df
n finalcol
1 2 aa;bb;cc
2 3 bb;dd;aa;cc
3 5 zz;bb;yy;kk
4 10 xx;nn
What I have to perform a simple merge
dff = df %>% unite(finalcol, c(s,b,t), sep = ";", remove = TRUE)
Since you mentioned unite, I want to show a solution using separate, the complement of unite.
This solution keeps it within the tidyverse, which makes it easy to understand what's going on step-by-step. #d.b's answer in the comment works perfectly, is compact, and probably runs faster, but has a steeper learning curve to understand what's going on. With a piped tidyverse solution, you can run each line and see what's going on.
This solution first separates the terms, then converts the data from wide to long data format with gather, so that we can do operations such as check for and handle NAs and "NA"s, drop_na, and then distinct, to get unique values only (per group with the same "id" i.e. items from the same original line). Then, it uses summarise and paste to go back to the original format, but could also use spread then unite. (Note that na.rm=TRUE is an upcoming feature of unite https://github.com/tidyverse/tidyr/issues/203)
Sources: I used these handy dplyr and tidyr reference sheets:
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf and I also worked out the solution based on the comments, questions, and answers here: How do I remove NAs with the tidyr::unite function?
# Load packages and data
library(tidyverse)
df = data.frame(n = c(2, 3, 5,10),
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn"),
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA"),
t = c("aa;bb;cc", "bb;dd", "kk", NA))
# Solution
dff <- df %>%
separate(col = "s", into = c("s1", "s2", "s3")) %>%
separate(col = "b", into = c("b1", "b2", "b3")) %>%
separate(col = "t", into = c("t1", "t2", "t3")) %>% # Solution here could be enhanced to take in n columns and put them into however many columns as needed, using map or apply.
rowid_to_column('id') %>%
gather(key, value, -(id:n)) %>%
mutate_at(vars(value), na_if, "NA") %>%
drop_na(value) %>%
group_by(id) %>%
distinct(value, .keep_all = TRUE) %>%
summarise(n = first(n), finalcol = paste(value, collapse = ';')) %>%
ungroup() %>%
select(-id)
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3,
#> 4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [2,
#> 3].
dff
#> # A tibble: 4 x 2
#> n finalcol
#> <dbl> <chr>
#> 1 2 aa;bb;cc
#> 2 3 bb;dd;aa;cc
#> 3 5 zz;bb;yy;kk
#> 4 10 xx;nn
Created on 2019-03-26 by the reprex package (v0.2.1)

Extract class of each field in data.frame; summarize classes in new data.frame

I have a number of very similar .csv's that I want to check through programatically to determine if their column types are the same.
Say I've imported a .csv as a data.frame and I want to check the column classes:
library(tidyverse)
test <- structure(list(Date = "6/15/2018", Time = structure(44255, class = c("hms",
"difftime"), units = "secs")), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"))
test
## A tibble: 1 x 2
# Date Time
# <chr> <time>
#1 6/15/2018 12:17
Checking the class of each column, I can see that the Time column has two classes:
map(test, class)
# $`Date`
# [1] "character"
# $Time
# [1] "hms" "difftime"
What I want is a data.frame that ideally would show:
Date Time
character hms, difftime
So that I can easily compare among different csvs.
I thought map_dfr or map_dfc might work but they return errors.
I also tried the following, but I haven't used summarize_all before and I can't get it to work:
test %>% data.frame() %>%
summarize_all(funs(paste0(collapse = ", ")))
You are very close, you are missing that funs() asks you to specify where the column vector is going to go in the function call(s) with .. So it would be:
test %>%
summarize_all(funs(paste0(class(.), collapse = ", ")))
However, funs() is soft-deprecated and throws a warning as of dplyr 0.8.0. Instead, you can use the formula notation like this:
library(tidyverse)
test <- structure(list(Date = "6/15/2018", Time = structure(44255, class = c("hms", "difftime"), units = "secs")), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
test %>%
summarise_all(~ class(.) %>% str_c(collapse = ", "))
#> # A tibble: 1 x 2
#> Date Time
#> <chr> <chr>
#> 1 character hms, difftime
If you want to try using a purrr style syntax, here's one way to get it in long format with imap_dfr in one line. We write the function to return a named vector for each column, and then bind into a dataframe with _dfr. (You could have used gather to reshape the wide format version too)
test %>%
imap_dfr(~ tibble(colname = .y, classes = class(.x) %>% str_c(collapse = ", ")))
#> # A tibble: 2 x 2
#> colname classes
#> <chr> <chr>
#> 1 Date character
#> 2 Time hms, difftime
Created on 2019-02-26 by the reprex package (v0.2.1)
You can use
lapply(test, function(x) paste0(class(x), collapse = ', ')) %>% data.frame()

Resources