R script not rounding as intended when running in Terminal [duplicate] - r

I am having trouble getting the desired number of decimal places from summarise. Here is a simple example:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(mean(V2))
The dataframe is:
V1 V2
1 a 245
2 a 246
3 b 247
4 b 248
The output is:
V1 `mean(V2)`
<fctr> <dbl>
1 a 246
2 b 248
I would like it to give me the means including the decimal place (i.e. 245.5 and 247.5)

Because you are using dplyr tools, the resulting output is actually a tibble, which by default prints numbers with 3 significant digits (see option pillar.sigfig). This is not the same as number of digits after the period. To obtain the latter, convert it simply to a data.frame: as.data.frame
Note that tibble's concept of significant digits is somehow complicated, and does not indicate how many digits after the period are represented, but the minimum number of digits necessary to have a given accurate representation of the number (I think 99.9%, see discussion here).
This means the number of digits printed depends on the "size" of your number:
library(tibble)
packageVersion("tibble")
#> [1] '2.1.3'
packageVersion("pillar")
#> [1] '1.4.2'
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
options(pillar.sigfig=3)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.123
#> 2 1.12
#> 3 10.1
#> 4 100.
#> 5 1000.
options(pillar.sigfig=4)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.1234
#> 2 1.123
#> 3 10.12
#> 4 100.1
#> 5 1000.
as.data.frame(tab)
#> x
#> 1 0.1234
#> 2 1.1234
#> 3 10.1234
#> 4 100.1234
#> 5 1000.1234
Created on 2019-08-21 by the reprex package (v0.3.0)

This is one solution-
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
dplyr::mutate_if(is.numeric, format, 1)
#> # A tibble: 2 x 2
#> V1 `mean(V2)`
#> <fct> <chr>
#> 1 a 245.5
#> 2 b 247.5
Created on 2018-01-20 by the reprex
package (v0.1.1.9000).
EDIT :
If you want to keep it as numeric :
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 1)
Gives
V1 mean(V2)
1 a 245.5
2 b 247.5
And with another example (from #Matifou) :
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
tab %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 2)
Gives :
x
1 0.12
2 1.12
3 10.12
4 100.12
5 1000.12

I think the simplest solution is the following:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(`mean(V2)` = sprintf("%0.1f",mean(V2)))
# A tibble: 2 x 2
V1 `mean(V2)`
<fct> <chr>
1 a 245.5
2 b 247.5

Related

Continuing a sequence into NAs using dplyr

I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3
It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())
We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))
A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B

R Regex capture to remove/keep columns with repeats in their column names

This is an example dataframe
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
I have column names in my dataframe in R in this format
\S+_T\d+|\S+_T\d+
The syntax is something like (Name)_ (T)(Number) | (Name)_ (T)(Number)
Step 1) I want to select columns which contain the same (T)(Number) on both sides of the "|"
I did this with some manual labor :
means_t0 <- means2 %>% select(matches("\\S+_T0\\|\\S+_T0")) %>% rownames_to_column("id_cp_interaction")
means_t1 <- means2 %>% select(matches("\\S+_T1\\|\\S+_T1")) %>% rownames_to_column("id_cp_interaction")
means_t5 <- means2 %>% select(matches("\\S+_T5\\|\\S+_T5")) %>% rownames_to_column("id_cp_interaction")
means3 <- full_join(means_t0, means_t1) %>% full_join(means_t5)
This gives me what I want and it was easy to do because I only had 3 types - T0, T1 and T5. What do I do if I had a huge number?
Step 2) From the output of Step1, I want to do a negation of the last question i.e. select only those columns with Names which are not the same
For example B_T0|B_T0 should be removed but B_T0|Fibro_T0 should be retained
Is there a way to regex capture the part in front of the pipe(|) and match it to the part at the back of the pipe(|)
Thank you
If you have that much information in your column names, I like to transform the data into the long format and then separate the info from the column name into several columns. Then it's easy to filter by these columns:
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
means2 <- cbind(data.frame(id_cp_interaction = 1:5), means2)
library(tidyr)
library(dplyr)
library(stringr)
res <- means2 %>%
pivot_longer(
cols = -id_cp_interaction,
names_to = "names",
values_to = "values"
) %>%
mutate(
celltype_1 = str_extract(names, "^[^_]*"),
timepoint_1 = str_extract(names, "[0-9](?=|)"),
celltype_2 = str_extract(names, "(?<=\\|)(.*?)(?=_)"),
timepoint_2 = str_extract(names, "[0-9]$")
)
head(res, n = 7)
#> # A tibble: 7 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|B… 1.68 B 0 B 0
#> 2 1 B_T0|B… 19.3 B 0 B 1
#> 3 1 B_T0|F… 10.6 B 0 Fibro 0
#> 4 1 B_T5|E… 12.5 B 5 Endo 5
#> 5 1 Macro_… 2.84 Macro 1 Fibro 1
#> 6 2 B_T0|B… 2.17 B 0 B 0
#> 7 2 B_T0|B… 10.1 B 0 B 1
# only keep interactions of different cell types
res %>%
filter(celltype_1 != celltype_2) %>%
head()
#> # A tibble: 6 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|F… 10.6 B 0 Fibro 0
#> 2 1 B_T5|E… 12.5 B 5 Endo 5
#> 3 1 Macro_… 2.84 Macro 1 Fibro 1
#> 4 2 B_T0|F… 1.47 B 0 Fibro 0
#> 5 2 B_T5|E… 11.3 B 5 Endo 5
#> 6 2 Macro_… 13.0 Macro 1 Fibro 1
Created on 2022-09-19 by the reprex package (v1.0.0)

Match one dataframe based on a range in another dataframe in R tidyverse

I have two large datasets that want to match with each other
library(tidyverse)
df1 <- tibble(position=c(10,11,200,250,300))
df1
#> # A tibble: 5 × 1
#> position
#> <dbl>
#> 1 10
#> 2 11
#> 3 200
#> 4 250
#> 5 300
df2 <- tibble(start=c(1,10,200,251),
end=c(20,100,250,350),
name=c("geneA","geneB","geneC","geneD"))
df2
#> # A tibble: 4 × 3
#> start end name
#> <dbl> <dbl> <chr>
#> 1 1 20 geneA
#> 2 10 100 geneB
#> 3 200 250 geneC
#> 4 251 350 geneD
Created on 2022-03-03 by the reprex package (v2.0.1)
I have the position of the genes in the df1 and I want to find based on the range (start-end) from the df2 how many genes can be found in this position.
I want my data to look like this
position start end name
<dbl> <dbl> <dbl> <chr>
1 10 1 20 geneA
2 10 10 100 geneB
3 11 1 20 geneA
4 11 10 100 geneB
5 200 200 250 geneC
6 250 200 250 geneC
7 300 251 350 geneD
One way to solve this could be through crossing and filtering
df1 %>%
crossing(df2) %>%
filter(position >= start & position <= end)
However my dataset is way too large and can not afford crossing or expanding. Any other idea?
1) SQL engines can perform such operations without crossing. (It may be possible to speed it up even more if you add indexes.)
library(sqldf)
sqldf("select *
from df1 a
join df2 b on a.position between b.start and b.end")
2) data.table can also do some sql-like operations. (Be careful because the first variable in each comparison must be from the first data table and the second from the second. They can't be reordered so, for example, the first comparison could not be written as position <= start even though it is mathematically the same.) Again, adding indexes may improve the speed.
library(data.table)
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)[, c("start2", "end2") := .(start, end)]
dt2[dt1, on = .(start <= position, end >= position)]
crossing is a wrapper around expand_grid and does additional stuff e.g. filtering. You can use it directly:
library(tidyverse)
df1 <- tibble(position = c(10, 11, 200, 250, 300))
df1
#> # A tibble: 5 × 1
#> position
#> <dbl>
#> 1 10
#> 2 11
#> 3 200
#> 4 250
#> 5 300
df2 <- tibble(
start = c(1, 10, 200, 251),
end = c(20, 100, 250, 350),
name = c("geneA", "geneB", "geneC", "geneD")
)
expand_grid(df1, df2) %>%
filter(position >= start & position <= end)
#> # A tibble: 7 × 4
#> position start end name
#> <dbl> <dbl> <dbl> <chr>
#> 1 10 1 20 geneA
#> 2 10 10 100 geneB
#> 3 11 1 20 geneA
#> 4 11 10 100 geneB
#> 5 200 200 250 geneC
#> 6 250 200 250 geneC
#> 7 300 251 350 geneD
Created on 2022-03-03 by the reprex package (v2.0.0)
Here is a dplyr way (sort of).
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 <- tibble(position = c(10, 11, 200, 250, 300))
df2 <- tibble(
start = c(1, 10, 200, 251),
end = c(20, 100, 250, 350),
name = c("geneA", "geneB", "geneC", "geneD")
)
vbetween <- function(data, col, data2, start, end){
f <- function(x, l, r) l <= x & x <= r
col <- enquo(col)
start <- enquo(start)
end <- enquo(end)
x <- data %>% pull(!!col)
l <- data2 %>% pull(!!start)
r <- data2 %>% pull(!!end)
yes <- lapply(x, f, l = l, r = r)
lapply(yes, \(i) data2[i, ])
}
df1 %>% vbetween(position, df2, start, end) %>% bind_rows()
#> # A tibble: 7 x 3
#> start end name
#> <dbl> <dbl> <chr>
#> 1 1 20 geneA
#> 2 10 100 geneB
#> 3 1 20 geneA
#> 4 10 100 geneB
#> 5 200 250 geneC
#> 6 200 250 geneC
#> 7 251 350 geneD
Created on 2022-03-03 by the reprex package (v2.0.1)

Replace infinite values in an R data frame [why doesn't `is.infinite()` behave like `is.na()`]

library(tidyverse)
df <- tibble(col1 = c("A", "B", "C"),
col2 = c(NA, Inf, 5))
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A NA
#> 2 B Inf
#> 3 C 5
I can use the base R is.na() function to easily replace NAs with 0s, shown below:
df %>% replace(is.na(.), 0)
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A 0
#> 2 B Inf
#> 3 C 5
If I try to duplicate this logic with is.infinite() things break:
df %>% replace(is.infinite(.), 1)
#> Error in is.infinite(.) : default method not implemented for type 'list'
Looking at this older answer about Inf and R data frames I can hack together the solution shown below. This takes my original data frame and turns all NA into 0 and all Inf into 1. Why doesn't is.infinite() behave like is.na() and what is (perhaps) a better way to do what I want?
df %>%
replace(is.na(.), 0) %>%
mutate_if(is.numeric, list(~na_if(abs(.), Inf))) %>% # line 3
replace(is.na(.), 1)
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A 0
#> 2 B 1
#> 3 C 5
The is.infinite expects the input 'x' to be atomic vector according to ?is.infinite
x- object to be tested: the default methods handle atomic vectors.
whereas ?is.na can take a vector, matrix, data.frame as input
an R object to be tested: the default method for is.na and anyNA handle atomic vectors, lists, pairlists, and NULL
Also, by checking the methods,
methods('is.na')
#[1] is.na.data.frame is.na.data.table* is.na.numeric_version is.na.POSIXlt is.na.raster* is.na.vctrs_vctr*
methods('is.infinite') # only for vectors
#[1] is.infinite.vctrs_vctr*
We can modify the replace in the code to
library(dplyr)
df %>%
mutate_if(is.numeric, ~ replace_na(., 0) %>%
replace(., is.infinite(.), 1))
# A tibble: 3 x 2
# col1 col2
# <chr> <dbl>
#1 A 0
#2 B 1
#3 C 5

Number of significant digits in dplyr summarise

I am having trouble getting the desired number of decimal places from summarise. Here is a simple example:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(mean(V2))
The dataframe is:
V1 V2
1 a 245
2 a 246
3 b 247
4 b 248
The output is:
V1 `mean(V2)`
<fctr> <dbl>
1 a 246
2 b 248
I would like it to give me the means including the decimal place (i.e. 245.5 and 247.5)
Because you are using dplyr tools, the resulting output is actually a tibble, which by default prints numbers with 3 significant digits (see option pillar.sigfig). This is not the same as number of digits after the period. To obtain the latter, convert it simply to a data.frame: as.data.frame
Note that tibble's concept of significant digits is somehow complicated, and does not indicate how many digits after the period are represented, but the minimum number of digits necessary to have a given accurate representation of the number (I think 99.9%, see discussion here).
This means the number of digits printed depends on the "size" of your number:
library(tibble)
packageVersion("tibble")
#> [1] '2.1.3'
packageVersion("pillar")
#> [1] '1.4.2'
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
options(pillar.sigfig=3)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.123
#> 2 1.12
#> 3 10.1
#> 4 100.
#> 5 1000.
options(pillar.sigfig=4)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.1234
#> 2 1.123
#> 3 10.12
#> 4 100.1
#> 5 1000.
as.data.frame(tab)
#> x
#> 1 0.1234
#> 2 1.1234
#> 3 10.1234
#> 4 100.1234
#> 5 1000.1234
Created on 2019-08-21 by the reprex package (v0.3.0)
This is one solution-
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
dplyr::mutate_if(is.numeric, format, 1)
#> # A tibble: 2 x 2
#> V1 `mean(V2)`
#> <fct> <chr>
#> 1 a 245.5
#> 2 b 247.5
Created on 2018-01-20 by the reprex
package (v0.1.1.9000).
EDIT :
If you want to keep it as numeric :
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 1)
Gives
V1 mean(V2)
1 a 245.5
2 b 247.5
And with another example (from #Matifou) :
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
tab %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 2)
Gives :
x
1 0.12
2 1.12
3 10.12
4 100.12
5 1000.12
I think the simplest solution is the following:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(`mean(V2)` = sprintf("%0.1f",mean(V2)))
# A tibble: 2 x 2
V1 `mean(V2)`
<fct> <chr>
1 a 245.5
2 b 247.5

Resources