Break matrix into averaged time bins using R - r

I need to convert a matrix x like this:
head(x)
Age d18O d13C
1 0.000 3.28 0.880
2 0.000 3.58 0.150
3 0.002 3.16 0.960
4 0.002 2.91 3.228
5 0.004 3.33 0.880
6 0.004 3.16 3.328
tail(x)
Age d18O d13C
14883 66.3037 1.00 2.03
14884 66.3159 1.02 1.70
14885 66.3800 0.62 2.01
14886 67.0073 1.30 1.23
14887 67.2391 1.31 1.30
14888 67.5173 1.36 1.35
into a matrix, containing 0.5 time bins with mean values of each of the variables, such as:
Age count(x$d18O) mean(x$d18O)
1 0 500 4.1003
2 0.5 522 4.079464
3 1 412 4.032743
4 1.5 366 3.810601
5 2 498 3.749257
6 2.5 608 3.649063
. . . .
. . . .
Age is given in Million of years.

This should do the trick:
library(dplyr)
x %>%
mutate(age_bucket = cut(Age, seq(min(Age), max(Age), by = 0.05), include.lowest = TRUE)) %>%
group_by(age_bucket) %>%
summarise(n = n(),
mean_d18O = mean(d18O))

Try this:
sdf=split(x,cut(x$Age,seq(0,max(x$Age)*1.01,by=.5)))
do.call(rbind,lapply(sdf,function(sx)c(length(sx$d18O),mean(sx$d18O))))
you will get something similar to:
(23,23.5] 0 NaN
(23.5,24] 4 2.9500345
(24,24.5] 1 6.9320712
(24.5,25] 2 3.0219788
(25,25.5] 2 3.7149871
(25.5,26] 1 1.9051732
(26,26.5] 2 3.1865066
(26.5,27] 1 3.9982569

Related

how to apply round() to odd or even rows only in R

assume my original dataframe is :
a b d e
1 1 1 2 1
2 20 30 40 30
3 1 2 6 2
4 40 50 40 50
5 5 5 3 5
6 60 60 60 60
I want to add a percentage row below each row.
a b d e
1 1.00 1.00 2.00 1.00
2 0.79 0.66 1.57 0.66
3 20.00 30.00 40.00 30.00
4 13.51 20.27 27.03 20.27
5 1.00 2.00 6.00 2.00
6 0.66 1.57 3.97 1.57
7 40.00 50.00 40.00 50.00
8 27.03 33.78 27.03 33.78
9 5.00 5.00 3.00 5.00
10 3.94 3.31 2.36 3.31
11 60.00 60.00 60.00 60.00
12 40.54 40.54 40.54 40.54
but as you see, my odd rows get .00 which I do not want.
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df <- df %>% slice(rep(1:n(), each=2))
df[seq_len(nrow(df)) %% 2 ==0, ] <- round(100*df[seq_len(nrow(df)) %% 2 ==0,
]/colSums(df[seq_len(nrow(df)) %% 2 ==0, ]),2)
how can I keep my odd rows without decimals?
The problem is that columns in data frames can only hold one type of data. If some of the columns in your data frame have decimals, then the whole column must be of type double. The only way to change how your data frame appears is via its print method.
Fortunately, you can easily turn your data frame into a tibble. This is a type of data frame, but prints in such a way that the integers don't have decimal points afterwards.
df
#> a b d e
#> 1 1.00 1.00 2.00 1.00
#> 2 0.79 0.66 1.57 0.66
#> 3 20.00 30.00 40.00 30.00
#> 4 13.51 20.27 27.03 20.27
#> 5 1.00 2.00 6.00 2.00
#> 6 0.66 1.57 3.97 1.57
#> 7 40.00 50.00 40.00 50.00
#> 8 27.03 33.78 27.03 33.78
#> 9 5.00 5.00 3.00 5.00
#> 10 3.94 3.31 2.36 3.31
#> 11 60.00 60.00 60.00 60.00
#> 12 40.54 40.54 40.54 40.54
dplyr::tibble(df)
#> # A tibble: 12 x 4
#> a b d e
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1
#> 2 0.79 0.66 1.57 0.66
#> 3 20 30 40 30
#> 4 13.5 20.3 27.0 20.3
#> 5 1 2 6 2
#> 6 0.66 1.57 3.97 1.57
#> 7 40 50 40 50
#> 8 27.0 33.8 27.0 33.8
#> 9 5 5 3 5
#> 10 3.94 3.31 2.36 3.31
#> 11 60 60 60 60
#> 12 40.5 40.5 40.5 40.5
Created on 2022-04-26 by the reprex package (v2.0.1)
Allan Cameron is right, that a tibble prints better and does what you want. To offer another solution, though, if you're trying to print something that you might send to a text file (rather than just look at on the screen), you could print the values to character strings as follows:
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df %>%
mutate(obs = row_number(),
across(-obs, ~.x/sum(.x)),
type = "pct") %>%
bind_rows(df %>% mutate(obs = row_number(),
type = "raw")) %>%
mutate(type = factor(type, levels=c("raw", "pct"))) %>%
arrange(obs, type) %>%
mutate(across(a:e, ~case_when(
type == "raw" ~ sprintf("%.0f", .x),
TRUE ~ sprintf("%.2f%%", .x*100)))) %>%
select(-c(obs, type))
#> a b d e
#> 1 1 1 2 1
#> 2 0.79% 0.68% 1.32% 0.68%
#> 3 20 30 40 30
#> 4 15.75% 20.27% 26.49% 20.27%
#> 5 1 2 6 2
#> 6 0.79% 1.35% 3.97% 1.35%
#> 7 40 50 40 50
#> 8 31.50% 33.78% 26.49% 33.78%
#> 9 5 5 3 5
#> 10 3.94% 3.38% 1.99% 3.38%
#> 11 60 60 60 60
#> 12 47.24% 40.54% 39.74% 40.54%
Created on 2022-04-26 by the reprex package (v2.0.1)
Also note, I think the percentages you calculated are wrong. When I used your data, I get:
sum(df$a[c(2,4,6,8,10,12)])
#> [1] 86.47
And when I use mine, that are different from yours, I get 100 (if we turn them back into numbers from strings).

How to output twice in R pipe?

library(psych)
library(mokken)
bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %T>%
summary %>%
{.$Hi[.$Hi<0]}
A1
-0.3873723
Above script works well.I get the final output but still want to review the output of summary.
How to make summary output too in this pipe?
If we want the summary as well, place it in a list
library(psych)
library(mokken)
library(magrittr)
out <- bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %>%
{list(summary(.), .$Hi[.$Hi < 0])}
out
#[[1]]
# ItemH #ac #vi #vi/#ac maxvi sum sum/#ac zmax #zsig crit
#A1 -0.39 75 54 0.72 0.52 9.79 0.1305 16.75 51 550
#A2 0.06 50 8 0.16 0.14 0.63 0.0126 4.76 7 128
#A3 0.09 30 6 0.20 0.12 0.45 0.0149 4.63 6 134
#[[2]]
# A1
#-0.3873723
You can use %T>% print() to show the result of summary() but not return it.
bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %T>%
{print(summary(.))} %>%
{.$Hi[.$Hi<0]}
# ItemH #ac #vi #vi/#ac maxvi sum sum/#ac zmax #zsig crit
# A1 -0.39 75 54 0.72 0.52 9.79 0.1305 16.75 51 550
# A2 0.06 50 8 0.16 0.14 0.63 0.0126 4.76 7 128
# A3 0.09 30 6 0.20 0.12 0.45 0.0149 4.63 6 134
#
# A1
# -0.3873723
If you assign it to a variable, it doesn't store the result of summary().
out <- ...
out
# A1
# -0.3873723

How to keep other columns when using aggregate in R?

I have a dataframe, p4p5, that contains the following columns:
p4p5 <- c("SampleID", "expr", "Gene", "Period", "Consequence", "isPTV")
I've used the aggregate function here to find the median expression per Gene:
p4p5_med <- aggregate(expr ~ Gene, p4p5, median)
However, this results in a dataframe with the columns "expr" and "Gene" only. How can I still retain all the original columns when applying the aggregate function?
UPDATE:
Input (p4p5):
SampleID expr Gene Period Consequence isPTV
HSB430 -1.23 ENSG000098 4 upstream_gene_variant 0
HSB321 -0.02 ENSG000098 5 stop_gained 1
HSB296 3.12 ENSG000027 4 upstream_gene_variant 0
HSB201 1.22 ENSG000027 4 intron_variant 0
HSB220 0.13 ENSG000013 6 intron_variant 0
Expected output:
SampleID expr Gene Period Consequence isPTV Median
HSB430 -1.23 ENSG000098 4 upstream_gene_variant 0 -0.625
HSB321 -0.02 ENSG000098 5 stop_gained 1 -0.625
HSB296 3.12 ENSG000027 4 upstream_gene_variant 0 2.17
HSB201 1.22 ENSG000027 4 intron_variant 0 2.17
HSB220 0.13 ENSG000013 6 intron_variant 0 0.13
I'd use dplyr for this:
library(dplyr)
p4p5 %>%
group_by(Gene) %>%
mutate(Median = median(expr, na.rm = TRUE)) %>%
ungroup()
SampleID expr Gene Period Consequence isPTV Median
<chr> <dbl> <chr> <int> <chr> <int> <dbl>
1 HSB430 -1.23 ENSG000098 4 upstream_gene_variant 0 -0.625
2 HSB321 -0.02 ENSG000098 5 stop_gained 1 -0.625
3 HSB296 3.12 ENSG000027 4 upstream_gene_variant 0 2.17
4 HSB201 1.22 ENSG000027 4 intron_variant 0 2.17
5 HSB220 0.13 ENSG000013 6 intron_variant 0 0.13

R group values in column based on intervals and average the results for each interval

I have two tables
table 1:
Dates_only <- data.frame(ID=c('1118','1118','1118','1118','1118',
'1118','1118','1118','1119','1119',
'1119','1119','1119','1119','1119',
'1119','13PP','13PP','13PP','13PP',
'13PP','13PP','13PP','13PP'),
Quart_y=c('2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2'),
Quart=c(0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00))
and table 2:
Values <- data.frame(ID=c('1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP'),
Day=c(0,0,0,0.14,0.13,0.13,0.2,0.23,0.24,0.27,0.28,
0.32,0.32,0.32,0.44,0.47,0.49,0.49,0.59,0.64,
0.61,0.72,0.71,0.73,0.95,0.86,0.78,1.1,0.93,1.15),
Value=c(7.6,6.2,6.8,7.1,6.2,5.9,6.8,5.8,4.6,6.5,5.4,
4.2,6.3,4.8,4,6,4.3,3.8,5.9,4,3.6,5.6,3.8,
3.4,5.4,3.2,3,5,2.9,2.9))
What I am trying to do is to find a way to change the values in Values$Day according to Dates_only$Quart.
Specifically, Dates_only$Quart represent quantified quarters (2017Q3 - 0.25, 2017Q4-0.50,...,2018Q4-1.50) etc. While, Values$Day represents quantified days.
I want to change the Values$Day classified by quarter instead, for example:
for 0<=Values$Day<=0.25 the Values$Day==0.25, for 0.25<Values$Day<=0.50 the Values$Day==0.50 etc.
What I have tried to do is to use this method bellow but it comes up with an error message:
unique_quarters <- unique(Dates_only$Quart)
unique_quarters <- append(unique_quarters, 0, after=0)
df3 <- transform(Dates_only,
Transf_Day=Values$Quart[findInterval(Values$Day, unique_quarters)])
The issue I guess is the problem that findInterval(Values$Day, unique_quarters) returns
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 5 4 5
While Values$Quart has values
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
try this:
library(tidyverse)
as.tbl(Values) %>%
mutate(Int=cut(Day, seq(0,3,0.25), include.lowest = T)) %>%
mutate(Int2=factor(Int, labels = seq(0.25,1.25,0.25)))
# A tibble: 30 x 5
ID Day Value Int Int2
<fctr> <dbl> <dbl> <fctr> <fctr>
1 1118 0.00 7.6 [0,0.25] 0.25
2 1119 0.00 6.2 [0,0.25] 0.25
3 13PP 0.00 6.8 [0,0.25] 0.25
4 1118 0.14 7.1 [0,0.25] 0.25
5 1119 0.13 6.2 [0,0.25] 0.25
6 13PP 0.13 5.9 [0,0.25] 0.25
7 1118 0.20 6.8 [0,0.25] 0.25
8 1119 0.23 5.8 [0,0.25] 0.25
9 13PP 0.24 4.6 [0,0.25] 0.25
10 1118 0.27 6.5 (0.25,0.5] 0.5
# ... with 20 more rows

tidyr - spread multiple columns

I'm preparing data for a network meta-analysis and I am having difficult in tyding the columns.
If I have this initial dataset:
Study Trt y sd n
1 1 -1.22 3.70 54
1 3 -1.53 4.28 95
2 1 -0.30 4.40 76
2 2 -2.60 4.30 71
2 4 -1.2 4.3 81
How can I finish with this other one?
Study Treatment1 y1 sd1 n1 Treatment2 y2 sd2 n2 Treatment3 y3 sd3 n3
1 1 1 -1.22 3.70 54 3 -1.53 4.28 95 NA NA NA NA
2 3 1 -0.30 4.40 76 2 -2.60 4.30 71 4 -1.2 4.3 81
I'm really stuck in this step, and I'd really appreciate some help...
We can gather to 'long' format, then unite multiple columns to single and spread it to wide
library(tidyverse)
gather(df1, Var, Val, Trt:n) %>%
group_by(Study, Var) %>%
mutate(n = row_number()) %>%
unite(VarT, Var, n, sep="") %>%
spread(VarT, Val, fill=0)

Resources