How to keep other columns when using aggregate in R? - r

I have a dataframe, p4p5, that contains the following columns:
p4p5 <- c("SampleID", "expr", "Gene", "Period", "Consequence", "isPTV")
I've used the aggregate function here to find the median expression per Gene:
p4p5_med <- aggregate(expr ~ Gene, p4p5, median)
However, this results in a dataframe with the columns "expr" and "Gene" only. How can I still retain all the original columns when applying the aggregate function?
UPDATE:
Input (p4p5):
SampleID expr Gene Period Consequence isPTV
HSB430 -1.23 ENSG000098 4 upstream_gene_variant 0
HSB321 -0.02 ENSG000098 5 stop_gained 1
HSB296 3.12 ENSG000027 4 upstream_gene_variant 0
HSB201 1.22 ENSG000027 4 intron_variant 0
HSB220 0.13 ENSG000013 6 intron_variant 0
Expected output:
SampleID expr Gene Period Consequence isPTV Median
HSB430 -1.23 ENSG000098 4 upstream_gene_variant 0 -0.625
HSB321 -0.02 ENSG000098 5 stop_gained 1 -0.625
HSB296 3.12 ENSG000027 4 upstream_gene_variant 0 2.17
HSB201 1.22 ENSG000027 4 intron_variant 0 2.17
HSB220 0.13 ENSG000013 6 intron_variant 0 0.13

I'd use dplyr for this:
library(dplyr)
p4p5 %>%
group_by(Gene) %>%
mutate(Median = median(expr, na.rm = TRUE)) %>%
ungroup()
SampleID expr Gene Period Consequence isPTV Median
<chr> <dbl> <chr> <int> <chr> <int> <dbl>
1 HSB430 -1.23 ENSG000098 4 upstream_gene_variant 0 -0.625
2 HSB321 -0.02 ENSG000098 5 stop_gained 1 -0.625
3 HSB296 3.12 ENSG000027 4 upstream_gene_variant 0 2.17
4 HSB201 1.22 ENSG000027 4 intron_variant 0 2.17
5 HSB220 0.13 ENSG000013 6 intron_variant 0 0.13

Related

Average a multiple number of rows for every column, multiple times

Here I have a snippet of my dataset. The rows indicate different days of the year.
The Substations represent individuals, there are over 500 individuals.
The 10 minute time periods run all the way through 24 hours.
I need to find an average value for each 10 minute interval for each individual in this dataset. This should result in single row for each individual substation, with the respective average value for each time interval.
I have tried:
meanbygroup <- stationgroup %>%
group_by(Substation) %>%
summarise(means = colMeans(tenminintervals[sapply(tenminintervals, is.numeric)]))
But this averages the entire column and I am left with the same average values for each individual substation.
So for each individual substation, I need an average for each individual time interval.
Please help!
Try using summarize(across()), like this:
df %>%
group_by(Substation) %>%
summarize(across(everything(), ~mean(.x, na.rm=T)))
Output:
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A -0.233 0.110 -0.106
2 B 0.203 -0.0997 -0.128
3 C -0.0733 0.196 -0.0205
4 D 0.0905 -0.0449 -0.0529
5 E 0.401 0.152 -0.0957
6 F 0.0368 0.120 -0.0787
7 G 0.0323 -0.0792 -0.278
8 H 0.132 -0.0766 0.157
9 I -0.0693 0.0578 0.0732
10 J 0.0776 -0.176 -0.0192
# … with 16 more rows
Input:
set.seed(123)
df = bind_cols(
tibble(Substation = sample(LETTERS,size = 1000, replace=T)),
as_tibble(setNames(lapply(1:3, function(x) rnorm(1000)),c("00:00", "00:10", "00:20")))
) %>% arrange(Substation)
# A tibble: 1,000 × 4
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A 0.121 -1.94 0.137
2 A -0.322 1.05 0.416
3 A -0.158 -1.40 0.192
4 A -1.85 1.69 -0.0922
5 A -1.16 -0.455 0.754
6 A 1.95 1.06 0.732
7 A -0.132 0.655 -1.84
8 A 1.08 -0.329 -0.130
9 A -1.21 2.82 -0.0571
10 A -1.04 0.237 -0.328
# … with 990 more rows

how to apply round() to odd or even rows only in R

assume my original dataframe is :
a b d e
1 1 1 2 1
2 20 30 40 30
3 1 2 6 2
4 40 50 40 50
5 5 5 3 5
6 60 60 60 60
I want to add a percentage row below each row.
a b d e
1 1.00 1.00 2.00 1.00
2 0.79 0.66 1.57 0.66
3 20.00 30.00 40.00 30.00
4 13.51 20.27 27.03 20.27
5 1.00 2.00 6.00 2.00
6 0.66 1.57 3.97 1.57
7 40.00 50.00 40.00 50.00
8 27.03 33.78 27.03 33.78
9 5.00 5.00 3.00 5.00
10 3.94 3.31 2.36 3.31
11 60.00 60.00 60.00 60.00
12 40.54 40.54 40.54 40.54
but as you see, my odd rows get .00 which I do not want.
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df <- df %>% slice(rep(1:n(), each=2))
df[seq_len(nrow(df)) %% 2 ==0, ] <- round(100*df[seq_len(nrow(df)) %% 2 ==0,
]/colSums(df[seq_len(nrow(df)) %% 2 ==0, ]),2)
how can I keep my odd rows without decimals?
The problem is that columns in data frames can only hold one type of data. If some of the columns in your data frame have decimals, then the whole column must be of type double. The only way to change how your data frame appears is via its print method.
Fortunately, you can easily turn your data frame into a tibble. This is a type of data frame, but prints in such a way that the integers don't have decimal points afterwards.
df
#> a b d e
#> 1 1.00 1.00 2.00 1.00
#> 2 0.79 0.66 1.57 0.66
#> 3 20.00 30.00 40.00 30.00
#> 4 13.51 20.27 27.03 20.27
#> 5 1.00 2.00 6.00 2.00
#> 6 0.66 1.57 3.97 1.57
#> 7 40.00 50.00 40.00 50.00
#> 8 27.03 33.78 27.03 33.78
#> 9 5.00 5.00 3.00 5.00
#> 10 3.94 3.31 2.36 3.31
#> 11 60.00 60.00 60.00 60.00
#> 12 40.54 40.54 40.54 40.54
dplyr::tibble(df)
#> # A tibble: 12 x 4
#> a b d e
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1
#> 2 0.79 0.66 1.57 0.66
#> 3 20 30 40 30
#> 4 13.5 20.3 27.0 20.3
#> 5 1 2 6 2
#> 6 0.66 1.57 3.97 1.57
#> 7 40 50 40 50
#> 8 27.0 33.8 27.0 33.8
#> 9 5 5 3 5
#> 10 3.94 3.31 2.36 3.31
#> 11 60 60 60 60
#> 12 40.5 40.5 40.5 40.5
Created on 2022-04-26 by the reprex package (v2.0.1)
Allan Cameron is right, that a tibble prints better and does what you want. To offer another solution, though, if you're trying to print something that you might send to a text file (rather than just look at on the screen), you could print the values to character strings as follows:
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df %>%
mutate(obs = row_number(),
across(-obs, ~.x/sum(.x)),
type = "pct") %>%
bind_rows(df %>% mutate(obs = row_number(),
type = "raw")) %>%
mutate(type = factor(type, levels=c("raw", "pct"))) %>%
arrange(obs, type) %>%
mutate(across(a:e, ~case_when(
type == "raw" ~ sprintf("%.0f", .x),
TRUE ~ sprintf("%.2f%%", .x*100)))) %>%
select(-c(obs, type))
#> a b d e
#> 1 1 1 2 1
#> 2 0.79% 0.68% 1.32% 0.68%
#> 3 20 30 40 30
#> 4 15.75% 20.27% 26.49% 20.27%
#> 5 1 2 6 2
#> 6 0.79% 1.35% 3.97% 1.35%
#> 7 40 50 40 50
#> 8 31.50% 33.78% 26.49% 33.78%
#> 9 5 5 3 5
#> 10 3.94% 3.38% 1.99% 3.38%
#> 11 60 60 60 60
#> 12 47.24% 40.54% 39.74% 40.54%
Created on 2022-04-26 by the reprex package (v2.0.1)
Also note, I think the percentages you calculated are wrong. When I used your data, I get:
sum(df$a[c(2,4,6,8,10,12)])
#> [1] 86.47
And when I use mine, that are different from yours, I get 100 (if we turn them back into numbers from strings).

Summing up Certain Sequences of a Dataframe in R

I have several data frames of daily rates of different regions by age-groups:
Date 0-14 Rate 15-29 Rate 30-44 Rate 45-64 Rate 65-79 Rate 80+ Rate
2020-23-12 0 33.54 45.68 88.88 96.13 41.28
2020-24-12 0 25.14 35.28 66.14 90.28 38.41
It begins on Wednesday (2020-23-12) and I have data from then on up to date.
I want to obtain weekly row sums of rates from each Wednesday to Tuesday.
There should be a wise way of combinations with aggregate, seq and rowsum functions to do this using a few lines. Otherwise, I'll use too long ways to do this.
I created some minimal data, three weeks with some arbitrary column and numerics (no missings). You can use tidyverse language to sum over columns, create groups per week and sum over rowsums by week:
# Minimal Data
MWE <- data.frame(date = c(outer(as.Date("12/23/20", "%m/%d/%y"), 0:20, `+`)),
column1 = runif(21,0,1),
column2 = runif(21,0,1))
library(tidyverse)
MWE %>%
# Calculate Row Sum Everywhere
mutate(sum = rowSums(across(where(is.numeric)))) %>%
# Create Week Groups
group_by(week = ceiling(row_number()/7)) %>%
# Sum Over All RowSums per Group
summarise(rowSums_by_week = sum(sum))
# Groups: week [3]
date column1 column2 sum week
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-12-23 0.449 0.759 1.21 1
2 2020-12-24 0.423 0.0956 0.519 1
3 2020-12-25 0.974 0.592 1.57 1
4 2020-12-26 0.798 0.250 1.05 1
5 2020-12-27 0.870 0.487 1.36 1
6 2020-12-28 0.952 0.345 1.30 1
7 2020-12-29 0.349 0.817 1.17 1
8 2020-12-30 0.227 0.727 0.954 2
9 2020-12-31 0.292 0.209 0.501 2
10 2021-01-01 0.678 0.276 0.954 2
# ... with 11 more rows
# A tibble: 3 x 2
week rowSums_by_week
<dbl> <dbl>
1 1 8.16
2 2 6.02
3 3 6.82

tidyr - spread multiple columns

I'm preparing data for a network meta-analysis and I am having difficult in tyding the columns.
If I have this initial dataset:
Study Trt y sd n
1 1 -1.22 3.70 54
1 3 -1.53 4.28 95
2 1 -0.30 4.40 76
2 2 -2.60 4.30 71
2 4 -1.2 4.3 81
How can I finish with this other one?
Study Treatment1 y1 sd1 n1 Treatment2 y2 sd2 n2 Treatment3 y3 sd3 n3
1 1 1 -1.22 3.70 54 3 -1.53 4.28 95 NA NA NA NA
2 3 1 -0.30 4.40 76 2 -2.60 4.30 71 4 -1.2 4.3 81
I'm really stuck in this step, and I'd really appreciate some help...
We can gather to 'long' format, then unite multiple columns to single and spread it to wide
library(tidyverse)
gather(df1, Var, Val, Trt:n) %>%
group_by(Study, Var) %>%
mutate(n = row_number()) %>%
unite(VarT, Var, n, sep="") %>%
spread(VarT, Val, fill=0)

Break matrix into averaged time bins using R

I need to convert a matrix x like this:
head(x)
Age d18O d13C
1 0.000 3.28 0.880
2 0.000 3.58 0.150
3 0.002 3.16 0.960
4 0.002 2.91 3.228
5 0.004 3.33 0.880
6 0.004 3.16 3.328
tail(x)
Age d18O d13C
14883 66.3037 1.00 2.03
14884 66.3159 1.02 1.70
14885 66.3800 0.62 2.01
14886 67.0073 1.30 1.23
14887 67.2391 1.31 1.30
14888 67.5173 1.36 1.35
into a matrix, containing 0.5 time bins with mean values of each of the variables, such as:
Age count(x$d18O) mean(x$d18O)
1 0 500 4.1003
2 0.5 522 4.079464
3 1 412 4.032743
4 1.5 366 3.810601
5 2 498 3.749257
6 2.5 608 3.649063
. . . .
. . . .
Age is given in Million of years.
This should do the trick:
library(dplyr)
x %>%
mutate(age_bucket = cut(Age, seq(min(Age), max(Age), by = 0.05), include.lowest = TRUE)) %>%
group_by(age_bucket) %>%
summarise(n = n(),
mean_d18O = mean(d18O))
Try this:
sdf=split(x,cut(x$Age,seq(0,max(x$Age)*1.01,by=.5)))
do.call(rbind,lapply(sdf,function(sx)c(length(sx$d18O),mean(sx$d18O))))
you will get something similar to:
(23,23.5] 0 NaN
(23.5,24] 4 2.9500345
(24,24.5] 1 6.9320712
(24.5,25] 2 3.0219788
(25,25.5] 2 3.7149871
(25.5,26] 1 1.9051732
(26,26.5] 2 3.1865066
(26.5,27] 1 3.9982569

Resources