Calculate the mean of some columns using dplyr::mutate - r

I want to calculate the mean of some columns using dplyr::mutate.
library(dplyr)
test <- data.frame(replicate(12, sample(1:12, 12, rep = T))) %>%
`colnames<-`(seq(1:12) %>% paste("BL", ., sep = ""))
The columns I want to include to calculate the mean are ONLY BL1 to BL9, so I do
test_again <- test %>%
rowwise() %>%
mutate(ave = mean(c(seq(1:9) %>% paste("BL", ., sep = ""))))
This would not work. I notice if I put the column one by one, it works
test_againAndAgain <- test %>%
rowwise() %>%
mutate(ave = mean(c(BL1, BL2, BL3, BL4, BL5, BL6, BL7, BL8, BL9)))
I suspected it's because I give the strings instead of "columns".
Can somebody explain this behavior? What will be the best solution for this?

You can use rowMeans with select(., BL1:BL9); Here select(., BL1:BL9) select columns from BL1 to BL9 and rowMeans calculate the row average; You can't directly use a character vector in mutate as columns, which will be treated as is instead of columns:
test %>% mutate(ave = rowMeans(select(., BL1:BL9)))
# BL1 BL2 BL3 BL4 BL5 BL6 BL7 BL8 BL9 BL10 BL11 BL12 ave
#1 5 11 1 1 12 5 10 12 6 11 12 9 7.000000
#2 1 10 5 11 7 6 5 9 9 1 8 4 7.000000
#3 8 10 1 2 7 12 5 9 5 3 3 11 6.555556
#4 5 2 5 4 9 5 5 3 5 2 8 1 4.777778
#5 9 1 1 10 3 5 1 9 9 6 3 12 5.333333
#6 9 7 9 6 3 2 5 4 9 5 1 2 6.000000
#7 3 3 1 9 7 8 7 9 9 11 12 9 6.222222
#8 12 9 3 3 9 11 4 2 5 12 12 12 6.444444
#9 1 7 7 12 6 6 5 3 10 12 5 10 6.333333
#10 12 7 7 1 2 8 5 8 11 9 1 5 6.777778
#11 9 1 5 8 12 6 6 11 3 12 3 9 6.777778
#12 5 6 1 11 10 12 6 7 8 7 8 2 7.333333

Or another option is pmap with mean
library(tidyverse)
test %>%
mutate(ave = select(., BL1:BL9) %>%
pmap(~ mean(c(...))))
# BL1 BL2 BL3 BL4 BL5 BL6 BL7 BL8 BL9 BL10 BL11 BL12 ave
#1 5 12 8 5 3 11 7 1 8 1 11 12 6.666667
#2 11 5 5 5 2 10 12 6 6 2 7 5 6.888889
#3 8 11 9 6 10 5 8 8 2 3 6 12 7.444444
#4 2 7 7 12 3 1 1 10 7 4 11 12 5.555556
#5 8 4 12 12 9 12 9 3 5 1 10 12 8.222222
#6 11 11 11 12 3 12 5 8 12 8 2 7 9.444444
#7 2 6 11 5 8 5 5 8 8 4 11 12 6.444444
#8 10 3 9 9 8 12 9 11 8 1 12 11 8.777778
#9 12 3 7 2 3 10 11 9 1 8 9 12 6.444444
#10 1 7 12 9 8 2 11 11 7 2 2 5 7.555556
#11 9 12 2 9 2 6 10 5 10 6 7 4 7.222222
#12 11 6 9 1 4 4 8 8 2 9 3 8 5.888889
NOTE: The values are difference as there was no set.seed

Related

Convert dataframe from vertical to horizontal

I already checked many questions and I don't seem to find the suitable answer.
I have this df
df = data.frame(x = 1:10,y=11:20)
the output
x y
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
I just wish the output to be:
1 2 3 4 5 6 7 8 9 10
x 1 2 3 4 5 6 7 8 9 10
y 11 12 13 14 15 16 17 18 19 20
thanks
Try t() like below
> data.frame(t(df), check.names = FALSE)
1 2 3 4 5 6 7 8 9 10
x 1 2 3 4 5 6 7 8 9 10
y 11 12 13 14 15 16 17 18 19 20
A transpose should do it
setNames(data.frame(t(df)), df[,"x"])
1 2 3 4 5 6 7 8 9 10
x 1 2 3 4 5 6 7 8 9 10
y 11 12 13 14 15 16 17 18 19 20

How to order a dataframe by a column with hyphen in dplyr

I have a dataframe like below.
I want to always have the dataframe in this order, but when I try to reorder the dataframe by id using dplyr::arrange(), it changes in a way that I don't want to. Is there any solution for this?
library(dplyr)
set.seed(10)
df <- data.frame(id = paste(2022,1:20, sep = "-"), weight = round(rnorm(20, 5, 1)))
df
id weight
1 2022-1 5
2 2022-2 5
3 2022-3 4
4 2022-4 4
5 2022-5 5
6 2022-6 5
7 2022-7 4
8 2022-8 5
9 2022-9 3
10 2022-10 5
11 2022-11 6
12 2022-12 6
13 2022-13 5
14 2022-14 6
15 2022-15 6
16 2022-16 5
17 2022-17 4
18 2022-18 5
19 2022-19 6
20 2022-20 5
wrong_order_df <- df %>% arrange(weight) %>% arrange(id)
wrong_order_df
id weight
1 2022-1 5
2 2022-10 5
3 2022-11 6
4 2022-12 6
5 2022-13 5
6 2022-14 6
7 2022-15 6
8 2022-16 5
9 2022-17 4
10 2022-18 5
11 2022-19 6
12 2022-2 5
13 2022-20 5
14 2022-3 4
15 2022-4 4
16 2022-5 5
17 2022-6 5
18 2022-7 4
19 2022-8 5
20 2022-9 3
The idea that I came up with is to add a new column just for working on this issue. But I believe there is a more ellegant way.
correct_order_df <- wrong_order_df %>% mutate(id_order = as.numeric(str_extract(id, '\\b\\w+$'))) %>% arrange(id_order)
correct_order_df
id weight id_order
1 2022-1 5 1
2 2022-2 5 2
3 2022-3 4 3
4 2022-4 4 4
5 2022-5 5 5
6 2022-6 5 6
7 2022-7 4 7
8 2022-8 5 8
9 2022-9 3 9
10 2022-10 5 10
11 2022-11 6 11
12 2022-12 6 12
13 2022-13 5 13
14 2022-14 6 14
15 2022-15 6 15
16 2022-16 5 16
17 2022-17 4 17
18 2022-18 5 18
19 2022-19 6 19
20 2022-20 5 20
You can input the specifications within arrange. Not sure how much more elegant we need than one line. Let me know if this works:
arrange(df, as.numeric(str_extract(id, "(?<=-)\\d+")))

Limit Number of Items Displayed in Legend - GGplot R

I have a large taxonomic dataset that I need to plot as a stacked bar chart. Sample Data:
ID X A B C D E F G
1 5 9 6 7 4 8 10 6
2 6 3 9 10 3 10 4 8
3 6 6 5 8 8 8 8 1
4 9 3 2 8 4 1 5 8
5 6 6 2 8 3 7 4 10
6 0 7 8 9 1 4 9 10
7 3 2 6 8 8 1 8 7
8 4 7 10 2 9 7 9 8
9 5 7 9 10 8 2 2 1
10 0 4 6 8 9 10 7 1
11 8 9 2 2 6 5 1 7
12 8 6 0 9 7 9 8 1
13 2 8 4 4 4 2 6 7
14 4 6 6 4 9 9 3 5
15 8 1 0 6 5 8 1 1
16 6 6 9 3 9 2 1 1
17 2 4 0 2 4 8 10 9
18 5 9 8 9 4 9 3 9
19 0 2 1 6 6 9 6 2
20 3 3 7 10 4 5 6 8
21 2 6 6 9 8 10 9 4
22 7 7 1 6 8 3 7 1
23 1 9 4 5 8 9 7 7
24 0 8 5 9 1 8 9 1
25 2 1 0 1 1 2 10 7
26 10 4 1 8 2 5 9 0
27 2 7 10 10 2 3 8 6
28 6 4 2 6 7 3 1 0
29 8 1 3 4 1 10 3 6
30 1 6 5 4 7 9 7 10
31 4 4 3 2 2 9 0 4
32 9 6 6 1 6 1 5 2
The plotting part is no problem, using gggplot as below:-
l5 <- read.xlsx(paste(taxawmeta,taxawmeta_files[2], sep = ""), sheetIndex = 1)
l5_long <- l5 %>% gather(taxa,value,-c(X.FinalSampleID,TimePoint_Luna))
ggplot(l5_long, aes(fill=taxa, y = value, x = X.FinalSampleID, )) +
geom_bar(position='stack', stat='identity') +
theme_minimal() +
labs(x='Sample', y='Relative Abundance', title='Family Level Relative Abundance') +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position="none")
Where I'm running into an issue is the actual dataset has almost 200 variables. Meaning the legend is completely out of control. I know I can just hide the legend with:-
theme(.position="none")
... but what I'd like to do is keep say the top 10 entries as those are the ones of most interest. Is there any simple method to limit the number of items that are displayed in the legend? Anything I've found so far seems very convoluted and not directly applicable to this problem.

How may I rowSum over a subset of variables by name with expression like a:b [duplicate]

This question already has answers here:
ignore NA in dplyr row sum
(6 answers)
Closed 2 years ago.
dd <- data.frame(a=1:10,b=2:11,c=3:12)
dd %>% mutate( Total=rowSums(.[1:2]))
a b c CCI
1 1 2 3 3
2 2 3 4 5
3 3 4 5 7
4 4 5 6 9
5 5 6 7 11
6 6 7 8 13
7 7 8 9 15
8 8 9 10 17
9 9 10 11 19
10 10 11 12 21
Is there a way to select the variable names like a:b ? I have hundreds of variables but the position may change with another version of dataset; so the safe way is to select variables by styles like a:b?
In dplyr 1.0.0 You can use rowwise and c_across:
dd %>% rowwise %>% mutate(Total = sum(c_across(a:b))) %>% ungroup
a b c Total
<int> <int> <int> <int>
1 1 2 3 3
2 2 3 4 5
3 3 4 5 7
4 4 5 6 9
5 5 6 7 11
6 6 7 8 13
7 7 8 9 15
8 8 9 10 17
9 9 10 11 19
10 10 11 12 21

Specify unique levels when creating multiple factors

I have a dataframe which I am trying to turn into factors. I want each row to represent a factor, with the levels ordered in the order that the values appear. My code is falling short of this last task:
> x
V11 V12 V13 V21 V22 V23 V31 V32 V33 V41 V42 V43
r1 1 2 3 4 5 6 7 8 9 10 11 12
r2 1 2 3 4 5 6 10 11 12 7 8 9
r3 1 2 3 7 8 9 10 11 12 4 5 6
r4 4 5 6 7 8 9 10 11 12 1 2 3
>
> x %>%
+ t %>%
+ as_data_frame %>%
+ mutate_all(factor) %>%
+ lapply(., unlist)
$r1
[1] 1 2 3 4 5 6 7 8 9 10 11 12
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
$r2
[1] 1 2 3 4 5 6 10 11 12 7 8 9
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
$r3
[1] 1 2 3 7 8 9 10 11 12 4 5 6
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
$r4
[1] 4 5 6 7 8 9 10 11 12 1 2 3
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
Is there any way to specify that the levels should match the other of each column in the initial dataframe (it was transformed as the first piped command); right now each factor has the same order of levels which is incorrect.
You need to specify the levels = argument inside factor():
lapply(data.frame(t(df)), function(x) factor(x, levels = unique(x)))
#$r1
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
#Levels: 1 2 3 4 5 6 7 8 9 10 11 12
#$r2
# [1] 1 2 3 4 5 6 10 11 12 7 8 9
#Levels: 1 2 3 4 5 6 10 11 12 7 8 9
#$r3
# [1] 1 2 3 7 8 9 10 11 12 4 5 6
#Levels: 1 2 3 7 8 9 10 11 12 4 5 6
#$r4
# [1] 4 5 6 7 8 9 10 11 12 1 2 3
#Levels: 4 5 6 7 8 9 10 11 12 1 2 3

Resources