Specify unique levels when creating multiple factors - r

I have a dataframe which I am trying to turn into factors. I want each row to represent a factor, with the levels ordered in the order that the values appear. My code is falling short of this last task:
> x
V11 V12 V13 V21 V22 V23 V31 V32 V33 V41 V42 V43
r1 1 2 3 4 5 6 7 8 9 10 11 12
r2 1 2 3 4 5 6 10 11 12 7 8 9
r3 1 2 3 7 8 9 10 11 12 4 5 6
r4 4 5 6 7 8 9 10 11 12 1 2 3
>
> x %>%
+ t %>%
+ as_data_frame %>%
+ mutate_all(factor) %>%
+ lapply(., unlist)
$r1
[1] 1 2 3 4 5 6 7 8 9 10 11 12
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
$r2
[1] 1 2 3 4 5 6 10 11 12 7 8 9
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
$r3
[1] 1 2 3 7 8 9 10 11 12 4 5 6
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
$r4
[1] 4 5 6 7 8 9 10 11 12 1 2 3
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
Is there any way to specify that the levels should match the other of each column in the initial dataframe (it was transformed as the first piped command); right now each factor has the same order of levels which is incorrect.

You need to specify the levels = argument inside factor():
lapply(data.frame(t(df)), function(x) factor(x, levels = unique(x)))
#$r1
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
#Levels: 1 2 3 4 5 6 7 8 9 10 11 12
#$r2
# [1] 1 2 3 4 5 6 10 11 12 7 8 9
#Levels: 1 2 3 4 5 6 10 11 12 7 8 9
#$r3
# [1] 1 2 3 7 8 9 10 11 12 4 5 6
#Levels: 1 2 3 7 8 9 10 11 12 4 5 6
#$r4
# [1] 4 5 6 7 8 9 10 11 12 1 2 3
#Levels: 4 5 6 7 8 9 10 11 12 1 2 3

Related

Limit Number of Items Displayed in Legend - GGplot R

I have a large taxonomic dataset that I need to plot as a stacked bar chart. Sample Data:
ID X A B C D E F G
1 5 9 6 7 4 8 10 6
2 6 3 9 10 3 10 4 8
3 6 6 5 8 8 8 8 1
4 9 3 2 8 4 1 5 8
5 6 6 2 8 3 7 4 10
6 0 7 8 9 1 4 9 10
7 3 2 6 8 8 1 8 7
8 4 7 10 2 9 7 9 8
9 5 7 9 10 8 2 2 1
10 0 4 6 8 9 10 7 1
11 8 9 2 2 6 5 1 7
12 8 6 0 9 7 9 8 1
13 2 8 4 4 4 2 6 7
14 4 6 6 4 9 9 3 5
15 8 1 0 6 5 8 1 1
16 6 6 9 3 9 2 1 1
17 2 4 0 2 4 8 10 9
18 5 9 8 9 4 9 3 9
19 0 2 1 6 6 9 6 2
20 3 3 7 10 4 5 6 8
21 2 6 6 9 8 10 9 4
22 7 7 1 6 8 3 7 1
23 1 9 4 5 8 9 7 7
24 0 8 5 9 1 8 9 1
25 2 1 0 1 1 2 10 7
26 10 4 1 8 2 5 9 0
27 2 7 10 10 2 3 8 6
28 6 4 2 6 7 3 1 0
29 8 1 3 4 1 10 3 6
30 1 6 5 4 7 9 7 10
31 4 4 3 2 2 9 0 4
32 9 6 6 1 6 1 5 2
The plotting part is no problem, using gggplot as below:-
l5 <- read.xlsx(paste(taxawmeta,taxawmeta_files[2], sep = ""), sheetIndex = 1)
l5_long <- l5 %>% gather(taxa,value,-c(X.FinalSampleID,TimePoint_Luna))
ggplot(l5_long, aes(fill=taxa, y = value, x = X.FinalSampleID, )) +
geom_bar(position='stack', stat='identity') +
theme_minimal() +
labs(x='Sample', y='Relative Abundance', title='Family Level Relative Abundance') +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position="none")
Where I'm running into an issue is the actual dataset has almost 200 variables. Meaning the legend is completely out of control. I know I can just hide the legend with:-
theme(.position="none")
... but what I'd like to do is keep say the top 10 entries as those are the ones of most interest. Is there any simple method to limit the number of items that are displayed in the legend? Anything I've found so far seems very convoluted and not directly applicable to this problem.

Create numbers based on different probability in R

I am trying to simulate a matrix of data set i*j, with i=2 ; j = 200, which represent subject and trial separately, and create random number between 0-10 based on trials with different probability. For first subject (i=1), the first 100 trials (j = 1-100) there is 70% probability to be number 1-5 and 30% probability to be number 6-10, and the probability reverse in trial 101 to 200. For second subject (i=2), the first 100 trials (j = 1-100) there is 60% probability to be number 1-5 and 40% probability to be number 6-10, and the probability reverse in trial 101 to 200.
I gave an example of 2 subjects because I need to do this with multiple i but not only 1 i.
Can I work this out with sample?
I guess what you are after is Stratified Sampling.
With base R, you can implement stratified sampling via sample, but you may need to define a user function like f as below
f <- function(N, p) {
c(
sapply(
list(p, rev(p)),
function(v) {
sapply(
sample(c(TRUE, FALSE), N, replace = TRUE, prob = v),
function(x) ifelse(x, sample(1:5, 1), sample(6:10, 1))
)
}
)
)
}
When you use it, you first define a probability list probs for each trial, e.g.,
probs <- list(c(0.7, 0.3), c(0.6, 0.4))
and then run
> lapply(probs, f, N = j)
[[1]]
[1] 2 1 2 5 3 6 9 2 2 2 3 2 3 7 4 5 3 7 1 4 10 2 3 6 8
[26] 7 8 3 1 2 5 1 4 4 4 2 1 5 5 4 1 6 4 2 9 10 5 1 1 5
[51] 4 4 3 4 8 4 10 3 2 1 3 4 7 4 2 10 1 4 3 3 5 2 7 6 5
[76] 3 10 4 2 2 5 1 2 3 2 3 3 2 9 10 10 10 10 3 1 4 3 1 1 5
[101] 8 6 5 9 1 6 1 9 10 4 5 4 6 5 8 2 4 10 6 3 8 5 10 8 8
[126] 8 9 3 8 6 5 7 10 9 6 8 9 5 6 8 4 6 6 7 4 4 8 10 10 6
[151] 9 10 9 7 8 7 3 7 4 6 10 8 10 8 5 6 10 8 9 6 6 1 9 4 8
[176] 1 5 10 7 10 8 7 6 6 5 4 7 7 8 8 1 10 8 5 8 9 4 5 6 7
[[2]]
[1] 7 9 4 9 5 3 3 9 4 5 6 10 4 5 2 3 2 5 4 5 3 8 5 2 1
[26] 6 5 3 9 3 9 9 9 8 7 3 4 5 7 3 5 3 5 7 5 3 4 2 6 4
[51] 7 6 2 7 4 4 10 4 10 2 8 10 3 2 8 1 8 10 8 4 3 2 9 8 4
[76] 4 10 1 3 10 6 8 6 3 5 2 3 3 9 4 7 5 1 1 1 3 10 5 2 7
[101] 2 10 2 6 8 10 10 7 3 7 3 3 7 1 10 3 4 1 1 8 2 5 2 4 7
[126] 2 7 7 4 9 10 7 1 4 4 9 7 9 9 9 8 4 1 10 6 10 4 4 8 9
[151] 7 8 3 2 9 1 9 7 6 9 1 6 3 9 7 8 5 9 3 8 9 6 5 1 2
[176] 5 10 2 7 8 7 8 8 8 8 8 5 1 1 7 6 3 3 4 2 3 2 3 1 3

Calculate the mean of some columns using dplyr::mutate

I want to calculate the mean of some columns using dplyr::mutate.
library(dplyr)
test <- data.frame(replicate(12, sample(1:12, 12, rep = T))) %>%
`colnames<-`(seq(1:12) %>% paste("BL", ., sep = ""))
The columns I want to include to calculate the mean are ONLY BL1 to BL9, so I do
test_again <- test %>%
rowwise() %>%
mutate(ave = mean(c(seq(1:9) %>% paste("BL", ., sep = ""))))
This would not work. I notice if I put the column one by one, it works
test_againAndAgain <- test %>%
rowwise() %>%
mutate(ave = mean(c(BL1, BL2, BL3, BL4, BL5, BL6, BL7, BL8, BL9)))
I suspected it's because I give the strings instead of "columns".
Can somebody explain this behavior? What will be the best solution for this?
You can use rowMeans with select(., BL1:BL9); Here select(., BL1:BL9) select columns from BL1 to BL9 and rowMeans calculate the row average; You can't directly use a character vector in mutate as columns, which will be treated as is instead of columns:
test %>% mutate(ave = rowMeans(select(., BL1:BL9)))
# BL1 BL2 BL3 BL4 BL5 BL6 BL7 BL8 BL9 BL10 BL11 BL12 ave
#1 5 11 1 1 12 5 10 12 6 11 12 9 7.000000
#2 1 10 5 11 7 6 5 9 9 1 8 4 7.000000
#3 8 10 1 2 7 12 5 9 5 3 3 11 6.555556
#4 5 2 5 4 9 5 5 3 5 2 8 1 4.777778
#5 9 1 1 10 3 5 1 9 9 6 3 12 5.333333
#6 9 7 9 6 3 2 5 4 9 5 1 2 6.000000
#7 3 3 1 9 7 8 7 9 9 11 12 9 6.222222
#8 12 9 3 3 9 11 4 2 5 12 12 12 6.444444
#9 1 7 7 12 6 6 5 3 10 12 5 10 6.333333
#10 12 7 7 1 2 8 5 8 11 9 1 5 6.777778
#11 9 1 5 8 12 6 6 11 3 12 3 9 6.777778
#12 5 6 1 11 10 12 6 7 8 7 8 2 7.333333
Or another option is pmap with mean
library(tidyverse)
test %>%
mutate(ave = select(., BL1:BL9) %>%
pmap(~ mean(c(...))))
# BL1 BL2 BL3 BL4 BL5 BL6 BL7 BL8 BL9 BL10 BL11 BL12 ave
#1 5 12 8 5 3 11 7 1 8 1 11 12 6.666667
#2 11 5 5 5 2 10 12 6 6 2 7 5 6.888889
#3 8 11 9 6 10 5 8 8 2 3 6 12 7.444444
#4 2 7 7 12 3 1 1 10 7 4 11 12 5.555556
#5 8 4 12 12 9 12 9 3 5 1 10 12 8.222222
#6 11 11 11 12 3 12 5 8 12 8 2 7 9.444444
#7 2 6 11 5 8 5 5 8 8 4 11 12 6.444444
#8 10 3 9 9 8 12 9 11 8 1 12 11 8.777778
#9 12 3 7 2 3 10 11 9 1 8 9 12 6.444444
#10 1 7 12 9 8 2 11 11 7 2 2 5 7.555556
#11 9 12 2 9 2 6 10 5 10 6 7 4 7.222222
#12 11 6 9 1 4 4 8 8 2 9 3 8 5.888889
NOTE: The values are difference as there was no set.seed

Create a new variable based on existing variable

My current dataset look like this
Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6
I want to create a new variable called "V2" based on the variables "Order" and "V1". For every 8 items in the "Order" variable, I want to assign a value of "0" in "V2" if the varialbe "Order" has observation equals to 1; otherwise, "V2" takes the value of previous item in "V1".
This is the dataset that I want
Order V1 V2
1 7 0
2 5 7
3 8 5
4 5 8
5 8 5
6 3 8
7 4 3
8 2 4
1 8 0
2 6 8
3 3 6
4 4 3
5 5 4
6 7 5
7 3 7
8 6 3
Since my actual dataset is very large, I'm trying to use for loop with if statement to generate "V2". But my code keeps failing. I appreciate if anyone can help me on this, and I'm open to other statements. Thank you!
(Up front: I am assuming that the order of Order is perfectly controlled.)
You need simply ifelse and lag:
df <- read.table(text="Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6 ", header=T)
df$V2 <- ifelse(df$Order==1, 0, lag(df$V1))
df
# Order V1 V2
# 1 1 7 0
# 2 2 5 7
# 3 3 8 5
# 4 4 5 8
# 5 5 8 5
# 6 6 3 8
# 7 7 4 3
# 8 8 2 4
# 9 1 8 0
# 10 2 6 8
# 11 3 3 6
# 12 4 4 3
# 13 5 5 4
# 14 6 7 5
# 15 7 3 7
# 16 8 6 3
with(dat,{V2<-c(0,head(V1,-1));V2[Order==1]<-0;dat$V2<-V2;dat})
Order V1 V2
1 1 7 0
2 2 5 7
3 3 8 5
4 4 5 8
5 5 8 5
6 6 3 8
7 7 4 3
8 8 2 4
9 1 8 0
10 2 6 8
11 3 3 6
12 4 4 3
13 5 5 4
14 6 7 5
15 7 3 7
16 8 6 3

Sequentially reorganize a vector in R

I have a numeric element z as below:
> sort(z)
[1] 1 5 5 5 6 6 7 7 7 7 7 9 9
I would like to sequentially reorganize this element so to have
> z
[1] 1 2 2 2 3 3 4 4 4 4 4 5 5
I guess converting z to a factor and use it as an index should be the way.
You answered it yourself really:
as.integer(factor(sort(z)))
I know this has been accepted already but I decided to look inside factor() to see how it's done there. It more or less comes down to this:
x <- sort(z)
match(x, unique(x))
Which is an extra line I suppose but it should be faster if that matters.
This should do the trick
z = sort(sample(1:10, 100, replace = TRUE))
cumsum(diff(z)) + 1
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[26] 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6
[51] 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
Note that diff omits the first element of the series. So to compensate:
c(1, cumsum(diff(z)) + 1)
Alternative using rle:
z = sort(sample(1:10, 100, replace = TRUE))
rle_result = rle(sort(z))
rep(rle_result$values, rle_result$lengths)
> rep(rle_result$values, rle_result$lengths)
[1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[26] 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6
[51] 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
rep(seq_along(rle(x)$l), rle(x)$l)

Resources