Sequentially reorganize a vector in R - r

I have a numeric element z as below:
> sort(z)
[1] 1 5 5 5 6 6 7 7 7 7 7 9 9
I would like to sequentially reorganize this element so to have
> z
[1] 1 2 2 2 3 3 4 4 4 4 4 5 5
I guess converting z to a factor and use it as an index should be the way.

You answered it yourself really:
as.integer(factor(sort(z)))
I know this has been accepted already but I decided to look inside factor() to see how it's done there. It more or less comes down to this:
x <- sort(z)
match(x, unique(x))
Which is an extra line I suppose but it should be faster if that matters.

This should do the trick
z = sort(sample(1:10, 100, replace = TRUE))
cumsum(diff(z)) + 1
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[26] 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6
[51] 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
Note that diff omits the first element of the series. So to compensate:
c(1, cumsum(diff(z)) + 1)

Alternative using rle:
z = sort(sample(1:10, 100, replace = TRUE))
rle_result = rle(sort(z))
rep(rle_result$values, rle_result$lengths)
> rep(rle_result$values, rle_result$lengths)
[1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[26] 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6
[51] 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10

rep(seq_along(rle(x)$l), rle(x)$l)

Related

Limit Number of Items Displayed in Legend - GGplot R

I have a large taxonomic dataset that I need to plot as a stacked bar chart. Sample Data:
ID X A B C D E F G
1 5 9 6 7 4 8 10 6
2 6 3 9 10 3 10 4 8
3 6 6 5 8 8 8 8 1
4 9 3 2 8 4 1 5 8
5 6 6 2 8 3 7 4 10
6 0 7 8 9 1 4 9 10
7 3 2 6 8 8 1 8 7
8 4 7 10 2 9 7 9 8
9 5 7 9 10 8 2 2 1
10 0 4 6 8 9 10 7 1
11 8 9 2 2 6 5 1 7
12 8 6 0 9 7 9 8 1
13 2 8 4 4 4 2 6 7
14 4 6 6 4 9 9 3 5
15 8 1 0 6 5 8 1 1
16 6 6 9 3 9 2 1 1
17 2 4 0 2 4 8 10 9
18 5 9 8 9 4 9 3 9
19 0 2 1 6 6 9 6 2
20 3 3 7 10 4 5 6 8
21 2 6 6 9 8 10 9 4
22 7 7 1 6 8 3 7 1
23 1 9 4 5 8 9 7 7
24 0 8 5 9 1 8 9 1
25 2 1 0 1 1 2 10 7
26 10 4 1 8 2 5 9 0
27 2 7 10 10 2 3 8 6
28 6 4 2 6 7 3 1 0
29 8 1 3 4 1 10 3 6
30 1 6 5 4 7 9 7 10
31 4 4 3 2 2 9 0 4
32 9 6 6 1 6 1 5 2
The plotting part is no problem, using gggplot as below:-
l5 <- read.xlsx(paste(taxawmeta,taxawmeta_files[2], sep = ""), sheetIndex = 1)
l5_long <- l5 %>% gather(taxa,value,-c(X.FinalSampleID,TimePoint_Luna))
ggplot(l5_long, aes(fill=taxa, y = value, x = X.FinalSampleID, )) +
geom_bar(position='stack', stat='identity') +
theme_minimal() +
labs(x='Sample', y='Relative Abundance', title='Family Level Relative Abundance') +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position="none")
Where I'm running into an issue is the actual dataset has almost 200 variables. Meaning the legend is completely out of control. I know I can just hide the legend with:-
theme(.position="none")
... but what I'd like to do is keep say the top 10 entries as those are the ones of most interest. Is there any simple method to limit the number of items that are displayed in the legend? Anything I've found so far seems very convoluted and not directly applicable to this problem.

Create numbers based on different probability in R

I am trying to simulate a matrix of data set i*j, with i=2 ; j = 200, which represent subject and trial separately, and create random number between 0-10 based on trials with different probability. For first subject (i=1), the first 100 trials (j = 1-100) there is 70% probability to be number 1-5 and 30% probability to be number 6-10, and the probability reverse in trial 101 to 200. For second subject (i=2), the first 100 trials (j = 1-100) there is 60% probability to be number 1-5 and 40% probability to be number 6-10, and the probability reverse in trial 101 to 200.
I gave an example of 2 subjects because I need to do this with multiple i but not only 1 i.
Can I work this out with sample?
I guess what you are after is Stratified Sampling.
With base R, you can implement stratified sampling via sample, but you may need to define a user function like f as below
f <- function(N, p) {
c(
sapply(
list(p, rev(p)),
function(v) {
sapply(
sample(c(TRUE, FALSE), N, replace = TRUE, prob = v),
function(x) ifelse(x, sample(1:5, 1), sample(6:10, 1))
)
}
)
)
}
When you use it, you first define a probability list probs for each trial, e.g.,
probs <- list(c(0.7, 0.3), c(0.6, 0.4))
and then run
> lapply(probs, f, N = j)
[[1]]
[1] 2 1 2 5 3 6 9 2 2 2 3 2 3 7 4 5 3 7 1 4 10 2 3 6 8
[26] 7 8 3 1 2 5 1 4 4 4 2 1 5 5 4 1 6 4 2 9 10 5 1 1 5
[51] 4 4 3 4 8 4 10 3 2 1 3 4 7 4 2 10 1 4 3 3 5 2 7 6 5
[76] 3 10 4 2 2 5 1 2 3 2 3 3 2 9 10 10 10 10 3 1 4 3 1 1 5
[101] 8 6 5 9 1 6 1 9 10 4 5 4 6 5 8 2 4 10 6 3 8 5 10 8 8
[126] 8 9 3 8 6 5 7 10 9 6 8 9 5 6 8 4 6 6 7 4 4 8 10 10 6
[151] 9 10 9 7 8 7 3 7 4 6 10 8 10 8 5 6 10 8 9 6 6 1 9 4 8
[176] 1 5 10 7 10 8 7 6 6 5 4 7 7 8 8 1 10 8 5 8 9 4 5 6 7
[[2]]
[1] 7 9 4 9 5 3 3 9 4 5 6 10 4 5 2 3 2 5 4 5 3 8 5 2 1
[26] 6 5 3 9 3 9 9 9 8 7 3 4 5 7 3 5 3 5 7 5 3 4 2 6 4
[51] 7 6 2 7 4 4 10 4 10 2 8 10 3 2 8 1 8 10 8 4 3 2 9 8 4
[76] 4 10 1 3 10 6 8 6 3 5 2 3 3 9 4 7 5 1 1 1 3 10 5 2 7
[101] 2 10 2 6 8 10 10 7 3 7 3 3 7 1 10 3 4 1 1 8 2 5 2 4 7
[126] 2 7 7 4 9 10 7 1 4 4 9 7 9 9 9 8 4 1 10 6 10 4 4 8 9
[151] 7 8 3 2 9 1 9 7 6 9 1 6 3 9 7 8 5 9 3 8 9 6 5 1 2
[176] 5 10 2 7 8 7 8 8 8 8 8 5 1 1 7 6 3 3 4 2 3 2 3 1 3

Create a new variable based on existing variable

My current dataset look like this
Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6
I want to create a new variable called "V2" based on the variables "Order" and "V1". For every 8 items in the "Order" variable, I want to assign a value of "0" in "V2" if the varialbe "Order" has observation equals to 1; otherwise, "V2" takes the value of previous item in "V1".
This is the dataset that I want
Order V1 V2
1 7 0
2 5 7
3 8 5
4 5 8
5 8 5
6 3 8
7 4 3
8 2 4
1 8 0
2 6 8
3 3 6
4 4 3
5 5 4
6 7 5
7 3 7
8 6 3
Since my actual dataset is very large, I'm trying to use for loop with if statement to generate "V2". But my code keeps failing. I appreciate if anyone can help me on this, and I'm open to other statements. Thank you!
(Up front: I am assuming that the order of Order is perfectly controlled.)
You need simply ifelse and lag:
df <- read.table(text="Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6 ", header=T)
df$V2 <- ifelse(df$Order==1, 0, lag(df$V1))
df
# Order V1 V2
# 1 1 7 0
# 2 2 5 7
# 3 3 8 5
# 4 4 5 8
# 5 5 8 5
# 6 6 3 8
# 7 7 4 3
# 8 8 2 4
# 9 1 8 0
# 10 2 6 8
# 11 3 3 6
# 12 4 4 3
# 13 5 5 4
# 14 6 7 5
# 15 7 3 7
# 16 8 6 3
with(dat,{V2<-c(0,head(V1,-1));V2[Order==1]<-0;dat$V2<-V2;dat})
Order V1 V2
1 1 7 0
2 2 5 7
3 3 8 5
4 4 5 8
5 5 8 5
6 6 3 8
7 7 4 3
8 8 2 4
9 1 8 0
10 2 6 8
11 3 3 6
12 4 4 3
13 5 5 4
14 6 7 5
15 7 3 7
16 8 6 3

Create a vector using rep() and seq()

How to create a vector sequence of:
2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
I tried to use:
2:8+rep(0:6,each=6)
but the result is:
2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 .... 12 13 14
Please help. Thanks.
This should accomplish what you're looking for:
x = 2
VecSeq = c(x:8)
while (x < 7) {
x = x + 1
calc = c(x:8)
VecSeq = c(VecSeq, calc)
}
VecSeq # Your desired vector
you could do this:
library(purrr)
unlist(map(2:7, ~.x:8))
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
and a little function in base R:
funky_vec <- function(from,to){unlist(sapply(from:(to-1),`:`,to))}
funky_vec(2,8)
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8
This is made really easy with sequence (since R 4.0.0):
sequence(7:2, 2:7)
# [1] 2 3 4 5 6 7 8 3 4 5 6 7 8 4 5 6 7 8 5 6 7 8 6 7 8 7 8

Converting multiple histogram frequency count into an array in R

For each row in the matrix "result" shown below
A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6
I would like to plot a histogram for each row with 3 bins as shown below:
samp<-result[1,]
hist(samp, breaks = 3, col="lightblue", border="pink")
Now what is needed is to convert the histogram frequency counts into an array as follows
If I have say 4 bins and say first bin has count=5 and second bin has a count=2 and fourth bin=3. Now I want a vector of all values in each of these bins, coming from data result(for every row) in a vector as my output.
row1 5 2 0 3
For hundreds of rows I would like to do it in an automated way and hence posted this question.
In the end the matrix should look like
bin 2-4 bin 4-6 bin6-8 bin8-10
row 1 5 2 0 3
row 2
row 3
row 4
row 5
row 6
row 7
row 8
row 9
DF <- read.table(text="A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6", header=TRUE)
m <- as.matrix(DF)
apply(m,1,function(x) hist(x,breaks = 3)$count)
# $`1`
# [1] 5 2 0 3
#
# $`2`
# [1] 5 0 2 3
#
# $`3`
# [1] 6 3 1
#
# $`4`
# [1] 1 6 2 1
#
# $`5`
# [1] 3 3 4
#
# $`6`
# [1] 3 4 2 1
#
# $`7`
# [1] 2 5 3
#
# $`8`
# [1] 6 3 1
#
# $`9`
# [1] 4 4 0 2
Note that according to the documentation the number of breaks is only a suggestion. If you want to have the same number of breaks in all rows, you should do the binning outside of hist:
breaks <- 1:5*2
t(apply(m,1,function(x) table(cut(x,breaks,include.lowest = TRUE))))
# [2,4] (4,6] (6,8] (8,10]
# 1 5 2 0 3
# 2 1 4 5 0
# 3 4 2 3 1
# 4 1 6 2 1
# 5 3 3 4 0
# 6 0 3 6 1
# 7 2 5 3 0
# 8 2 4 3 1
# 9 0 4 6 0
You could access the counts vector which is returned by hist (see ?hist for details):
counts <- hist(samp, breaks = 3, col="lightblue", border="pink")$counts

Resources