Limit Number of Items Displayed in Legend - GGplot R - r

I have a large taxonomic dataset that I need to plot as a stacked bar chart. Sample Data:
ID X A B C D E F G
1 5 9 6 7 4 8 10 6
2 6 3 9 10 3 10 4 8
3 6 6 5 8 8 8 8 1
4 9 3 2 8 4 1 5 8
5 6 6 2 8 3 7 4 10
6 0 7 8 9 1 4 9 10
7 3 2 6 8 8 1 8 7
8 4 7 10 2 9 7 9 8
9 5 7 9 10 8 2 2 1
10 0 4 6 8 9 10 7 1
11 8 9 2 2 6 5 1 7
12 8 6 0 9 7 9 8 1
13 2 8 4 4 4 2 6 7
14 4 6 6 4 9 9 3 5
15 8 1 0 6 5 8 1 1
16 6 6 9 3 9 2 1 1
17 2 4 0 2 4 8 10 9
18 5 9 8 9 4 9 3 9
19 0 2 1 6 6 9 6 2
20 3 3 7 10 4 5 6 8
21 2 6 6 9 8 10 9 4
22 7 7 1 6 8 3 7 1
23 1 9 4 5 8 9 7 7
24 0 8 5 9 1 8 9 1
25 2 1 0 1 1 2 10 7
26 10 4 1 8 2 5 9 0
27 2 7 10 10 2 3 8 6
28 6 4 2 6 7 3 1 0
29 8 1 3 4 1 10 3 6
30 1 6 5 4 7 9 7 10
31 4 4 3 2 2 9 0 4
32 9 6 6 1 6 1 5 2
The plotting part is no problem, using gggplot as below:-
l5 <- read.xlsx(paste(taxawmeta,taxawmeta_files[2], sep = ""), sheetIndex = 1)
l5_long <- l5 %>% gather(taxa,value,-c(X.FinalSampleID,TimePoint_Luna))
ggplot(l5_long, aes(fill=taxa, y = value, x = X.FinalSampleID, )) +
geom_bar(position='stack', stat='identity') +
theme_minimal() +
labs(x='Sample', y='Relative Abundance', title='Family Level Relative Abundance') +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position="none")
Where I'm running into an issue is the actual dataset has almost 200 variables. Meaning the legend is completely out of control. I know I can just hide the legend with:-
theme(.position="none")
... but what I'd like to do is keep say the top 10 entries as those are the ones of most interest. Is there any simple method to limit the number of items that are displayed in the legend? Anything I've found so far seems very convoluted and not directly applicable to this problem.

Related

Create numbers based on different probability in R

I am trying to simulate a matrix of data set i*j, with i=2 ; j = 200, which represent subject and trial separately, and create random number between 0-10 based on trials with different probability. For first subject (i=1), the first 100 trials (j = 1-100) there is 70% probability to be number 1-5 and 30% probability to be number 6-10, and the probability reverse in trial 101 to 200. For second subject (i=2), the first 100 trials (j = 1-100) there is 60% probability to be number 1-5 and 40% probability to be number 6-10, and the probability reverse in trial 101 to 200.
I gave an example of 2 subjects because I need to do this with multiple i but not only 1 i.
Can I work this out with sample?
I guess what you are after is Stratified Sampling.
With base R, you can implement stratified sampling via sample, but you may need to define a user function like f as below
f <- function(N, p) {
c(
sapply(
list(p, rev(p)),
function(v) {
sapply(
sample(c(TRUE, FALSE), N, replace = TRUE, prob = v),
function(x) ifelse(x, sample(1:5, 1), sample(6:10, 1))
)
}
)
)
}
When you use it, you first define a probability list probs for each trial, e.g.,
probs <- list(c(0.7, 0.3), c(0.6, 0.4))
and then run
> lapply(probs, f, N = j)
[[1]]
[1] 2 1 2 5 3 6 9 2 2 2 3 2 3 7 4 5 3 7 1 4 10 2 3 6 8
[26] 7 8 3 1 2 5 1 4 4 4 2 1 5 5 4 1 6 4 2 9 10 5 1 1 5
[51] 4 4 3 4 8 4 10 3 2 1 3 4 7 4 2 10 1 4 3 3 5 2 7 6 5
[76] 3 10 4 2 2 5 1 2 3 2 3 3 2 9 10 10 10 10 3 1 4 3 1 1 5
[101] 8 6 5 9 1 6 1 9 10 4 5 4 6 5 8 2 4 10 6 3 8 5 10 8 8
[126] 8 9 3 8 6 5 7 10 9 6 8 9 5 6 8 4 6 6 7 4 4 8 10 10 6
[151] 9 10 9 7 8 7 3 7 4 6 10 8 10 8 5 6 10 8 9 6 6 1 9 4 8
[176] 1 5 10 7 10 8 7 6 6 5 4 7 7 8 8 1 10 8 5 8 9 4 5 6 7
[[2]]
[1] 7 9 4 9 5 3 3 9 4 5 6 10 4 5 2 3 2 5 4 5 3 8 5 2 1
[26] 6 5 3 9 3 9 9 9 8 7 3 4 5 7 3 5 3 5 7 5 3 4 2 6 4
[51] 7 6 2 7 4 4 10 4 10 2 8 10 3 2 8 1 8 10 8 4 3 2 9 8 4
[76] 4 10 1 3 10 6 8 6 3 5 2 3 3 9 4 7 5 1 1 1 3 10 5 2 7
[101] 2 10 2 6 8 10 10 7 3 7 3 3 7 1 10 3 4 1 1 8 2 5 2 4 7
[126] 2 7 7 4 9 10 7 1 4 4 9 7 9 9 9 8 4 1 10 6 10 4 4 8 9
[151] 7 8 3 2 9 1 9 7 6 9 1 6 3 9 7 8 5 9 3 8 9 6 5 1 2
[176] 5 10 2 7 8 7 8 8 8 8 8 5 1 1 7 6 3 3 4 2 3 2 3 1 3

Change units of time dimension in NetCDF file from months to months since

I currently have multiple NetCDF files with 4 dimensions, (latitude, longitude, time, and depth). Each represents a single year of monthly data. The unit of time is "month", 1-12, and therefore quite useless if I want to merge these files across years to give me a single NetCDF file with a time dimension of size months*years.
The time dimension attributes for a single file:
time Size:12 *** is unlimited ***
long_nime: time
units: month
I used ncrcat of nco to merge.
ncrcat soda3.3.1*sst.nc -O soda3.3.1_1980_2015_sst.nc
This works except that when merged, time values read
#in R
soda.info$var$temp$dim[[3]]$vals
[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1
[26] 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2
[51] 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
[76] 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4
[101] 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5
[126] 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6
[151] 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
[176] 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8
[201] 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9
[226] 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
[251] 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
[276] 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
[301] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1
[326] 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2
[351] 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
[376] 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4
[401] 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5
[426] 6 7 8 9 10 11 12
...which obviously isn't much help if I want to keep track of time.
In the past I've only used NetCDF files with a "months since..." unit. Is there a way to change these rather groundless 'month' units to 'months since...'?
Would it suffice to enumerate the months sequentially?
ncap2 -s 'time=array(0,1,$time)' soda3.3.1_1980_2015_sst.nc out.nc
You can also add a "months since ..." unit to time as described in the comment by Chelmy and/or in the NCO manual. I leave that as an exercise for you, gentle reader.

Create a new variable based on existing variable

My current dataset look like this
Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6
I want to create a new variable called "V2" based on the variables "Order" and "V1". For every 8 items in the "Order" variable, I want to assign a value of "0" in "V2" if the varialbe "Order" has observation equals to 1; otherwise, "V2" takes the value of previous item in "V1".
This is the dataset that I want
Order V1 V2
1 7 0
2 5 7
3 8 5
4 5 8
5 8 5
6 3 8
7 4 3
8 2 4
1 8 0
2 6 8
3 3 6
4 4 3
5 5 4
6 7 5
7 3 7
8 6 3
Since my actual dataset is very large, I'm trying to use for loop with if statement to generate "V2". But my code keeps failing. I appreciate if anyone can help me on this, and I'm open to other statements. Thank you!
(Up front: I am assuming that the order of Order is perfectly controlled.)
You need simply ifelse and lag:
df <- read.table(text="Order V1
1 7
2 5
3 8
4 5
5 8
6 3
7 4
8 2
1 8
2 6
3 3
4 4
5 5
6 7
7 3
8 6 ", header=T)
df$V2 <- ifelse(df$Order==1, 0, lag(df$V1))
df
# Order V1 V2
# 1 1 7 0
# 2 2 5 7
# 3 3 8 5
# 4 4 5 8
# 5 5 8 5
# 6 6 3 8
# 7 7 4 3
# 8 8 2 4
# 9 1 8 0
# 10 2 6 8
# 11 3 3 6
# 12 4 4 3
# 13 5 5 4
# 14 6 7 5
# 15 7 3 7
# 16 8 6 3
with(dat,{V2<-c(0,head(V1,-1));V2[Order==1]<-0;dat$V2<-V2;dat})
Order V1 V2
1 1 7 0
2 2 5 7
3 3 8 5
4 4 5 8
5 5 8 5
6 6 3 8
7 7 4 3
8 8 2 4
9 1 8 0
10 2 6 8
11 3 3 6
12 4 4 3
13 5 5 4
14 6 7 5
15 7 3 7
16 8 6 3

How to create multiple rank plot in R

I have a dataset of 10 variables and 150 observation.
ebi anago maguro ika uni sake tamago toro tekka.maki kappa.maki
7 4 5 1 0 2 8 3 9 6
1 4 5 7 2 0 8 6 9 3
7 2 5 4 8 1 0 3 9 6
4 7 5 1 2 0 3 8 6 9
4 5 7 2 0 3 8 1 6 9
4 5 7 2 0 3 1 8 6 9
5 7 4 1 0 2 3 8 9 6
5 4 1 6 7 2 0 8 3 9
5 7 2 3 8 4 9 0 6 1
1 7 2 0 8 3 5 4 6 9
4 7 5 1 8 2 3 9 6 0
7 5 0 4 2 3 8 6 1 9
4 7 0 5 2 1 8 3 6 9
4 5 7 0 3 1 2 6 8 9
7 4 0 2 5 3 1 8 9 6
7 5 4 0 2 3 8 1 6 9
2 7 0 8 6 3 1 9 5 4
7 2 5 4 3 0 8 1 6 9
7 5 0 2 1 6 8 9 3 4
7 4 5 0 3 1 2 8 6 9
Every variable is the rank of the agent preference of sushi type and I'd like to create multiple plot in the same image like the one in the photo.
Any help?
Something like this, maybe?
library(ggplot2)
library(reshape2)
rank=read.csv('rank.csv')
melt=melt(rank, id.vars=NULL)
melt$value=factor(melt$value, c(9:0))
ggplot(melt, aes(value)) + geom_bar() + facet_wrap(~variable, 5,5)

Sequentially reorganize a vector in R

I have a numeric element z as below:
> sort(z)
[1] 1 5 5 5 6 6 7 7 7 7 7 9 9
I would like to sequentially reorganize this element so to have
> z
[1] 1 2 2 2 3 3 4 4 4 4 4 5 5
I guess converting z to a factor and use it as an index should be the way.
You answered it yourself really:
as.integer(factor(sort(z)))
I know this has been accepted already but I decided to look inside factor() to see how it's done there. It more or less comes down to this:
x <- sort(z)
match(x, unique(x))
Which is an extra line I suppose but it should be faster if that matters.
This should do the trick
z = sort(sample(1:10, 100, replace = TRUE))
cumsum(diff(z)) + 1
[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[26] 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6
[51] 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
Note that diff omits the first element of the series. So to compensate:
c(1, cumsum(diff(z)) + 1)
Alternative using rle:
z = sort(sample(1:10, 100, replace = TRUE))
rle_result = rle(sort(z))
rep(rle_result$values, rle_result$lengths)
> rep(rle_result$values, rle_result$lengths)
[1] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
[26] 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6
[51] 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8
[76] 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10
rep(seq_along(rle(x)$l), rle(x)$l)

Resources