Ordering stacked barplot based on multiple values

Ordering stacked barplot based on multiple values - r

I'm asking this question because even though there are many similar questions on this website (like this, this, and this), none of them are exactly my situation. Actually, this link is asking the same question as mine, but the answer there is unclear to me and raises the question that I am about to ask.
I have a dataset from which I am constructing a stacked barplot, and I wan't to know how I can arrange the stacked barplot where "similar" individuals cluster together. I work in bioinformatics, and here is the dataset which is a d-by-n matrix. In this toy dataset, there are d=10 ancestral populations and n = 5 individuals:
> a
V1 V2 V3 V4 V5
1 0.534410243 0.009358740 0.011295181 0.2141751740 0.0030129254
2 0.026653603 0.372426720 0.447847534 0.0179177507 0.4072904477
3 0.193317915 0.003605024 0.003186611 0.4832114736 0.0007095471
4 0.111881585 0.000000000 0.000000000 0.2296213741 0.0119233461
5 0.089696570 0.591163629 0.509774416 0.0032542030 0.5535847030
6 0.007543558 0.000000000 0.000000000 0.0364907757 0.0013148362
7 0.004862942 0.000000000 0.002123909 0.0146682272 0.0004053690
8 0.009276195 0.011710457 0.014367894 0.0000000000 0.0000000000
9 0.006903171 0.004314528 0.011404455 0.0000000000 0.0126889937
10 0.015454219 0.007420903 0.000000000 0.0006610215 0.0090698319
All columns add up to 1. I create a stacked barplot like so:
pop <- rownames(a)
a <- a %>% mutate(pop = rownames(a))
a_long <- gather(a, key, value, -pop)
# trying to create a similarity index
a_long <- a_long %>% group_by(key) %>%
mutate(mean = mean(value)) %>%
arrange(desc(mean))
# looking at some of the expanded dataset
> a_long[1:20,]
# A tibble: 20 x 4
# Groups: key [2]
pop key value mean
<chr> <chr> <dbl> <dbl>
1 1 V2 0.00936 0.1
2 2 V2 0.372 0.1
3 3 V2 0.00361 0.1
4 4 V2 0 0.1
5 5 V2 0.591 0.1
6 6 V2 0 0.1
7 7 V2 0 0.1
8 8 V2 0.0117 0.1
9 9 V2 0.00431 0.1
10 10 V2 0.00742 0.1
11 1 V4 0.214 0.1
12 2 V4 0.0179 0.1
13 3 V4 0.483 0.1
14 4 V4 0.230 0.1
15 5 V4 0.00325 0.1
16 6 V4 0.0365 0.1
17 7 V4 0.0147 0.1
18 8 V4 0 0.1
19 9 V4 0 0.1
20 10 V4 0.000661 0.1
# colors
v_colors <- c("#440154FF", "#443B84FF", "#34618DFF", "#404588FF", "#1FA088FF", "#40BC72FF",
"#67CC5CFF", "#A9DB33FF", "#DDE318FF", "#FDE725FF")
plot <- ggplot(a_long, aes(x = key, y = value, fill = pop))
plot + geom_bar(position="stack", stat="identity") + scale_fill_manual(values = v_colors)
The output looks like this:
How can I make the output look more neat, e.g. with the individuals with higher proportion of population 5 ancestry be next to each other on the x-axis? So far, I have tried to compute the "mean" of value of each individual, but it didn't work since it's not a good measure. How can I create a similarity index that tells me how similar individual 1 is to individual 2, and then how do I order it them on the x-axis so that they look well-clustered (e.g. like the barplots in this figure)?
Thanks!
One last thing: if you want to re-create the dataset a, here is the code:
v1 = c(0.534410243, 0.026653603, 0.193317915, 0.111881585, 0.089696570, 0.007543558, 0.004862942, 0.009276195, 0.006903171, 0.015454219)
v2 = c(0.009358740, 0.372426720, 0.003605024, 0.000000000, 0.591163629, 0.000000000, 0.000000000, 0.011710457, 0.004314528, 0.007420903)
v3 = c(0.011295181, 0.447847534, 0.003186611, 0.000000000, 0.509774416, 0.000000000, 0.002123909, 0.014367894, 0.011404455, 0.000000000)
v4 = c(0.2141751740, 0.0179177507, 0.4832114736, 0.2296213741, 0.0032542030, 0.0364907757, 0.0146682272, 0.0000000000, 0.0000000000, 0.0006610215)
v5 = c(0.0030129254, 0.4072904477, 0.0007095471, 0.0119233461, 0.5535847030, 0.0013148362, 0.0004053690, 0.0000000000, 0.0126889937, 0.0090698319)
a = data.frame(V1 = v1, V2 = v2, V3 = v3, V4 = v4, V5 = v5)

Related

extract values after first character in data frame column

I have the following dataframe
df <- data.frame(V1 = c(1, 2), V2 = c(10, 20), V3=c("9,1", "13,3,4"))
> df
V1 V2 V3
1 1 10 9,1
2 2 20 13,3,4
Now I want to create a new column 'V4' that takes the value after the first ',' from V3, divides it by the value in V2 and multiplies that by 100
In my example this would be:
(1 divided by 10) * 100 = 10
(3 divided by 20) * 100 = 15
So the output would look like this
df_new
V1 V2 V3 V4
1 1 10 9,1 10
2 2 20 13,3,4 15
How can this be achieved?

We can use regex to extract the number after first comma divide it by V2 and multiply by 100.
transform(df, V4 = as.integer(sub("\\d+,(\\d+).*", "\\1", V3))/V2 * 100)
# V1 V2 V3 V4
#1 1 10 9,1 10
#2 2 20 13,3,4 15

Normalise the columns of a dataframe (sum equal to 1)

Let's say I have the following dataframe df containing weights:
df <- as.data.frame(t(matrix(seq(1,9), nrow = 3, ncol = 3)))
> df
V1 V2 V3
1 1 2 3
2 4 5 6
3 7 8 9
I would like to produce a new dataframe df_2 with normalised weights (sum of the columns must be equal to 1) as below:
> df_2
V1 V2 V3
1 0.1666667 0.3333333 0.5
2 0.2666667 0.3333333 0.4
3 0.2916667 0.3333333 0.375
Note that the way I normalise a vector w is the following:
w_normalised <- w/sum(w)

We need to divide with rowwise sum of the dataset
df/rowSums(df)

cbind specific columns from multiple data.tables efficiently

I have a list of data.tables that I need to cbind, however, I only need the last X columns.
My data is structured as follows:
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
[[1]]
x y v1 v2
1: 1 a 1 3
2: 1 a 2 4
[[2]]
x y v3
1: 1 a 5
2: 1 a 6
[[3]]
x y v4 v5 v6
1: 1 a 7 9 11
2: 1 a 8 10 12
Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:
x y v1 v2 v3 v4 v5 v6
1: 1 a 1 3 5 7 9 11
2: 1 a 2 4 6 8 10 12
I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.
thanks

Here's another way:
Reduce(
function(x,y){
newcols = setdiff(names(y),names(x))
x[,(newcols)] <- y[, ..newcols]
x
},
DT.list,
init = copy(DT.list[[1]][,c("x","y")])
)
# x y v1 v2 v3 v4 v5 v6
# 1: 1 a 1 3 5 7 9 11
# 2: 1 a 2 4 6 8 10 12
This avoids modifying the list (as #bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.

Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():
do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:
for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
## x y v1 v2
## 1: 1 a 1 3
## 2: 1 a 2 4
##
## [[2]]
## v3
## 1: 5
## 2: 6
##
## [[3]]
## v4 v5 v6
## 1: 7 9 11
## 2: 8 10 12
##
do.call(cbind,DT.list);
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

Another option would be to use the [,, indexing function option inside lapplyon the list of data tables and exclude "unwanted" columns (in your case x and y). In this way, duplicates columns are not created.
# your given test data
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
A) using a character vector to indicate which columns to exclude
# cbind a list of subsetted data.tables
exclude.col <- c("x","y")
myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
B) same as above but using the character vector directly in lapply
myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y")))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
C) same as above, but all in one line
do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) )
# way too many brackets...but I think it works
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

presence and absence mapping in r

i am trying to generate 0 and 1 for absence and presence. my data is line segments and i have to plot 0 or 1 at and interval of 0.1 for points that lie within the segment or points outside the segment.
V1 V2 V3 V4 V5 V6 V7
3 17 26.0 26.0 0 12-Jun-84 1 0
4 17 48.0 48.0 1 12-Jun-84 3 0
5 17 56.7 56.7 0 12-Jun-84 1 0
143 17 16.3 16.3 0 19-Jun-84 1 8
144 17 17.7 17.7 0 19-Jun-84 1 8
145 17 22.0 22.0 0 19-Jun-84 1 8
v2 and v3 are the start and endpoints and v4 is the separation between them.
i have tried
tran17 <- seq(0, 80, by=0.1)
tran17.date1 <- rep(0, length(tran17))
##
sub1 <-which(tran17 >= c$V2[i] & tran17 <= c$V3[i])
tran17.date1[sub1] <- 1
thankyou

Ignoring your data example and focusing in your question, I think this solves the problem. Also, if V1 is a grouping factor, you can use tapply over PAmatrix.
# test data
sed.seed(1104)
dat = data.frame(V1=17, V2=runif(200, 10, 60))
dat$V3 = dat$V2 + runif(200, 0, 20)
dat$V4 = dat$V3 - dat$V2
V1 V2 V3 V4
1 17 37.25826 45.54194 8.2836734
2 17 17.44098 22.86841 5.4274331
3 17 49.78488 55.51627 5.7313965
4 17 51.66640 52.54813 0.8817293
5 17 21.84276 39.38477 17.5420079
6 17 53.39457 54.51613 1.1215530
# functions to solve the problem
isInside = function(limits, tran) as.numeric(tran>=limits[1] & tran<=limits[2])
PAmatrix = function(data, tran) t(apply(data, 1, isInside, tran=tran))
# calculate the PA matrix
tran17 = seq(0, 80, by=0.1)
PA17 = PAmatrix(data=dat[,c("V2","V3")], tran=tran17)
# plot the results
image(seq(nrow(dat)), tran17, PA17, col=c("blue", "red"))

tran17 <- seq(0, 80, by=0.1)
tran17.date1 <- rep(0, length(tran17))
dm <- which(c$V5 == "31-Jul-84")
for(i in dm){
print(i)
sub1 <-which(tran17 >= c$V2[i] & tran17 <= c$V3[i])
tran17.date1[sub1] <- 1
}
plot(tran17, tran17.date1)

mixed types of variables in R

I am a new user of R. I have this kind of data type. How can I separate the different types of variables (eg.binary, or counts; or others are continuous) from them?
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0.17 0 0 12 22 2 1 1 240.215 65.049 1.478 114
0.15 1 0 13 22 2 1 1 247.133 66.315 1.474 120
0.16 0 0 12 22 2 0 1 233.329 58.163 1.353 110
0.07 0 0 12 20 2 0 1 219.660 56.162 1.370 114
0.11 0 0 12 26 2 0 2 289.294 70.844 1.389 134
Thanks in advance!

You can use the function typeof to determine the storage mode of an object.
An example data frame:
dat <- data.frame(a = 1:2,
b = c(0.5, -1.3),
c = c("a", "b"),
d = c(TRUE, FALSE), stringsAsFactors = FALSE)
With lapply you can apply the function to all columns:
lapply(dat, typeof)
The result:
$a
[1] "integer"
$b
[1] "double"
$c
[1] "character"
$d
[1] "logical"
If you want to select, for example, all character columns, you can use:
dat[sapply(dat, typeof) == "character"] # possibility 1
dat[sapply(dat, is.character)] # possibility 2
# both commands will return the same result
c
1 a
2 b
PS: You should also have a look at the functions modeand storage.mode.

In addition to typeof, str and summary are other possibilities. These can also be applied directly to the data frame, ie no lapply or looping required.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Ordering stacked barplot based on multiple values - r

Related

extract values after first character in data frame column

Normalise the columns of a dataframe (sum equal to 1)

cbind specific columns from multiple data.tables efficiently

presence and absence mapping in r

mixed types of variables in R

Categories

Resources