Plotting different columns on the same file using boxes - plot

I have a file that looks like
$cat myfile.dat
1 8 32 19230 1.186 3.985
1 8 64 9620 0.600 7.877
1 8 128 4810 0.312 15.136
1 8 256 2410 0.226 20.927
1 8 512 1210 0.172 27.708
1 8 1024 610 0.135 35.582
1 8 2048 310 0.121 40.172
1 8 4096 160 0.117 43.141
1 8 8192 80 0.112 44.770
.....
2 8 16384 300 0.692 6.816
2 8 32768 150 0.686 6.877
2 8 65536 80 0.853 5.904
2 10 320 7830 1.041 4.575
2 10 640 3920 0.919 5.189
2 10 1280 1960 0.828 5.757
2 10 2560 980 0.773 6.167
2 10 5120 490 0.746 6.391
2 10 10240 250 0.748 6.507
2 10 20480 130 0.770 6.567
....
3 18 8192 10 1.311 12.759
3 20 32 650 1.631 3.978
3 20 64 330 0.838 7.863
3 20 128 170 0.483 14.046
3 20 256 90 0.508 14.160
3 20 512 50 0.559 14.283
3 20 1024 30 0.665 14.405
3 20 2048 20 0.865 14.782
3 20 4096 10 0.856 14.932
3 20 8192 10 1.704 14.998
As you can see, there are many ways of plotting this information depending on the column we want as x axis. One of the ways I would like to plot the information is the 6th against the 1st column
p "myfile.dat" u 1:6
My main questions is if there is a way to plot those bars as solid boxes since we are only interested in the peak value achieved and not the frequency or density region of the dots.

Gnuplot has the smooth option, which can be used e.g. as smooth frequency to sum all y-values for the same x-value. Unfortunately there is no smooth maximum, which you would need here, but one can 'emulate' that with a bit of tricking in the Using statement.
reset
xval = -1000
max(x, y) = (x > y ? x : y)
maxval = 0
colnum = 6
set boxwidth 0.2
plot 'mydata.dat' using (val = column(colnum), $1):\
(maxval_prev = (xval == $1 ? maxval : 0), \
maxval = (xval == $1 ? max(maxval, val) : val),\
xval = $1, \
(maxval > maxval_prev ? maxval-maxval_prev : 0)\
) \
smooth frequency lw 3 with boxes t 'maximum values'
Every using entry can consist of different assignments, which are separated by a comma.
If a new x value appears, the variables are initialized. This works, because the data is made monotonic in x by smooth frequency.
If the current value is bigger than the stored maximum value, the difference between the stored maximum value and the current value is added. Potentially, this could result in numerical errors due to repeated adding and subtracting, but judging from you sample data and given the resolution of the plot, this shouldn't be a problem.
The result for you data is:

You can search for the maximum and plot only that, but this is probably easier, even if it draws lots of boxes one over another:
plot "myfile.dat" using 1:6:(.1) with boxes fillstyle solid

Related

Clustering biological sequences based on numeric values

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
library(SOMbrero)
you define the number of cluster in advance
fit <- trainSOM(x.data=output , dimension = c(5, 5), nb.save = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the nb.save is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
table(my.som$clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
summary(fit)
#part of output
Summary
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
ANOVA :
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
plot(superClass(fit))
fit1 <- superClass(fit, k = 4)
summary(fit1)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
Clustering
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
ANOVA
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette

How to make a 3D Mesh in RGL with shade3d or wire3d using tmesh3d in R

I have some data which I have collected.
It consists of Vertices and then Triangles which I have made using a meshing software.
I am able to use R with
trimesh(triangles, vertices)
to make a nice mesh plot.
But can't figure out how to use RGL to make an interactive plot that I can view, and I can't work out how to colour the faces of the mesh based on a different value in the data frame.
here are the vertices in a data frame. x, y, z are the coordinates of the nodes/points (nn)
'data.frame': 23796 obs. of 7 variables:
$ nn : int 0 1 2 3 4 5 6 7 8 9 ...
$ x : num 39.5 70.8 49 83.5 -16 ...
$ y : num 28.2 -2.97 -25.67 -9.1 -39.75 ...
$ z: num 160 158 109 121 188 ...
$ uni: num 3.87 6.64 5.02 4.48 1.91 ...
$ bi : num 0.749 0.784 1.045 0.935 0.733 ...
nn x y z uni bi
0 39.527 28.202 160.219 3.86942 0.74871
1 70.804 -2.966 157.578 6.64361 0.78373
2 48.982 -25.674 109.022 5.02491 1.0451
3 83.514 -9.096 120.988 4.47977 0.9348
4 -16.04 -39.749 188.467 1.90873 0.73286
5 74.526 -3.096 174.347 8.4263 0.70594
6 54.93 -56.347 151.496 7.53334 2.17128
7 56.936 -20.131 186.177 7.16118 1.44875
8 -14.627 -47.1 162.185 2.13939 0.70887
9 38.207 -59.201 147.993 5.83457 4.32971
10 50.645 -32.04 110.418 5.3741 1.14543
The triangles for the vertices are
'data.frame': 47602 obs. of 7 variables:
$ X : int 3435 3161 18424 13600 1564 21598 21283 1171 51 9331 ...
$ Y : int 19658 17204 17467 19721 10099 19018 11341 2723 15729 5851 ...
$ Z : int 2764 9466 16955 2669 10091 21205 18399 20833 15865 9106 ...
X Y Z
3435 19658 2764
3161 17204 9466
18424 17467 16955
13600 19721 2669
1564 10099 10091
21598 19018 21205
21283 11341 18399
1171 2723 20833
51 15729 15865
9331 5851 9106
310 3513 9121
5651 11928 15468
8594 2295 6852
22725 22636 11114
I need to make this into a mesh as I can in trimesh, but with RGL and I need to colour the faces of the mesh based on a scale of uni, where <0.5 is red, 0/5-1/5 is orange and >1.5 is green
It looks something like this in trimesh but how to i do it in RGL for R, WITH COLOURING BASED ON VALUE ON UNI in the first data table
Here is an example, starting with two dataframes.
> library(rgl)
> vertices
x y z
1 1 -1 1
2 1 -1 -1
3 1 1 -1
4 1 1 1
5 -1 -1 1
6 -1 -1 -1
7 -1 1 -1
8 -1 1 1
> triangles
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
1 5 5 1 1 2 2 6 6 8 8 1 1
2 1 4 2 3 6 7 5 8 4 3 6 5
3 4 8 3 4 7 3 8 7 3 7 2 6
You need matrices to deal with tmesh3d. A row of 1's must be added to the table of vertices.
verts <- rbind(t(as.matrix(vertices)),1)
trgls <- as.matrix(triangles)
tmesh <- tmesh3d(verts, trgls)
Now you can plot the mesh:
wire3d(tmesh)
About colors, you have to associate one color to each triangle:
tmesh$material <- list(color=rainbow(ncol(trgls)))
wire3d(tmesh)
> shade3d(tmesh)
UPDATE 2019-03-09
The newest version of rgl (0.100.18) allows different interpretation of the material colors.
You can assign a color to each face:
vertices <- as.matrix(vertices)
triangles <- as.matrix(triangles)
mesh1 <- tmesh3d(
vertices = t(vertices),
indices = triangles,
homogeneous = FALSE,
material = list(color = rainbow(ncol(triangles)))
)
shade3d(mesh1, meshColor = "faces")
or assign a color to each vertex:
mesh2 <- tmesh3d(
vertices = t(vertices),
indices = triangles,
homogeneous = FALSE,
material = list(color = rainbow(nrow(vertices)))
)
shade3d(mesh2, meshColor = "vertices")

gnuplot: plot multiple lines from single file and make title at end from column and key title manually

I'm quite new with gnuplot and so maybe my question has an obvious answer. Please excuse if this is too noobish.
I have the following data
20 500 1.0
30 500 0.95
40 500 0.85
50 500 0.7
60 500 0.5
20 1000 1.1
30 1000 1.05
40 1000 0.95
50 1000 0.8
60 1000 0.6
20 1500 1.2
30 1500 1.15
40 1500 1.05
50 1500 0.9
60 1500 0.7
20 2000 1.26
30 2000 1.22
40 2000 1.13
50 2000 0.99
60 2000 0.79
20 2500 1.33
30 2500 1.29
40 2500 1.21
50 2500 1.06
60 2500 0.88
Plotting this as a surface worked fine. Now I would like to plot this as 5 separate lines (using 1:3) and have the 2nd column as 'title at end' for each of the lines.
I tried
plot "demo.dat" using 1:3:2 with lines title columnhead(2) at end
but this will only label the last line (which is bogus) with 500 and ignore all the others. Also it sets 500 as title in the key box (which I would like to set to another string). Is that possible or do I have to split the blocks into several files as suggested in How to plot single/multiple lines depending on values in a column with GNUPlot ?
You may try in two steps:
plot 'demo.dat' u 1:3 notitle with lines, \
'demo.dat' u 1:3:2 every ::4::4 notitle with labels
first plotting the data.
then adding a label at the last point of each block (composed of 5 points going from 0 to 4), or at the point before the last one replacing ::4::4 by ::3::3.
Another rather general solution. No need for different data format or for splitting the data into several files.
Of course, it would be easier if the data was split into subblocks by two empty lines, however, the OP's data is separated only by single empty lines. How to handle this without modifying the data outside gnuplot?
For the following solution you don't have to know in advance how many subblocks your data has and how many datapoints there are in one subblock.
Furthermore:
different colors for the subblocks
value of column 2 as label at the end
some other text can be placed in the legend/key
Edit:
each subblock can have different number of datapoints
works with gnuplot>=4.6.0
Data: SO32603146.dat
10 500 1.05
20 500 1.0
30 500 0.95
40 500 0.85
50 500 0.7
60 500 0.5
20 1000 1.1
30 1000 1.05
40 1000 0.95
50 1000 0.8
20 1500 1.2
30 1500 1.15
40 1500 1.05
50 1500 0.9
60 1500 0.7
70 1500 0.51
80 1500 0.30
20 2000 1.26
30 2000 1.22
40 2000 1.13
50 2000 0.99
60 2000 0.79
70 2000 0.61
20 2500 1.33
30 2500 1.29
40 2500 1.21
50 2500 1.06
60 2500 0.88
Script: (works with gnuplot>=4.6.0, March 2012)
### plotting some subblock data with label at the end
reset
FILE = "SO32603146.dat"
set rmargin 7
set key at graph 1.0,0.95 noautotitle
addPosLabel(colX,colY,colL) = (x0=x1,x1=column(colX), y0=y1,y1=column(colY), \
L0=L1,L1=strcol(colL), b0=b1,b1=column(-1),b0!=b1 ? \
myLabels = myLabels.sprintf(" %g %g %s",x0,y0,L0) : 0, column(colY))
x1 = y1 = b1 = NaN
L1 = myLabels = ''
PosX(i) = real(word(myLabels,int(i*3+3)))
PosY(i) = real(word(myLabels,int(i*3+4)))
Label(i) = word(myLabels,int(i*3+5))
plot FILE u 1:(addPosLabel(1,3,2)):-1 w lp pt 7 lc var, \
myLabels=myLabels.sprintf(" %g %g %s",x1,y1,L1), \
'+' u (x0=int($0*3+5),PosX($0)):(PosY($0)):(word(myLabels,x0)) \
every ::::(words(myLabels)-2)/3-1 w labels left offset 1,0 ti "Some other text\n for the legend"
### end of script
Result:

how to discretize R data.frame cloumn in a given width?

Say, I have a data.frame() like this
>head(Acquisition)
original_date first_payment_date LTV DTI FICO
1 01/2007 03/2007 56 37 734
2 02/2007 04/2007 80 11 762
3 12/2006 02/2007 80 28 656
4 12/2006 03/2007 70 50 700
I want to discretize the Acquisition$LTV and Acquisition$DTI by the step size 0.05 and Acquisition$FICO by the step size 10.
I have found the answer just use cut function is okay.
dis.LTV=cut(Acquisition$LTV,(max(Acquisition$LTV)-min(Acquisition$LTV))/0.05)

R: Using Log Rank Test (survdiff)

OK, so I have a dataframe that looks like this:
head(exprs, 21)
sample expr ID X_OS
1 BIX high TCGA_DM_A28E_01 26
2 BIX high TCGA_AY_6197_01 88
3 BIX high TCGA_HB_KH8H_01 553
4 BIX low TCGA_K4_6303_01 256
5 BIX low TCGA_F4_6703_01 491
6 BIX low TCGA_Y7_PIK2_01 177
7 BIX low TCGA_A6_5657_01 732
8 HEF high TCGA_DM_A28E_01 26
9 HEF high TCGA_AY_6197_01 88
10 HEF high TCGA_F4_6703_01 491
11 HEF high TCGA_HB_KH8H_01 553
12 HEF low TCGA_K4_6303_01 256
13 HEF low TCGA_Y7_PIK2_01 177
14 HEF low TCGA_A6_5657_01 732
15 TUR high TCGA_DM_A28E_01 26
16 TUR high TCGA_F4_6703_01 491
17 TUR high TCGA_Y7_PIK2_01 177
18 TUR low TCGA_K4_6303_01 256
19 TUR low TCGA_AY_6197_01 88
20 TUR low TCGA_HB_KH8H_01 553
21 TUR low TCGA_A6_5657_01 732
Simply, for each sample, there are 7 patients, each with a survival time (X_OS) and expression level high or low (expr). In the code below, I wish to take the first sample and run it through the survdiff function, with the outputs going to dfx. However, I'm new to survival analysis and I'm not sure how to use the parameters of the survdiff function. I wish to compare high and low expression groups for each sample. How can I edit the function expfun to yield the survdiff output I need? In addition, ideally I'd love to get the pvalues out of it, but I can work on that in a later step. Thank you!
expfun = function(x) {
survdiff(Surv(x$X_OS, x$expr))
}
dfx <- pblapply(split(exprs[c("expr", "X_OS")], exprs$sample), expfun)
Try this. I added a proper Surv() call because you only had times and no status argument and I made it into a formula (with the predictor on the other side of the tilde) because Surv function expects status as its second argument and survdiff expects a formula as its first argument. That means you need to use the regular R regression calling convention where column names are used as the formula tokens and the dataframe is given to the data argument. If you had a censoring variable, it would be put in as the second Surv argument rather than the 1's that I have in there now.
expfun = function(x) {
survdiff( Surv( X_OS, rep(1,nrow(x)) ) ~ expr, data=x)
}
dfx <- lapply(split(exprs[c("expr", "X_OS")], exprs$sample), expfun)
This is the result from print.survdiff:
> dfx
$BIX
Call:
survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
N Observed Expected (O-E)^2/E (O-E)^2/V
expr=high 3 3 2.05 0.446 0.708
expr=low 4 4 4.95 0.184 0.708
Chisq= 0.7 on 1 degrees of freedom, p= 0.4
$HEF
Call:
survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
N Observed Expected (O-E)^2/E (O-E)^2/V
expr=high 4 4 3.14 0.237 0.51
expr=low 3 3 3.86 0.192 0.51
Chisq= 0.5 on 1 degrees of freedom, p= 0.475
$TUR
Call:
survdiff(formula = Surv(X_OS, rep(1, nrow(x))) ~ expr, data = x)
N Observed Expected (O-E)^2/E (O-E)^2/V
expr=high 3 3 1.75 0.902 1.41
expr=low 4 4 5.25 0.300 1.41
Chisq= 1.4 on 1 degrees of freedom, p= 0.235
Note that you can see the code to produce the print output with:
getAnywhere(print.survdiff)

Resources