I'm trying to evaluate some data for my thesis. I can use R to build a boxplot and conduct the statistical test just fine, and I can do the compact letter display manually... but this time I simply have too much data to do it this way. I'm plotting the distance travelled by different species against each other. I did find some manuals online telling me to use the cldList-function, like so:
PT = Data$res
PT
library(rcompanion)
cldList(P.adj ~ Comparison,
Data = PT,
threshold = 0.05)
But it seems the resulting table isn't right:
[1] Group Letter MonoLetter
<0 Zeilen> (oder row.names mit Länge 0)
Obviously, I need the data grouped by species, but I thought I had already clarified this when conducting the Kruskal-Wallis-Test.
I'm fairly inexperienced with R, or programming in general, so I have no idea where the error is here. I'd apprechiate any help.
Related
I'm trying to plot the Levenshtein distance scores between 2 list of sequences (amino acid sequences) using something other than a heatmap. This is a code I used to generate a heatmap as an example:
library (utils)
library (pheatmap)
dist_scores<-adist(LV #first list of sequence
,CD4 #second list of sequences, counts = TRUE)
colors = c("tomato","khaki1","darkseagreen2", "mediumseagreen", "gray30")
breaks <- c(0, 1,2,3,4,5)
pheatmap(dist_scores,breaks=breaks, color=colors, cluster_rows = T, cluster_cols = T)
and here is the heatmap from the example:
https://i.stack.imgur.com/4ay55.png
I want to have a more intuitive way of showing the data..
I'm thinking of plotting the data as nodes (representing different sequences) and edges (representing the distances..where the length of the edge increases as the score increases), and also color-code the nodes by whether they are from "LV" or "CD4". Is there a way to do this in R?
My coding skills are subpar at best so I would be really grateful for any help.
Thanks :)
IIRC, from my previous experience doing Bioinformatics decades ago, there are already good graphical representations to show similarity between DNA sequences. One option I recall uses sequence dot plots, and R has at least 2 packages for doing that: seqinr (https://cran.r-project.org/web/packages/seqinr/index.html) and dotplot (https://github.com/evolvedmicrobe/dotplot). One option that is not an R package but a web tool, is YASS (https://bioinfo.cristal.univ-lille.fr/yass/index.php).
For some alternative metrics and representations, see: https://ieeexplore.ieee.org/document/9097943 ("Levenshtein Distance, Sequence Comparison and Biological Database Search"), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4880953/ ("Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images"), and https://pdf.sciencedirectassets.com/271876/1-s2.0-S0888613X07X01403/1-s2.0-S0888613X07000382/main.pdf ("Distance measures for biological sequences: Some recent approaches")
first post :)
I've been transitioning my R code from sp() to sf()/stars(), and one thing I'm still trying to grasp is accounting for the area in my grids.
Here's an example code to explain what I mean.
library(stars)
library(tidyverse)
# Reading in an example tif file, from stars() vignette
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x = read_stars(tif)
x
# Get areas for each grid of the x object. Returns stars object with "area" in units of [m^2]
x_area <- st_area(x)
x_area
I tried loosely adopting code from this vignette (https://github.com/r-spatial/stars/blob/master/vignettes/stars5.Rmd) to divide each value in x by it's grid area, and it's not working as expected (perhaps because my objects are stars and not sf?)
x$test1 = x$L7_ETMs.tif / x_area # Some computationally intensive calculation seems to happen, but doesn't produce the results I expect?
x$test1 = x$L7_ETMs.tif / x_area$area # Throws error, "non-conformable arrays"
What does seem to work is the following.
x %>%
mutate(test1 = L7_ETMs.tif / units::set_units(as.numeric(x_area$area), m^2))
Here are the concerns I have with this code.
I worry that as I turn the x_area$area (a matrix, areas in lat/lon) into a numeric vector, I may mess up the lat/lon matching between the grid and it's area. I did some rough testing to see if the areas match up the way I expect them to, but can't escape the worry that this could lead to errors that are difficult to catch.
It just doesn't seem clean that I start with "x_area" in the correct units, only to remove then set the units again during the computation.
Can someone suggest a "cleaner" implementation for what I'm trying to do, i.e. multiplying or dividing grids by its area while maintaining units throughout? Or convince me that the code I have is fine?
Thanks!
I do not know how to improve the stars code, but you can compare the results you get with this
tif <- system.file("tif/L7_ETMs.tif", package = "stars")
library(terra)
r <- rast(tif)
a <- cellSize(r, sum=FALSE)
x <- r / a
With planar data you could do this when it is safe to assume there is no distortion (generally not the case, but it can be the case)
y <- r / prod(res(r))
I came across PCA analysis, and noticed the different values returned by different functions in R. The intention of this question is to disambiguate the output of each. I didn't find a satisfactory answer as to why these functions return different values. The functions compared are: stats::princomp(), stats::prcomp(), psych::principal(), and FactoMineR::PCA(). Data set was scaled and centered for sake of comparison and all set to return 4 components, however only the first two PCs are shown here for brevity.
Below is a code of a MWE to set up the case. Please feel free to report any other function in R that you might see it helpful to compare its output here in one place, I hope.
princompPCA <- princomp(USArrests, cor = TRUE)
prcompPCA <- prcomp(USArrests,scale.=TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
library(FactoMineR)
fmrPCA <- PCA(USArrests, ncp=4, graph=FALSE) # vars scaled data
# now the first two PCs from each package into one data frame
dfComp <- cbind.data.frame(princompPCA$scores[,1:2],prcompPCA$x[,1:2],principalPCA$scores[,1:2],fmrPCA$ind$coord[,1:2])
names(dfComp) <- c("princompDim1","princompDim2","prcompDim1","prcompDim2","principalDim1","principalDim2","fmrDim1","fmrDim2")
head(dfComp)
Output:
princompDim1 princompDim2 prcompDim1 prcompDim2 principalDim1 principalDim2 fmrDim1 fmrDim2
Alabama -0.9855659 1.1333924 -0.9756604 1.1220012 0.61951483 -1.1277874 0.9855659 -1.1333924
Alaska -1.9501378 1.0732133 -1.9305379 1.0624269 1.22583308 -1.0679059 1.9501378 -1.0732133
Arizona -1.7631635 -0.7459568 -1.7454429 -0.7384595 1.10830334 0.7422678 1.7631635 0.7459568
Arkansas 0.1414203 1.1197968 0.1399989 1.1085423 -0.08889509 -1.1142591 -0.1414203 -1.1197968
California -2.5239801 -1.5429340 -2.4986128 -1.5274267 1.58654347 1.5353037 2.5239801 1.5429340
Colorado -1.5145629 -0.9875551 -1.4993407 -0.9776297 0.95203595 0.9826713 1.5145629 0.9875551
I noticed that output of stats::princomp() is exactly the same as FactoMineR::PCA() except for the inverted signs. Any idea why the signs are mirrored? Both outputs of these two functions are drawing near to the stats::prcomp() but that may be due to floating point issues, a minor issue. But psych::principal() is relatively different than others. Could it be due to rotation differences between the mentioned functions? So any explanation for these differences would be much appreciated.
The outcome of the PCA are vectors along an axis. The numbers with the sign inverted are simply the vectors pointing in the other direction along the same axis. So, the results you get are the same.
Other differences could be due a different way of calculating the principal components, i.e. using eigenvectors of a correlation matrix or using singular vector decomposition. But I'm just speculating here.
I was looking for the same info and found this link helpful:
https://groups.google.com/forum/#!topic/factominer-users/BRN8jRm-_EM
FactoMiner outputs PCA coordinates not loadings which confused me for a while....
Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning
I have some experience with R as a statistics platform, but am inexperienced in image based maths. I have a series of photographs (tiff format, px/µm is known) with holes and irregular curves. I'd like to measure the shortest distance between a hole and the closest curve for that particular hole. I'd like to do this for each hole in a photograph. The holes are not regular either, so maybe I'd need to tell the program what are holes and what are curves (ImageJ has a point and segmented line functions).
Any ideas how to do this? Which package should I use in R? Would you recommend another program for this kind of task?
EDIT: Doing this is now possible using sclero package. The package is currently available on GitHub and the procedure is described in detail in the tutorial. Just to illustrate, I use an example from the tutorial:
library(devtools)
install_github("MikkoVihtakari/sclero", dependencies = TRUE)
library(sclero)
path <- file.path(system.file("extdata", package = "sclero"), "shellspots.zip")
dat <- read.ijdata(path, scale = 0.7812, unit = "um")
shell <- convert.ijdata(dat)
aligned <- spot.dist(shell)
plot(aligned)
It is also possible to add sample spot sizes using the functions provided by the sclero package. Please see Section 2.5 in the tutorial.
There's a tool for edge detection written for Image J that might help you first find the holes and the lines, and clarify them. You find it at
http://imagejdocu.tudor.lu/doku.php?id=plugin:filter:edge_detection:start
Playing around with the settings for the tresholding and the hysteresis can help in order to get the lines and holes found. It's difficult to tell whether this has much chance of working without seeing your actual photographs, but a colleague of mine had good results using this tool on FRAP images. I programmed a ImageJ tool that can calculate recoveries in FRAP analysis based on those images. You might get some ideas for yourself when looking at the code (see: http://imagejdocu.tudor.lu/doku.php?id=plugin:analysis:frap_normalization:start )
The only way I know you can work with images, is by using EBImage that's contained in the bioconductor system. The package Rimage is orphaned, so is no longer maintained.
To find the shortest distance: once you have the coordinates of the lines and holes, you can go for the shotgun approach : calculate the distances between all points and the line, and then take the minimum. An illustration about that in R :
x <- -100:100
x2 <- seq(-70,-50,length.out=length(x)/4)
a.line <- list(x = x,
y = 4*x + 5)
a.hole <- list(
x = c(x2,rev(x2)),
y = c(200 + sqrt(100-(x2+60)^2),
rev(200 - sqrt(100-(x2+60)^2)))
)
plot(a.line,type='l')
lines(a.hole,col='red')
calc.distance <- function(line,hole){
mline <- matrix(unlist(line),ncol=2)
mhole <- matrix(unlist(hole),ncol=2)
id1 <- rep(1:nrow(mline),nrow(mhole))
id2 <- rep(1:nrow(mhole), each=nrow(mline))
min(
sqrt(
(mline[id1,1]-mhole[id2,1])^2 +
(mline[id1,2]-mhole[id2,2])^2
)
)
}
Then :
> calc.distance(a.line,a.hole)
[1] 95.51649
Which you can check mathematically by deriving the equations from the circle and the line. This goes fast enough if you don't have millions of points describing thousands of lines and holes.