How can I plot consensus sequences as a binary heatmap in R - r

I have many amino acids sequences like in a fasta format that I did multiple sequence alignment. I was trying to plot something like binary code as heatmap. If it had a change it would be red, if it did not change would be yellow, for example.
I came across to msaplot from ggtreepackage. I also checked ggmsa package for that. But so far, I did not get what I wanted.
So basically I wanted to:
change the multiple sequence alignment to a binary matrix (if the amino differs from the reference sequence, plot x, if not y)
plot as a heatmap
A multiple sequence alignment is something like this for those who don't know.
I know that I should provide some type of data example but I am not sure how to create an example of multiple sequence alignment
if you install ggmsa you can have an example of the data and plot in r using:
protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
ggmsa(protein_sequences, start = 265, end = 300)

We read in the alignment:
library(Biostrings)
library(ggmsa)
protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
aln = readAAMultipleAlignment(protein_sequences)
ggmsa(protein_sequences, start = 265, end = 300)
Set the reference as the 1st sequence, some Rattus, you can also use the consensus with consensusString() :
aln = unmasked(aln)
names(aln)[1]
[1] "PH4H_Rattus_norvegicus"
ref = aln[1]
Here we iterate through the sequence and make the binary for where the sequences are the same as the reference:
bm = sapply(1:length(aln),function(i){
as.numeric(as.matrix(aln[i])==as.matrix(ref))
})
bm = t(bm)
rownames(bm) = names(aln)
The plot you see above has sequences reversed, so to visualize the same thing we reverse it, and also subset on 265 - 300:
library(pheatmap)
pheatmap(bm[nrow(bm):1,265:300],cluster_rows=FALSE,cluster_cols=FALSE)
The last row, is Rattus, the reference, and anything similar to that is read, as you can see in the alignment above, last 4 sequences are identical.

Related

Problem with computing venn.diagram with two data sets of character values

As in title, when I try to draw a venn diagram with VennDiagram package, the resulting plot looks like this: Resulting graph.
My input is two tables read from txt files with read.delim (although I also tried read.table) put into a list for venn.diagram purpose. The datasets are 1325 and 675 rows long with short peptide sequences as character values (eg. REVDPDGRRTL), so I don't understand the resulting graph.
Here's what in theory should work:
library("VennDiagram")
#reading files
hid <- read.table("data/file1.txt", sep = "\t")
lid <- read.table("data/file2.txt", sep = "\t")
#creating a list
vid <- list(High = hid, Low = lid)
#graph
venn.diagram(vid, fill = c("#EFC000FF", "#0073C2FF"), filename = "#venn.png")
I also tried transforming the sets to vectors/lists and plotting like that but problem stays the same.
It surely lays on datasets/list side, because the graph is correct when I put example values like this
venn.diagram(list(Low = c("REVDPDGRRTL", "IYEDEDVKEA", "GVYDGREHTV"), High = c("IYEDEDVKEA", "GVYDGREHTV")),
fill = c("#EFC000FF", "#0073C2FF"), filename = "#venn.png")
I'm sure it's some rookie mistake but I can't think of a solution.
Any help is highly appreciated,
Thank you

Create Venn Diagram from two DF

I'm trying to create a Venn diagram of two data frames, but am only able receive incorrect results. An example of the data sets of the same structure:
Chemical
ChemID
Oxidopamine
D016627
Melatonin
D016627
I've only received incorrect results from the following:
VennDiagram::venn.diagram(
x = list(Lewy, Park),
category.names = c("ChemID, ChemID"),
filename ="venndiagramm.png",
output=TRUE)
Ideally, I would like to export an image of number of overlapping chemicals between the two sets.
Welcome to SO! As far as I guess your data structure (two dataframes Lewy and Park, each with the column ChemID), try the following:
VennDiagram::venn.diagram(
x = list(Lewy$ChemID, Park$ChemID), # expects vectors, not dataframes
# category.names = c("ChemID, ChemID"), # see if these are rather to construct nice labels
filename ="venndiagramm.png",
output=TRUE)
You may increase the chance of a useful answer by providing minimal working data samples by dput(). Of course you can use simulated data. Try to explain what exactly did not work.
See also ? venn.diagram

Venn diagram in R completely blank

I am trying to create a Venn diagram for common differentially expressed genes across 3 data sets. I created a list that contains the differentially expressed genes, then I used the venn.diagram() function with the following arguments: x (which is my list of gene names in the three data sets) , filename,category.names and output. However, the Venn diagram is turning out completely blank, no category names nor numbers inside intersections.
My code looks like this:
venn.diagram(up, filename = 'venn_up.png', category.names = c('up_PC3', 'up_LAPC4', 'up_22Rv1'), output = TRUE)
Has anyone faced a similar problem? Thanks all!
Without reproducible dataset it is hard, so I created one:
genes <- paste("gene",1:1000,sep="")
x <- list(
up_PC3 = sample(genes,300),
up_LAPC4 = sample(genes,525),
up_22Rv1 = sample(genes,440)
)
You can use the following code to run a Venn diagram:
library(VennDiagram)
venn.diagram(x, filename = "venn_up.png", category.names = c('up_PC3', 'up_LAPC4', 'up_22Rv1'))
Than check at the right folder of your working directory for the output:

How to group repeating sequences of numbers using R

The simplest description of what I am trying to do is that I have a column in a data.frame like 1,2,3,..., n, 1,2,3,...n,.... and I want group the first 1...n as 1 the second 1...n as 2 and so on.
The full context is; I am using the R spcosa package to do equal area stratification composite sampling on parcels of land. I start with a shape file from a GIS that contains a number of polygons (land parcels). The end result I want is a GIS file with each of the strata and sample locations in a GIS file format with each stratum and sample location labeled by land parcel, stratum and sample id. So far I can do all this except one bit which is identifying the stratum that the samples belongs too and including it in the sample label. The sample label needs to look like "parcel#-strata#-composite# (where # is the number). In practice I don't need this actual label but as separate attributes in GIS file.
The basic work flow is a follows
For each individual polygon using spcosa::stratify I divide it into a number of equal area strata like
strata.CSEA <- stratify(poly[i,], nStrata = n, nTry = 1, equalArea = TRUE, nGridCells = x)
Note spcosa::stratify generates a CompactStratificationEqualArea object. I cocerce this to a SpatialPixelData then use rasterToPolygon to be able to output it as a GIS file.
I then generate the sample locations as follows:
samples.SPRC <- spsample(strata.CSEA, n = n, type = "composite")
spcosa::spsample creates a SamplingPatternRandomComposite object. I coerce this to a SpatialPointsDataFrame
samples.SPDF <- as(samples.SPRC, "SpatialPointsDataFrame")
and add two columns to the #data slot
samples.SPDF#data$Strata <- "this is the bit I can't do yet"
samples.SPDF#data$CEA <- poly[i,]$name
I can then write samples.SPDF as a GIS file ( ie writeOGE) with all the wanted attributes.
As above the part I can't sort out is how the sample ids relate to the strata ids. The sample points are a vector like 1,2,3...n, 1,2,3...n,.... How do I extract which sample goes with which strata? As actual strata number are arbitrary, I can just group ( as per my simple question above) but ideally I would like to use the numbering of the actual strata so everything lines up.
To give any contributors access to a hands on example I copy below the code from the spcosa documentation slightly modified to generate the correct objects.
# Note: the example below requires the 'rgdal'-package You may consider the 'maptools'-package as an alternative
if (require(rgdal)) {
# read a vector representation of the `Farmsum' field
shpFarmsum <- readOGR(
dsn = system.file("maps", package = "spcosa"),
layer = "farmsum"
)
# stratify `Farmsum' into 50 strata
# NB: increase argument 'nTry' to get better results
set.seed(314)
myStratification <- stratify(shpFarmsum, nStrata = 50, nTry = 1, equalArea = TRUE)
# sample two sampling units per stratum
mySamplingPattern <- spsample(myStratification, n = 2 type = "composite")
# plot the resulting sampling pattern on
# top of the stratification
plot(myStratification, mySamplingPattern)
}
Maybe order() function can help you
n <- 10
dat <- data.frame(col1 = rep(1:n, 2), col2 = rnorm(2*n))
head(dat)
dat[order(dat$col1), ]
I did not get where the "ID" (1,2,3...n) is to be found; so let's assume you have your SpatialPolygonsDataFrame called shpFarmsum with a attribute data column "ID". You can access this column via shpFarmsum$ID. Therefore, if you want to create individual subsets for each ID this is one way to go:
for (i in unique(shpFarmsum$ID)) {
tempSubset shpFarmsum[shpFarmsum$ID == i,]
writeOGR(tempSubset, ".", paste0("subset_", i), driver = "ESRI Shapefile")
}
I added the line writeOGR(... so all subsets are written to your working direktory. However, you can change this line or add further analysis into the for-loop.
How it works
unique(shpFarmsum$ID) extracts all occuring IDs (compareable to your 1,2,3...n).
In each repetition of the for loop, another value of this IDs will be used to create a subset of the whole SpatialPolygonsDataFrame, which you can use for further analysis.

Creating SpatialLinesDataFrame from SpatialLines object and basic df

Using leaflet, I'm trying to plot some lines and set their color based on a 'speed' variable. My data start at an encoded polyline level (i.e. a series of lat/long points, encoded as an alphanumeric string) with a single speed value for each EPL.
I'm able to decode the polylines to get lat/long series of (thanks to Max, here) and I'm able to create segments from those series of points and format them as a SpatialLines object (thanks to Kyle Walker, here).
My problem: I can plot the lines properly using leaflet, but I can't join the SpatialLines object to the base data to create a SpatialLinesDataFrame, and so I can't code the line color based on the speed var. I suspect the issue is that the IDs I'm assigning SL segments aren't matching to those present in the base df.
The objects I've tried to join, with SpatialLinesDataFrame():
"sl_object", a SpatialLines object with ~140 observations, one for each segment; I'm using Kyle's code, linked above, with one key change - instead of creating an arbitrary iterative ID value for each segment, I'm pulling the associated ID from my base data. (Or at least I'm trying to.) So, I've replaced:
id <- paste0("line", as.character(p))
with
lguy <- data.frame(paths[[p]][1])
id <- unique(lguy[,1])
"speed_object", a df with ~140 observations of a single speed var and row.names set to the same id var that I thought I created in the SL object above. (The number of observations will never exceed but may be smaller than the number of segments in the SL object.)
My joining code:
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object)
And the result:
row.names of data and Lines IDs do not match
Thanks, all. I'm posting this in part because I've seen some similar questions - including some referring specifically to changing the ID output of Kyle's great tool - and haven't been able to find a good answer.
EDIT: Including data samples.
From sl_obj, a single segment:
print(sl_obj)
Slot "ID":
[1] "4763655"
[[151]]
An object of class "Lines"
Slot "Lines":
[[1]]
An object of class "Line"
Slot "coords":
lon lat
1955 -74.05228 40.60397
1956 -74.05021 40.60465
1957 -74.04182 40.60737
1958 -74.03997 40.60795
1959 -74.03919 40.60821
And the corresponding record from speed_obj:
row.names speed
... ...
4763657 44.74
4763655 34.8 # this one matches the ID above
4616250 57.79
... ...
To get rid of this error message, either make the row.names of data and Lines IDs match by preparing sl_object and/or speed_object, or, in case you are certain that they should be matched in the order they appear, use
splndf <- SpatialLinesDataFrame(sl = sl_object, data = speed_object, match.ID = FALSE)
This is documented in ?SpatialLinesDataFrame.
All right, I figured it out. The error wasn't liking the fact that my speed_obj wasn't the same length as my sl_obj, as mentioned here. ("data =
object of class data.frame; the number of rows in data should equal the number of Lines elements in sl)
Resolution: used a quick loop to pull out all of the unique lines IDs, then performed a left join against that list of uniques to create an exhaustive speed_obj (with NAs, which seem to be OK).
ids <- data.frame()
for (i in (1:length(sl_obj))) {
id <- data.frame(sl_obj#lines[[i]]#ID)
ids <- rbind(ids, id)
}
colnames(ids)[1] <- "linkId"
speed_full <- join(ids, speed_obj)
speed_full_short <- data.frame(speed_obj[,c(-1)])
row.names(speed_full_short) <- speed_full$linkId
splndf <- SpatialLinesDataFrame(sl_obj, data = speed_full_short, match.ID = T)
Works fine now!
I may have deciphered the issue.
When I am pulling in my spatial lines data and I check the class it reads as
"Spatial Lines Data Frame" even though I know it's a simple linear shapefile, I'm using readOGR to bring the data in and I believe this is where the conversion is occurring. With that in mind the speed assignment is relatively easy.
sl_object$speed <- speed_object[ match( sl_object$ID , row.names( speed_object ) ) , "speed" ]
This should do the trick, as I'm willing to bet your class(sl_object) is "Spatial Lines Data Frame".
EDIT: I had received the same error as OP, driving me to check class()
I am under the impression that the error that was populated for you is because you were trying to coerce a data frame into a data frame and R wasn't a fan of that.

Resources