Plot a nucleotide chain in R

Plot a nucleotide chain in R - r

I am interested in plotting this sample figure in R. Sample figure was generated in Illustrator.
Essentially, my data is structured as such:
> dput(data)
structure(list(FirstPos = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("pos1",
"pos2"), class = "factor"), SecondPos = structure(c(1L, 1L, 1L,
2L, 2L, 2L), .Label = c("pos2", "pos3"), class = "factor"), FirstPosseq = structure(c(1L,
1L, 1L, 2L, 3L, 3L), .Label = c("A", "C", "T"), class = "factor"),
SecondPosseq = structure(c(2L, 4L, 1L, 1L, 3L, 4L), .Label = c("A",
"C", "G", "T"), class = "factor"), Count = c(10L, 100L, 1L,
100L, 100L, 100L)), .Names = c("FirstPos", "SecondPos", "FirstPosseq",
"SecondPosseq", "Count"), class = "data.frame", row.names = c(NA,
-6L))
This is a list of positions (original position and partner position). For each row, the "count" column signifies how likely the 2 nucleotides co-occur. I want a way to display that probability and the order (on the x-axis). In the example, I tried varying the line thickness based on the 'Count'.
Looking through the ggplot2 library, I couldn't find figures like this and was hoping to get your advice on potential packages/ways I could use.
Thank you!

One possible solution is to use the igraph package. Below is a basic example of how to get started with your data set.
# Assign your data to variable 'dat'.
dat = structure(list(FirstPos = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("pos1", "pos2"), class = "factor"),
SecondPos = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("pos2", "pos3"), class = "factor"),
FirstPosseq = structure(c(1L, 1L, 1L, 2L, 3L, 3L),
.Label = c("A", "C", "T"), class = "factor"),
SecondPosseq = structure(c(2L, 4L, 1L, 1L, 3L, 4L),
.Label = c("A", "C", "G", "T"), class = "factor"),
Count = c(10L, 100L, 1L, 100L, 100L, 100L)),
.Names = c("FirstPos", "SecondPos", "FirstPosseq",
"SecondPosseq", "Count"), class = "data.frame",
row.names = c(NA, -6L))
library(igraph)
# Create unique names/ids for each vertex in the graph.
dat$node1 = paste(dat$FirstPos, dat$FirstPosseq, sep="_")
dat$node2 = paste(dat$SecondPos, dat$SecondPosseq, sep="_")
# Use last two column of data as an edge list matrix, create graph.
g = graph_from_edgelist(as.matrix(dat[, c(6, 7)]))
# Add edge weights to graph.
E(g)$weight = dat$Count
# Plot using 'layout_as_tree' to control layout.
plot(g, layout=layout_as_tree(g, root=1), edge.width=log10(E(g)$weight + 1) * 5,
vertex.size=30, vertex.color="white", edge.color="black",
edge.arrow.mode=0L, vertex.label.color="black")

Related

cut.default error in heatmap generation R

I want to generate a heatmap from a 8*6 dataframe. The last row in the dataframe has the information to annotate the columns. Structure of the dataframe is as follows:
heatmap_try <-structure(list(BGC0000041 = structure(c(1L, 2L, 1L, 1L, 1L, 3L
), .Label = c("0", "0.447458977", "a"), class = "factor"), BGC0000128 = structure(c(1L,
1L, 1L, 3L, 2L, 4L), .Label = c("0", "1.785875195", "4.093659107",
"a"), class = "factor"), BGC0000287 = structure(c(1L, 1L, 1L,
3L, 2L, 4L), .Label = c("0", "1.785875195", "4.456229186", "b"
), class = "factor"), BGC0000294 = structure(c(3L, 1L, 2L, 4L,
1L, 5L), .Label = c("0", "2.035046947", "3.230553742", "3.286304185",
"b"), class = "factor"), BGC0000295 = structure(c(1L, 1L, 1L,
2L, 1L, 3L), .Label = c("0", "2.286304185", "c"), class = "factor"),
BGC0000308 = structure(c(4L, 2L, 3L, 5L, 1L, 6L), .Label = c("6.277728291",
"6.313707588", "6.607936616", "6.622871165", "6.64385619",
"c"), class = "factor"), BGC0000323 = structure(c(1L, 2L,
1L, 1L, 1L, 3L), .Label = c("0", "0.447458977", "c"), class = "factor"),
BGC0000328 = structure(c(1L, 2L, 1L, 1L, 1L, 3L), .Label = c("0",
"0.447458977", "c"), class = "factor")), class = "data.frame", row.names = c("Gut",
"Oral", "Anterior_nares", "Retroauricular_crease", "Vagina",
"AL"))
My code for heatmap generation is as follows (I am using pheatmap library):
library(pheatmap)
heatmap_data1 <- heatmap_try[ c(1:5), c(1:8) ]
anotation_data <- as.data.frame(t(heatmap_try[6, ]))
row.names(anotation_data) <- colnames(heatmap_data1)
pheatmap(heatmap_data1, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)
However, I am getting the following error:
Error in cut.default(x, breaks = breaks, include.lowest = T) :
'x' must be numeric
What I am doing wrong?
Thanks!

This is because the columns of heatmap_data1 are factors, they need to be numeric. One way to convert is with:
heatmap_data1_num <- as.data.frame(lapply(heatmap_data1,
function(x) as.numeric(as.character(x))))
# then as before
pheatmap(heatmap_data1_num, annotation_col = anotation_data, color = colorRampPalette(c("white","blue"))(n=100),cellwidth = 40,cellheight = 6,fontsize_row = 5,cluster_rows = F,cluster_cols = F)

geom_bar & multiple variables

I am having trouble getting my plots to work, I have multiple categorical variables by which I want to color by one, and facet by another. However, R keeps adding the "values" (I used melt) for the same variables together instead. It works when I only have one variable.
Here is my plot with one variable
Here is my plot with two variables, you can see the adding that is happening
simple dataframe
Here is my code:
library(reshape2)
library(ggplot2)
test2 <- structure(list(SampleID = c(12.19, 12.22, 13.1, 12.19, 12.22,
13.1, 12.19, 12.22, 13.1, 12.19, 12.22, 13.1), patient = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), type = structure(c(1L,
1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L), .Label = c("L",
"T"), class = "factor"), timepoint = structure(c(1L, 2L, 2L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "D", class = "factor"), variable = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("A",
"B", "C", "D", "E", "F", "G", "H", "I"), class = "factor"),
value = c(2L, 5L, 6L, 25L, 18L, 12L, 6L, 10L, 15L, 21L, 23L,
33L)), .Names = c("SampleID", "patient", "type", "timepoint",
"Group", "variable", "value"), row.names = c(NA, 12L), class = "data.frame")
ggplot(test2, aes(test2$variable, test2$value, fill=test2$timepoint)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_manual(values=c("rosybrown1", "steelblue2", "gray")) +
labs(x="Category", y="Count", title = paste0("Sample ", as.character(unique(test2$patient)) , " - " , as.character(unique(test2$Group)))) +
facet_wrap(~test2$type) +
theme(text = element_text(size=15),
axis.text.x = element_text(angle = 90, hjust = 1, vjust=.5, size = 7))

If I am understanding right, it looks like you just need to give the scales option to facet_wrap like so:
facet_wrap(~type, scales = "free_x")

Select observations in R based on maximum number listed in a column

I hope I've done this correctly! I have two data frames:
teachers = structure(list(Teacher = c(123L, 123L, 123L, 123L, 124L),
tStudents = c(3L, 3L, 4L, 3L, 4L), Term = c(1801L, 1802L, 1801L, 1803L, 1802L),
Course = structure(c(5L, 6L, 7L, 6L, 8L), .Label = c("ENGG",
"ENGG2", "LITT", "LITT2", "MATH", "MATH2", "PHYS", "SCIE"
), class = "factor")), .Names = c("Teacher", "tStudents", "Term", "Course"), row.names = c(NA, 5L), class = "data.frame")
enrols = structure(list(UniqueStudent = structure(c(3L, 2L, 1L, 5L, 4L),
.Label = c("1801-ENGG-N1-abcd1#abc.edu.au", "1801-MATH-C1-abcd1#abc.edu.au","1801-PHYS-L1-abcd1#abc.edu.au", "1802-MATH2-G1-abcd1#abc.edu.au", "1802-SCIE-K2-abcd1#abc.edu.au"), class = "factor"), Term = c(1801L,1801L, 1801L, 1802L, 1802L), Student.Email.Addresses = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "abcd1#abc.edu.au", class = "factor"), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "s12344", class = "factor"),
Gender.Description = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "M", class = "factor"),
Age = c(12L, 12L, 12L, 12L, 12L), Program.Short.Description = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "LSC1", class = "factor"), Term.CC.CN = structure(c(3L,
2L, 1L, 5L, 4L), .Label = c("1801-ENGG-N1", "1801-MATH-C1",
"1801-PHYS-L1", "1802-MATH2-G1", "1802-SCIE-K2"), class = "factor"),
Course.Code = structure(c(4L, 2L, 1L, 5L, 3L), .Label = c("ENGG",
"MATH", "MATH2", "PHYS", "SCIE"), class = "factor"), Class.Number = structure(c(4L,
1L, 5L, 3L, 2L), .Label = c("C1", "G1", "K2", "L1", "N1"), class = "factor"),
Teacher = c(123L, 123L, 125L, 124L, 123L)), .Names = c("UniqueStudent", "Term", "Student.Email.Addresses", "ID", "Gender.Description", "Age", "Program.Short.Description", "Term.CC.CN", "Course.Code", "Class.Number", "Teacher"), row.names = c(NA, 5L), class = "data.frame")
teachers$tStudents lists the maximum number of students allowed to be allocated to a teacher per Term and Course. I've also pre-merged the Course enrolments in the "enrols" data to list the Teachers for each course.
So, what I need to do is create class lists from the enrols data using the teachers data by c("teacher", "Term", "Course") but my class lists can only select a maximum value of students based on the number listed in teachers$tStudents. Ideally, I'd also like to select a representative distribution of students so that the new class lists have both genders, different ages and are from different Program.Short.Description.
I've tried merging in different ways in dplyr and can create full lists with all students but haven't been able to use the teachers$tStudents column to limit the number of observations to select. Is this possible?

Special characters in a column: mess in the table

I have a problem with special characters in a column of a table.
Here is an example of the data:
structure(list(shipType = structure(c(1L, 3L, 1L, 2L, 4L), .Label = c("CARGO",
"FISHING", "TOWING_LONG_WIDE", "UNKNOWN"), class = "factor"),
shipCargo = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "UNDEFINED", class = "factor"),
destination = structure(c(3L, 1L, 2L, 4L, 5L), .Label = c("\\KORSOR ;.,NA,.\\",
"LEHTMA", "RIGA", "TALLIN", "VYBORG"), class = "factor"),
eta = structure(c(1L, 2L, 5L, 3L, 4L), .Label = c("01/01 00:00 UTC",
"01/01 09:00 UTC", "24/12 16:00 UTC", "26/12 07:00 UTC",
"30/12 16:00 UTC"), class = "factor"), imo = structure(c(3L,
5L, 1L, 4L, 2L), .Label = c("7101891", "7406318", "9066045",
"9158185", "Russia"), class = "factor"), callsign = structure(c(5L,
1L, 2L, 3L, 4L), .Label = c("12", "UALB", "UBYK8", "UFPC",
"UICC"), class = "factor"), country = structure(c(2L, 1L,
2L, 2L, 2L), .Label = c("2014-12-29", "Russia"), class = "factor"),
month = c(12L, 1L, 12L, 12L, 12L), date = structure(c(2L,
1L, 2L, 2L, 2L), .Label = c("", "2014-12-29"), class = "factor"),
week = c(1L, NA, 1L, 1L, 1L), X = c(NA, NA, NA, NA, NA)), .Names = c("shipType",
"shipCargo", "destination", "eta", "imo", "callsign", "country",
"month", "date", "week", "X"), class = "data.frame", row.names = c(NA,
-5L))
As you can see on the second row, there is a problem in the column "destination" when reading the file with the following code
data <- read.table(file, header=T, fill=T, sep=",")
I have tried different things, such as: exporting with quotes and without headers
data <- read.table(file, sep=",", fill=T, head=F, quote="")
and then removing the first line (the actual headers that are in the table...) and adding one more time these headers
data <- data[-1,]
colnames(data)<-c( "shipType", "shipCargo","destination","eta","imo","callsign", "country","month","date","week")
It looks better, but there are a lot of special characters and it will be time consuming / source of errors (I have lot of tables..) to edit.
Is there a way to avoid the columns to be messed up when importing the file?
Thank you!

Bin data by (x,y) and summarize

These are the first 10 lines of a huge files I have: (Note that there is only one user in these 10 lines but I've got thousands of users)
dput(testd)
structure(list(user = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), otime = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L
), .Label = c("2010-10-12T19:56:49Z", "2010-10-13T03:57:23Z",
"2010-10-13T16:41:35Z", "2010-10-13T20:05:43Z", "2010-10-13T23:31:51Z",
"2010-10-14T00:21:47Z", "2010-10-14T18:25:51Z", "2010-10-16T03:48:54Z",
"2010-10-16T06:02:04Z", "2010-10-17T01:48:53Z"), class = "factor"),
lat = c(39.747652, 39.891383, 39.891077, 39.750469, 39.752713,
39.752508, 39.7513, 39.758974, 39.827022, 39.749934),
long = c(-104.99251, -105.070814, -105.068532, -104.999073,
-104.996337, -104.996637, -105.000121, -105.010853,
-105.143191, -105.000017),
locid = structure(c(5L, 4L, 9L, 6L, 1L, 2L, 8L, 3L, 10L, 7L),
.Label = c("2ef143e12038c870038df53e0478cefc",
"424eb3dd143292f9e013efa00486c907", "6f5b96170b7744af3c7577fa35ed0b8f",
"7a0f88982aa015062b95e3b4843f9ca2", "88c46bf20db295831bd2d1718ad7e6f5",
"9848afcc62e500a01cf6fbf24b797732f8963683", "b3d356765cc8a4aa7ac5cd18caafd393",
"d268093afe06bd7d37d91c4d436e0c40d217b20a", "dd7cd3d264c2d063832db506fba8bf79",
"f6f52a75fd80e27e3770cd3a87054f27"), class = "factor"),
dnt = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L),
.Label = c("2010-10-12 19:56:49",
"2010-10-13 03:57:23", "2010-10-13 16:41:35", "2010-10-13 20:05:43",
"2010-10-13 23:31:51", "2010-10-14 00:21:47", "2010-10-14 18:25:51",
"2010-10-16 03:48:54", "2010-10-16 06:02:04", "2010-10-17 01:48:53"
), class = "factor"),
x = c(-11674.6344476781, -11683.3414552141,
-11683.0877083915, -11675.3642199817, -11675.0599906624,
-11675.0933491404, -11675.4807522648, -11676.6740962175,
-11691.3894104198, -11675.4691879924),
y = c(4419.73724843345, 4435.719406435, 4435.68538078744,
4420.05048454181, 4420.3000059572, 4420.27721099723,
4420.14288752585, 4420.99619739292, 4428.56278976123,
4419.99099525605),
cellx = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L),
.Label = c("[-11682,-11672)", "[-11692,-11682)"
), class = "factor"),
celly = structure(c(1L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("[4419,4429)", "[4429,4439)"
), class = "factor"),
cellxy = structure(c(1L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 2L, 1L), .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"
), class = "factor")), .Names = c("user", "otime", "lat",
"long", "locid", "dnt", "x", "y", "cellx", "celly", "cellxy"), class = "data.frame", row.names = c(NA,
-10L))
A bit of explanation on what the data is to simplify understanding. The x and y are transformation of the lat and long coordinates. I have discretised the x,y locations into bins using cut. I want to get the most visited bin per user so I use ddply. As follows:
cells = ddply(testd, .(user, cellxy), summarise, length(cellxy))
Obtaining:
dput(cells)
structure(list(user = c(0, 0, 0), cellxy = structure(1:3, .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"), class = "factor"),
count = c(7L, 1L, 2L)), .Names = c("user", "cellxy", "count"
), row.names = c(NA, -3L), class = "data.frame")
Now what I want to do is calculate the average x,y from the first dataset for the most visited bin per user as obtained from the previous calculation. I have no idea how to do this efficiently and given that my dataset is really big I would appreciate some guidance. Thanks!

Here is two stage approach. First, modified your original code of cells - for each combination of cellxy and user calculate mean x and y value.
cells = ddply(testd, .(user, cellxy), summarise,
cellcount=length(cellxy),meanx=mean(x),meany=mean(y))
cells
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.40 4420.214
2 0 [-11692,-11682)[4419,4429) 1 -11691.39 4428.563
3 0 [-11692,-11682)[4429,4439) 2 -11683.21 4435.702
Then use other call to ddply() to subset for each user cellxy with highest cellcount.
cells2 = ddply(cells,.(user),subset,cellcount==max(cellcount))
cells2
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.4 4420.214

since your data set is large, you might want to consider data.table, which not only will be blazing fast, it will also make the data mungling a bit easier.
Converting to a data table is straight forward:
library (data.table)
DT <- data.table(testd, by="user")
Then determining the most visited, by user, is just one line
# Determining which is the most visited, by user
DT[, "MostVisited" := {counts <- table(cellxy); names(counts)[which(counts==max(counts))]}, by=user]
I'm not sure how specifically you want to calculate the average x, y relative to the MostVisited, but I'm sure that as well could be relatively straight forward with data.table.
## But perhaps something like this
DT[, c("AvgX", "AvgY") := list(mean(x), mean(y)), by=list(user, MostVisited)]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Plot a nucleotide chain in R - r

Related

cut.default error in heatmap generation R

geom_bar & multiple variables

Select observations in R based on maximum number listed in a column

Special characters in a column: mess in the table

Bin data by (x,y) and summarize

Categories

Resources