Creating a data set with paired data and converting it into a matrix - r

So, I'm using R to try and do a phylogenetic PCA on a dataset that I have using the phyl.pca function from the phytools package. However, I'm having issues organising my data in a way that the function will accept! And that's not all: I did a bit of experimenting and I know that there are more issues further down the line, which I will get into...
Getting straight to the issue, here's the data frame (with dummy data) that I'm using:
>all
Taxa Tibia Feather
1 Microraptor 138 101
2 Microraptor 139 114
3 Microraptor 145 141
4 Anchiornis 160 81
5 Anchiornis 14 NA
6 Archaeopteryx 134 82
7 Archaeopteryx 136 71
8 Archaeopteryx 132 NA
9 Archaeopteryx 14 NA
10 Scansoriopterygidae 120 85
11 Scansoriopterygidae 116 NA
12 Scansoriopterygidae 123 NA
13 Sapeornis 108 NA
14 Sapeornis 112 86
15 Sapeornis 118 NA
16 Sapeornis 103 NA
17 Confuciusornis 96 NA
18 Confuciusornis 107 30
19 Confuciusornis 148 33
20 Confuciusornis 128 61
The taxa are arranged into a tree (called "tree") with Microraptor being the most basal and then progressing in order through to Confuciusornis:
>summary(tree)
Phylogenetic tree: tree
Number of tips: 6
Number of nodes: 5
Branch lengths:
mean: 1
variance: 0
distribution summary:
Min. 1st Qu. Median 3rd Qu. Max.
1 1 1 1 1
No root edge.
Tip labels: Confuciusornis
Sapeornis
Scansoriopterygidae
Archaeopteryx
Anchiornis
Microraptor
No node labels.
And the function:
>phyl.pca(tree, all, method="BM", mode="corr")
And this is the error that is coming up:
Error in phyl.pca(tree, all, method = "BM", mode = "corr") :
number of rows in Y cannot be greater than number of taxa in your tree
Y being the "all" data frame. So I have 6 taxa in my tree (matching the 6 taxa in the data frame) but there are 20 rows in my data frame. So I used this function:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
And got this:
Group.1 Tibia Feather
1 Anchiornis 153 81
2 Archaeopteryx 136 77
3 Confuciusornis 120 41
4 Microraptor 141 119
5 Sapeornis 110 86
6 Scansoriopterygidae 120 85
It's a bit odd that the order of the taxa has changed... Is this ok?
In any case, I converted it into a matrix:
> all_agg_matrix <- as.matrix(all_agg)
> all_agg_matrix
Group.1 Tibia Feather
[1,] "Anchiornis" "153" "81"
[2,] "Archaeopteryx" "136" "77"
[3,] "Confuciusornis" "120" "41"
[4,] "Microraptor" "141" "119"
[5,] "Sapeornis" "110" "86"
[6,] "Scansoriopterygidae" "120" "85"
And then used the phyl.pca function:
> phyl.pca(tree, all_agg_matrix, method = "BM", mode = "corr")
[1] "Y has no names. function will assume that the row order of Y matches tree$tip.label"
Error in invC %*% X : requires numeric/complex matrix/vector arguments
So, now the order that the function is considering taxa in is all wrong (but I can fix that relatively easily). The issue is that phyl.pca doesn't seem to believe that my matrix is actually a matrix. Any ideas why?

I think you may have bigger problems. Most phylogenetic methods, I suspect including phyl.pca, assume that traits are fixed at the species level (i.e., they don't account for within-species variation). Thus, if you want to use phyl.pca, you probably need to collapse your data to a single value per species, e.g. via
dd_agg <- aggregate(dd[,-1],by=list(dd$Taxa),mean,na.rm=TRUE)
Extract the numeric columns and label the rows properly so that phyl.pca can match them up with the tips correctly:
dd_mat <- dd_agg[,-1]
rownames(dd_mat) <- dd_agg[,1]
Using these aggregated data, I can make up a tree (since you didn't give us one) and run phyl.pca ...
library(phytools)
tt <- rcoal(nrow(dd_agg),tip.label=dd_agg[,1])
phyl.pca(tt,dd_mat)
If you do need to do an analysis that takes within-species variation into account you might need to ask somewhere more specialized, e.g. the r-sig-phylo#r-project.org mailing list ...

The answer posted by Ben Bolker seems to work whereby the data (called "all") is collapsed into a single value per species before creating a matrix and running the function. As per so:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
> all_mat <- all_agg[,-1]
> rownames(all_mat) <- all_agg[,1]
> phyl.pca(tree,all_mat, method= "lambda", mode = "corr")
Thanks to everyone who contributed an answer and especially Ben! :)

Related

R - [DESeq2] - How use TMM normalized counts (from EdgeR) in inputs for DESeq2?

I have several RNAseq samples, from different experimental conditions. After sequencing, and alignment to reference genome, I merged the raw counts to get a dataframe that looks like this:
> df_merge
T0 DJ21 DJ24 DJ29 DJ32 Rec2 Rec6 Rec9
G10 421 200 350 288 284 198 314 165
G1000 17208 10608 11720 11421 10142 10768 10331 6121
G10000 37 16 19 21 28 12 9 4
G10002 45 13 44 27 12 35 74 14
G10003 136 79 162 429 184 112 192 162
G10004 54 162 73 169 102 300 429 180
G10006 1 0 1 0 0 0 0 0
G10007 3 4 7 2 1 1 1 0
G1001 9030 8366 10608 13604 9808 10654 11663 7985
... ... ... ... ... ... ... ... ...
I use EdgeR to perform TMM normalization, which is the normalization method I want to use, and is not available in DESeq2. For that I use the following script:
## Normalisation by the TMM method (Trimmed Mean of M-value)
dge <- DGEList(df_merge) # DGEList object created from the count data
dge2 <- calcNormFactors(dge, method = "TMM") # TMM normalization calculate the normfactors
I then obtain the following normalization factors:
> dge2$samples
group lib.size norm.factors
T0 1 129884277 1.1108130
DJ21 1 110429304 0.9453988
DJ24 1 126410256 1.0297216
DJ29 1 123008035 1.0553169
DJ32 1 118968544 0.9927826
Rec2 1 119000510 0.9465131
Rec6 1 114775318 1.0053686
Rec9 1 90693946 0.9275454
I normalize the raw counts with the normalization factors:
# Normalized pseudo counts are obtained with the function cpm and stored in a data frame:
pseudo_TMM <- log2(cpm(dge2) + 1)
df_TMM <- melt(pseudo_TMM, id = rownames(raw_counts_wn))
names(df_TMM)[1:2] <- c ("id", "sample")
df_TMM$method <- rep("TMM", nrow(df_TMM))
And I get TMM normalized counts, in a new dataframe:
> pseudo_TMM
T0 DJ21 DJ24 DJ29 DJ32 Rec2 Rec6 Rec9
G10 1.970115581 1.54384913 1.88316953 1.68642670 1.76745996 1.46356074 1.89575666 1.56628879
G1000 6.910138402 6.68101996 6.50839579 6.47542172 6.44077248 6.59395683 6.50032388 6.20481983
G10000 0.329354263 0.20571418 0.19656414 0.21632677 0.30692404 0.14605339 0.10835095 0.06701850
G10002 0.391657436 0.16931112 0.42010652 0.27261134 0.13960084 0.39037793 0.71483462 0.22209164
G10003 0.958011321 0.81287356 1.16642722 2.10593537 1.35494357 0.99592405 1.41354030 1.54881003
G10004 0.458675608 1.35147467 0.64230087 1.20281148 0.89809414 1.87320592 2.23810756 1.65064058
G10006 0.009964976 0.00000000 0.01104103 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
G10007 0.029690785 0.05424318 0.07556948 0.02205789 0.01216343 0.01275200 0.01244875 0.00000000
G1001 5.990679797 6.34224022 6.36623615 6.72515956 6.39302663 6.57876150 6.67346174 6.58377191
... ... ... ... ... ... ... ... ...
And this is where it gets complicated. Usually I do my DGE analysis with DESeq2 with the DESeqDataSetFromHTSeqCount() and DESeq() functions, which itself runs an RLE normalization. Now I would like to use DESeq2 directly to do the DGE analysis on my already normalized data. I saw that the DeseqDataSet object could be created from a matrix with the DESeqDataSetFromMatrix() function.
If someone has already succeeded in using DESeq2 with data from TMM normalization, I would appreciate some advice
I remembered I saw something about how the norm factors must be converted to the appropriate size factors in DESeq2 and I found the thread on Bioconductor:
https://support.bioconductor.org/p/p133964/
It was suggested to read the following in order to get a better understanding of the conversion necessary:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0157022
Essentially in the supplementary info, they give the following code snippet for the conversion:
tmm <- calcNormFactors(geneCount, method="TMM")
N <- colSums(geneCount) #vector of library size
tmm.counts <- N*tmm/exp(mean(log(N*tmm)))
Cheers

subset of dataframe in R using a predefined list of names

I have a list of gene names called "COMBO_mk_plt_genes_labels" and a dataframe of marker genes called "Marker_genes_41POS_12_libraries_test_1" containing genes and fold changes.
I want to extract the names of COMBO_mk_plt_genes_labels.
I know that the which() function in R would get the positions of the genes. See my example below. How do I extract the names and not only the position?
print(head(Marker_genes_41POS_12_libraries_test_1))
p_val avg_logFC pct.1 pct.2 p_val_adj
HBD 6.610971e-108 3.3357135 0.930 0.080 2.419682e-103
GP1BB 1.332211e-91 2.5397301 0.825 0.047 4.876024e-87
CMTM5 1.938091e-63 2.0580724 0.605 0.005 7.093606e-59
SH3BGRL3 1.067771e-60 1.3750032 0.975 0.592 3.908149e-56
PF4 1.899932e-60 3.0111590 0.371 0.000 6.953941e-56
FTH1 4.242081e-58 0.8947325 0.996 0.905 1.552644e-53
COMBO_mk_plt_genes=read.csv(file = "combined_Mk_Plt_genes_list.csv", row.names = ,1)
COMBO_mk_plt_genes_labels=COMBO_mk_plt_genes[,1]
print(head(COMBO_mk_plt_genes_labels))
[1] "CMTM5" "GP9" "CLEC1B" "LTBP1" "C12orf39" "CAMK1"
PLT_genes_in_dataframe= which(rownames(Marker_genes_41POS_12_libraries_test_1) %in% COMBO_mk_plt_genes_labels)
print(PLT_genes_in_dataframe)
[1] 2 3 5 8 11 12 13 20 22 23 24 27 32 38 39 42
[17] 48 60 61 66 68 75 77 92 93 108 112 145 158 175 188 196
[33] 203 214 236 253 261 307 308 1004 1017
I want the names of the elements not the positions. Any advice is appreciated.
You can use the base intersect():
intersect(rownames(Marker_genes_41POS_12_libraries_test_1), COMBO_mk_plt_genes_labels)
intersect() outputs the items that match between the two sequences of items.
Run ?intersect() or ?base::intersect() for more information.
Alternative solution: Getting element names with your which() approach
You can still use which() to find the items or element names. Knowing that your which() function provides the index numbers at which rownames(Marker_genes_41POS_12_libraries_test_1) matches COMBO_mk_plt_genes_labels in rownames(Marker_genes_41POS_12_libraries_test_1), you can use those index numbers to call the element names in rownames(Marker_genes_41POS_12_libraries_test_1) that matched.
rownames(Marker_genes_41POS_12_libraries_test_1)[which(rownames(Marker_genes_41POS_12_libraries_test_1) %in% COMBO_mk_plt_genes_labels)]
# or in short
rownames(Marker_genes_41POS_12_libraries_test_1)[PLT_genes_in_dataframe]
intersect(), though, is a simpler approach.
However, there is one difference you need to be aware of and that is with duplicated items. If the rownames(...) (let's call it x) has duplicates that match with items in the second sequence of items y, intersect(x, y) will not provide you any duplicates. In contrast, the x[which(x %in% y)] (i.e., the which() approach) will provide you duplicated x where the match with y in x is TRUE. Switch x and y and you can get duplicated y names, too, using y[which(y %in% x)]. You can use this for something like tallying the number of times that there was a match.

R - Data Frame is a list of columns?

Question
Is a data frame in R is a list (list is, in my understanding, a sequence of objects) of columns?
What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
Any reference to related design document or article of data structure design would be appreciated.
I am just used to row-as-a-unit/record and would like to know why it is column oriented. Or if I misunderstood something, kindly suggest.
Background
I had thought a dataframe was a sequence of row, such as (Ozone, Solar.R, Wind, Temp, Month, Day).
> c ## data frame created from read.csv()
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
> typeof(c)
[1] "list"
However when lapply() is applied against c to show each list element, it was a column.
> lapply(c, function(arg){ return(arg) })
$Ozone
[1] 41 36 12 18 23 19
$Solar.R
[1] 190 118 149 313 299 99
$Wind
[1] 7.4 8.0 12.6 11.5 8.6 13.8
$Temp
[1] 67 72 74 62 65 59
$Month
[1] 5 5 5 5 5 5
$Day
[1] 1 2 3 4 7 8
Whereas I had expected was
[1] 41 190 7.4 67 5 1
[1] 36 118 8.0 72 5 2
…
1) Is a data frame in R a list of columns?
Yes.
df <- data.frame(a=c("the", "quick"), b=c("brown", "fox"), c=1:2)
is.list(df) # -> TRUE
attr(df, "name") # -> [1] "a" "b" "c"
df[[1]][2] # -> "quick"
2) What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
A data.frame is a list of column vectors.
is.atomic(df[[1]]) # -> TRUE
mode(df[[1]]) # -> [1] "character"
mode(df[[3]]) # -> [1] "numeric"
Vectors can only store one kind of object. A "row-oriented" data.frame would demand data frames be composed of lists instead. Now imagine what the performance of an operation like
df[[1]][20000]
would be in a list-based data frame keeping in mind that random access is O(1) for vectors and O(n) for lists.
3) Any reference to related design document or article of data structure design would be appreciated.
http://adv-r.had.co.nz/Data-structures.html#data-frames

Retrieving adjaceny values in a nng igraph object in R

edited to improve the quality of the question as a result of the (wholly appropriate) spanking received by Spacedman!
I have a k-nearest neighbors object (an igraph) which I created as such, by using the file I have uploaded here:
I performed the following operations on the data, in order to create an adjacency matrix of distances between observations:
W <- read.csv("/path/sim_matrix.csv")
W <- W[, -c(1,3)]
W <- scale(W)
sim_matrix <- dist(W, method = "euclidean", upper=TRUE)
sim_matrix <- as.matrix(sim_matrix)
mygraph <- nng(sim_matrix, k=10)
This give me a nice list of vertices and their ten closest neighbors, a small sample follows:
1 -> 25 26 28 30 32 144 146 151 177 183 2 -> 4 8 32 33 145 146 154 156 186 199
3 -> 1 25 28 51 54 106 144 151 177 234 4 -> 7 8 89 95 97 158 160 170 186 204
5 -> 9 11 17 19 21 112 119 138 145 158 6 -> 10 12 14 18 20 22 147 148 157 194
7 -> 4 13 123 132 135 142 160 170 173 174 8 -> 4 7 89 90 95 97 158 160 186 204
So far so good.
What I'm struggling with, however, is how to to get access to the values for the weights between the vertices that I can do meaningful calculations on. Shouldn't be so hard, this is a common thing to want from graphs, no?
Looking at the documentation, I tried:
degree(mygraph)
which gives me the sum of the weights for each node. But I don't want the sum, I want the raw data, so I can do my own calculations.
I tried
get.data.frame(mygraph,"E")[1:10,]
but this has none of the distances between nodes:
from to
1 1 25
2 1 26
3 1 28
4 1 30
5 1 32
6 1 144
7 1 146
8 1 151
9 1 177
10 1 183
I have attempted to get values for the weights between vertices out of the graph object, that I can work with, but no luck.
If anyone has any ideas on how to go about approaching this, I'd be grateful. Thanks.
It's not clear from your question whether you are starting with a dataset, or with a distance matrix, e.g. nng(x=mydata,...) or nng(dx=mydistancematrix,...), so here are solutions with both.
library(cccd)
df <- mtcars[,c("mpg","hp")] # extract from mtcars dataset
# knn using dataset only
g <- nng(x=as.matrix(df),k=5) # for each car, 5 other most similar mpg and hp
V(g)$name <- rownames(df) # meaningful names for the vertices
dm <- as.matrix(dist(df)) # full distance matrix
E(g)$weight <- apply(get.edges(g,1:ecount(g)),1,function(x)dm[x[1],x[2]])
# knn using distance matrix (assumes you have dm already)
h <- nng(dx=dm,k=5)
V(h)$name <- rownames(df)
E(h)$weight <- apply(get.edges(h,1:ecount(h)),1,function(x)dm[x[1],x[2]])
# same result either way
identical(get.data.frame(g),get.data.frame(h))
# [1] TRUE
So these approaches identify the distances from each vertex to it's five nearest neighbors, and set the edge weight attribute to those values. Interestingly, plot(g) works fine, but plot(h) fails. I think this might be a bug in the plot method for cccd.
If all you want to know is the distances from each vertex to the nearest neighbors, the code below does not require package cccd.
knn <- t(apply(dm,1,function(x)sort(x)[2:6]))
rownames(knn) <- rownames(df)
Here, the matrix knn has a row for each vertex and columns specifying the distance from that vertex to it's 5 nearest neighbors. It does not tell you which neighbors those are, though.
Okay, I've found a nng function in cccd package. Is that it? If so.. then mygraph is just an igraph object and you can just do E(mygraph)$whatever to get the names of the edge attributes.
Following one of the cccd examples to create G1 here, you can get a data frame of all the edges and attributes thus:
get.data.frame(G1,"E")[1:10,]
You can get/set individual edge attributes with E(g)$whatever:
> E(G1)$weight=1:250
> E(G1)$whatever=runif(250)
> get.data.frame(G1,"E")[1:10,]
from to weight whatever
1 1 3 1 0.11861240
2 1 7 2 0.06935047
3 1 22 3 0.32040316
4 1 29 4 0.86991432
5 1 31 5 0.47728632
Is that what you are after? Any igraph package tutorial will tell you more!

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources