subset of dataframe in R using a predefined list of names - r

I have a list of gene names called "COMBO_mk_plt_genes_labels" and a dataframe of marker genes called "Marker_genes_41POS_12_libraries_test_1" containing genes and fold changes.
I want to extract the names of COMBO_mk_plt_genes_labels.
I know that the which() function in R would get the positions of the genes. See my example below. How do I extract the names and not only the position?
print(head(Marker_genes_41POS_12_libraries_test_1))
p_val avg_logFC pct.1 pct.2 p_val_adj
HBD 6.610971e-108 3.3357135 0.930 0.080 2.419682e-103
GP1BB 1.332211e-91 2.5397301 0.825 0.047 4.876024e-87
CMTM5 1.938091e-63 2.0580724 0.605 0.005 7.093606e-59
SH3BGRL3 1.067771e-60 1.3750032 0.975 0.592 3.908149e-56
PF4 1.899932e-60 3.0111590 0.371 0.000 6.953941e-56
FTH1 4.242081e-58 0.8947325 0.996 0.905 1.552644e-53
COMBO_mk_plt_genes=read.csv(file = "combined_Mk_Plt_genes_list.csv", row.names = ,1)
COMBO_mk_plt_genes_labels=COMBO_mk_plt_genes[,1]
print(head(COMBO_mk_plt_genes_labels))
[1] "CMTM5" "GP9" "CLEC1B" "LTBP1" "C12orf39" "CAMK1"
PLT_genes_in_dataframe= which(rownames(Marker_genes_41POS_12_libraries_test_1) %in% COMBO_mk_plt_genes_labels)
print(PLT_genes_in_dataframe)
[1] 2 3 5 8 11 12 13 20 22 23 24 27 32 38 39 42
[17] 48 60 61 66 68 75 77 92 93 108 112 145 158 175 188 196
[33] 203 214 236 253 261 307 308 1004 1017
I want the names of the elements not the positions. Any advice is appreciated.

You can use the base intersect():
intersect(rownames(Marker_genes_41POS_12_libraries_test_1), COMBO_mk_plt_genes_labels)
intersect() outputs the items that match between the two sequences of items.
Run ?intersect() or ?base::intersect() for more information.
Alternative solution: Getting element names with your which() approach
You can still use which() to find the items or element names. Knowing that your which() function provides the index numbers at which rownames(Marker_genes_41POS_12_libraries_test_1) matches COMBO_mk_plt_genes_labels in rownames(Marker_genes_41POS_12_libraries_test_1), you can use those index numbers to call the element names in rownames(Marker_genes_41POS_12_libraries_test_1) that matched.
rownames(Marker_genes_41POS_12_libraries_test_1)[which(rownames(Marker_genes_41POS_12_libraries_test_1) %in% COMBO_mk_plt_genes_labels)]
# or in short
rownames(Marker_genes_41POS_12_libraries_test_1)[PLT_genes_in_dataframe]
intersect(), though, is a simpler approach.
However, there is one difference you need to be aware of and that is with duplicated items. If the rownames(...) (let's call it x) has duplicates that match with items in the second sequence of items y, intersect(x, y) will not provide you any duplicates. In contrast, the x[which(x %in% y)] (i.e., the which() approach) will provide you duplicated x where the match with y in x is TRUE. Switch x and y and you can get duplicated y names, too, using y[which(y %in% x)]. You can use this for something like tallying the number of times that there was a match.

Related

lapply with different indices in list of lists

I'm trying to get the output of a certain column ($new_age = numeric values) within lists of lists.
Data is "my_groups", which consists of 28 lists. Those lists have lists themselves of irregular size:
92 105 96 86 91 94 73 100 87 89 88 90 112 82 95 83 94 106
91 101 86 81 89 68 89 87 109 73 (len_df)
The 1st list has 92 lists, the 2nd 105 etc. ... until the 28th list with 73 lists.
First, I want my function to iterate through the 28 years of data and second, within these years I want to iterate through len_df, since $new_age is in the nested lists.
What I tried is this:
test <- lapply(seq(1:28), function(i) sapply(seq(1:len_df), function(j) (my_groups[[i]][[j]]$new_age) ) )
However, the index is out of bounds and I'm not sure how to combine two different indices for the nested lists. Unlist is not ideal, since I have to treat the data as separate groups and sorted for each year.
Expected output: $new_age (numeric values) for each of the 28 years e.g. 1st = 92 values, 2nd = 105 values etc.
Any idea how to make this work? Thank you!
Here are a few different approaches:
1) whole object approach Assuming that the input is L shown reproducibly in the Note at the end and that what is wanted is a list of length(L) numeric vectors, i.e. list(1:2, 3:5), consisting of the new_age values:
lapply(L, sapply, `[[`, "new_age")
giving:
[[1]]
[1] 1 2
[[2]]
[1] 3 4 5
2) indices If you want to do it using indices, as in the code shown in question, then using seq_along:
ix <- seq_along(L)
lapply(ix, function(i) sapply(seq_along(L[[i]]), function(j) L[[i]][[j]]$new_age))
3) unlist To use unlist form an appropriate grouping variable using rep and split into separate vectors by it. This assumes that new_age are the only leaves which may or may not be the case in your data but is the case in the reproducible example in the Note at the end.
split(unname(unlist(L)), rep(seq_along(L), lengths(L)))
Note
L <- list(list(list(new_age = 1), list(new_age = 2)),
list(list(new_age = 3), list(new_age = 4), list(new_age = 5)))

What can do to find and remove semi-duplicate rows in a matrix?

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.
You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

Comparing variables with values in another dataframe and replace them with another value

I have a Data.Frame with:
Height <- c(169,176,173,172,176,158,168,162,178)
and another with reference heights and weights.
heights_f <- c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6)
weights_f <- c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84)
weightfactor_f <- data.frame(heights_f, weights_f)
I now need to match the values of the heights from the first data.frame with the height reference in the second one that's the most fitting and to give me the correspondent reference weight.
I haven't yet had any success, as I haven't been able to find anything about matching values that are not exactly the same.
If I understand your goal, instead of taking the nearest value, consider interpolating through the approx function. For instance:
approx(weightfactor_f$heights_f,weightfactor_f$weights_f,xout=Height)$y
#[1] 60.23846 66.44400 63.85385 62.95600 66.44400 50.36000 59.35385 53.96923
#[9] 68.28400
You could do:
Height<- c(169,176,173,172,176,158,168,162,178)
heights_f<- as.numeric(c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6))
weights_f<- as.numeric(c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84))
df = data.frame(Height=Height, match_weight=
sapply(Height, function(x) {weights_f[which.min(abs(heights_f-x))]}))
i.e. for each entry in Height, find the corresponding element in the heights_f vector by doing which.min(abs(heights_f-x) and fetch the corresponding entry from the weights_f vector.
Output:
Height match_weight
1 169 61.3
2 176 65.8
3 173 63.6
4 172 63.6
5 176 65.8
6 158 49.9
7 168 59.0
8 162 54.5
9 178 68.1
library(dplyr)
Slightly different structure to reproducible example:
Height <- data.frame(height = as.numeric(c(169,176,173,172,176,158,168,162,178)))
The rest is the same:
heights_f<- as.numeric(c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6))
weights_f<- as.numeric(c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84))
weightfactor_f<- data.frame(heights_f,weights_f)
Then, round to the nearest whole number:
weightfactor_f$heights_f <- round(weightfactor_f$heights_f, 0)
Then just:
left_join(Height, weightfactor_f, by = c("height" = "heights_f"))
Output:
height weights_f
1 169 NA
2 176 NA
3 173 63.6
4 172 NA
5 176 NA
6 158 49.9
7 168 59.0
8 162 NA
9 178 68.1
z <- vector()
for(i in 1:length(Height)) {
z[i] <- weightfactor_f$weights_f[which.min(abs(Height[i]-weightfactor_f$heights_f))]
}

Creating a data set with paired data and converting it into a matrix

So, I'm using R to try and do a phylogenetic PCA on a dataset that I have using the phyl.pca function from the phytools package. However, I'm having issues organising my data in a way that the function will accept! And that's not all: I did a bit of experimenting and I know that there are more issues further down the line, which I will get into...
Getting straight to the issue, here's the data frame (with dummy data) that I'm using:
>all
Taxa Tibia Feather
1 Microraptor 138 101
2 Microraptor 139 114
3 Microraptor 145 141
4 Anchiornis 160 81
5 Anchiornis 14 NA
6 Archaeopteryx 134 82
7 Archaeopteryx 136 71
8 Archaeopteryx 132 NA
9 Archaeopteryx 14 NA
10 Scansoriopterygidae 120 85
11 Scansoriopterygidae 116 NA
12 Scansoriopterygidae 123 NA
13 Sapeornis 108 NA
14 Sapeornis 112 86
15 Sapeornis 118 NA
16 Sapeornis 103 NA
17 Confuciusornis 96 NA
18 Confuciusornis 107 30
19 Confuciusornis 148 33
20 Confuciusornis 128 61
The taxa are arranged into a tree (called "tree") with Microraptor being the most basal and then progressing in order through to Confuciusornis:
>summary(tree)
Phylogenetic tree: tree
Number of tips: 6
Number of nodes: 5
Branch lengths:
mean: 1
variance: 0
distribution summary:
Min. 1st Qu. Median 3rd Qu. Max.
1 1 1 1 1
No root edge.
Tip labels: Confuciusornis
Sapeornis
Scansoriopterygidae
Archaeopteryx
Anchiornis
Microraptor
No node labels.
And the function:
>phyl.pca(tree, all, method="BM", mode="corr")
And this is the error that is coming up:
Error in phyl.pca(tree, all, method = "BM", mode = "corr") :
number of rows in Y cannot be greater than number of taxa in your tree
Y being the "all" data frame. So I have 6 taxa in my tree (matching the 6 taxa in the data frame) but there are 20 rows in my data frame. So I used this function:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
And got this:
Group.1 Tibia Feather
1 Anchiornis 153 81
2 Archaeopteryx 136 77
3 Confuciusornis 120 41
4 Microraptor 141 119
5 Sapeornis 110 86
6 Scansoriopterygidae 120 85
It's a bit odd that the order of the taxa has changed... Is this ok?
In any case, I converted it into a matrix:
> all_agg_matrix <- as.matrix(all_agg)
> all_agg_matrix
Group.1 Tibia Feather
[1,] "Anchiornis" "153" "81"
[2,] "Archaeopteryx" "136" "77"
[3,] "Confuciusornis" "120" "41"
[4,] "Microraptor" "141" "119"
[5,] "Sapeornis" "110" "86"
[6,] "Scansoriopterygidae" "120" "85"
And then used the phyl.pca function:
> phyl.pca(tree, all_agg_matrix, method = "BM", mode = "corr")
[1] "Y has no names. function will assume that the row order of Y matches tree$tip.label"
Error in invC %*% X : requires numeric/complex matrix/vector arguments
So, now the order that the function is considering taxa in is all wrong (but I can fix that relatively easily). The issue is that phyl.pca doesn't seem to believe that my matrix is actually a matrix. Any ideas why?
I think you may have bigger problems. Most phylogenetic methods, I suspect including phyl.pca, assume that traits are fixed at the species level (i.e., they don't account for within-species variation). Thus, if you want to use phyl.pca, you probably need to collapse your data to a single value per species, e.g. via
dd_agg <- aggregate(dd[,-1],by=list(dd$Taxa),mean,na.rm=TRUE)
Extract the numeric columns and label the rows properly so that phyl.pca can match them up with the tips correctly:
dd_mat <- dd_agg[,-1]
rownames(dd_mat) <- dd_agg[,1]
Using these aggregated data, I can make up a tree (since you didn't give us one) and run phyl.pca ...
library(phytools)
tt <- rcoal(nrow(dd_agg),tip.label=dd_agg[,1])
phyl.pca(tt,dd_mat)
If you do need to do an analysis that takes within-species variation into account you might need to ask somewhere more specialized, e.g. the r-sig-phylo#r-project.org mailing list ...
The answer posted by Ben Bolker seems to work whereby the data (called "all") is collapsed into a single value per species before creating a matrix and running the function. As per so:
> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)
> all_mat <- all_agg[,-1]
> rownames(all_mat) <- all_agg[,1]
> phyl.pca(tree,all_mat, method= "lambda", mode = "corr")
Thanks to everyone who contributed an answer and especially Ben! :)

R : how to Detect Pattern in Matrix By Row

I have a big matrix with 4 columns, containing normalized values (by column, mean ~ 0 and standard deviation = 1)
I would like to see if there is a pattern in the matrix, and if yes I would like to cluster rows by pattern, by pattern I mean values in a given row example
for row N
if value in column 1 < column 2 < column 3 < column 4 then it is let's say a pattern 1
Basically there is 4^4 = 256 possible patterns (in theory)
Is there a way in R to do this ?
Thanks in advance
Rad
Yes. (Although the number of distinct permutations is only 24 = 4*3*2. After one value is chosen, there are only three possible second values, and after the second is specified there are only two more orderings left.) The order function applied to each row should give the desired 1,2,3, 4 permutations:
mtx <- matrix(rnorm(10000), ncol=4)
res <- apply(mtx, 1, function(x) paste( order(x), collapse=".") )
> table(res)[1:10]
> table(res)
res
1.2.3.4 1.2.4.3 1.3.2.4 1.3.4.2 1.4.2.3 1.4.3.2
98 112 95 120 114 118
2.1.3.4 2.1.4.3 2.3.1.4 2.3.4.1 2.4.1.3 2.4.3.1
101 114 105 102 104 122
3.1.2.4 3.1.4.2 3.2.1.4 3.2.4.1 3.4.1.2 3.4.2.1
105 82 107 90 97 86
4.1.2.3 4.1.3.2 4.2.1.3 4.2.3.1 4.3.1.2 4.3.2.1
99 93 100 108 118 110

Resources