lapply with different indices in list of lists - r

I'm trying to get the output of a certain column ($new_age = numeric values) within lists of lists.
Data is "my_groups", which consists of 28 lists. Those lists have lists themselves of irregular size:
92 105 96 86 91 94 73 100 87 89 88 90 112 82 95 83 94 106
91 101 86 81 89 68 89 87 109 73 (len_df)
The 1st list has 92 lists, the 2nd 105 etc. ... until the 28th list with 73 lists.
First, I want my function to iterate through the 28 years of data and second, within these years I want to iterate through len_df, since $new_age is in the nested lists.
What I tried is this:
test <- lapply(seq(1:28), function(i) sapply(seq(1:len_df), function(j) (my_groups[[i]][[j]]$new_age) ) )
However, the index is out of bounds and I'm not sure how to combine two different indices for the nested lists. Unlist is not ideal, since I have to treat the data as separate groups and sorted for each year.
Expected output: $new_age (numeric values) for each of the 28 years e.g. 1st = 92 values, 2nd = 105 values etc.
Any idea how to make this work? Thank you!

Here are a few different approaches:
1) whole object approach Assuming that the input is L shown reproducibly in the Note at the end and that what is wanted is a list of length(L) numeric vectors, i.e. list(1:2, 3:5), consisting of the new_age values:
lapply(L, sapply, `[[`, "new_age")
giving:
[[1]]
[1] 1 2
[[2]]
[1] 3 4 5
2) indices If you want to do it using indices, as in the code shown in question, then using seq_along:
ix <- seq_along(L)
lapply(ix, function(i) sapply(seq_along(L[[i]]), function(j) L[[i]][[j]]$new_age))
3) unlist To use unlist form an appropriate grouping variable using rep and split into separate vectors by it. This assumes that new_age are the only leaves which may or may not be the case in your data but is the case in the reproducible example in the Note at the end.
split(unname(unlist(L)), rep(seq_along(L), lengths(L)))
Note
L <- list(list(list(new_age = 1), list(new_age = 2)),
list(list(new_age = 3), list(new_age = 4), list(new_age = 5)))

Related

R: Get index of first instance in vector greater than variable but for whole colum

I am trying to create a new variable in a data.table. It is intended to take a variable in the data.table and for each observation compare that variable to a vector and return the index of the first observation in the vector that is greater than the variable in the data.table.
Example
ComparatorVector <- c(seq(1000, 200000, 1000))
Variable <- runif(10, min = 1000, max = 200000)
For each observation in Variable I'd like to know the index of the first observation in ComparatorVector that is larger than the observation of Variable.
I've played araound with min(which()), but couldn't get it to just go through the ComparatorVector. I also saw the match() function, but didn't find how to get it to return anything but the index of the exact match.
An option is findInterval
findInterval(Variable, ComparatorVector) +1
#[1] 190 152 99 107 38 148 114 95 53 73
Or with sapply
sapply(Variable, function(x) which(ComparatorVector > x)[1])
#[1] 190 152 99 107 38 148 114 95 53 73

subset of dataframe in R using a predefined list of names

I have a list of gene names called "COMBO_mk_plt_genes_labels" and a dataframe of marker genes called "Marker_genes_41POS_12_libraries_test_1" containing genes and fold changes.
I want to extract the names of COMBO_mk_plt_genes_labels.
I know that the which() function in R would get the positions of the genes. See my example below. How do I extract the names and not only the position?
print(head(Marker_genes_41POS_12_libraries_test_1))
p_val avg_logFC pct.1 pct.2 p_val_adj
HBD 6.610971e-108 3.3357135 0.930 0.080 2.419682e-103
GP1BB 1.332211e-91 2.5397301 0.825 0.047 4.876024e-87
CMTM5 1.938091e-63 2.0580724 0.605 0.005 7.093606e-59
SH3BGRL3 1.067771e-60 1.3750032 0.975 0.592 3.908149e-56
PF4 1.899932e-60 3.0111590 0.371 0.000 6.953941e-56
FTH1 4.242081e-58 0.8947325 0.996 0.905 1.552644e-53
COMBO_mk_plt_genes=read.csv(file = "combined_Mk_Plt_genes_list.csv", row.names = ,1)
COMBO_mk_plt_genes_labels=COMBO_mk_plt_genes[,1]
print(head(COMBO_mk_plt_genes_labels))
[1] "CMTM5" "GP9" "CLEC1B" "LTBP1" "C12orf39" "CAMK1"
PLT_genes_in_dataframe= which(rownames(Marker_genes_41POS_12_libraries_test_1) %in% COMBO_mk_plt_genes_labels)
print(PLT_genes_in_dataframe)
[1] 2 3 5 8 11 12 13 20 22 23 24 27 32 38 39 42
[17] 48 60 61 66 68 75 77 92 93 108 112 145 158 175 188 196
[33] 203 214 236 253 261 307 308 1004 1017
I want the names of the elements not the positions. Any advice is appreciated.
You can use the base intersect():
intersect(rownames(Marker_genes_41POS_12_libraries_test_1), COMBO_mk_plt_genes_labels)
intersect() outputs the items that match between the two sequences of items.
Run ?intersect() or ?base::intersect() for more information.
Alternative solution: Getting element names with your which() approach
You can still use which() to find the items or element names. Knowing that your which() function provides the index numbers at which rownames(Marker_genes_41POS_12_libraries_test_1) matches COMBO_mk_plt_genes_labels in rownames(Marker_genes_41POS_12_libraries_test_1), you can use those index numbers to call the element names in rownames(Marker_genes_41POS_12_libraries_test_1) that matched.
rownames(Marker_genes_41POS_12_libraries_test_1)[which(rownames(Marker_genes_41POS_12_libraries_test_1) %in% COMBO_mk_plt_genes_labels)]
# or in short
rownames(Marker_genes_41POS_12_libraries_test_1)[PLT_genes_in_dataframe]
intersect(), though, is a simpler approach.
However, there is one difference you need to be aware of and that is with duplicated items. If the rownames(...) (let's call it x) has duplicates that match with items in the second sequence of items y, intersect(x, y) will not provide you any duplicates. In contrast, the x[which(x %in% y)] (i.e., the which() approach) will provide you duplicated x where the match with y in x is TRUE. Switch x and y and you can get duplicated y names, too, using y[which(y %in% x)]. You can use this for something like tallying the number of times that there was a match.

What can do to find and remove semi-duplicate rows in a matrix?

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.
You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

R: How to divide a data frame by column values?

Suppose I have a data frame with 3 columns and 10 rows as follows.
# V1 V2 V3
# 10 24 92
# 13 73 100
# 25 91 120
# 32 62 95
# 15 43 110
# 28 54 84
# 30 56 71
# 20 82 80
# 23 19 30
# 12 64 89
I want to create sub-dataframes that divide the original by the values of V1.
For example,
the first data frame will have the rows with values of V1 from 10-14,
the second will have the rows with values of V1 from 15-19,
the third from 20-24, etc.
What would be the simplest way to make this?
So if this is your data
dd<-data.frame(
V1=c(10,13,25,32,15,38,30,20,23,13),
V2=c(24,73,91,62,43,54,56,82,19,64),
V3=c(92,100,120,95,110,84,71,80,30,89)
)
then the easiest way to split is using the split() command. And since you want to split in ranges, you can use the cut() command to create those ranges. A simple split can be done with
ss<-split(dd, cut(dd$V1, breaks=seq(10,35,by=5)-1)); ss
split returns a list where each item is the subsetted data.frame. So to get at the data.frame with the values for 10-14, use ss[[1]], and for 15-19, use ss[[2]] etc.

R : how to Detect Pattern in Matrix By Row

I have a big matrix with 4 columns, containing normalized values (by column, mean ~ 0 and standard deviation = 1)
I would like to see if there is a pattern in the matrix, and if yes I would like to cluster rows by pattern, by pattern I mean values in a given row example
for row N
if value in column 1 < column 2 < column 3 < column 4 then it is let's say a pattern 1
Basically there is 4^4 = 256 possible patterns (in theory)
Is there a way in R to do this ?
Thanks in advance
Rad
Yes. (Although the number of distinct permutations is only 24 = 4*3*2. After one value is chosen, there are only three possible second values, and after the second is specified there are only two more orderings left.) The order function applied to each row should give the desired 1,2,3, 4 permutations:
mtx <- matrix(rnorm(10000), ncol=4)
res <- apply(mtx, 1, function(x) paste( order(x), collapse=".") )
> table(res)[1:10]
> table(res)
res
1.2.3.4 1.2.4.3 1.3.2.4 1.3.4.2 1.4.2.3 1.4.3.2
98 112 95 120 114 118
2.1.3.4 2.1.4.3 2.3.1.4 2.3.4.1 2.4.1.3 2.4.3.1
101 114 105 102 104 122
3.1.2.4 3.1.4.2 3.2.1.4 3.2.4.1 3.4.1.2 3.4.2.1
105 82 107 90 97 86
4.1.2.3 4.1.3.2 4.2.1.3 4.2.3.1 4.3.1.2 4.3.2.1
99 93 100 108 118 110

Resources