R unlist a list to integers - r

[revised version]
I have a large character vector in R of size 57241 that contains gene symbols e.g
gene <- c("AL627309.1","SMIM1","DFFB") # assume this of size 57241
I have another table in which one column table$genes has some combinations of genes in each row e.g
head(table$genes)
[1] ,OR4F5,AL627309.1,OR4F29,OR4F16,AL669831.1,
[2] ,TP73,CCDC27,SMIM1,LRRC47,CEP104,DFFB
..
this table has about 1400 rows. For each gene I wanted to find the index of row in table in which it is located.
To do that I used
ind <- sapply(gene, grep, table$genes, fixed=TRUE,USE.NAMES=FALSE))
The variable "ind" returned is a large list of size 57241 which looks like this
head(ind)
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 1
[[4]]
[1] 1
[[5]]
[1] 1
[[6]]
[1] 1
I know for a fact each gene exists only once in that table. So the numbers that I am interested in is the list one in each line above i.e. 1. How can I convert this into an integer vector? When I unlist() this somehow I get a vector of length ~500000 whereas I should be getting the same length as of the list. I have tried many functions and combinations but nothing seems to work. Any ideas?
Thanks

I'm not able to reproduce that behavior with either a list or a dataframe:
> gene <- c("AL627309.1","SMIM1","DFFB")
>
> table <- list(genes =c(",OR4F5,AL627309.1,OR4F29,OR4F16,AL669831.1,",
",TP73,CCDC27,SMIM1,LRRC47,CEP104,DFFB"))
> (ind <- sapply(gene, grep, table$genes, fixed=TRUE,USE.NAMES=FALSE))
[1] 1 2 2
I thought for a bit that you should be using match but after further consideration, it seemed as though there must be something different about your data structure. Try posting dput(head (table$gene)) and dput(gene) to make your problem reproducible. You should also stop using the word "list" to refer to the items in that table$gene items. It confuses regular users of R who think you are talking about an R "list". You can try to see which of the items in your ind "list" has a vector of length greater than one with:
which(sapply(ind, length) > 1)

Related

Repeating patterns in a vector in R

If a vector is produced from a vector of unknown length with unique elements by repeating it unknown times
small_v <- c("as","d2","GI","Worm")
big_v <- rep(small_v, 3)
then how to determine how long that vector was and how many times it was repeated?
So in this example the original length was 4 and it repeats 3 times.
Realistically in my case the vectors will be fairly small and will be repeated only a few times.
1) Assuming that there is at least one unique element in small_v (which is the case in the question since it assumes all elements in small_v are unique):
min(table(big_v))
## [1] 3
or using pipes
big_v |> table() |> min()
## [1] 3
Here is a more difficult test but it still works because small_v2[2] is unique in small_v2 even though the other elements of small_v2 are not unique.
# test data
small_v2 <- c(small_v, small_v[-2])
big_v2 <- rep(small_v2, 3)
min(table(big_v2))
## [1] 3
2) If we knew that the first element of small_v were unique (which is the case in the question since it assumes all elements in small_v are unique) then this would work:
sum(big_v[1] == big_v)
## [1] 3
1) If the elements are all repeating and no other values are there, then use
length(big_v)/length(unique(big_v))
[1] 3
2) Or use
library(data.table)
max(rowid(big_v))
[1] 3
Alternatively we could use rle with with to count the repeats
with(rle(sort(big_v)), max(lengths))
Created on 2023-02-04 with reprex v2.0.2
[1] 3

Split a list of elements into two unique lists (and get all combinations) in R

I have a list of elements (my real list has 11 elements, this is just an example):
x <- c(1, 2, 3)
and want to split them into two lists (using all entries) but I want to get all possible combinations of that list to be returned e.g.:
(1,2)(3) & (1)(2,3) & (2)(1,3)
Does anyone know an efficient way to do this for a more complex list?
Thanks in advance for your help!
List with 3 elements:
vec <- 1:3
Note that for each element we have two possibilities: it is either in 1st split or in 2nd split. So we define a matrix of all possible splits (in rows) using expand.grid which produces all possible combinations:
groups <- as.matrix(expand.grid(rep(list(1:2), length(vec))))
However This will treat scenarios where the groups are flipped as different splits. Also will include scenarios where all the observations are in the same group (but there will only be 2 of them).
If you want to remove them we need to remove the lines from groups matrix that only have one group (2 such lines) and all the lines that split the vector in the same way, only switching the groups.
One-group entries are on top and bottom so removing them is easy:
groups <- groups[-c(1, nrow(groups)),]
Duplicated entries are a bit trickier. But note that we can get rid fo them by removing all the rows where the first group is 2. In effect this will make a requirement that the first element is always assigned to group 1.
groups <- groups[groups[,1]==1,]
Then the job is to split the list we have using each of the rows in the groups matrix. For that we use Map to call split() function on our list vec and each row of groups matrix:
splits <- Map(split, list(vec), split(groups, row(groups)))
> splits
[[1]]
[[1]]$`1`
[1] 1 3
[[1]]$`2`
[1] 2
[[2]]
[[2]]$`1`
[1] 1 2
[[2]]$`2`
[1] 3
[[3]]
[[3]]$`1`
[1] 1
[[3]]$`2`
[1] 2 3

R is returning an ID that is not in the data frame?

I have a data frame with several variables that represent ID numbers (the data frames in the workspace are all originally tables from a normalized database). I was surprised to see that I am sometimes able to reference an ID's description before I use the merge to map the description in, but only if I use the $ notation. For example: I set up data frame q to include the variable "LocationID". Then I do the following...
Example for 1 & 2:
> colnames(q)
[1] "LocationID" "PlanID" "Rate"
> sort(unique(q[,'Location')) #This fails. duh
Error in `[.data.frame`(q, , "Location") : undefined columns selected
> sort(unique(q$Location)) #This works. what?
[1] 1 2 3
Questions
Why does the second one work? Maybe that's looking a gift horse in the mouth.
Why doesn't the first one work if the first one does?
For the above example, q is constructed from another data frame with more
variables. This fails for the larger data frame. Why does it fail?
Example for 3:
> dim(y)
[1] 207171 86
q<-y[,cbind('LocationID','PlanID','Rate')]
> dim(q)
[1] 207171 3
> unique(y$Location)
NULL
> unique(q$Location)
[1] 1 2 3

R - Why does rep() seemingly change behaviour of lists

When I started to preinitialize lists of list in R which should be filled afterwards I wondered about the behaviour of list objects when used as value in rep(). When I am trying the following...
listOfLists <- rep(list(1, 2), 4)
... listOfLists is a single list:
1 2 1 2 1 2 1 2
However, I would assume it to be a list of lists which finally contain the values 1 and 2 each:
1 2
1 2
1 2
1 2
To get the desired result I have to surround the value entries with c() additionally:
listOfLists <- rep(list(c(1, 2)), 4)
I wonder why this is the case in R. Shouldn't list create a fully functional list as it normally does instead of doing something similar to c()? Why does grouping the values with c() actually solves the problem here?
Thank you for your thoughts!
Conclusion:
Both Ben Bolker's and Peyton's posts give the final answer. It was the behaviour of neither the list()- nor the c()-function. Instead rep() seems to combine the entries of lists and vectors to one. Surrounding the values with another container makes rep() actualy "ignore" the first but repeat the second container.
What you got with rep(list(c(1, 2)), 4) is not a list of lists; it's a list of numeric vectors. If you really want a list of lists, try
replicate(4,list(1,2),simplify=FALSE)
or
rep(list(list(1, 2)), 4)
You can understand a little bit better why this works as it does by performing exegesis on the first line of ?rep:
‘rep’ replicates the values in ‘x’.
In other words, it promises to replicate the contents of x, but not necessarily to replicate x itself. (This is why the second suggestion, kindly contributed by #flodel, works -- it makes x into a list whose contents are a list -- and why the vector-based rep() works -- the contents of the list are a vector.)
It does create a fully functional list. The difference is that in your first example, you create a list with two elements, whereas in the second example, you create a list with one element--a vector.
When you combine lists (e.g., with rep), you're essentially creating a new list with all the elements of the previous lists. In the first example, then, you'll have eight elements, and in the second example, you'll have four.
Another way to see this:
> length(list(1, 2))
[1] 2
> c(list(1, 2), list(1, 2), list(1, 2))
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 1
[[4]]
[1] 2
[[5]]
[1] 1
[[6]]
[1] 2
> length(list(1:2))
[1] 1
> c(list(1:2), list(1:2), list(1:2))
[[1]]
[1] 1 2
[[2]]
[1] 1 2
[[3]]
[1] 1 2

Transposing a large dataframe / matrix in R

Am encountering a strange issue transposing a large dataset. I want to get a list of non-linear flight routes (i.e. sub-lists of vectors with 30 vertices each) into a dataframe (with 32 columns for vertices). The list coerces into a data.frame no problem, but then fails when (1) transposing with t(x) and (2) converting to matrix.
To illustrate:
> class(gc)
[1] "list"
> length(gc)
[1] 58278
> gc[[1]][1:30]
[1] 147.2200 147.1606 147.1012 147.0418 146.9824 146.9231 146.8638
[8] 146.8046 146.7454 146.6862 146.6270 146.5679 146.5088 146.4498
[15] 146.3908 146.3318 146.2728 146.2139 146.1550 146.0961 146.0373
[22] 145.9785 145.9197 145.8610 145.8022 145.7435 145.6849 145.6262
[29] 145.5676 145.5090
> gc2 <- data.frame(gc)
> nrow(gc2)
[1] 32
> length(gc2)
[1] 116556
> gc2[1:5,1:5]
lon lat lon.1 lat.1 lon.2
1 147.2200 -9.443383 -80.37861 43.46083 -87.90484
2 147.1606 -9.335072 -80.23135 43.52385 -87.53193
3 147.1012 -9.226751 -80.08379 43.58667 -87.15751
4 147.0418 -9.118420 -79.93591 43.64931 -86.78161
5 146.9824 -9.010080 -79.78773 43.71175 -86.40421
> gc3 <- t(gc2)
> nrow(gc3)
[1] 116556
> length(gc3)
[1] 3729792
> gc3 <- as.matrix(gc2)
> nrow(gc3)
[1] 32
> length(gc3)
[1] 3729792
The 3729792 figure is 116556*32..
Grateful for any assistance!
3729792 figure is 116556*32
That is correct. length() for a matrix tells you the number of elements the matrix holds (which you have verified). length() for a data.frame tells you the number of columns it has.
If you want to compare apples to apples in your data.frame vs. matrix comparison, use nrow() and ncol()
I'm guessing a little at your data structure, but you've hinted that it's a list of numeric vectors.
n_routes <- 5
gc <- replicate(n_routes, runif(30), simplify = FALSE)
names(gc) <- letters[seq_len(n_routes)]
You can convert this list to be a vector with as.data.frame(gc) but note that data frames aren't meant to be transposed (it doesn't make sense if columns have different types.
This means that you need to convert to data frame and then to matrix before transposing.
gc2 <- t(as.matrix(as.data.frame(gc)))
Since all your columns are numeric, you may want to leave it as a matrix. Alternatively, use as.data.frame again to make it a data frame.
as.data.frame(gc2)
As others have pointed out, length has different meanings for matrices and data frames. The definition for data frames – the number of columns – is unintuitive, and a legacy of S compatibility. Use ncol instead, since it gives the same answer, but with more readable code.

Resources