Transposing a large dataframe / matrix in R - r

Am encountering a strange issue transposing a large dataset. I want to get a list of non-linear flight routes (i.e. sub-lists of vectors with 30 vertices each) into a dataframe (with 32 columns for vertices). The list coerces into a data.frame no problem, but then fails when (1) transposing with t(x) and (2) converting to matrix.
To illustrate:
> class(gc)
[1] "list"
> length(gc)
[1] 58278
> gc[[1]][1:30]
[1] 147.2200 147.1606 147.1012 147.0418 146.9824 146.9231 146.8638
[8] 146.8046 146.7454 146.6862 146.6270 146.5679 146.5088 146.4498
[15] 146.3908 146.3318 146.2728 146.2139 146.1550 146.0961 146.0373
[22] 145.9785 145.9197 145.8610 145.8022 145.7435 145.6849 145.6262
[29] 145.5676 145.5090
> gc2 <- data.frame(gc)
> nrow(gc2)
[1] 32
> length(gc2)
[1] 116556
> gc2[1:5,1:5]
lon lat lon.1 lat.1 lon.2
1 147.2200 -9.443383 -80.37861 43.46083 -87.90484
2 147.1606 -9.335072 -80.23135 43.52385 -87.53193
3 147.1012 -9.226751 -80.08379 43.58667 -87.15751
4 147.0418 -9.118420 -79.93591 43.64931 -86.78161
5 146.9824 -9.010080 -79.78773 43.71175 -86.40421
> gc3 <- t(gc2)
> nrow(gc3)
[1] 116556
> length(gc3)
[1] 3729792
> gc3 <- as.matrix(gc2)
> nrow(gc3)
[1] 32
> length(gc3)
[1] 3729792
The 3729792 figure is 116556*32..
Grateful for any assistance!

3729792 figure is 116556*32
That is correct. length() for a matrix tells you the number of elements the matrix holds (which you have verified). length() for a data.frame tells you the number of columns it has.
If you want to compare apples to apples in your data.frame vs. matrix comparison, use nrow() and ncol()

I'm guessing a little at your data structure, but you've hinted that it's a list of numeric vectors.
n_routes <- 5
gc <- replicate(n_routes, runif(30), simplify = FALSE)
names(gc) <- letters[seq_len(n_routes)]
You can convert this list to be a vector with as.data.frame(gc) but note that data frames aren't meant to be transposed (it doesn't make sense if columns have different types.
This means that you need to convert to data frame and then to matrix before transposing.
gc2 <- t(as.matrix(as.data.frame(gc)))
Since all your columns are numeric, you may want to leave it as a matrix. Alternatively, use as.data.frame again to make it a data frame.
as.data.frame(gc2)
As others have pointed out, length has different meanings for matrices and data frames. The definition for data frames – the number of columns – is unintuitive, and a legacy of S compatibility. Use ncol instead, since it gives the same answer, but with more readable code.

Related

Repeating patterns in a vector in R

If a vector is produced from a vector of unknown length with unique elements by repeating it unknown times
small_v <- c("as","d2","GI","Worm")
big_v <- rep(small_v, 3)
then how to determine how long that vector was and how many times it was repeated?
So in this example the original length was 4 and it repeats 3 times.
Realistically in my case the vectors will be fairly small and will be repeated only a few times.
1) Assuming that there is at least one unique element in small_v (which is the case in the question since it assumes all elements in small_v are unique):
min(table(big_v))
## [1] 3
or using pipes
big_v |> table() |> min()
## [1] 3
Here is a more difficult test but it still works because small_v2[2] is unique in small_v2 even though the other elements of small_v2 are not unique.
# test data
small_v2 <- c(small_v, small_v[-2])
big_v2 <- rep(small_v2, 3)
min(table(big_v2))
## [1] 3
2) If we knew that the first element of small_v were unique (which is the case in the question since it assumes all elements in small_v are unique) then this would work:
sum(big_v[1] == big_v)
## [1] 3
1) If the elements are all repeating and no other values are there, then use
length(big_v)/length(unique(big_v))
[1] 3
2) Or use
library(data.table)
max(rowid(big_v))
[1] 3
Alternatively we could use rle with with to count the repeats
with(rle(sort(big_v)), max(lengths))
Created on 2023-02-04 with reprex v2.0.2
[1] 3

Split a list of elements into two unique lists (and get all combinations) in R

I have a list of elements (my real list has 11 elements, this is just an example):
x <- c(1, 2, 3)
and want to split them into two lists (using all entries) but I want to get all possible combinations of that list to be returned e.g.:
(1,2)(3) & (1)(2,3) & (2)(1,3)
Does anyone know an efficient way to do this for a more complex list?
Thanks in advance for your help!
List with 3 elements:
vec <- 1:3
Note that for each element we have two possibilities: it is either in 1st split or in 2nd split. So we define a matrix of all possible splits (in rows) using expand.grid which produces all possible combinations:
groups <- as.matrix(expand.grid(rep(list(1:2), length(vec))))
However This will treat scenarios where the groups are flipped as different splits. Also will include scenarios where all the observations are in the same group (but there will only be 2 of them).
If you want to remove them we need to remove the lines from groups matrix that only have one group (2 such lines) and all the lines that split the vector in the same way, only switching the groups.
One-group entries are on top and bottom so removing them is easy:
groups <- groups[-c(1, nrow(groups)),]
Duplicated entries are a bit trickier. But note that we can get rid fo them by removing all the rows where the first group is 2. In effect this will make a requirement that the first element is always assigned to group 1.
groups <- groups[groups[,1]==1,]
Then the job is to split the list we have using each of the rows in the groups matrix. For that we use Map to call split() function on our list vec and each row of groups matrix:
splits <- Map(split, list(vec), split(groups, row(groups)))
> splits
[[1]]
[[1]]$`1`
[1] 1 3
[[1]]$`2`
[1] 2
[[2]]
[[2]]$`1`
[1] 1 2
[[2]]$`2`
[1] 3
[[3]]
[[3]]$`1`
[1] 1
[[3]]$`2`
[1] 2 3

R: Compare vectors of differing lengths

I'm actually having trouble phrasing my question, so if anyone has feedback on that, I'd love to hear it.
I'm working in R and have a vector and a data frame, of different lengths:
xp.data <- c(400,500,600,700)
XPTable <- data.frame("Level"=1:10,"XP"=c(10,50,100,200,400,600,700,800,900,1000))
What I'm hoping to obtain is a new vector:
> lv.data
[1] 5 5 6 7
The goal is to do so without using a loop, as the xp.data vector can be any length, and the XPTable data frame can also be of varying lengths.
If I was doing this without a vector for xp.data, I'd just use:
max(XPTable$Level[XPTable$XP < XP.data])
However, this only works if XP.data has a length of 1.
lv.data <- findInterval(xp.data, XPTable$XP)
print(lv.data)
# [1] 5 5 6 7

Using R to count patterns in columns

I have a matrix in R containing 1000 columns and 4 rows. Each cell in the matrix contains an integer between 1-4. I want to know two things:
1) What is the number of columns that contain a "1", "2", "3", and "4" in any order? Ideally, I would like the code to not require that I input each possible combination of 1,2,3,4 to perform its count.
2) What is the number of columns that contain 3 of the possible integers, but not all 4?
Solution 1
The most obvious approach is to run apply() over the columns and test for the required tabulation of the column vector using tabulate(). This requires first building a factor() out of the column vector to normalize its storage representation to an integer vector based from 1. And since you don't care about order, we must run sort() before comparing it against the expected tabulation.
For the "4 of 4" problem the expected tabulation will be four 1s, while for the "3 of 4" problem the expected tabulation will be two 1s and one 2.
## generate data
set.seed(1L); NR <- 4L; NC <- 1e3L; m <- matrix(sample(1:4,NR*NC,T),NR);
sum(apply(m,2L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x))))));
## [1] 107
sum(apply(m,2L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x))))));
## [1] 545
Solution 2
v <- c(1L,2L,4L,8L);
sum(colSums(matrix(v[m],nrow(m)))==15L);
## [1] 107
v <- c(1L,3L,9L,27L);
s3 <- c(14L,32L,38L,16L,34L,22L,58L,46L,64L,42L,48L,66L);
sum(colSums(matrix(v[m],nrow(m)))%in%s3);
## [1] 545
Here's a slightly weird solution.
I was looking into how to use colSums() or colMeans() to try to find a quick test for columns that have 4 of 4 or 3 of 4 of the possible cell values. The problem is, there are multiple combinations of the 4 values that sum to the same total. For example, 1+2+3+4 == 10, but 1+1+4+4 == 10 as well, so just getting a column sum of 10 is not enough.
I realized that one possible solution would be to change the set of values that we're summing, such that our target combinations would sum to unambiguous values. We can achieve this by spreading out the original set from 1:4 to something more diffuse. Furthermore, the original set of values of 1:4 is perfect for indexing a precomputed vector of values, so this seemed like a particularly logical approach for your problem.
I wasn't sure what degree of diffusion would be required to make unique the sums of the target combinations. Some ad hoc testing seemed to indicate that multiplication by a fixed multiplier would not be sufficient to disambiguate the sums, so I moved up to exponentiation. I wrote the following code to facilitate the testing of different bases to identify the minimal bases necessary for this disambiguation.
tryBaseForTabulation <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## get the sums of good and bad combinations
hs <- rowSums(g[good,,drop=F]);
ns <- rowSums(g[-good,,drop=F]);
## return the number of ambiguous sums; we need to get zero!
sum(hs%in%ns);
}; ## end tryBaseForTabulation()
The function takes the size of the set (4 for us), the required tabulation (as returned by tabulate()) in sorted order (as revealed earlier, this is four 1s for the "4 of 4" problem, two 1s and one 2 for the "3 of 4" problem), and the test base. This is the result for a base of 2 for the "4 of 4" problem:
tryBaseForTabulation(4L,rep(1L,4L),2L);
## [1] 0
So we get the result we need right away; a base of 2 is sufficient for the "4 of 4" problem. But for the "3 of 4" problem, it takes one more attempt:
tryBaseForTabulation(4L,c(1L,1L,2L),2L);
## [1] 7
tryBaseForTabulation(4L,c(1L,1L,2L),3L);
## [1] 0
So we need a base of 3 for the "3 of 4" problem.
Note that, although we are using exponentiation as the tool to diffuse the set, we don't actually need to perform any exponentiation at solution run-time, because we can simply index a precomputed vector of powers to transform the value space. Unfortunately, indexing a vector with a matrix returns a flat vector result, losing the matrix structure. But we can easily rebuild the matrix structure with a call to matrix(), thus we don't lose very much with this idiosyncrasy.
The last step is to derive the destination value space and the set of sums that satisfy the problem condition. The value spaces are easy; we can just compute the power sequence as done within tryBaseForTabulation():
2L^(1:4-1L);
## [1] 1 2 4 8
3L^(1:4-1L);
## [1] 1 3 9 27
The set of sums was computed as hs in the tryBaseForTabulation() function. Hence we can write a new similar function for these:
getBaseSums <- function(N,tab,base) {
## make destination value set, exponentiating from 0 to N-1
x <- base^(seq_len(N)-1L);
## make a matrix of unique combinations of the original set
g <- unique(t(apply(expand.grid(x,x,x,x),1L,sort)));
## get the indexes of combinations that match the required tabulation
good <- which(apply(g,1L,function(x) identical(tab,sort(tabulate(factor(x))))));
## return the sums of good combinations
rowSums(g[good,,drop=F]);
}; ## end getBaseSums()
Giving:
getBaseSums(4L,rep(1L,4L),2L);
## [1] 15
getBaseSums(4L,c(1L,1L,2L),3L);
## [1] 14 32 38 16 34 22 58 46 64 42 48 66
Now that the solution is complete, I realize that the cost of the vector index operation, rebuilding the matrix, and the %in% operation for the second problem may render it inferior to other potential solutions. But in any case, it's one possible solution, and I thought it was an interesting idea to explore.
Solution 3
Another possible solution is to precompute an N-dimensional lookup table that stores which combinations match the problem condition and which don't. The input matrix can then be used directly as an index matrix into the lookup table (well, almost directly; we'll need a single t() call, since its combinations are laid across columns instead of rows).
For a large set of values, or for long vectors, this could easily become impractical. For example, if we had 8 possible cell values with 8 rows then we would need a lookup table of size 8^8 == 16777216. But fortunately for the sizing given in the question we only need 4^4 == 256, which is completely manageable.
To facilitate the creation of the lookup table, I wrote the following function, which stands for "N-dimensional combinations":
NDcomb <- function(N,f) {
x <- seq_len(N);
g <- do.call(expand.grid,rep(list(x),N));
array(apply(g,1L,f),rep(N,N));
}; ## end NDcomb()
Once the lookup table is computed, the solution is easy:
v <- NDcomb(4L,function(x) identical(rep(1L,4L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 107
v <- NDcomb(4L,function(x) identical(c(1L,1L,2L),sort(tabulate(factor(x)))));
sum(v[t(m)]);
## [1] 545
We can use colSums. Loop over 1:4, convert the matrix to a logical matrix, get the colSums, check whether it is not equal to 0 and sum it.
sapply(1:4, function(i) sum(colSums(m1==i)!=0))
#[1] 6 6 9 5
If we need the number of columns that contain 3 and not have 4
sum(colSums(m1!=4)!=0 & colSums(m1==3)!=0)
#[1] 9
data
set.seed(24)
m1 <- matrix(sample(1:4, 40, replace=TRUE), nrow=4)

R unlist a list to integers

[revised version]
I have a large character vector in R of size 57241 that contains gene symbols e.g
gene <- c("AL627309.1","SMIM1","DFFB") # assume this of size 57241
I have another table in which one column table$genes has some combinations of genes in each row e.g
head(table$genes)
[1] ,OR4F5,AL627309.1,OR4F29,OR4F16,AL669831.1,
[2] ,TP73,CCDC27,SMIM1,LRRC47,CEP104,DFFB
..
this table has about 1400 rows. For each gene I wanted to find the index of row in table in which it is located.
To do that I used
ind <- sapply(gene, grep, table$genes, fixed=TRUE,USE.NAMES=FALSE))
The variable "ind" returned is a large list of size 57241 which looks like this
head(ind)
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 1
[[4]]
[1] 1
[[5]]
[1] 1
[[6]]
[1] 1
I know for a fact each gene exists only once in that table. So the numbers that I am interested in is the list one in each line above i.e. 1. How can I convert this into an integer vector? When I unlist() this somehow I get a vector of length ~500000 whereas I should be getting the same length as of the list. I have tried many functions and combinations but nothing seems to work. Any ideas?
Thanks
I'm not able to reproduce that behavior with either a list or a dataframe:
> gene <- c("AL627309.1","SMIM1","DFFB")
>
> table <- list(genes =c(",OR4F5,AL627309.1,OR4F29,OR4F16,AL669831.1,",
",TP73,CCDC27,SMIM1,LRRC47,CEP104,DFFB"))
> (ind <- sapply(gene, grep, table$genes, fixed=TRUE,USE.NAMES=FALSE))
[1] 1 2 2
I thought for a bit that you should be using match but after further consideration, it seemed as though there must be something different about your data structure. Try posting dput(head (table$gene)) and dput(gene) to make your problem reproducible. You should also stop using the word "list" to refer to the items in that table$gene items. It confuses regular users of R who think you are talking about an R "list". You can try to see which of the items in your ind "list" has a vector of length greater than one with:
which(sapply(ind, length) > 1)

Resources