Object size increases hugely when transposing a data frame - r

I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str() of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...

First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the structure of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:
A data frame is first coerced to a matrix: see as.matrix.
OK, see ?as.matrix:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham

Related

why does apply() produce a list sometimes, and a vector others?

I have this piece of code:
p.data=samp_data[,c('t_het_f','t_ane_f','t_loh_f')]
str(p.data)
head(p.data)
colnames(p.data)
head(apply(p.data,1,which.max))
which for one set of data produces this result:
'data.frame': 449 obs. of 3 variables:
$ t_het_f: num 0.663 0.688 0.746 0.429 0.484 ...
$ t_ane_f: num 0.291 0.3 0.247 0.398 0.261 ...
$ t_loh_f: num 0.04601 0.01236 0.00657 0.17376 0.2546 ...
t_het_f t_ane_f t_loh_f
1 0.6629108 0.2910798 0.046009390
...
6 0.7019118 0.2589706 0.039117647
[1] "t_het_f" "t_ane_f" "t_loh_f"
[1] 1 1 1 1 1 1
But for another set of data produces:
'data.frame': 587 obs. of 3 variables:
$ t_het_f: num 0.505 0.566 0.205 0.367 0.59 ...
$ t_ane_f: num 0.491 0.182 0.745 0.42 0.251 ...
$ t_loh_f: num 0.00427 0.25193 0.05003 0.21227 0.15891 ...
t_het_f t_ane_f t_loh_f
1 0.5048134 0.4909143 0.004272287
...
6 0.8159115 0.1829711 0.001117381
[1] "t_het_f" "t_ane_f" "t_loh_f"
[[1]]
t_het_f
1
[[2]]
t_het_f
1
Why would what looks to me like the same data structure (p.data) produce a vector in one case, and a list in another?
The return Value in apply depends on the length of the output as mentioned in ?apply
If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1. If n equals 1, apply returns a vector if MARGIN has length 1 and an array of dimension dim(X)[MARGIN] otherwise. If n is 0, the result has length 0 but not necessarily the ‘correct’ dimension.
If the calls to FUN return vectors of different lengths, apply returns a list of length prod(dim(X)[MARGIN]) with dim set to MARGIN if this has length greater than one.
Since the same function (which.max) was applied in both cases, it was not obvious that it might be returning different length values for the two datasets. The difference was being caused by the presence of 'NA' in the second dataset, but not in the first.

Subsetting columns in different positions and with different names in a large list of lists with purrr

I have a large list of lists. There are 46 lists in "output". Each list is a tibble with differing number of rows and columns. My immediate goal is to subset a specific column from each list.
This is str(output) of the first two lists to give you an idea of the data.
> str(output)
List of 46
$ Brain :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6108 obs. of 8 variables:
..$ p_val : chr [1:6108] "0" "1.60383253411205E-274" "0" "0" ...
..$ avg_diff : num [1:6108] 1.71 1.7 1.68 1.6 1.58 ...
..$ pct.1 : num [1:6108] 0.998 0.808 0.879 0.885 0.923 0.905 0.951 0.957 0.619 0.985 ...
..$ pct.2 : num [1:6108] 0.677 0.227 0.273 0.323 0.36 0.384 0.401 0.444 0.152 0.539 ...
..$ cluster : num [1:6108] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:6108] "Plp1" "Mal" "Ermn" "Stmn4" ...
..$ X__1 : logi [1:6108] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:6108] "Myelinating oligodendrocyte" NA NA NA ...
$ Bladder :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4656 obs. of 8 variables:
..$ p_val : num [1:4656] 0.00 1.17e-233 2.85e-276 0.00 0.00 ...
..$ avg_diff : num [1:4656] 2.41 2.23 2.04 2.01 1.98 ...
..$ pct.1 : num [1:4656] 0.833 0.612 0.855 0.987 1 0.951 0.711 0.544 0.683 0.516 ...
..$ pct.2 : num [1:4656] 0.074 0.048 0.191 0.373 0.906 0.217 0.105 0.044 0.177 0.106 ...
..$ cluster : num [1:4656] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:4656] "Dpt" "Gas1" "Cxcl12" "Lum" ...
..$ X__1 : logi [1:4656] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:4656] "Stromal cell_Dpt high" NA NA NA ...
Since I have a large number of lists that make up the list, I have been trying to create an iterative code to perform tasks. This hasn't been successful.
I can achieve this manually, or list by list, but I haven't been successful in finding an iterative way of doing this.
x <- data.frame(output$Brain, stringsAsFactors = FALSE)
tmp.list <- x$Cell.Type
tmp.output <- purrr::discard(tmp.list, is.na)
x <- subset(x, Cell.Type %in% tmp.output)
This gives me the output that I want, which are the rows in the column "Cell.Type" with non-NA values.
I got as far as the code below to get the 8th column of each list, which is the "Cell.Type" column.
lapply(output, "[", , 8))
But here I found that the naming and positioning of the "Cell.Type" column in each list is not consistent. This means I cannot use the lapply function to subset the 8th columns, as some lists have this on for example the 9th column.
I tried the code below, but it does not work and gets an error.
lapply(output, "[", , c('Cell.Type', 'celltyppe'))
#Error: Column `celltyppe` not found
#Call `rlang::last_error()` to see a backtrace
Essentially, from my "output" list, I want to subset either columns "Cell.Type" or "celltyppe" from each of the 46 lists to create a new list with 46 lists of just a single column of values. Then I want to drop all rows with NA.
I would like to perform this using some sort of loop.
At the moment I have not had much success. Lapply seems to be able to extract columns through lists iterately, and I am having difficultly trying to subset names columns.
Once I can do this, I then want to create a loop that can subset only rows without NA.
FINAL CODE
This is the final code I have used to create exactly what I had hoped for. The first line of the code specifies the loop to go through each list of the large list. The second line of code selects columns of each list that contains "ell" in its name (Cell type, Cell Type, or celltyppe). The last removes any rows with "na".
purrr::map(output, ~ .x %>%
dplyr::select(matches("ell")) %>%
na.omit)
We can use anonymous function call
lapply(output, function(x) na.omit(x[grep("(?i)Cell\\.?(?i)Typp?e", names(x))]))
#[[1]]
# Cell.Type
#1 1
#2 2
#3 3
#4 4
#5 5
#[[2]]
# celltyppe
#1 7
#2 8
#3 9
#4 10
#5 11
Also with purrr
library(tidyverse)
map(output, ~ .x %>%
select(matches("(?i)Cell\\.?(?i)Typp?e") %>%
na.omit))
data
output <- list(data.frame(Cell.Type = 1:5, col1 = 6:10, col2 = 11:15),
data.frame(coln = 1:5, celltyppe = 7:11))

Extract multiple objects from list in R

I have some output from the vegan function specaccum. It is a list of 8 objects of varying lengths;
> str(SPECIES)
List of 8
$ call : language specaccum(comm = PRETEND.DATA, method = "rarefaction")
$ method : chr "rarefaction"
$ sites : num [1:5] 1 2 3 4 5
$ richness : num [1:5] 20.9 34.5 42.8 47.4 50
$ sd : num [1:5] 1.51 2.02 1.87 1.35 0
$ perm : NULL
$ individuals: num [1:5] 25 50 75 100 125
$ freq : num [1:50] 1 2 3 2 4 3 3 3 4 2 ...
- attr(*, "class")= chr "specaccum"
I want to extract three of the lists ('richness', 'sd' and 'individuals') and convert them to columns in a data frame. I have developed a workaround;
SPECIES.rich <- data.frame(SPECIES[["richness"]])
SPECIES.sd <- data.frame(SPECIES[["sd"]])
SPECIES.individuals <- data.frame(SPECIES[["individuals"]])
SPECIES.df <- cbind(SPECIES.rich, SPECIES.sd, SPECIES.individuals)
But this seems clumsy and protracted. I wonder if anyone could suggest a neater solution? (Should I be looking at something with lapply??) Thanks!
Example data to generate the specaccum output;
Set.Seed(100)
PRETEND.DATA <- matrix(sample(0:1, 250, replace = TRUE), 5, 50)
library(vegan)
SPECIES <- specaccum(PRETEND.DATA, method = "rarefaction")
We can concatenate the names in a vector and extract it
SPECIES.df <- data.frame(SPECIES[c("richness", "sd", "individuals")])
Another alternative, similar to akrun, is:
ctoc1 = as.data.frame(cbind(SPECIES$richness, SPECIES$sd, SPECIES$individuals))
Please note that in both cases (my answer and akrun) you will get an error if the lengths of the columns do not match.
e.g.: SPECIES.df <- data.frame(SPECIES[c( "sd", "freq")])
Error in data.frame(richness = c(20.5549865665613, 33.5688503093388, 41.4708434700877, :
arguments imply differing number of rows:7, 47
If so, remember to use length() function :
length(SPECIES$sd) <- 47 # this will add NAs to increase the column length.
SPECIES.df <- data.frame(SPECIES[c("sd", "freq")])
SPECIES.df # dataframe with 2 columns and 7 rows.

R Summary based on column name length

I have the following problem:
I have a matrix with 80 columns which names have either 10/11, 21/22,31/32 or 42/43 characters. The names are totally different but the lenth fits always in one of the four groups. Now I would like to add four columns were I get the sum of all the values of columns corresponding to one group. Here is a little example of what I mean
a<-rnorm(1:100)
b<-rnorm(1:100)
cc<-rnorm(1:100)
dd<-rnorm(1:100)
eee<-rnorm(1:100)
fff<-rnorm(1:100)
g<-data.frame(a,b,cc,dd,eee,fff)
g$group1<-"sum of all columns of with headers of length 1 (in this case a+b)"
g$group2<-"sum of all columns of with headers of length 2 (in this case cc+dd)"
g$group3<-"sum of all columns of with headers of length 3 (in this case eee+fff)"
I was able to transfer the matrix to a dataframe using melt() and carrying out the operation using stringr::str_length(). However, I could not transform this back to a matrix which I really need as final output. The columns are not in order and ordering would not help me much, since the number of columns depends on the outcome of the previous calculation and it would be too tedious to define dataframe ranges every time again.
Hope you can help.
You want this:
tmp <- nchar(names(g))
chargroups <- split(1:dim(g)[2], tmp)
# `chargroups` is a list of groups of columns with same number of letters in name
sapply(chargroups, function(x) {
if(length(x)>1) # rowSums can only accept 2+-dimensional object
rowSums(g[,x])
else
g[,x]
})
# `x` is, for each number of letters, a vector of column indices of `g`
The key part of this is that nchar is going to determine how long the column names are. The rest is pretty straightforward.
EDIT: In your actual code, though you should deal with the ranges of name lengths by just doing something like the following after you define tmp but before the sapply statement:
tmp[tmp==10] <- 11
tmp[tmp==21] <- 22
tmp[tmp==31] <- 32
tmp[tmp==32] <- 43
Another approach
set.seed(123)
a <- rnorm(1:100)
b <- rnorm(1:100)
cc <- rnorm(1:100)
dd <- rnorm(1:100)
eee <- rnorm(1:100)
fff <- rnorm(1:100)
g <- data.frame(a,b,cc,dd,eee,fff)
for ( i in 1:3 )
eval(parse(text = sprintf("g$group%s <- rowSums(g[nchar(names(g)) == %s])", i, i)))
## 'data.frame': 100 obs. of 9 variables:
## $ a : num -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
## $ b : num -0.71 0.257 -0.247 -0.348 -0.952 ...
## $ cc : num 2.199 1.312 -0.265 0.543 -0.414 ...
## $ dd : num -0.715 -0.753 -0.939 -1.053 -0.437 ...
## $ eee : num -0.0736 -1.1687 -0.6347 -0.0288 0.6707 ...
## $ fff : num -0.602 -0.994 1.027 0.751 -1.509 ...
## $ group1: num -1.2709 0.0267 1.312 -0.277 -0.8223 ...
## $ group2: num 1.484 0.56 -1.204 -0.509 -0.851 ...
## $ group3: num -0.675 -2.162 0.392 0.722 -0.838 ...

Getting column name which holds a max value within a row of a matrix holding a separate max value within an array

For instance given:
dim1 <- c("P","PO","C","T")
dim2 <- c("LL","RR","R","Y")
dim3 <- c("Jerry1", "Jerry2", "Jerry3")
Q <- array(1:48, c(4, 4, 3), dimnames = list(dim1, dim2, dim3))
I want to reference within this array, the matrix that has the max dim3 value at the (3rd row, 4th column) location.
Upon identifying that matrix, I want to return the column name which has the maximum value within the matrix's (3rd Row, 1st Column) to (3rd Row, 3rd Column) range.
So what I'd hope to happen is that Jerry3 gets referenced because the number 47 is stored in its 3rd row, 4th column, and then within Jerry3, I would want the maximum number in row 3 to get referenced which would be 43, and ultimately, what I need returned (the only value I need) is then the column name which would be "R".
That's what I need to know how to do, obtain get that "R" and assign it to a variable, i.e. "column_ref", such that column_ref <- "R".
This should do it - if I understand correctly:
Q <- array(1:48, c(4,4,3), dimnames=list(
c("P","PO","C","T"), c("LL","RR","R","Y"), c("Jerry1", "Jerry2", "Jerry3")))
column_ref <- names(which.max(Q[3,1:3, which.max(Q[3,4,])]))[1] # "R"
Some explanation:
which.max(Q[3,4,]) # return the index of the "Jerry3" slice (3)
which.max(Q[3,1:3, 3]) # returns the index of the "R" column (3)
...and then names returns the name of the index ("R").
Here's a simple way to solve:
mxCol=function(df, colIni, colFim){ #201609
if(missing(colIni)) colIni=1
if(missing(colFim)) colFim=ncol(df)
if(colIni>=colFim) { print('colIni>=ColFim'); return(NULL)}
dfm=cbind(mxC=apply(df[colIni:colFim], 1, function(x) colnames(df)[which.max(x)+(colIni-1)])
,df)
dfm=cbind(mxVal=as.numeric(apply(dfm,1,function(x) x[x[1]]))
,dfm)
returndfm
}
This post helped me to solve a data.frame general problem.
I have repeated measures for groups, G1 e G2.
> str(df)
'data.frame': 6 obs. of 15 variables:
$ G1 : num 0 0 2 2 8 8
$ G2 : logi FALSE TRUE FALSE TRUE FALSE TRUE
$ e.10.100 : num 26.41 -11.71 27.78 3.17 26.07 ...
$ e.10.250 : num 27.27 -12.79 29.16 3.19 26.91 ...
$ e.20.100 : num 29.96 -12.19 26.19 3.44 27.32 ...
$ e.20.100d: num 26.42 -13.16 28.26 4.18 25.43 ...
$ e.20.200 : num 24.244 -18.364 29.047 0.553 25.851 ...
$ e.20.50 : num 26.55 -13.28 29.65 4.34 27.26 ...
$ e.20.500 : num 27.94 -13.92 27.59 2.47 25.54 ...
$ e.20.500d: num 24.4 -15.63 26.78 4.86 25.39 ...
$ e.30.100d: num 26.543 -15.698 31.849 0.572 29.484 ...
$ e.30.250 : num 26.776 -16.532 28.961 0.813 25.407 ...
$ e.50.100 : num 25.995 -14.249 28.697 0.803 27.852 ...
$ e.50.100d: num 26.1 -12.7 27.1 2.5 27.4 ...
$ e.50.500 : num 28.78 -9.39 25.77 2.73 23.73 ..
I need to know which measure (column) has the best (max) result. And I need to disconsider grouping columns.
I ended up with this function
apply(df[colIni:colFim], 1, function(x) colnames(df)[which.max(x)+(colIni-1)]
#colIni: first column to consider; colFim: last column to consider
After having column name, another tiny function to get the max value
apply(dfm,1,function(x) x[x[1]])
And the function to solve similar problems, that return the column and the max value
mxCol=function(df, colIni, colFim){ #201609
if(missing(colIni)) colIni=1
if(missing(colFim)) colFim=ncol(df)
if(colIni>=colFim) { print('colIni>=ColFim'); return(NULL)}
dfm=cbind(mxCol=apply(df[colIni:colFim], 1, function(x) colnames(df)[which.max(x)+(colIni-1)])
,df)
dfm=cbind(mxVal=as.numeric(apply(dfm,1,function(x) x[x[1]]))
,dfm)
return(dfm)
}
In this case,
> mxCol(df,3)[1:11]
mxVal mxCol G1 G2 e.10.100 e.10.250 e.20.100 e.20.100d e.20.200 e.20.50 e.20.500
1 29.958 e.20.100 0 FALSE 26.408 27.268 29.958 26.418 24.244 26.553 27.942
2 -9.395 e.50.500 0 TRUE -11.708 -12.789 -12.189 -13.162 -18.364 -13.284 -13.923
3 31.849 e.30.100d 2 FALSE 27.782 29.158 26.190 28.257 29.047 29.650 27.586
4 4.862 e.20.500d 2 TRUE 3.175 3.190 3.439 4.182 0.553 4.337 2.467
5 29.484 e.30.100d 8 FALSE 26.069 26.909 27.319 25.430 25.851 27.262 25.535
6 -9.962 e.30.250 8 TRUE -11.362 -12.432 -15.960 -11.760 -12.832 -12.771 -12.810

Resources