I have a data frame for a raw data set where the variable names are extremely long. I would like to display the structure of the data frame using the str function, and impose a character limit on the displayed variable names, so that it is easier to read.
Here is a reproducible example of the kind of thing I am talking about.
#Data frame with long names
set.seed(1);
DATA <- data.frame(ID = 1:50,
Value = rnorm(50),
This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name = runif(50));
#Show structure of DATA
str(DATA);
> str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name: num 0.655 0.353 0.27 0.993 0.633 ...
I would like to use the str function but impose an upper limit on the number of characters to display in the variable names, so that I get output that is something like the one below. I have read the documentation, but I have not been able to identify if there is an option to do this. (There seem to be options to impose upper limits on the lengths of strings in the data, but I cannot see an option to impose a limit on the length of the variable name.)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...
Question: Is there a simple way to get the structure of the data frame, but imposing a limitation on the length of the variable names (to get output something like the above)?
As far as I can see you're right, there doesn't seem to be a built in means to control this. You also can't do it after the fact because str() doesn't return anything. So the easiest option seems to be renaming beforehand. Relying on setNames(), you could create a simple function to accomplish this:
short_str <- function(data, n = 20, ...) {
name_vec <- names(data)
str(setNames(data, ifelse(
nchar(name_vec) > n, paste0(substring(name_vec, 1, n - 4), "... "), name_vec
)), ...)
}
short_str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...
I have a data frame of ca. 50,000 RNA transcripts in rows, with 10,000 different samples in columns. The size of the data frame is 4.9GB.
I then have to transpose the data in order to subset it properly later:
df <- data.frame(t(df))
After the transpose, the object size has ballooned to 70GB. Why is this happening? Should transposing data really change the file size that much?
str() of the first 20 columns:
str(df[1:20])
Classes 'tbl_df', 'tbl' and 'data.frame': 56202 obs. of 20 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
$ Description : chr "DDX11L1" "WASH7P" "MIR1302-11" "FAM138A" ...
$ GTEX-1117F-0226-SM-5GZZ7: num 0.1082 21.4 0.1602 0.0505 0 ...
$ GTEX-111CU-1826-SM-5GZYN: num 0.1158 11.03 0.0643 0 0 ...
$ GTEX-111FC-0226-SM-5N9B8: num 0.021 16.75 0.0467 0.0295 0 ...
$ GTEX-111VG-2326-SM-5N9BK: num 0.0233 8.172 0 0.0326 0 ...
$ GTEX-111YS-2426-SM-5GZZQ: num 0 7.658 0.0586 0 0 ...
$ GTEX-1122O-2026-SM-5NQ91: num 0.0464 9.372 0 0 0 ...
$ GTEX-1128S-2126-SM-5H12U: num 0.0308 10.08 0.1367 0.0861 0.1108 ...
$ GTEX-113IC-0226-SM-5HL5C: num 0.0936 13.56 0.2079 0.131 0.0562 ...
$ GTEX-117YX-2226-SM-5EGJJ: num 0.121 9.889 0.0537 0.0677 0 ...
$ GTEX-11DXW-0326-SM-5H11W: num 0.0286 9.121 0.0635 0 0 ...
$ GTEX-11DXX-2326-SM-5Q5A2: num 0 6.698 0.0508 0.032 0 ...
$ GTEX-11DZ1-0226-SM-5A5KF: num 0.0237 9.835 0 0.0664 0 ...
$ GTEX-11EI6-0226-SM-5EQ64: num 0.0802 13.1 0 0 0 ...
$ GTEX-11EM3-2326-SM-5H12B: num 0.0223 8.904 0.0496 0.0625 0.0402 ...
$ GTEX-11EMC-2826-SM-5PNY6: num 0.0189 16.59 0 0.0265 0.034 ...
$ GTEX-11EQ8-0226-SM-5EQ5G: num 0.0931 15.1 0.0689 0.0869 0 ...
$ GTEX-11EQ9-2526-SM-5HL66: num 0.0777 9.838 0 0 0 ...
First, you write that:
I then have to transpose this dataset in order to subset it properly later,
To be honest, I doubt you have to. Thus, this may be an XY-problem. That said, I think could be of general interest to dissect the issue.
The increase in object size is most likely due to that the class of the object before and after transposing has changed, together with the fact that objects of different class have different size.
I will try to illustrate this with some examples. We begin with the change of class.
Create a toy data frame with a structure resembling yours, a few character columns and several numeric columns:
# set number of rows and columns
nr <- 5
nc <- 5
set.seed(1)
d <- data.frame(x = sample(letters, nr, replace = TRUE),
y = sample(letters, nr, replace = TRUE),
matrix(runif(nr * nc), nrow = nr),
stringsAsFactors = FALSE)
Transpose it:
d_t <- t(d)
Check the structure of the original data and its transposed sibling:
str(d)
# 'data.frame': 5 obs. of 7 variables:
# $ x : chr "g" "j" "o" "x" ...
# $ y : chr "x" "y" "r" "q" ...
# $ X1: num 0.206 0.177 0.687 0.384 0.77
# $ X2: num 0.498 0.718 0.992 0.38 0.777
# $ X3: num 0.935 0.212 0.652 0.126 0.267
# $ X4: num 0.3861 0.0134 0.3824 0.8697 0.3403
# $ X5: num 0.482 0.6 0.494 0.186 0.827
str(d_t)
# chr [1:7, 1:5] "g" "x" "0.2059746" "0.4976992" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:7] "x" "y" "X1" "X2" ...
# ..$ : NULL
The data frame has became a character matrix. How did this happen? Well, check the help text for the transpose method for data frames: ?t.data.frame:
A data frame is first coerced to a matrix: see as.matrix.
OK, see ?as.matrix:
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column [...]
Whereas a data frame is a list where each column can be of different class, a matrix is just a vector with dimensions, which can hold only one class. Thus, because you have at least one character column, i.e. a non-(numeric/logical/complex) column, your data frame is coerced to a character matrix as a result of transpose. Then you coerce the matrix to data frame, where all columns are character (or factor, depending on your stringsAsFactors setting) - check str(data.frame(d_t)).
In the second step, the size of different objects is compared. Start with the data frame and its transpose, as created above:
# original data frame
object.size(d)
# 2360 bytes
# transposed df - a character matrix
object.size(d_t)
# 3280 bytes
The transposed object is clearly larger. If we increase the number rows and the number of numeric columns to mimic your data better, the relative difference is even larger:
nr <- 56202
nc <- 20
object.size(d)
# 9897712 bytes
object.size(d_t)
# 78299656 bytes
Because the number of elements is the same in the original and transposed data, the (memory) size of each individual element must differ. Let's check the size of integer, numeric, and character vectors of the same length. First, vectors with one-digit values and a corresponding vector of one-character elements:
onedigit_int <- sample(1:9, 1e4, replace = TRUE)
onedigit_num <- as.numeric(onedigit_int)
onedigit_char <- as.character(onedigit_int)
object.size(onedigit_int)
# 40048 bytes
object.size(onedigit_num)
# 80048 bytes
object.size(onedigit_char)
# 80552 bytes
For the single digits/characters, integer vectors occupy 4 bytes per element, and numeric and character vectors 8 bytes per element. The single-character vector does not require more memory than the numeric vector. Does this mean that we can reject the idea that the increase in total size is explained by the coercion of a large number of numeric variables to character? Well, we need to check what happens with vectors with multi-digits (which you seem to have) and their corresponding vectors of multi-character strings:
multidigit_int <- sample(1:1e6, 1e4, replace = TRUE)
multidigit_num <- as.numeric(multidigit_int)
multidigit_char <- as.character(multidigit_int)
object.size(multidigit_int)
# 40048 bytes
object.size(multidigit_num)
# 80048 bytes
object.size(multidigit_char)
# 637360 bytes
The integer vector still occupies 4 bytes for each element, the numeric vector still occupies 8 bytes for each element. However, the size per element in the character vector is larger for larger strings.
Thus, the transpose coerced your data frame to a character matrix, and the size of each character element is larger than its corresponding numeric element.
Transposing a data frame with columns of different class is very rarely sensible. And if all columns are of same class, then we may just as well use a matrix from the start.
Read more about how much memory is used to store different objects in Advanced R by Hadley Wickham
I have looked through various Overflow pages with similar questions (some linked) but haven't found anything that seems to help with this complicated task.
I have a series of data frames in my workspace and I would like to loop the same function (rollmean or some version of that) over all of them, then save the results to new data frames.
I have written a couple of lines of to generate a list of all data frames and a for loop that should iterate an apply statement over each data frame; however, I'm having problems trying to accomplish everything I'm hoping to achieve (my code and some sample data are included below):
1) I would like to restrict the rollmean function to all columns, except the 1st (or first several), so that the column(s) 'info' does not get averaged. I would also like to add this column(s) back to the output data frame.
2) I want to save the output as a new data frame (with a unique name). I do not care if it is saved to the workspace or exported as an xlsx, as I already have batch import codes written.
3) Ideally, I would like the resultant data frame to be the same number of observations as the input, where as rollmean shrinks your data. I also do not want these to become NA, so I don't want to use fill = NA This could be accomplished by writing a new function, passing type = "partial" in rollmean (though that still shrinks my data by 1 in my hands), or by starting the roll mean on the nth+2 term and binding the non averaged nth and nth+1 terms to the resulting data frame. Any way is fine.
(see picture for detail, it illustrates what the later would look like)
My code only accomplishes parts of these things and I cannot get the for loop to work together but can get parts to work if I run them on single data frames.
Any input is greatly appreciated because I'm out of ideas.
#reproducible data frames
a = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
b = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
c = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
colnames(a) = c("info", 1:20)
colnames(b) = c("info", 1:20)
colnames(c) = c("info", 1:20)
#identify all dataframes for looping rollmean
dflist = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)]
#for loop to create rolling average and save as new dataframe
for (j in 1:length(dflist)){
list = as.list(ls()[sapply(mget(ls(), .GlobalEnv), is.data.frame)])
new.names = as.character(unique(list))
smoothed = as.data.frame(
apply(
X = names(list), MARGIN = 1, FUN = rollmean, k = 3, align = 'right'))
assign(new.names[i], smoothed)
}
I also tried a nested apply approach but couldn't get it to call the rollmean/rollapply function similar to issue here so I went back to for loops but if someone can make this work with nested applies, I'm down!
Picture is ideal output: Top is single input dataframe with colored boxes demonstrating a rolling average across all columns, to be iterated over each column; bottom is ideal output with colors reflecting the location of output for each colored window above
To approach this, think about one column, then one frame (which is just a list of columns), then a list of frames.
(My data used is at the bottom of the answer.)
One Column
If you don't like the reduction of zoo::rollmean, then write your own:
myrollmean <- function(x, k, ..., type=c("normal","rollin","keep"), na.rm=FALSE) {
type <- match.arg(type)
out <- zoo::rollmean(x, k, ...)
aug <- c()
if (type == "rollin") {
# effectively:
# c(mean(x[1]), mean(x[1:2]), ..., mean(x[1:j]))
# for the j=k-1 elements that precede the first from rollmean,
# when it'll become something like:
# c(mean(x[3:5]), mean(x[4:6]), ...)
aug <- sapply(seq_len(k-1), function(i) mean(x[seq_len(i)], na.rm=na.rm))
} else if (type == "keep") {
aug <- x[seq_len(k-1)]
}
out <- c(aug, out)
out
}
myrollmean(1:8, k=3) # "normal", default behavior
# [1] 2 3 4 5 6 7
myrollmean(1:8, k=3, type="rollin")
# [1] 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0
myrollmean(1:8, k=3, type="keep")
# [1] 1 2 2 3 4 5 6 7
I caution that this implementation is a bit naïve at best, and needs to be fixed. Make sure that you understand what it is doing when you pick other than "normal" (which will not work for you, I'm just defaulting to the normal zoo::rollmean behavior). This function could easily be applied to other zoo::roll* functions.
On one column of the data:
rbind(
dflist[[1]][,2], # for comparison
myrollmean(dflist[[1]][,2], k=3, type="keep")
)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1.865352 0.4047481 0.1466527 1.7307097 0.08952618 0.6668976 1.0743669 1.511629 1.314276 0.1565303
# [2,] 1.865352 0.4047481 0.8055844 0.7607035 0.65562952 0.8290445 0.6102636 1.084298 1.300091 0.9941452
One "frame"
Simple use of lapply, omitting the first column:
str(dflist[[1]][1:4, 1:3])
# 'data.frame': 4 obs. of 3 variables:
# $ info: num 1 2 3 4
# $ 1 : num 1.865 0.405 0.147 1.731
# $ 2 : num 0.745 1.243 0.674 1.59
dflist[[1]][-1] <- lapply(dflist[[1]][-1], myrollmean, k=3, type="keep")
str(dflist[[1]][1:4, 1:3])
# 'data.frame': 4 obs. of 3 variables:
# $ info: num 1 2 3 4
# $ 1 : num 1.865 0.405 0.806 0.761
# $ 2 : num 0.745 1.243 0.887 1.169
(For validation, column $ 1 matches the second row in the "one column" example above.)
List of "frames"
(I reset the data to what it was before I modified it above ... see the "data" code at the bottom of the answer.)
We nest the previous technique into another lapply:
dflist2 <- lapply(dflist, function(ldf) {
ldf[-1] <- lapply(ldf[-1], myrollmean, k=3, type="keep")
ldf
})
str(lapply(dflist2, function(a) a[1:4, 1:3]))
# List of 3
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 1.865 0.405 0.806 0.761
# ..$ 2 : num [1:4] 0.745 1.243 0.887 1.169
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 0.271 3.611 2.36 3.095
# ..$ 2 : num [1:4] 0.127 0.722 0.346 0.73
# $ :'data.frame': 4 obs. of 3 variables:
# ..$ info: num [1:4] 1 2 3 4
# ..$ 1 : num [1:4] 1.278 0.346 1.202 0.822
# ..$ 2 : num [1:4] 0.341 1.296 1.244 1.528
(Again, for simple validation, see that the first frame's $ 1 row shows the same rolled means as the second row of the "one column" example, above.)
PS:
if you need to skip more than just the first column, then inside the outer lapply, use instead ldf[-(1:n)] <- lapply(ldf[-(1:n)], myrollmean, k=3, type="keep") to skip the first n columns
to use a window function other than zoo::rollmean, you'll want to change the special-cases of myrollmean, though it should be straight-forward enough given this example
I use a concocted str(...) to shorten the output for display here. You should verify all of your data that it is doing what you expect for the whole of each frame.
Reproducible Data
set.seed(2)
a = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
b = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
c = as.data.frame(cbind(info = 1:10, matrix(rexp(200), 10)))
colnames(a) = c("info", 1:20)
colnames(b) = c("info", 1:20)
colnames(c) = c("info", 1:20)
dflist <- list(a,b,c)
str(lapply(dflist, function(a) a[1:3, 1:4]))
# List of 3
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 1.865 0.405 0.147
# ..$ 2 : num [1:3] 0.745 1.243 0.674
# ..$ 3 : num [1:3] 0.356 0.689 0.833
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 0.271 3.611 3.198
# ..$ 2 : num [1:3] 0.127 0.722 0.188
# ..$ 3 : num [1:3] 1.99 2.74 4.78
# $ :'data.frame': 3 obs. of 4 variables:
# ..$ info: num [1:3] 1 2 3
# ..$ 1 : num [1:3] 1.278 0.346 1.981
# ..$ 2 : num [1:3] 0.341 1.296 2.094
# ..$ 3 : num [1:3] 1.1159 3.05877 0.00506
Below dfnames is the names of the data frames in env, the global environment -- we have named it env in case you want to later change where they are located. Note that ls has a pattern= argument and if the data frame names have a distinct pattern then dfnames <- ls(pattern=whatever) could be used instead where whatever is a suitable regular expression.
Now define make_new which calls rollapplyr with a new mean function mean3 which returns the last value of its input if the input vector has a length less than 3 and mean otherwise. Then loop over the names using rollappyr with FUN=mean3 and partial=TRUE.
library(zoo)
env <- .GlobalEnv
dfnames <- Filter(function(x) is.data.frame(get(x, env)), ls(env))
# make_new - first version
mean3 <- function(x, k = 3) if (length(x) < k) tail(x, 1) else mean(x)
make_new <- function(df) replace(df, -1, rollapplyr(df[-1], 3, mean3, partial = TRUE))
for(nm in dfnames) env[[paste(nm, "new", sep = "_")]] <- make_new(get(nm, env))
Alternative version of make_new
An alternative to the first version of make_new shown above is the following second version. In the second version instead of defining mean3 we use just plain mean but specify a vector of widths w in rollapplyr such that w equals c(1, 1, 3, 3, ..., 3). Thus it takes the mean of just the last element for the first two input components and the mean of the 3 last elements for the rest. Note that now that we specify the widths explicitly we no longer need to specify partial= .
# make_new -- second version
make_new <- function(df) {
w <- replace(rep(3, nrow(df)), 1:2, 1)
replace(df, -1, rollapplyr(df[-1], w, mean))
}
Note
Normally when writing R and manpulating a set of objects one stores the objects in a list rather than leaving them loose in the global environment. We could create such a list L like this and then use lapply to create a second list L2 containing the new versions. Either version of make_new would work here.
L <- mget(dfnames, env)
L2 <- lapply(L, make_new)
I have a large file that looks like this
region type coeff p-value distance count
82365593523656436 A -0.9494 0.050 -16479472.5 8
82365593523656436 B 0.47303 0.526 57815363.0 8
82365593523656436 C -0.8938 0.106 42848210.5 8
When I read it in using fread, suddenly 82365593523656436 is not found anymore
correlations <- data.frame(fread('all_to_all_correlations.txt'))
> "82365593523656436" %in% correlations$region
[1] FALSE
I can find a slightly different number
> "82365593523656432" %in% correlations$region
[1] TRUE
but this number is not in the actual file
grep 82365593523656432 all_to_all_correlations.txt
gives no results, while
grep 82365593523656436 all_to_all_correlations.txt
does.
When I try to read in the small sample file I showed above instead of the full file I get
Warning message:
In fread("test.txt") :
Some columns have been read as type 'integer64' but package bit64 isn't loaded.
Those columns will display as strange looking floating point data.
There is no need to reload the data.
Just require(bit64) toobtain the integer64 print method and print the data again.
and the data looks like
region type coeff p.value distance count
1 3.758823e-303 A -0.94940 0.050 -16479472 8
2 3.758823e-303 B 0.47303 0.526 57815363 8
3 3.758823e-303 C -0.89380 0.106 42848210 8
So I think during reading 82365593523656436 was changed into 82365593523656432. How can I prevent this from happening?
IDs (and that's apparently what the first column is) should usually be read as characters:
correlations <- setDF(fread('region type coeff p-value distance count
82365593523656436 A -0.9494 0.050 -16479472.5 8
82365593523656436 B 0.47303 0.526 57815363.0 8
82365593523656436 C -0.8938 0.106 42848210.5 8',
colClasses = c(region = "character")))
str(correlations)
#'data.frame': 3 obs. of 6 variables:
# $ region : chr "82365593523656436" "82365593523656436" "82365593523656436"
# $ type : chr "A" "B" "C"
# $ coeff : num -0.949 0.473 -0.894
# $ p-value : num 0.05 0.526 0.106
# $ distance: num -16479473 57815363 42848211
# $ count : int 8 8 8
I have some output from the vegan function specaccum. It is a list of 8 objects of varying lengths;
> str(SPECIES)
List of 8
$ call : language specaccum(comm = PRETEND.DATA, method = "rarefaction")
$ method : chr "rarefaction"
$ sites : num [1:5] 1 2 3 4 5
$ richness : num [1:5] 20.9 34.5 42.8 47.4 50
$ sd : num [1:5] 1.51 2.02 1.87 1.35 0
$ perm : NULL
$ individuals: num [1:5] 25 50 75 100 125
$ freq : num [1:50] 1 2 3 2 4 3 3 3 4 2 ...
- attr(*, "class")= chr "specaccum"
I want to extract three of the lists ('richness', 'sd' and 'individuals') and convert them to columns in a data frame. I have developed a workaround;
SPECIES.rich <- data.frame(SPECIES[["richness"]])
SPECIES.sd <- data.frame(SPECIES[["sd"]])
SPECIES.individuals <- data.frame(SPECIES[["individuals"]])
SPECIES.df <- cbind(SPECIES.rich, SPECIES.sd, SPECIES.individuals)
But this seems clumsy and protracted. I wonder if anyone could suggest a neater solution? (Should I be looking at something with lapply??) Thanks!
Example data to generate the specaccum output;
Set.Seed(100)
PRETEND.DATA <- matrix(sample(0:1, 250, replace = TRUE), 5, 50)
library(vegan)
SPECIES <- specaccum(PRETEND.DATA, method = "rarefaction")
We can concatenate the names in a vector and extract it
SPECIES.df <- data.frame(SPECIES[c("richness", "sd", "individuals")])
Another alternative, similar to akrun, is:
ctoc1 = as.data.frame(cbind(SPECIES$richness, SPECIES$sd, SPECIES$individuals))
Please note that in both cases (my answer and akrun) you will get an error if the lengths of the columns do not match.
e.g.: SPECIES.df <- data.frame(SPECIES[c( "sd", "freq")])
Error in data.frame(richness = c(20.5549865665613, 33.5688503093388, 41.4708434700877, :
arguments imply differing number of rows:7, 47
If so, remember to use length() function :
length(SPECIES$sd) <- 47 # this will add NAs to increase the column length.
SPECIES.df <- data.frame(SPECIES[c("sd", "freq")])
SPECIES.df # dataframe with 2 columns and 7 rows.