GRanges as column in base::data.frame

GRanges as column in base::data.frame - r

I would like to store a GenomicRanges::GRanges object from Bioconductor as a single column in a base R data.frame. The reason I'd like to have it in a base R data.frame is because I'd like to write some ggplot2 functions that exclusively work with data.frames under the hood. However, any attempts I made don't seem to be fruitful. Basically this is what I want to do:
library(GenomicRanges)
x <- GRanges(c("chr1:100-200", "chr1:200-300"))
df <- data.frame(x = x, y = 1:2)
But the column is automatically expanded, whereas I like to keep it as a valid GRanges object in a single column:
> df
x.seqnames x.start x.end x.width x.strand y
1 chr1 100 200 101 * 1
2 chr1 200 300 101 * 2
When I work with the S4Vectors::DataFrame, it works as I want, except I'd like a base R data.frame to do the same thing:
> S4Vectors::DataFrame(x = x, y = 1:2)
DataFrame with 2 rows and 2 columns
x y
<GRanges> <integer>
1 chr1:100-200 1
2 chr1:200-300 2
I also tried the following without succes:
> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
y x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2 <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs
df[["x"]] <- I(x)
Error in rep(value, length.out = nrows) :
attempt to replicate an object of type 'S4'
I had some minor succes with implementing an S3 variant of the GRanges class using vctrs::new_rcrd, but that seems to be a very roundabout way to get a single column representing a genomic range.

I found a very simple way to convert an GR object to a dataframe so that you can operate on the data.frame very easily.
The annoGR2DF function in the Repitools package can do so.
> library(GenomicRanges)
> library(Repitools)
>
> x <- GRanges(c("chr1:100-200", "chr1:200-300"))
>
> df <- annoGR2DF(x)
> df
chr start end width
1 chr1 100 200 101
2 chr1 200 300 101
> class(df)
[1] "data.frame"

A not pretty but practical solution is to use the accessor functions of GenomicRanges, then convert to the relevant data vector, i.e. numeric or character. I added magrittr, but you can also do it without it.
library(GenomicRanges)
library(magrittr)
x <- GRanges(c("chr1:100-200", "chr1:200-300"))
df <- data.frame(y = 1:2)
df$chr <- seqnames(x) %>% as.character
df$start <- start(x) %>% as.numeric
df$end <- end(x) %>% as.numeric
df$strand <- strand(x) %>% as.character
df$width <- width(x) %>% as.numeric
df

So since posting this question, I figured out that the crux of my problem seemed to be that just the format method of S4 objects is not playing nicely with the data.frames, and having GRanges as columns isn't necessarily a problem. (The construction of the data.frame still is though).
Consider this bit of the original question:
> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
y x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2
Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs
If we write a simple format method for GRanges, it will not throw an error:
library(GenomicRanges)
format.GRanges <- function(x, ...) {showAsCell(x)}
df <- data.frame(y = 1:3)
df$x <- GRanges(c("chr1:100-200", "chr1:200-300", "chr2:100-200"))
> df
y x
1 1 chr1:100-200
2 2 chr1:200-300
3 3 chr2:100-200
It seems to subset just fine too:
> df[c(1,3),]
y x
1 1 chr1:100-200
3 3 chr2:100-200
As a bonus, this seems to work for other S4 classes too, for example:
library(S4Vectors)
format.Rle <- function(x, ...) {showAsCell(x)}
x <- Rle(1:5, 5:1)
df <- data.frame(y = 1:15)
df$x <- x

Related

selecting columns from a set of names with dplyr

I'm attempting to make subsets of a large data frame based on whether the column names are in an externally defined set. So I'm starting with something like:
> x <- c(1,2,3)
> y <- c("a","b","c")
> z <- c(4,5,6)
>
> df <- data.frame(x=x,y=y,z=z)
> df
x y z
1 1 a 4
2 2 b 5
3 3 c 6
chosen_columns <- c(x,y)
And I'm attempting to use this much to end up with:
x y
1 1 a
2 2 b
3 3 c
It seems like using select() from dplyr should be able to handle this perfectly, but I'm not sure what the arguments would be to get that. Something like:
df_chosen <- df %>%
select(is.element(___,chosen_columns))
I'm just not sure what would go in the ___ there.
Thank you!

c(x, y) is not a vector of two columns: it's combining your objects x and y into a vector of characters: c("1", "2", "3", "a","b","c").
You may want to create a vector of column names and then pass it directly to select():
library(dplyr)
chosen_columns <- c("x", "y")
df |> select(all_of(chosen_columns))
(Thank you, Gregor Thomas, for the advice to wrap column names in all_of()).

in r convert a column of a data frame to a vector without "unlist" and with a dynamic name

I would like to convert a data frame column to a vector, where the name is dynamic.
All I know is that the I want the first column.
Now, I can do this with unlist, but it is about two orders of magnitude slower than accessing by name:
df = data_frame(x = 3, y = 4)
microbenchmark::microbenchmark({df$x}) #less than 1 microsecond
microbenchmark::microbenchmark({unlist(df[,1}) #about 15 microseconds!
Is there a more efficient way than unlist if I don't know the name of the column in advance?

The reason is that df[,1] is still a tibble with one column.
str(df[,1])
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 1 variable:
# $ x: num 3
We need df[[1]] to extract the column. So, it is doing operations in two steps, [,1] and then unlist
Also, if we do a profileing, the unlist step is taking more memory and time
library(profvis)
df <- tibble(x = 1:1e7, y = 1:1e7)
profvis({
df1 <- df[,1]
unlist(df1)
})
profvis({
df1 <- df %>%
select(x)
unlist(df1)
})
and check with
profvis({
df %>%
pull(x)
})
or
profvis(df$x)
NOTE: These are too fast that they complete even before the profvis acts on it resulting in error

It's worth also noting that data.frame and tibble differ in how they preserve the dimensions of the object. If we were to define a data frame and subset on a single column, it would return a vector:
df <- data.frame(x = 3, y = 4)
df[,1]
#[1] 3
While a tibble does not simplify by default:
df <- tibble(x = 3, y = 4)
df[,1]
# A tibble: 1 x 1
# x
# <dbl>
# 1 3
If we want a tibble to simplify, we can either use [[ subsetting to extract a single column, which does simplify into a vector, or we can specify drop = T:
df <- tibble(x = 3, y = 4)
df[,1, drop = T]
# [1] 3
df[[1]]
# [1] 3

How to coerce a character column to a list column

I am trying to bind data frames rows. I generate some data frame with list columns after aggregation but some are character. I can't find a way to bind them. I tried converting the character column using as.list() but that didn't work.
library(dplyr)
df1 <- data.frame(a = c(1,2,3),stringsAsFactors = F)
df1$b <- list(c("1","2"),"4",c("5","6"))
> df1
a b
1 1 1, 2
2 2 4
3 3 5, 6
df2 <- data.frame(a=c(4,5),b=c("9","12"),stringsAsFactors = F)
> df2
a b
1 4 9
2 5 12
dplyr::bind_rows(df2,df1)
Error in bind_rows_(x, .id) :
Column `b` can't be converted from character to list

I don't know the dplyr library well, but using base R's rbind() below seems to be working:
df1 <- data.frame(a = c(1,2,3),stringsAsFactors = F)
df1$b <- list(c("1","2"),"4",c("5","6"))
df2 <- data.frame(a=c(4,5),b=c("9","12"),stringsAsFactors = F)
result <- rbind(df1, df2)
class(result$a)
[1] "numeric"
class(result$b)
[1] "list"
Demo
If you wanted to get this working with bind_rows(), start by looking at the error message. It looks like dplyr doesn't like that one data frame has character data while the other has list data. You could try converting the character column to list and then call bind_rows, e.g.
df2$b <- as.list(df2$b)
dplyr::bind_rows(df2,df1)

How to subset a large data frame through FOR loops and print the desired result?

I have a data frame that looks something like this:
x y
1 a
1 b
1 c
1 NA
1 NA
2 d
2 e
2 NA
2 NA
And my desired output should be a data frame that should display the sum of all complete cases of Y (that is the non-NA values) with the corresponding X. So if supposing Y has 2500 complete observations for X = 1, and 557 observations for X = 2, I should get this simple data frame:
x y(c.cases)
1 2500
2 557
Currently my function performs well but only for a single X but when I mention X to be a range (for ex. 30:25) then I get the sum of all the Ys specified instead of individual complete observations for each X. This is an outline of my function:
complete <- function(){
files <- file.list()
dat<- c() #Creates an empty vector
Y <- c() #Empty vector that will list down the Ys
result <- c()
for(i in c(X)){
dat <- rbind(dat, read.csv(files[i]))
}
dat_subset_Y <- dat[which(dat[, 'X'] %in% x), ]
Y <- c(Y, sum(complete.cases(dat)))
result <- cbind(X, Y)
print(result)
}
There are no errors or warning messages but only wrong results in a range of Xs.

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'x', get the sum of all non NA elements (!is.na(y)).
library(data.table)
setDT(df1)[, list(y=sum(!is.na(y))), by = x]
Or another option is table
with(df1, table(x, !is.na(y)))

no need for that loop.
library(dplyr)
df %>%
filter(complete.cases(.))%>%
group_by(x) %>%
summarise(sumy=length(y))
Or
df %>%
group_by(x) %>%
summarise(sumy=sum(!is.na(y)))

Creating empty data frame with stringsAsFactors = FALSE

I am trying to create an empty data frame where the data will be strings and with stringsAsFactors set to FALSE. It seems that when I do that, though, it does not remember the value of stringsAsFactors.
It works if I create a blank row, like this:
> df <- data.frame(a="", b="", stringsAsFactors=FALSE)
> new.row <- c("a", "z")
> df <- rbind(df, new.row)
> df
a b
1
2 a z
> df[2,1] <- "q"
> df
a b
1
2 q z
But, I want an empty data frame. When I do that, though, it treats the strings that I later add as factors:
> df2 <- data.frame(a=character(), b=character(), stringsAsFactors=FALSE)
> df2 <- rbind(df2, new.row)
> df2
X.a. X.z.
1 a z
> df2[2,1] <- "q"
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "q") :
invalid factor level, NA generated
How can I create the empty data frame without string factors?

rbind.data.frame first drops all zero-row and zero column data.frames, and then coerces the remaining arguments into data.frames. This internal coercion uses the default value for stringsAsFactors in the coercion. (see the help for rbind, under data frame methods.
You can set this value by setting
options(stringsAsFactors=FALSE)
# now it works as you wish
str(rbind(df2,new.row))
# 'data.frame': 1 obs. of 2 variables:
# $ X.a.: chr "a"
# $ X.z.: chr "z"

I have been searching for an answer to this same problem and couldn't find anything, so I wrote my own function:
row.add <- function(x,newRow)
{
cn <- colnames(x)
x <- data.frame(lapply(x,as.character),stringsAsFactors = FALSE)
x <- rbind(x,newRow)
colnames(x) <- cn
return(x)
}
df <- data.frame("a"=character(),"b"=character())
df <- row.add(df,c("A","Z"))
df <- row.add(df,c("B","X"))
Hopefully someone searching for a similar answer will find this useful.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

GRanges as column in base::data.frame - r

Related

selecting columns from a set of names with dplyr

in r convert a column of a data frame to a vector without "unlist" and with a dynamic name

How to coerce a character column to a list column

How to subset a large data frame through FOR loops and print the desired result?

Creating empty data frame with stringsAsFactors = FALSE

Categories

Resources