I am trying to create an empty data frame where the data will be strings and with stringsAsFactors set to FALSE. It seems that when I do that, though, it does not remember the value of stringsAsFactors.
It works if I create a blank row, like this:
> df <- data.frame(a="", b="", stringsAsFactors=FALSE)
> new.row <- c("a", "z")
> df <- rbind(df, new.row)
> df
a b
1
2 a z
> df[2,1] <- "q"
> df
a b
1
2 q z
But, I want an empty data frame. When I do that, though, it treats the strings that I later add as factors:
> df2 <- data.frame(a=character(), b=character(), stringsAsFactors=FALSE)
> df2 <- rbind(df2, new.row)
> df2
X.a. X.z.
1 a z
> df2[2,1] <- "q"
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "q") :
invalid factor level, NA generated
How can I create the empty data frame without string factors?
rbind.data.frame first drops all zero-row and zero column data.frames, and then coerces the remaining arguments into data.frames. This internal coercion uses the default value for stringsAsFactors in the coercion. (see the help for rbind, under data frame methods.
You can set this value by setting
options(stringsAsFactors=FALSE)
# now it works as you wish
str(rbind(df2,new.row))
# 'data.frame': 1 obs. of 2 variables:
# $ X.a.: chr "a"
# $ X.z.: chr "z"
I have been searching for an answer to this same problem and couldn't find anything, so I wrote my own function:
row.add <- function(x,newRow)
{
cn <- colnames(x)
x <- data.frame(lapply(x,as.character),stringsAsFactors = FALSE)
x <- rbind(x,newRow)
colnames(x) <- cn
return(x)
}
df <- data.frame("a"=character(),"b"=character())
df <- row.add(df,c("A","Z"))
df <- row.add(df,c("B","X"))
Hopefully someone searching for a similar answer will find this useful.
Related
I would like to store a GenomicRanges::GRanges object from Bioconductor as a single column in a base R data.frame. The reason I'd like to have it in a base R data.frame is because I'd like to write some ggplot2 functions that exclusively work with data.frames under the hood. However, any attempts I made don't seem to be fruitful. Basically this is what I want to do:
library(GenomicRanges)
x <- GRanges(c("chr1:100-200", "chr1:200-300"))
df <- data.frame(x = x, y = 1:2)
But the column is automatically expanded, whereas I like to keep it as a valid GRanges object in a single column:
> df
x.seqnames x.start x.end x.width x.strand y
1 chr1 100 200 101 * 1
2 chr1 200 300 101 * 2
When I work with the S4Vectors::DataFrame, it works as I want, except I'd like a base R data.frame to do the same thing:
> S4Vectors::DataFrame(x = x, y = 1:2)
DataFrame with 2 rows and 2 columns
x y
<GRanges> <integer>
1 chr1:100-200 1
2 chr1:200-300 2
I also tried the following without succes:
> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
y x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2 <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs
df[["x"]] <- I(x)
Error in rep(value, length.out = nrows) :
attempt to replicate an object of type 'S4'
I had some minor succes with implementing an S3 variant of the GRanges class using vctrs::new_rcrd, but that seems to be a very roundabout way to get a single column representing a genomic range.
I found a very simple way to convert an GR object to a dataframe so that you can operate on the data.frame very easily.
The annoGR2DF function in the Repitools package can do so.
> library(GenomicRanges)
> library(Repitools)
>
> x <- GRanges(c("chr1:100-200", "chr1:200-300"))
>
> df <- annoGR2DF(x)
> df
chr start end width
1 chr1 100 200 101
2 chr1 200 300 101
> class(df)
[1] "data.frame"
A not pretty but practical solution is to use the accessor functions of GenomicRanges, then convert to the relevant data vector, i.e. numeric or character. I added magrittr, but you can also do it without it.
library(GenomicRanges)
library(magrittr)
x <- GRanges(c("chr1:100-200", "chr1:200-300"))
df <- data.frame(y = 1:2)
df$chr <- seqnames(x) %>% as.character
df$start <- start(x) %>% as.numeric
df$end <- end(x) %>% as.numeric
df$strand <- strand(x) %>% as.character
df$width <- width(x) %>% as.numeric
df
So since posting this question, I figured out that the crux of my problem seemed to be that just the format method of S4 objects is not playing nicely with the data.frames, and having GRanges as columns isn't necessarily a problem. (The construction of the data.frame still is though).
Consider this bit of the original question:
> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
y x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2
Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs
If we write a simple format method for GRanges, it will not throw an error:
library(GenomicRanges)
format.GRanges <- function(x, ...) {showAsCell(x)}
df <- data.frame(y = 1:3)
df$x <- GRanges(c("chr1:100-200", "chr1:200-300", "chr2:100-200"))
> df
y x
1 1 chr1:100-200
2 2 chr1:200-300
3 3 chr2:100-200
It seems to subset just fine too:
> df[c(1,3),]
y x
1 1 chr1:100-200
3 3 chr2:100-200
As a bonus, this seems to work for other S4 classes too, for example:
library(S4Vectors)
format.Rle <- function(x, ...) {showAsCell(x)}
x <- Rle(1:5, 5:1)
df <- data.frame(y = 1:15)
df$x <- x
I have a list of data frames that are pulled in from an Excel file. Some of the columns in the data frames have are named 'NA', contain no data, and are useless; therefore, I would like to drop them. The list contains 9 data frames and most have columns with 'NA' as their title.
Through multiple iterations, R has returned an error or warning. Including:
all_list <- all_list[!is.na(colnames(all_list))]
Warning message:
In is.na(colnames(all_list)) :
is.na() applied to non-(list or vector) of type 'NULL'
The above did not serve it's intended purpose, as the NA columns are still in each data frame.
all_list <- lapply(all_list, function(x){
colnames(x) <- x[!is.na(colnames(x))]
return(x)
})
This seems closer to the intended output, but reformats the data frame columns to be filled with NA's instead.
Here is a sample of my data showcasing the aforementioned NA's:
str(all_list)
List of 8
$ Retail :'data.frame': 305 obs. of 25 variables:
$ NA : chr [1:305] NA "Variable" "Variable" "Variable" ...
$ TIMEPERIOD : chr [1:305] NA "41640" "41671" "41699" ...
Edit: In case it wasn't clear, these blank columns filled with NA are the result of formatting within Excel for the sake of spacing; however, they serve no purpose for analysis within R.
You are pretty close to solution. A slight change in function used with lapply will take you to expected result.
The lapply traverses through each dataframe and your function needs to subset columns which names are not equal to NA.
all_list < lapply(all_list, function(x){
x[,colnames(x) != "NA"]
})
# Verify changed data all_list
all_list[[1]]
# col1 col2
# 1 g x
# 2 j z
# 3 n p
# 4 u o
# 5 e b
Data:
set.seed(1)
df1 <- data.frame(sample(letters, 5), sample(letters, 5), 1:5,
stringsAsFactors = FALSE)
names(df1) <- c("col1","col2","NA")
df2 <- data.frame(sample(letters, 5), sample(letters, 5), 11:15,
stringsAsFactors = FALSE)
names(df2) <- c("col1","col2","NA")
df3 <- data.frame(sample(letters, 5), sample(letters, 5), rep(NA, 5),
stringsAsFactors = FALSE)
names(df3) <- c("col1","col2","NA")
df4 <- data.frame(sample(letters, 5), sample(letters, 5), rep(NA, 5),
stringsAsFactors = FALSE)
names(df4) <- c("col1","col2","NA")
all_list <- list(df1,df2,df3,df4)
#check data
all_list[[1]]
# col1 col2 NA
#1 g x 1
#2 j z 2
#3 n p 3
#4 u o 4
$5 e b 5
# all_list[[2]], all_list[[3]] and all_list[[4]] contains similar values
I have multiple .csv files that are in the same format, same column names etc.
I am wanting to do some operations on the columns then return the operations after each for loop. Here is some repeatable code:
df1 <- data.frame(x= (0:9), y= (10:19))
df2 <- data.frame(x= (20:29), y=(30:39))
listy <- list(df1, df2)
avg <- 0
filenames<- c("df1", "df2")
filenumbers<-seq(listy)
b <- 0
for(filenumber in filenumbers){ b <- b+1
allDM <- as.data.frame(filenames[filenumber],
header=TRUE)
allDM <- data.frame(
pred= filenames[filenumber]$x,
actual= filenames[filenumber]$y
)
allDM$pa <- allDM$pred-allDM$actual
avg <- mean(allDM$pa)
return(avg)
}
It is not happy using the $ function here.
Error is: Error in filenames[filenumber]$x :
$ operator is invalid for atomic vectors
Cheers,
filenames[filenumber]
is simply an (atomic) character object, i.e.
[1] "df1"
or
[1] "df2"
thus it wouldn't make sense to use $ on it.
You can fix this by using get():
for(filenumber in filenumbers){
b <- b + 1
allDM <- as.data.frame(filenames[filenumber],
header=TRUE)
tmp <- get(filenames[filenumber])
allDM <- data.frame(
pred = tmp$x,
actual = tmp$y
)
allDM$pa <- allDM$pred-allDM$actual
avg <- mean(allDM$pa)
}
Note that I also took out return(avg) because this is not a function so you can't use return(), but you have no need to anyway. avg still gets created.
I've some questions related to the behaviour/properties of the different classes.
When trying to create a data frame with a column of class character it creates a data frame with a factor.
df1 <- data.frame(var1= character())
str(df1)
Which is the same as
df2 <- data.frame(var1= factor())
str(df2)
Why isn't the class in the first case chr?
When trying to add a time variable an error occurs combined with a for instance a character.
This works:
df3 <- data.frame( var1=as.POSIXct(0,origin="2012-12-31"))
str(df3)
This doesn't:
df4 <- data.frame(var1= character(0),var2=as.POSIXct(0,origin="2012-12-31"))
str(df4)
But these do:
df4.1 <- data.frame(var1= character(1),var2=as.POSIXct(0,origin="2012-12-31"))
str(df4.1)
df4.2 <- data.frame(var1= factor(0),var2=as.POSIXct(0,origin="2012-12-31"))
str(df4.2)
It seems that the behaviour is related to the absence of a level or format (which are present with factor or date classes) with character, numeric and integer classes.
For your first question, stringsAsFactors = TRUE is default when creating a data.frame. Changing it gets the result you expect.
> df1a <- data.frame(var1= character())
> str(df1a)
'data.frame': 0 obs. of 1 variable:
$ var1: Factor w/ 0 levels:
> df1b <- data.frame(var1= character(), stringsAsFactors=FALSE)
> str(df1b)
'data.frame': 0 obs. of 1 variable:
$ var1: chr
For your second one, character(0) and factor(0) are different things. character() is the same as character(0), but factor() is not the same as factor(0)
Try this:
> a <- character()
> b <- character(0)
> A <- factor()
> B <- factor(0)
> sapply(list(a=a, b=b, A=A, B=B), length)
a b A B
0 0 0 1
Specifically, from ?character, usage is in the form of:
character(length = 0) ## Just the one argument
while from ?factor, usage is in the form of:
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x))
where the first item is the values that you are going to use to create your factors.
Read help(data.frame).
df1: This behaviour is controlled by the stringsAsFactors parameter.
df4: var1 is of length 0, var2of length 1. All columns in a data.frame must have the same length. Normally, the shorter vector would be recycled, but that's not possible with a vector of length 0.
df4.2: factor(0) does not return a factor variable of length 0, but a factor with value 0. So both columns are of equal length.
If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)