Caveat: novice. I have several data.tables with millions of rows each, variables are mostly dates and factors. I was using rbindlist() to combine them because. Yesterday, after breaking up the tables into smaller pieces vertically (instead of the current horizontal splicing), I was trying to understand rbind better (especially with fill = TRUE) and also tried bind_rows() and then tried to verify the results but identical() returned FALSE.
library(data.table)
library(dplyr)
DT1 <- data.table(a=1, b=2)
DT2 <- data.table(a=4, b=3)
DT_bindrows <- bind_rows(DT1,DT2)
DT_rbind <- rbind(DT1,DT2)
identical(DT_bindrows,DT_rbind)
# [1] FALSE
Visually inspecting the results from bind_rows() and rbind() says they are indeed identical. I read this and this (from where I adapted the example). My question: (a) what am I missing, and (b) if the number, names, and order of my columns is the same, should I be concerned that identical() = FALSE?
The identical checks for attributes which are not the same. With all.equal, there is an option not to check the attributes (check.attributes)
all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE
If we check the str of both the datasets, it becomes clear
str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute
By assigning the attribute to NULL, the identical returns TRUE
attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE
Related
Here is something I do not understand with data.table
If I select a line and I try to set all values of this line to NA the new line-data.table is coerced to logical
#Here is a sample table
DT <- data.table(a=rep(1L,3),b=rep(1.1,3),d=rep('aa',3))
DT
# a b d
# 1: 1 1.1 aa
# 2: 1 1.1 aa
# 3: 1 1.1 aa
#Here I extract a line, all the column types are kept... good
str(DT[1])
# Classes ‘data.table’ and 'data.frame': 1 obs. of 3 variables:
# $ a: int 1
# $ b: num 1.1
# $ d: chr "aa"
# - attr(*, ".internal.selfref")=<externalptr>
#Now here I want to set them all to `NA`...they all become logicals => WHY IS THAT ?
str(DT[1][,colnames(DT) := NA])
# Classes ‘data.table’ and 'data.frame': 1 obs. of 3 variables:
# $ a: logi NA
# $ b: logi NA
# $ d: logi NA
# - attr(*, ".internal.selfref")=<externalptr>
EDIT: I think it is a bug as
str(DT[1][ , a := NA])
# Classes ‘data.table’ and 'data.frame': 1 obs. of 3 variables:
# $ a: logi NA
# $ b: num 1.1
# $ d: chr "aa"
# - attr(*, ".internal.selfref")=<externalptr>
str(DT[1:2][ , a := NA])
# Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
# $ a: int NA NA
# $ b: num 1.1 1.1
# $ d: chr "aa" "aa"
# - attr(*, ".internal.selfref")=<externalptr>
To provide an answer, from ?":=" :
Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.
The motivation for all this is large tables (say 10GB in RAM), of course. Not 1 or 2 row tables.
To put it more simply: if length(RHS) == nrow(DT) then the RHS (and whatever its type) is plonked into that column slot. Even if those lengths are 1. If length(RHS) < nrow(DT), the memory for the column (and its type) is kept in place, but the RHS is coerced and recycled to replace the (subset of) items in that column.
If I need to change a column's type in a large table I write:
DT[, col := as.numeric(col)]
here as.numeric allocates a new vector, coerces "col" into that new memory, which is then plonked into the column slot. It's as efficient as it can be. The reason that's a plonk is because length(RHS) == nrow(DT).
If you want to overwrite a column with a different type containing some default value:
DT[, col := rep(21.5, nrow(DT))] # i.e., deliberately harder
If "col" was type integer before, then it'll change to type numeric containing 21.5 for every row. Otherwise just DT[, col := 21.5] would result in a warning about 21.5 being coerced to 21 (unless DT is only 1 row!)
I am new to R. I have created two vectors 'a' and 'b' in R. Both of them contains the same elements, but in different order. Following are the details of 2 vectors.
> str(a)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 369 obs. of 1 variable:
$ SKD_DOCUMENT_NO: chr "A0000514011" "A0000514012" "A0000514013" "A0000514014" ...
> str(b)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 369 obs. of 1 variable:
$ SKD_DOCUMENT_NO: chr "A0000648001" "A0000648050" "A0000648049"
"A0000648048" ...
but when I try to check whether an element is in the vectors or not, i get confusing answers from R.
>'A0000648050' %in% a #[1] FALSE
>"A0000648050" %in% a #[1] FALSE
But when I try to use other methods to check whether the element is in 'a'. Then following results come:
> any(a == "A0000648050") #[1] TRUE
> which(a == "A0000648050") #[1] 115
> grep("A0000648050", a) #[1] 1
Q1. What I don't understand is why %in% is failing.
Q2. What is the easiest way to find if all elements of 'a' vector are present in all elements of 'b' vector? (all elements of 'a' are indeed present in 'b'. But would like to be confirmed from R). Why does following 2 lines give different results?
> a %in% b #[1] FALSE
> setequal(a,b) # TRUE
%in%
From ?'%in%' :
%in% is currently defined as "%in%" <- function(x, table) match(x,
table, nomatch = 0) > 0
Factors, raw vectors and lists are converted to character vectors, and
then x and table are coerced to a common type (the later of the two
types in R's ordering, logical < integer < numeric < complex <
character) before matching. If incomparables has positive length it is
coerced to the common type.
in your case a is a tibble, which is a data.frame, which is a list, so it's converted to character before the comparison takes place.
a <- tibble(SKD_DOCUMENT_NO =c("A0000514011","A0000514012","A0000514013","A0000514014"))
as.character(a)
# [1] "c(\"A0000514011\", \"A0000514012\", \"A0000514013\", \"A0000514014\")"
This, though it's not intuitive, will return TRUE:
"c(\"A0000514011\", \"A0000514012\", \"A0000514013\", \"A0000514014\")" %in% a
any
from ?any, on the ... argument:
Other objects of zero length are ignored, and the rest are coerced to
logical ignoring any class.
a == "A0000514012"
# SKD_DOCUMENT_NO
# [1,] FALSE
# [2,] TRUE
# [3,] FALSE
# [4,] FALSE
The following happens when we coerce it to logical:
as.logical(a == "A0000514012")
# [1] FALSE TRUE FALSE FALSE
so the output you get with any(a == "A0000514012") makes sense.
The same exercise can be done with which or grep
solution
The solution is to use either:
"A0000514012" %in% a$SKD_DOCUMENT_NO # to look into a precise column
or
"A0000514012" %in% unlist(a) # to look into all columns, equivalent to your solution with `any`
or
sapply(a,`%in%`,x = "A0000514012") # to look into individual columns separately
I can't figure this out.
library(dplyr)
dat <- data.frame(a = 1:5,b = rep(TRUE,5))
# this doesn't work
dat %>% all(.$b) # tricky
# this doesn't work
dat %>% all(b) #
# this does
dat %>% .$b %>% all
I find it confusing that all(.$b) doesn't work. That doesn't seem intuitive to me at all.
Well, the %>% operator is borrowed from the magrittr package which defines the following rules:
By default the left-hand side (LHS) will be piped in as the first argument of the function appearing on the right-hand side (RHS).
When the LHS is needed at a position other than the first, one can use the dot,'.', as placeholder.
You can see that the whole data frame is still being passed in as the first parameter with this example
f<-function(...) str(list(...))
dat %>% f(.$b)
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 1 2 3 4 5
# ..$ b: logi [1:5] TRUE TRUE TRUE TRUE TRUE
# $ : logi [1:5] TRUE TRUE TRUE TRUE TRUE
So you are getting both the data.frame and the vector (the function is receiving two parameters). I believe this is because you are not moving the . to a position other than the first parameter so you are not changing the behavior to pass along the object as the first parameter.
It just so happens that the magrittr package has a different operator for use in cases like this. You can use %$%.
library(magrittr)
dat %$% all(b)
# [1] TRUE
Say I have a list below
> str(lll)
List of 2
$ :List of 3
..$ Name : chr "Sghokbt"
..$ Title: NULL
..$ Value: int 7
$ :List of 3
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
How can I convert this list to a data frame as below?
> df
Name Title Value
1 Sghokbt <NA> 7
2 Sgnglio Mr 5
as.data.frame doesn't work, I suspect due to the NULL in the first list element. EDIT: I have also tried do.call(rbind, list) as suggested in another question, but the result is a matrix of lists, not a data frame.
To reproduce the list:
list(structure(list(Name = "Sghokbt", Title = NULL, Value = 7L), .Names = c("Name",
"Title", "Value")), structure(list(Name = "Sgnglio", Title = "Mr",
Value = 5), .Names = c("Name", "Title", "Value")))
I think I've found a solution myself.
My approach is to first convert all the sub-lists into dataframes, so I have a list of dataframes instead of list of lists. These dataframes will just drop the NULL variables.
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
The resultant list of dataframes:
> str(ldf)
List of 2
$ :'data.frame': 1 obs. of 2 variables:
..$ Name : chr "Sghokbt"
..$ Value: int 7
$ :'data.frame': 1 obs. of 3 variables:
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
From here I get a little help from plyr.
require(plyr)
df <- ldply(ldf)
The result has the columns out of order, but I'm happy enough with it.
> str(df)
'data.frame': 2 obs. of 3 variables:
$ Name : chr "Sghokbt" "Sgnglio"
$ Value: num 7 5
$ Title: chr NA "Mr"
I won't accept this as an answer yet for now in case there is a better solution.
Tidyverse solution
Here's a solution with the tidyverse which might be more readable or at least more intuitive to read for those who are familiar with dplyr and purrr.
lll %>%
# apply to the whole list, and then convert into a tibble
map_df(~
# convert every list element to a char vector
as.character(.x) %>%
# convert the char vector to a tibble row
as_tibble_row(.name_repair = "unique")) %>%
# convert all "NULL" entries to NA
na_if("NULL") %>%
# set tibble names assuming all list entries contain the same names
set_names(lll[[1]] %>% names())
There are several tricks to note:
map_df cannot merge the character vectors into a dataframe. therefore, you convert them into dataframe rows by as_tibble_row(). theoretically, you could name these vectors but as.character has no names attribute, but you need a conversion into a named vector
for as_tibble_row(), you need to specify a .name_repair argument, so map_df can merge the tibble rows without names
i'm truly grateful for the dplyr::na_if() function, you should be too!
lll[[1]] %>% names() is just one way to get the names of the first list entry, and it assumes the other list entries are named the same and in the same order. you should check that before.
Details:
when you use na_if(), you so elegantly replace this code by Ricky (which is totally fine but hard to remember):
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
data.frame(do.call(rbind, lll))
Name Title Value
1 Sghokbt NULL 7
2 Sgnglio Mr 5
do.call is useful in that it accepts lists as an argument. It will execute the function rbind which combines the lists observation by observation. data.frame structures the output as needed. The weakness is that because data frames also accept lists, the new object will keep the list attributes and will be difficult to perform calculations on the elements. Below, is another option, but also potentially problematic.
By removing the NULL value first:
null.remove <- function(lst) {
lapply(lst, function(x) {x <- paste(x, ""); x})
}
newlist <- lapply(lll, null.remove)
asvec <- unlist(newlist)
col.length <- length(newlist[[1]])
data.frame(rbind(asvec[1:col.length],
asvec[(col.length+1):length(asvec)]))
Name Title Value
1 Sghokbt 7
2 Sgnglio Mr 5
'data.frame': 2 obs. of 3 variables:
$ Name : Factor w/ 2 levels "Sghokbt ","Sgnglio ": 1 2
$ Title: Factor w/ 2 levels " ","Mr ": 1 2
$ Value: Factor w/ 2 levels "5 ","7 ": 2 1
This method coerces a value onto the NULL elements in the list by pasting a space onto the existing object. Next unlist allows the list elements to be treated as a named vector. col.length takes note of how many variables there are for use in the new data frame. The last function call creates the data frame by using the col.length value to split the vector.
This is still an intermediate result. Before regular data frame operations can be done, the extra space will have to be trimmed off of the factors. The digits must also be coerced to the class numeric.
I can continue the process when I have another chance to update.
A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"
In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.