Recode '[.data.table' - r

I have an user defined object that has a class with 3 attributes.
i.e. > class(data)
[1] "cumulative" "data.table" "data.frame"
I wish to recode [ so that when I call it on my object it uses the data.table defined function i.e. [.data.table but returns my user defined class. How do I do this?
I tried creating a function as follows, and a few other variations but I can't get it to work
'[.cumulative' <- function(x,i,j,...) {
y <- NextMethod(.Generic)(x,i.j)
class(y) <- .Class
}

This bug has been fixed in the current development version 1.9.3. From NEWS:
If another class inherits from data.table; e.g. class(DT) == c("UserClass", "data.table", "data.frame") then DT[...] now retains UserClass in the result. Thanks to Daniel Krizian for reporting, #5296 (git #64). Test added.
require(data.table) ## 1.9.2
dt = data.table(x=1:5, y=6:10)
setattr(dt, 'class', c("foo", "data.table", "data.frame"))
class(dt)
# [1] "foo" "data.table" "data.frame"
## bug...
class(dt[, .N, by=x])
# [1] "data.table" "data.frame"
# -------------------------------
require(data.table) ## 1.9.3
dt = data.table(x=1:5, y=6:10)
setattr(dt, 'class', c("foo", "data.table", "data.frame"))
class(dt)
# [1] "foo" "data.table" "data.frame"
## bug fixed
class(dt[, .N, by=x])
# [1] "foo" "data.table" "data.frame"

Related

Looping through vector of dates in R drops class info

Here is my example.
my_df <- data.frame(col_1 = c(1,2),
col_2 = c(as.Date('2018-11-11'), as.Date('2016-01-01')))
dates_list <- my_df$col_2
for(el in dates_list){
print(el)
}
It generates:
17846
16801
How do I output dates instead? I can do it with explicit index, but would like to have simpler solution
1) Use as.list:
for(el in as.list(dates_list)) {
print(el)
}
giving:
[1] "2018-11-11"
[1] "2016-01-01"
2) or not quite as nice but one can iterate over the indexes:
for(i in seq_along(dates_list)) {
print(dates_list[i])
}
You can also use as_date function from lubridate.
library(lubridate)
for(i in dates_list){
print(as_date(i))
}
[1] "2018-11-11"
[1] "2016-01-01"
The cause of the problem could be that dates_list <- my_df$col_2 coerces the column to a date vector:
dates_list <- my_df$col_2
class(dates_list)
> [1] "Date"
so another solution would be to resolve this, as follows:
dates_list <- my_df["col_2"]
class(dates_list)
[1] "data.frame"
for(el in dates_list){
print(el)
}
[1] "2018-11-11" "2016-01-01"

`write.dbf` fails with an object of class `tbl_df`

I do a lot of my work with .dbf files, and also with dplyr. There's a bug in write.dbf() that prevents writing a tbl_df object to a .dbf file.
Unfortunately, the error message is poorly written and it's therefore difficult to figure out exactly what is happening.
Here's a MWE
library(dplyr)
library(foreign)
d <- data_frame( x = 1:4, y = rnorm(4) )
write.dbf(d, "test.dbf")
Error in write.dbf(d, "test.dbf") : unknown column type in data frame
The solution here is to force the class of d to a bare data.frame
class(d)
[1] "tbl_df" "tbl" "data.frame"
df <- as.data.frame(d)
class(df)
[1] "data.frame"
write.dbf(as.data.frame(df), "test.dbf") # works
I've filed a bug report with the foreign people, but hopefully this post can save someone else some pain.
I'm not sure it's fair to assert a bug in foreign. Consider this:
library(dplyr)
df <- data.frame(x=1:10, y=11:20)
class(df)
# [1] "data.frame"
mode(df$x) # as expected
# [1] "numeric"
mode(df[,"x"]) # as expected
# [1] "numeric"
dp <- data_frame(x=1:10, y=11:20)
class(dp)
# [1] "tbl_df" "tbl" "data.frame"
mode(dp$x)
# [1] "numeric" # as expected
mode(dp[,"x"])
# [1] "list" # WTF?!
There are many, many functions in R that use, e.g., mode(my.data.frame[,"mycolumn"]) to test the mode of a column in a dataframe, but with a tbl_df object, the mode returned is "list".

Why doesn't setDT have any effect in this case?

Consider the following code
library(data.table) # 1.9.2
x <- data.frame(letters[1:2])
setDT(x)
class(x)
## [1] "data.table" "data.frame"
Which is an expected result. Now if I run
x <- letters[1:2]
setDT(data.frame(x))
class(x)
## [1] "character"
The class of x remained unchanged for some reason.
One possibility is that setDT changes only the classes of objects in the global environment, so I've tried
x <- data.frame(letters[1:2])
ftest <- function(x) setDT(x)
ftest(x)
class(x)
##[1] "data.table" "data.frame"
Seems like setDT don't care much about the environment of an object in order to change its class.
So what's causing the above behaviour? Is it just a bug or there is some common sense behind it?
setDT changes the data.frame and returns it invisibly. Since you don't save this data.frame, it is lost. All you need to do is somehow save the data.frame, so that the data.table is also saved. E.g.
setDT(y <- data.frame(x))
class(y)
## [1] "data.table" "data.frame"
or
z <- setDT(data.frame(x))
class(z)
## [1] "data.table" "data.frame"

How to avoid implicit character conversion when using apply on dataframe

When using apply on a data.frame, the arguments are (implicitly) converted to character. An example:
df <- data.frame(v=1:10, t=1:10)
df <- transform(df, t2 = as.POSIXlt(t, origin = "2013-08-13"))
class(df$t2[1])
## [1] "POSIXct" "POSIXt" (correct)
but:
apply(df, 1, function(y) class(y["t2"]))
## [1] "character" "character" "character" "character" "character" "character"
## [7] "character" "character" "character" "character"
Is there any way to avoid this conversion? Or do I always have to convert back through as.POSIXlt(y["t2"])?
edit
My df has 2 timestamps (say, t2 and t3) and some other fields (say, v1, v2). For each row with given t2, I want to find k (e.g. 3) rows with t3 closest to, but lower than t2 (and the same v1), and return a statistics over v2 from these rows (e.g. an average). I wrote a function f(t2, v1, df) and just wanted to apply it on all rows using apply(df, 1, function(x) f(y["t2"], y["v1"], df). Is there any better way to do such things in R?
Let's wrap up multiple comments into an explanation.
the use of apply converts a data.frame to a matrix. This
means that the least restrictive class will be used. The least
restrictive in this case is character.
You're supplying 1 to apply's MARGIN argument. This applies
by row and makes you even worse off as you're really mixing classes
together now. In this scenario you're using apply designed for matrices
and data.frames on a vector. This is not the right tool for the job.
In ths case I'd use lapply or sapply as rmk points out to grab the classes of
the single t2 column as seen below:
Code:
df <- data.frame(v=1:10, t=1:10)
df <- transform(df, t2 = as.POSIXlt(t, origin = "2013-08-13"))
sapply(df[, "t2"], class)
lapply(df[, "t2"], class)
## [[1]]
## [1] "POSIXct" "POSIXt"
##
## [[2]]
## [1] "POSIXct" "POSIXt"
##
## [[3]]
## [1] "POSIXct" "POSIXt"
##
## .
## .
## .
##
## [[9]]
## [1] "POSIXct" "POSIXt"
##
## [[10]]
## [1] "POSIXct" "POSIXt"
In general you choose the apply family that fits the job. Often I personally use lapply or a for loop to act on specific columns or subset the columns I want using indexing ([, ]) and then proceed with apply. The answer to this problem really boils down to determining what you want to accomplish, asking is apply the most appropriate tool, and proceed from there.
May I offer this blog post as an excellent tutorial on what the different apply family of functions do.
Try:
sapply(df, function(y) class(y["t2"]))
$v
[1] "integer"
$t
[1] "integer"
$t2
[1] "POSIXct" "POSIXt"

as.matrix not preserving the data mode of an empty data.frame

I have found something odd today, I wanted to ask you if there was a logical reason for what I am seeing, or if you think this is a bug that should be reported to the R-devel team:
df <- data.frame(a = 1L:10L)
class(df$a)
# [1] "integer"
m <- as.matrix(df)
class(m[, "a"])
# [1] "integer"
No surprise so far: as.matrix preserves the data mode, here "integer". However, with an empty (no rows) data.frame:
df <- data.frame(a = integer(0))
class(df$a)
# [1] "integer"
m <- as.matrix(df)
class(m[, "a"])
# [1] "logical"
Any idea why the mode changes from "integer" to "logical" here? I am using version 2.13.1
Thank you.
This is because of this one line in as.matrix.data.frame:
if (any(dm == 0L)) return(array(NA, dim = dm, dimnames = dn))
Basically, if any dimensions are zero, you get an array "full" of NA. I say "full" because there aren't really any observations because one of the dimensions is zero.
The reason the class is logical is because that's the class of NA. There are special NA for other classes, but they're not really necessary here. For example:
> class(NA)
[1] "logical"
> class(NA_integer_)
[1] "integer"
> class(NA_real_)
[1] "numeric"
> class(NA_complex_)
[1] "complex"
> class(NA_character_)
[1] "character"

Resources