I am using S3 methods in that way.
First, seek all commonn task between all classes programmed and put this code only once before "Usemethod". Then, I program the rest of each class.
The problem arises when you modify an argument, because they are defined by-reference. Other tasks like check arguments or define sub-functions works well in these schemas.
The next example, I modify an argument:
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
UseMethod("secure_filter", table)
}
secure_filter.data.table <- function(
table, col, value
){
return(table[ col == value,])
}
secure_filter.data.frame <- function(
table, col, value
){
return(table[table$col == !!value,])
}
and the result is wrong
df <- data.frame(a=c("a", "b", "c"), column = c("1", "2", "3"))
dt <- as.data.table(df)
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
Empty data.table (0 rows and 2 cols): a,column
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
[1] a column
<0 rows> (or 0-length row.names)
Am I using S3 well? How do I save repeated code between S3 classes?
Any example in a well known R function?
I am using this approach to re-program tidyverse scripts to data.table scripts.
Use NextMethod instead UseMethod.
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
NextMethod("secure_filter")
#UseMethod("secure_filter", table)
}
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
a column
1: a 1
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
a column
1 a 1
But I don´t know that answer is well behaved because it doesn't shot the dispatched method neither get the generic.
> sloop::s3_dispatch(secure_filter(dt, "column", 1))
secure_filter.data.table
secure_filter.data.frame
secure_filter.default
> sloop::s3_get_method(secure_filter)
Error: Could not find generic
Related
I have a dataframe in R that I import from excel and a dataframe that I create with a script. These dataframes contain the same columns but since one is imported from excel, the class of the columns are not identical to the columns of the dataframe created with the script.
The dataframes contain 500+ columns so to do it individually would take a lot of time. Is there any way to change the class of all columns of the excel imported dataframe to the class of the columns from the script created dataframe?
Many thanks!
df1 <- data.frame(a=1,b="2")
df2 <- data.frame(a=1L,b=2,d=3)
nms <- intersect(names(df1), names(df2))
df2[nms] <- Map(function(ref, tgt) { class(tgt) <- class(ref); tgt; }, df1[nms], df2[nms])
str(df2)
# 'data.frame': 1 obs. of 3 variables:
# $ a: int 1
# $ b: chr "2"
# $ d: num 3
Granted, $a remains integer instead of being cast to numeric; if that's not a concern, then that may suffice. If not, then this more-verbose and more-flexible option might be preferred:
cls <- sapply(df1[nms], function(z) class(z)[1])
df2[nms] <- Map(function(tgt, cl) {
if (cl == "numeric") {
tgt <- as.numeric(tgt)
} else if (cl == "integer") {
tgt <- as.integer(tgt)
} else if (cl == "character") {
tgt <- as.character(tgt)
}
tgt
}, df2[nms], cls)
str(df2)
# 'data.frame': 1 obs. of 3 variables:
# $ a: num 1
# $ b: chr "2"
# $ d: num 3
The rationale behind sapply(.., class(z)[1]) is that some classes have length greater than 1 (e.g., tbl_df, POSIXct), which will spoil that process.
This is my first question here, sorry for possible mistakes.
I have got a "tt" list of dataframes after I streamed-in a jason file.
some of dataframes are empty, some have predefined structure, here is an example:
> str(tt)
List of 2
$ :'data.frame': 0 obs. of 0 variables
$ :'data.frame': 2 obs. of 2 variables:
..$ key : chr [1:2] "issue_id" "letter_id"
..$ value: chr [1:2] "43" "223663"
> tt
[[1]]
data frame with 0 columns and 0 rows
[[2]]
key value
1 issue_id 43
2 letter_id 223663
I would like to get a column (e.g. named "t") with issue_id's out of "tt" structure, so that
t[1] = NA (or NULL)
t[2] = 43
I can do it accessing dataframes as a list elements like this
> tt[[1]][1,2]
NULL
> tt[[2]][1,2]
[1] "43"
How can I do this in a "vectorized" way? tried different things with no success like
> t <- tt[[]][1,2]
Error in tt[[]] : invalid subscript type 'symbol'
> t <- tt[][1,2]
Error in tt[][1, 2] : incorrect number of dimensions
> t <- tt[[]][1][2]
Error in tt[[]] : invalid subscript type 'symbol'
> t <- tt[][1][2]
> t
[[1]]
NULL
It should be something very simple I guess
We can use lapply to loop over the list. As there are null elements or if the number of rows are zero, we skip it and extract the 'value' from the other elements.
lapply(tt, function(x) if(!(is.null(x)|!nrow(x))) with(x, value[key=="issue_id"]))
As #MikeRSpencer mentioned in the comments, if we need to extract the first 'value'
sapply(tt, function(x) if(!(is.null(x)|!nrow(x))) x$value[1])
and it would be return a vector
Dataset below has the characteristics of my large dataset. I am managing it in data.table, some columns are loaded as chr despite they are numbers and I want to convert them into numerics and these column names are known
dt = data.table(A=LETTERS[1:10],B=letters[1:10],C=as.character(runif(10)),D = as.character(runif(10))) # simplified version
strTmp = c('C','D') # Name of columns to be converted to numeric
# columns converted to numeric and returned a 10 x 2 data.table
dt.out1 <- dt[,lapply(.SD, as.numeric, na.rm = T), .SDcols = strTmp]
I am able to convert those 2 columns to numeric with the code above however I want to update dt instead. I tried using := however it didn't work. I need help here!
dt.out2 <- dt[, strTmp:=lapply(.SD, as.numeric, na.rm = T), .SDcols = strTmp] # returned a 10 x 6 data.table (2 columns extra)
I even tried the code below (coded like a data.frame - not my ideal solution even if it works as I am worried in some cases the order might change) but it still doesn't work. Can someone let me know why it doesn't work please?
dt[,strTmp,with=F] <- dt[,lapply(.SD, as.numeric, na.rm = T), .SDcols = strTmp]
Thanks in advance!
You don't need to assign the whole data.table if you assign by reference with := (i.e., you don't need dt.out2 <-).
You need to wrap the LHS of := in parentheses to make sure it is evaluated (and not used as the name).
Like this:
dt[, (strTmp) := lapply(.SD, as.numeric), .SDcols = strTmp]
str(dt)
#Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ A: chr "A" "B" "C" "D" ...
# $ B: chr "a" "b" "c" "d" ...
# $ C: num 0.30204 0.00269 0.46774 0.08641 0.02011 ...
# $ D: num 0.151 0.0216 0.5689 0.3536 0.26 ...
# - attr(*, ".internal.selfref")=<externalptr>
While Roland's answer is more idiomatic, you can also consider set within a loop for something as direct as this. An approach might be something like:
strTmp = c('C','D')
ind <- match(strTmp, names(dt))
for (i in seq_along(ind)) {
set(dt, NULL, ind[i], as.numeric(dt[[ind[i]]]))
}
str(dt)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ A: chr "A" "B" "C" "D" ...
# $ B: chr "a" "b" "c" "d" ...
# $ C: num 0.308 0.564 0.255 0.828 0.128 ...
# $ D: num 0.635 0.0485 0.6281 0.4793 0.7 ...
# - attr(*, ".internal.selfref")=<externalptr>
From the help page at ?set, this would avoid some of the [.data.table overhead if that ever becomes a problem for you.
A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"
In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.
I would appreciate insight into why this happens and how I might do this more eloquently.
When I use sapply, I would like it to return a 3x2 matrix, but it returns a 2x3 matrix. Why is this? And why is it difficult to attach this to another data frame?
a <- data.frame(id=c('a','b','c'), var1 = c(1,2,3), var2 = c(3,2,1))
out <- sapply(a$id, function(x) out = a[x, c('var1', 'var2')])
#out is 3x2, but I would like it to be 2x3
#I then want to append t(out) (out as a 2x3 matrix) to b, a 1x3 dataframe
b <- data.frame(var3=c(0,0,0))
when I try to attach these,
b[,c('col2','col3')] <- t(out)
The error that I get is:
Warning message:
In `[<-.data.frame`(`*tmp*`, , c("col2", "col3"), value = list(1, :
provided 6 variables to replace 2 variables
although the following appears to give the desired result:
rownames(out) <- c('col1', 'col2')
b <- cbind(b, t(out))
I can not operate on the variables:
b$var1/b$var2
returns
Error in b$var1/b$var2 : non-numeric argument to binary operator
Thanks!
To expand on DWin's answer: it would help to look at the structure of your out object. It explains why b$var1/b$var2 doesn't do what you expect.
> out <- sapply(a$id, function(x) out = a[x, c('var1', 'var2')])
> str(out) # this isn't a data.frame or a matrix...
List of 6
$ : num 1
$ : num 3
$ : num 2
$ : num 2
$ : num 3
$ : num 1
- attr(*, "dim")= int [1:2] 2 3
- attr(*, "dimnames")=List of 2
..$ : chr [1:2] "var1" "var2"
..$ : NULL
The apply family of functions are designed to work on vectors and arrays, so you need to take care when using them with data.frames (which are usually lists of vectors). You can use the fact that data.frames are lists to your advantage with lapply.
> out <- lapply(a$id, function(x) a[x, c('var1', 'var2')]) # list of data.frames
> out <- do.call(rbind, out) # data.frame
> b <- cbind(b,out)
> str(b)
'data.frame': 3 obs. of 4 variables:
$ var3: num 0 0 0
$ var1: num 1 2 3
$ var2: num 3 2 1
$ var3: num 0 0 0
> b$var1/b$var2
[1] 0.3333333 1.0000000 3.0000000
First a bit of R notation. The If you look at the code for sapply, you will find the answer to your question. The sapply function checks to see if the list lengths are all equal, and if so, it first "unlist()"s them and then takes that series of lists as the data argument to array(). Since array (like matrix() ) by default arranges its values in column major order, that is what you get. The lists get turned on their side. If you don't like it then you can define a new function tsapply that will return the transposed values:
> tsapply <- function(...) t(sapply(...))
> out <- tsapply(a$id, function(x) out = a[x, c('var1', 'var2')])
> out
var1 var2
[1,] 1 3
[2,] 2 2
[3,] 3 1
... a 3 x 2 matrix.
Have a look at ddply from the plyr package
a <- data.frame(id=c('a','b','c'), var1 = c(1,2,3), var2 = c(3,2,1))
library(plyr)
ddply(a, "id", function(x){
out <- cbind(O1 = rnorm(nrow(x), x$var1), O2 = runif(nrow(x)))
out
})