data.table - Extract all the text features - r

As part of a function, I am trying to isolate all features that are either character or factor. My data set is a data.table.
text_features <- c(names(data_set[sapply(data_set, is.character)]), names(data_set[sapply(data_set, is.factor)]))
When I run the function I am getting an exception message that says :
Error in [.data.table(data_set, sapply(data_set, is.character)) :
i evaluates to a logical vector length 87 but there are 12992 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
I understand this error is thrown by a recent version of data.table - How should I change my code to work the same way in order to avoid this error?
Note:
packageVersion("data.table")
[1] ‘1.10.4.3’
Thanks

The error that you are getting is because you have commas in the wrong place when you are subsetting your inner data.tables. You want a subset of the columns, not rows:
data_set[sapply(data_set, is.character)] # subsetting rows
data_set[,sapply(data_set, is.character), with = FALSE] # subsetting columns
All that said, I think a much cleaner solution would be:
text_cols <- names(data_set)[sapply(data_set, class) %in% c("character","factor")]
data_set[, ..text_cols] # subset data

Related

using setNames in ifelse statment in R

I noticed that if I called setNames() in ifelse() the returned object does not preserved the names from setNames().
x <- 1:10
#no names kept
ifelse(x <5, setNames(x+1,letters[1:4]), setNames(x^3, letters[5:10]))
#names kept
setNames(ifelse(x <5, x+1,x^3), letters[1:10])
After looking at the code I realize that the second way is more concise but still would be interested to know why the names are not preserved when setNames() is called in ifelse(). ifelse() documentation warns of :
The mode of the result may depend on the value of test (see the examples), and the class attribute (see oldClass) of the result is taken from test and may be inappropriate for the values selected from yes and no.
Is the named list being stripped related to this warning?
It's not really specific to setNames. ifelse simply doesn't preserve names for the TRUE/FALSE parameter. It would get confusing if your TRUE and FALSE values had different names so it just doesn't bother. However, according to the Value session of the help page
A vector of the same length and attributes (including dimensions and "class") as test
Since names are stored as attributes, names are only preserved from the the test parameter. Observe these simple examples
ifelse(TRUE, c(a=1), c(x=4))
# [1] 1
ifelse(c(g=TRUE), c(a=1), c(x=4))
# g
# 1
So in your examples you need to move the names to the test condition
ifelse(setNames(x <5,letters[1:10]), x+1, x^3)

Paste0, subset Error: 'subset' must be logical

I would like to use paste0 to create a long string containing the conditions for the subset function.
I tried the following:
#rm(list=ls())
set.seed(1)
id<-1:20
ids<-sample(id, 3)
d <- subset(id, noquote(paste0("id==",ids,collapse="|")))
I get the
Error in subset.default(id, noquote(paste0("id==", ids, collapse = "|"))) :
'subset' must be logical
I tried the same without noquote. Interestinly when I run
noquote(paste0("id==",ids,collapse="|"))
I get [1] id==4|id==7|id==1. When I then paste this by hand in the subset formula
d2<-subset(id,id==4|id==7|id==1)
Everything runs nice. But why does subset(id, noquote(paste0("id==",ids,collapse="|"))) not work although it seems to be the same? Thanks a lot for your help!

Are data tables with more than 2^31 rows supported in R with the data table package yet?

I am trying to do a cross join (from the original question here), and I have 500GB of ram. The problem is that the final data.table has more than 2^31 rows, so I get this error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
Is there a way to override this? When I add by=.EACHI, I get the error:
'by' or 'keyby' is supplied but not j
I know this question is not in ideal reproducible format (my apologies!), but I am not sure that is strictly necessary for an answer. Maybe I am just missing something or data.table is limited in this way?
I am aware only of this question from 2013, which seems to suggest data.table could not do this back then.
This is the below code that causes the error:
pfill=q[, k:=t+1][q2[, k:=tprm], on=.(k), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]
As data.table seems to be still limited to 2^31 rows, you could as a workaround use arrow combined with dplyr to overcome this limit:
library(arrow)
library(dplyr)
# Create 3 * 2^30 rows feathers
dt <-data.frame(val=rep(1.0,2^30))
write_feather(dt, "test/data1.feather")
write_feather(dt, "test/data2.feather")
write_feather(dt, "test/data3.feather")
# Read the 3 files in a common dataset
dset <- open_dataset('test',format = 'feather')
# Get number of rows
(nrows <- dset %>% summarize(n=n()) %>% collect() %>% pull)
#integer64
#[1] 3221225472
# Check that we're above 2^31 rows
nrows / 2^31
#[1] 1.5
If you just want to know Yes or No, I guess we cannot have a data.table object with more than 2^31 rows if you stick to be within data.table only. However, if you jump out of data.table, the answer by #Waldi is a fabulous workaround for this issue.
The explanation below is just an example to somewhat "prove" the infeasibility, which may provide you with some hints, hopefully.
Let's think about it in the other way around. Assuming we have a data.table dt with more then 2^31 rows, what will happen when indexing the rows? It should be noted that we use integers to index the rows, that means we need to support integers larger than 2^31 in your case. Unfortunately, if you type ?.Machine in the console, you will see that
The algorithm is based on Cody's (1988) subroutine MACHAR. As all
current implementations of R use 32-bit integers and use IEC 60559
floating-point (double precision) arithmetic, the "integer" and
"double" related values are the same for almost all R builds.
and
integer.max the largest integer which can be represented. Always 2^31 - 1 = 2147483647.
If the assumption is true, then we come to indexing issues, i.e., invalid indexing. Thus the assumption does not hold.
A Simple Test
Given a long vector v of length 2^31 (which is larger than 2^31-1), let's see what will happen if we use it to initialize a data.table:
> v <- seq_len(2^31)
> d <- data.table(v)
Error in vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_integer_, USE.NAMES = use.names) :
values must be type 'integer',
but FUN(X[[1]]) result is type 'double'
As we can see, there is no issue when create a vector of length 2^31, but we have troubles when initializing a data.table d. When we look into the source code of data.table, we see there are several places using length, which is applicable when the vector is not longer than 2^31-1
The default method for length currently returns a non-negative integer
of length 1, except for vectors of more than 2^{31}-1 elements, when
it returns a double.
and we can check that
> class(length(v))
[1] "numeric"
which means the output is not integer as required when calling data.table.

data.table gives the error: "logical error. i is not a data.table, but 'on' argument is provided"

Given the code as following.
library(data.table)
dt <- data.table(V1=round(runif(9,100,999),2), V2=rep(1:3,3),
V3=round(runif(9,10,99),2), V4=rep(letters[1:3],3))
setindex(dt,V4)
F1 <- dt[V2==2 & V3>=3, max(V1)]
F2 <- dt[V2==2 & V3>=3, max(V1), on = "V4"]
I am 100% sure that class(dt) is "data.table, data.frame"
and it runs well with F1, but comes the error of
logical error. i is not a data.table, but 'on' argument is provided.
when F2?
Why? How to solve it?
What I am trying to do is not subsetting(or grouping), but to improve the calculation efficiency with the "on" command which I was told is the keyword for secondary indexing.
Many thanks.
I know where I made a mistake. Simply bcs I am using it in the wrong way.
i is always a data.table when "on" command is given.
My original purpose is to search for the target efficiently.
condition: V3>=3, and V2==2
target: max(V1)
I can't pass i a condition, but I can make i a data.table as following.
F2 <- dt[V3>=3][V2==2,max(V1), on = "V4"]
it runs perfectly!
Thanks guys.
You're not using the right subset syntax. You must have some objects in the workspace named V2 or V3. data.table thinks you're merging them. The i argument is the first argument to [.data.table. Replace V2==2 & V3>=3 with (V2==2 & V3>=3) to refer to the column variables. See here about the subtleties of scoping with i= as a subset in [.data.table. The last on should probably be by (although you would still have had an error because of the subset syntax).

is.na() behaves differently than is.numeric() - where's the consistency?

Let's create the data frame:
df <- data.frame(VarA = c(1, NA, 5), VarB = c(NA, 2, 7))
VarA VarB
1 1 NA
2 NA 2
3 5 7
If I run a simple NA query it shows me the locations of each NA.
is.na(df)
VarA VarB
[1,] FALSE TRUE
[2,] TRUE FALSE
[3,] FALSE FALSE
Why doesn't is.numeric return the same type of data frame? It only outputs a single "FALSE".
is.numeric(df)
[1] FALSE
Is there a good explanation of data types, classes, etc. somewhere? I read about these things often but don't have a solid feel for them. I don't get the difference between a matrix and data frame, or num vs dbl. It's easy to conflate these things.
I did the Cyclismo "basic data types" tutorial but would like to dig a little deeper.
First - documentation
Let's turn to the documentation. From ?is.na:
The generic function is.na indicates which elements are missing.
So is.na is made to tell you which individual elements within an object are missing.
From ?is.numeric:
is.numeric is a more general test of an object being interpretable as numbers.
So is.numeric tells you whether an object is numeric (not whether individual elements within the object are numeric).
These are behaving exactly as documented - is.na(df) tells you which elements of the data frame are missing. is.numeric(df) tells you what df is not numeric (in fact, it is a data.frame).
Is it inconsistent?
I can see how this seems inconsistent. There are just a few is.* functions that work element-wise. is.na, is.finite, is.nan are the only ones I can think of. All the other is.* functions work on the whole object. These function are essentially stand-ins for equality testing with == when the equality testing wouldn't work (more on this below). But once you understand the data structures a little more, they don't seem inconsistent, because they really wouldn't make sense the other way.
is.numeric makes sense the way it is
It would not make sense for is.numeric to be applied element-wise. A vector is either numeric or not in its entirety - whether or not it has missing values. If you wanted to apply the is.numeric function to each column of your data frame, you could do
sapply(df, is.numeric)
Which will tell you that both columns are numeric. You could make an argument that the default behavior when is.numeric() is given a data frame should be to apply it to every column, but it's possible someone want to make sure that something is a numeric vector, not a data.frame (or anything else), and having, say, a one-column data.frame say TRUE to is.numeric() could cause confusion and errors.
is.na makes sense the way it is
Conversely, it wouldn't make sense for is.na to not be applied element-wise. NA is a stand-in for a single value, not a complicated object like a data.frame. It wouldn't really make sense to have a "missing" data frame - you could have a missing value but there's nothing to tell you that it's a data frame. However a data.frame (or a vector, or a matrix...) can contain missing values, and is.na will tell you exactly where they are.
This is pretty much identical to how equality (or other comparisons) work. You could also check for 1s in your data frame with df == 1, or for values less than 5 with df < 5. is.na() is the recommended way to check for missing values - anything == NA returns NA, so df == NA doesn't work for that. is.na(df) is the right way to do this.
To accomplish this, is.na actually has many methods. You can seem them with methods("is.na"). In my current R session, I see
methods("is.na")
[1] is.na,abIndex-method is.na,denseMatrix-method is.na,indMatrix-method
[4] is.na,nsparseMatrix-method is.na,nsparseVector-method is.na,sparseMatrix-method
[7] is.na,sparseVector-method is.na.coxph.penalty* is.na.data.frame
[10] is.na.data.table* is.na.integer64* is.na.numeric_version
[13] is.na.POSIXlt is.na.raster* is.na.ratetable*
[16] is.na.Surv*
This shows me that all these different types of objects support a is.na() call to nicely tell me where missing values are inside of them. And if I call it on another object class, then is.na.default will try to handle it.
Secondary questions
I don't get the difference between a matrix and data frame, or num vs dbl. It's easy to conflate these things.
num vs dbl is not relevant to R. I'm shocked that anything directed at R beginners would mention doubles - it shouldn't. If you look at the help at ?double it includes.
It is identical to numeric.
... as.double is a generic function. It is identical to as.numeric.
For R purposes, forget the term double and just use numeric.
I don't get the difference between a matrix and data frame
Both are rectangular - rows and columns. A matrix can only have one data type/class inside it - the whole matrix is numeric, or character, or integer, etc, with no mixing. A data.frame can have different class for each of its columns, the first column can be numeric, the second character, the third factor, etc.
Matrices are simpler and more efficient, very suitable for linear algebra operations. Data frames are much more common because it is common to have data of mixed types.
Primarily because the test in is.numeric() applies to the whole object (so returns a single value that says whether the entire object is numeric), while is.na() applies to individual elements of the object.
The next, subtler question (which you haven't asked yet but might ask next) is: why doesn't is.numeric() return TRUE, since all the elements of the data frame are numeric? It's because data frames are internally represented as lists, and could contain elements of different types (is.numeric(as.matrix(df)) does return TRUE).
str(df)
'data.frame': 3 obs. of 2 variables:
$ VarA: num 1 NA 5
$ VarB: num NA 2 7
The thing to consider is this, is.na is testing each value that appears in a vector... whereas is.numeric is checking the class of the object itself. It's apples-to-oranges in a sense. Think of it like this,
Is this object Not Available(NA)? Since it exists, check each object contained in the tested vectors. Is this object a number? Nope.. it's a data.frame

Resources