data.table subset rows when there is a lubridate interval object column - r

I am getting an error message with a data.table that I do not understand. I have a main data.table that I have subset based on ID variable. Once I have this second data.table I am again want to subset it by a vector of row indexes. Unfortunately, I cannot share my data and have not been able to reproduce the error with another data set. Sorry I cannot provide more detail than this. Can anyone tell what is going on with this limited info?
> class(auth)
[1] "data.table" "data.frame"
> x <- auth[ID == auth$ID[1]]
> x[, authInterval := interval(x$AUTH_DT, x$AUTH_END_DT)]
>
> # Find sequential auth intervals that overlap
> overlap <- sapply(1:(nrow(x) - 1), function(y) {
+ int_overlaps(x$authInterval[y], x$authInterval[y + 1])
+ })
>
> x[, overlap := c(NA, overlap)]
>
> # which two rows have overlap
> whichOverlap <- lapply(which(x$overlap), function(y) {c(y - 1, y)})
> whichOverlap
[[1]]
[1] 1 2
>
> x[unlist(whichOverlap)]
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
In addition: Warning messages:
1: In unclass(e1) + unclass(e2) :
longer object length is not a multiple of shorter object length
2: In cbind(ID = c("XXXXXX1", "XXXXXX2"), COMP_CD = c("280", "280"), :
number of rows of result is not a multiple of vector length (arg 1)
dim and dimnames output
> dim(x)
[1] 3 6
> dimnames(x)
[[1]]
NULL
[[2]]
[1] "ID" "COMP_CD" "AUTH_DT" "AUTH_END_DT" "authInterval" "overlap"
Based on the traceback it seems like the subsetting screws things up with print.data.frame
> traceback()
3: `rownames<-`(`*tmp*`, value = paste0(format(rn, right = TRUE,
scientific = FALSE), ":"))
2: print.data.table(x)
1: (function (x, ...)
UseMethod("print"))(x)
List of 1
$ :Classes ‘data.table’ and 'data.frame': 2 obs. of 6 variables:
..$ ID : num [1:2] XXXX1 XXXX2
..$ COMP_CD : chr [1:2] XXX XXX
..$ AUTH_DT : POSIXct[1:2], format: xxx xxx
..$ AUTH_END_DT : POSIXct[1:2], format: xxx xxx
..$ authInterval:Formal class 'Interval' [package "lubridate"] with 3 slots
.. .. ..# .Data: num [1:2] 7776000 7776000
.. .. ..# start: POSIXct[1:3], format: "xxx" "xxx" "xxx"
.. .. ..# tzone: chr "UTC"
..$ overlap : logi [1:2] NA TRUE
..- attr(*, ".internal.selfref")=<externalptr>

So as it turns out, as this github issue states, this is a bug in data.table and handling columns that are S4 objects. There is also a workaround given here by making each element of the S4 column a list. So in my case the following fixes the issue. Notice that since the S4 columns are now lists, I had to change from using [ to [[.
x[, authInterval := interval(x$AUTH_DT, x$AUTH_END_DT)]
x[, authInterval := as.list(authInterval)]
# Find sequential auth intervals that overlap
overlap <- sapply(1:(nrow(x) - 1), function(y) {
int_overlaps(x$authInterval[[y]], x$authInterval[[y + 1]])
})
x[, overlap := c(NA, overlap)]
# which two rows have overlap
whichOverlap <- lapply(which(x$overlap), function(y) {c(y - 1, y)})
whichOverlap
x[unlist(whichOverlap)]

Related

How to display a list of lists in a nice way

I have a list of lists, such as below.
Each list (e.g. list1, list2, list3) has two attributes: Variable and Time
list1 <- list(c("Color", "Price"), "Quarter")
list2 <- list(c("Price"), "Month")
list3 <- list(c("Color"), "Month")
total <- list(list1, list2, list3)
when we print total, we'll see:
[[1]]
[[1]][[1]]
[1] "Color" "Price"
[[1]][[2]]
[1] "Quarter"
[[2]]
[[2]][[1]]
[1] "Price"
[[2]][[2]]
[1] "Month"
[[3]]
[[3]][[1]]
[1] "Color"
[[3]][[2]]
[1] "Month"
How can I turn it into a data frame such as this one?
EDIT: I am able to accomplish it using this code. Any better suggestion is appreciated!
num <- length(total)
max <- 0
for(i in 1:num) {
if(length(total[[i]][1]) > max) {
max <- length(total[[i]])
}
}
for(i in 1:num) {
length(total[[i]][[1]]) <- max
for(j in 1:max) {
if(is.null(total[[i]][[1]][[j]])) {
total[[i]][[1]][[j]] <- " "
}
}
}
df <- data.frame(matrix(unlist(total), nrow=num, byrow=T))
This isn't just a nested-list problem, it's a nested problem. If I'm interpretting things correctly, the fact that Color and Price are in one list and Quarter is in another is meaningful. So really, you should be looking at how to turn the first element of each list into a data.frame, repeat for all other elements, then join the results. (This is where #divibisan's and #camille's suggestions come into play ... reduce the problem, use the duplicates' code, then combine.)
(The fact that I believe you will never have more than two elems in each list is not strictly a factor. Below is a general way of handling 1-or-more, not just "always 2".)
Your data:
str(total)
# List of 3
# $ :List of 2
# ..$ : chr [1:2] "Color" "Price"
# ..$ : chr "Quarter"
# $ :List of 2
# ..$ : chr "Price"
# ..$ : chr "Month"
# $ :List of 2
# ..$ : chr "Color"
# ..$ : chr "Month"
What we need to do is break this down by element-of-each-list. (I'm assuming that there will be symmetry here.) Let's start by just working on the first elem of each:
total1 <- lapply(total, `[[`, 1)
str(total1)
# List of 3
# $ : chr [1:2] "Color" "Price"
# $ : chr "Price"
# $ : chr "Color"
In order to use the suggestions from the dupes, we need to know how much to pad them. That is, they need to be the same length.
( maxlen <- max(sapply(total1, function(l) length(unlist(l)))) )
# [1] 2
Now we pad them:
total1 <- lapply(total1, function(l) { length(l) <- maxlen; l; })
str(total1)
# List of 3
# $ : chr [1:2] "Color" "Price"
# $ : chr [1:2] "Price" NA
# $ : chr [1:2] "Color" NA
(You can start to see the structure break out here.) The dupes suggested cbinding them, but you want to rbind them:
do.call(rbind, total1)
# [,1] [,2]
# [1,] "Color" "Price"
# [2,] "Price" NA
# [3,] "Color" NA
Now this is a matrix, not a data.frame, but it's a start. Let's work with naming at the end. Let's write a function to do what we just did, and then we'll use it on each level of total.
In order to do this, though, we need to modify total, so that the new first element has all first elements, new second has all seconds, etc.
newtotal <- lapply(seq_len(max(sapply(total, length))), function(i) lapply(total, `[[`, i))
str(newtotal)
# List of 2
# $ :List of 3
# ..$ : chr [1:2] "Color" "Price"
# ..$ : chr "Price"
# ..$ : chr "Color"
# $ :List of 3
# ..$ : chr "Quarter"
# ..$ : chr "Month"
# ..$ : chr "Month"
m <- do.call(cbind, lapply(newtotal, func))
m
# [,1] [,2] [,3]
# [1,] "Color" "Price" "Quarter"
# [2,] "Price" NA "Month"
# [3,] "Color" NA "Month"
So this last point is pretty much what you need, though as a matrix. From here, it's easy enough to name things:
m <- do.call(cbind, lapply(newtotal, func))
colnames(m) <- c(paste0("Var", seq_len(ncol(m)-1L)), "Time")
df <- as.data.frame(m)
df$List <- paste0('List', seq_len(nrow(df)))
df
# Var1 Var2 Time List
# 1 Color Price Quarter List1
# 2 Price <NA> Month List2
# 3 Color <NA> Month List3

Conditionally replacing elements in a list with a list in R

I am trying to do this
a <- list(1,2,3)
a[a == 2] <- list(1,2,3)
Which gives me number of items to replace is not a multiple of replacement length. Generally speaking, I iteratively want to replace elements in a list of integers based on a condition with other lists of various lengths that depend on the integer in the original list.
The question did not state what result is desired but this works without warning or error replacing the second element of a with the indicated list.
a <- list(1, 2, 3)
a[a == 2] <- list(list(1,2,3))
giving:
> str(a)
List of 3
$ : num 1
$ :List of 3
..$ : num 1
..$ : num 2
..$ : num 3
$ : num 3

casting to a data.frame in order to sort columns fails with unimplemented type list

Why does the final cast to a data.frame appear not to work? When I try to sort it I get: Error in order(temp[, 1], decreasing = T) : unimplemented type 'list' in 'orderVector1'
data<-lapply(1:5,function(i){
lapply(1:5,function(j){
list(i=i,j=j)
})
})
temp<-as.data.frame(data)
temp<-matrix(temp,ncol=2,byrow=T)
head(temp,20)
temp<-data.frame(temp)
class(temp) #####IS A DATA.FRAME
temp<-temp[order(temp[,1],decreasing=T),]
The columns in the OP's dataset are each list of length 25. We can convert it to a normal data.frame with column vectors.
temp1 <- data.frame(lapply(temp, unlist))
and then do the order
temp1[order(temp1[,1], decreasing = TRUE),]
It is easier to check the structure of the dataset with str
str(temp, list.len = 3)
#'data.frame': 25 obs. of 2 variables:
# $ X1:List of 25
# ..$ : int 1
# ..$ : int 1
# ..$ : int 1
# .. [list output truncated]
# $ X2:List of 25
# ..$ : int 1
# ..$ : int 2
# ..$ : int 3
# .. [list output truncated]
Also, we can directly get a data.frame with expand.grid
expand.grid(rep(list(1:5), 2))
Or using CJ from data.table
library(data.table)
CJ(1:5, 1:5)

Accessing particular cells within dataframes organized into list in R in a "vectorized" way

This is my first question here, sorry for possible mistakes.
I have got a "tt" list of dataframes after I streamed-in a jason file.
some of dataframes are empty, some have predefined structure, here is an example:
> str(tt)
List of 2
$ :'data.frame': 0 obs. of 0 variables
$ :'data.frame': 2 obs. of 2 variables:
..$ key : chr [1:2] "issue_id" "letter_id"
..$ value: chr [1:2] "43" "223663"
> tt
[[1]]
data frame with 0 columns and 0 rows
[[2]]
key value
1 issue_id 43
2 letter_id 223663
I would like to get a column (e.g. named "t") with issue_id's out of "tt" structure, so that
t[1] = NA (or NULL)
t[2] = 43
I can do it accessing dataframes as a list elements like this
> tt[[1]][1,2]
NULL
> tt[[2]][1,2]
[1] "43"
How can I do this in a "vectorized" way? tried different things with no success like
> t <- tt[[]][1,2]
Error in tt[[]] : invalid subscript type 'symbol'
> t <- tt[][1,2]
Error in tt[][1, 2] : incorrect number of dimensions
> t <- tt[[]][1][2]
Error in tt[[]] : invalid subscript type 'symbol'
> t <- tt[][1][2]
> t
[[1]]
NULL
It should be something very simple I guess
We can use lapply to loop over the list. As there are null elements or if the number of rows are zero, we skip it and extract the 'value' from the other elements.
lapply(tt, function(x) if(!(is.null(x)|!nrow(x))) with(x, value[key=="issue_id"]))
As #MikeRSpencer mentioned in the comments, if we need to extract the first 'value'
sapply(tt, function(x) if(!(is.null(x)|!nrow(x))) x$value[1])
and it would be return a vector

write a list as element in data frame in r

I would like to make an empty data frame with 2 columns, named "ngrams" and "pred".
df <- data.frame(nGrams=character(), pred = character(), stringsAsFactors=FALSE)
I need each element in column "pred" to be a vector of words, but if I initialize 'pred = list()' the data frame won't add that column in.
I tried
> pred
[1] "a" "the" "not" "that" "to" "an"
df[nrow(df)+1, ] <- c("is", pred)
Error in matrix(value, n, p) :
(converted from warning) data length [7] is not a sub-multiple or multiple of the number of columns [2]
df[nrow(df)+1, ] <- c("the", list(pred))
Error in `[<-.data.frame`(`*tmp*`, nrow(df) + 1, , value = list("the", :
(converted from warning) replacement element 2 has 6 rows to replace 1 rows
Can anyone show me what is the right way of doing it? Thanks in advance.
EDIT
I got the solution using data.table
dt <- data.table(nGrams = my_ngrams, pred = list_pred)
where list_pred is a list of lists. But it's still good to know the right way for data frame.
You could try
d1 <- data.frame(nGrams='the', pred=I(list(pred)))
str(d1)
#'data.frame': 1 obs. of 2 variables:
#$ nGrams: Factor w/ 1 level "the": 1
#$ pred :List of 1
# ..$ : chr "a" "the" "not" "that" ...
#..- attr(*, "class")= chr "AsIs"
Or using the empty dataframe df
df[nrow(df)+1,] <- list('is', list(pred))
str(df)
#'data.frame': 1 obs. of 2 variables:
#$ nGrams: chr "is"
#$ pred :List of 1
# ..$ : chr "a" "the" "not" "that" ...
where,
pred <- c('a', 'the', 'not', 'that', 'to', 'an')

Resources