I would like to make an empty data frame with 2 columns, named "ngrams" and "pred".
df <- data.frame(nGrams=character(), pred = character(), stringsAsFactors=FALSE)
I need each element in column "pred" to be a vector of words, but if I initialize 'pred = list()' the data frame won't add that column in.
I tried
> pred
[1] "a" "the" "not" "that" "to" "an"
df[nrow(df)+1, ] <- c("is", pred)
Error in matrix(value, n, p) :
(converted from warning) data length [7] is not a sub-multiple or multiple of the number of columns [2]
df[nrow(df)+1, ] <- c("the", list(pred))
Error in `[<-.data.frame`(`*tmp*`, nrow(df) + 1, , value = list("the", :
(converted from warning) replacement element 2 has 6 rows to replace 1 rows
Can anyone show me what is the right way of doing it? Thanks in advance.
EDIT
I got the solution using data.table
dt <- data.table(nGrams = my_ngrams, pred = list_pred)
where list_pred is a list of lists. But it's still good to know the right way for data frame.
You could try
d1 <- data.frame(nGrams='the', pred=I(list(pred)))
str(d1)
#'data.frame': 1 obs. of 2 variables:
#$ nGrams: Factor w/ 1 level "the": 1
#$ pred :List of 1
# ..$ : chr "a" "the" "not" "that" ...
#..- attr(*, "class")= chr "AsIs"
Or using the empty dataframe df
df[nrow(df)+1,] <- list('is', list(pred))
str(df)
#'data.frame': 1 obs. of 2 variables:
#$ nGrams: chr "is"
#$ pred :List of 1
# ..$ : chr "a" "the" "not" "that" ...
where,
pred <- c('a', 'the', 'not', 'that', 'to', 'an')
Related
I am getting an error message with a data.table that I do not understand. I have a main data.table that I have subset based on ID variable. Once I have this second data.table I am again want to subset it by a vector of row indexes. Unfortunately, I cannot share my data and have not been able to reproduce the error with another data set. Sorry I cannot provide more detail than this. Can anyone tell what is going on with this limited info?
> class(auth)
[1] "data.table" "data.frame"
> x <- auth[ID == auth$ID[1]]
> x[, authInterval := interval(x$AUTH_DT, x$AUTH_END_DT)]
>
> # Find sequential auth intervals that overlap
> overlap <- sapply(1:(nrow(x) - 1), function(y) {
+ int_overlaps(x$authInterval[y], x$authInterval[y + 1])
+ })
>
> x[, overlap := c(NA, overlap)]
>
> # which two rows have overlap
> whichOverlap <- lapply(which(x$overlap), function(y) {c(y - 1, y)})
> whichOverlap
[[1]]
[1] 1 2
>
> x[unlist(whichOverlap)]
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
In addition: Warning messages:
1: In unclass(e1) + unclass(e2) :
longer object length is not a multiple of shorter object length
2: In cbind(ID = c("XXXXXX1", "XXXXXX2"), COMP_CD = c("280", "280"), :
number of rows of result is not a multiple of vector length (arg 1)
dim and dimnames output
> dim(x)
[1] 3 6
> dimnames(x)
[[1]]
NULL
[[2]]
[1] "ID" "COMP_CD" "AUTH_DT" "AUTH_END_DT" "authInterval" "overlap"
Based on the traceback it seems like the subsetting screws things up with print.data.frame
> traceback()
3: `rownames<-`(`*tmp*`, value = paste0(format(rn, right = TRUE,
scientific = FALSE), ":"))
2: print.data.table(x)
1: (function (x, ...)
UseMethod("print"))(x)
List of 1
$ :Classes ‘data.table’ and 'data.frame': 2 obs. of 6 variables:
..$ ID : num [1:2] XXXX1 XXXX2
..$ COMP_CD : chr [1:2] XXX XXX
..$ AUTH_DT : POSIXct[1:2], format: xxx xxx
..$ AUTH_END_DT : POSIXct[1:2], format: xxx xxx
..$ authInterval:Formal class 'Interval' [package "lubridate"] with 3 slots
.. .. ..# .Data: num [1:2] 7776000 7776000
.. .. ..# start: POSIXct[1:3], format: "xxx" "xxx" "xxx"
.. .. ..# tzone: chr "UTC"
..$ overlap : logi [1:2] NA TRUE
..- attr(*, ".internal.selfref")=<externalptr>
So as it turns out, as this github issue states, this is a bug in data.table and handling columns that are S4 objects. There is also a workaround given here by making each element of the S4 column a list. So in my case the following fixes the issue. Notice that since the S4 columns are now lists, I had to change from using [ to [[.
x[, authInterval := interval(x$AUTH_DT, x$AUTH_END_DT)]
x[, authInterval := as.list(authInterval)]
# Find sequential auth intervals that overlap
overlap <- sapply(1:(nrow(x) - 1), function(y) {
int_overlaps(x$authInterval[[y]], x$authInterval[[y + 1]])
})
x[, overlap := c(NA, overlap)]
# which two rows have overlap
whichOverlap <- lapply(which(x$overlap), function(y) {c(y - 1, y)})
whichOverlap
x[unlist(whichOverlap)]
I am using S3 methods in that way.
First, seek all commonn task between all classes programmed and put this code only once before "Usemethod". Then, I program the rest of each class.
The problem arises when you modify an argument, because they are defined by-reference. Other tasks like check arguments or define sub-functions works well in these schemas.
The next example, I modify an argument:
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
UseMethod("secure_filter", table)
}
secure_filter.data.table <- function(
table, col, value
){
return(table[ col == value,])
}
secure_filter.data.frame <- function(
table, col, value
){
return(table[table$col == !!value,])
}
and the result is wrong
df <- data.frame(a=c("a", "b", "c"), column = c("1", "2", "3"))
dt <- as.data.table(df)
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
Empty data.table (0 rows and 2 cols): a,column
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
[1] a column
<0 rows> (or 0-length row.names)
Am I using S3 well? How do I save repeated code between S3 classes?
Any example in a well known R function?
I am using this approach to re-program tidyverse scripts to data.table scripts.
Use NextMethod instead UseMethod.
secure_filter <- function(table, col, value){
if(!is.numeric(table[[col]])) table[[col]] <- as.numeric(table[[col]])
message("converting to numeric")
print(str(table))
NextMethod("secure_filter")
#UseMethod("secure_filter", table)
}
> secure_filter(dt, "column", 1)
converting to numeric
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
NULL
a column
1: a 1
> secure_filter(df, "column", 1)
converting to numeric
'data.frame': 3 obs. of 2 variables:
$ a : chr "a" "b" "c"
$ column: num 1 2 3
NULL
a column
1 a 1
But I don´t know that answer is well behaved because it doesn't shot the dispatched method neither get the generic.
> sloop::s3_dispatch(secure_filter(dt, "column", 1))
secure_filter.data.table
secure_filter.data.frame
secure_filter.default
> sloop::s3_get_method(secure_filter)
Error: Could not find generic
I have a list of lists, such as below.
Each list (e.g. list1, list2, list3) has two attributes: Variable and Time
list1 <- list(c("Color", "Price"), "Quarter")
list2 <- list(c("Price"), "Month")
list3 <- list(c("Color"), "Month")
total <- list(list1, list2, list3)
when we print total, we'll see:
[[1]]
[[1]][[1]]
[1] "Color" "Price"
[[1]][[2]]
[1] "Quarter"
[[2]]
[[2]][[1]]
[1] "Price"
[[2]][[2]]
[1] "Month"
[[3]]
[[3]][[1]]
[1] "Color"
[[3]][[2]]
[1] "Month"
How can I turn it into a data frame such as this one?
EDIT: I am able to accomplish it using this code. Any better suggestion is appreciated!
num <- length(total)
max <- 0
for(i in 1:num) {
if(length(total[[i]][1]) > max) {
max <- length(total[[i]])
}
}
for(i in 1:num) {
length(total[[i]][[1]]) <- max
for(j in 1:max) {
if(is.null(total[[i]][[1]][[j]])) {
total[[i]][[1]][[j]] <- " "
}
}
}
df <- data.frame(matrix(unlist(total), nrow=num, byrow=T))
This isn't just a nested-list problem, it's a nested problem. If I'm interpretting things correctly, the fact that Color and Price are in one list and Quarter is in another is meaningful. So really, you should be looking at how to turn the first element of each list into a data.frame, repeat for all other elements, then join the results. (This is where #divibisan's and #camille's suggestions come into play ... reduce the problem, use the duplicates' code, then combine.)
(The fact that I believe you will never have more than two elems in each list is not strictly a factor. Below is a general way of handling 1-or-more, not just "always 2".)
Your data:
str(total)
# List of 3
# $ :List of 2
# ..$ : chr [1:2] "Color" "Price"
# ..$ : chr "Quarter"
# $ :List of 2
# ..$ : chr "Price"
# ..$ : chr "Month"
# $ :List of 2
# ..$ : chr "Color"
# ..$ : chr "Month"
What we need to do is break this down by element-of-each-list. (I'm assuming that there will be symmetry here.) Let's start by just working on the first elem of each:
total1 <- lapply(total, `[[`, 1)
str(total1)
# List of 3
# $ : chr [1:2] "Color" "Price"
# $ : chr "Price"
# $ : chr "Color"
In order to use the suggestions from the dupes, we need to know how much to pad them. That is, they need to be the same length.
( maxlen <- max(sapply(total1, function(l) length(unlist(l)))) )
# [1] 2
Now we pad them:
total1 <- lapply(total1, function(l) { length(l) <- maxlen; l; })
str(total1)
# List of 3
# $ : chr [1:2] "Color" "Price"
# $ : chr [1:2] "Price" NA
# $ : chr [1:2] "Color" NA
(You can start to see the structure break out here.) The dupes suggested cbinding them, but you want to rbind them:
do.call(rbind, total1)
# [,1] [,2]
# [1,] "Color" "Price"
# [2,] "Price" NA
# [3,] "Color" NA
Now this is a matrix, not a data.frame, but it's a start. Let's work with naming at the end. Let's write a function to do what we just did, and then we'll use it on each level of total.
In order to do this, though, we need to modify total, so that the new first element has all first elements, new second has all seconds, etc.
newtotal <- lapply(seq_len(max(sapply(total, length))), function(i) lapply(total, `[[`, i))
str(newtotal)
# List of 2
# $ :List of 3
# ..$ : chr [1:2] "Color" "Price"
# ..$ : chr "Price"
# ..$ : chr "Color"
# $ :List of 3
# ..$ : chr "Quarter"
# ..$ : chr "Month"
# ..$ : chr "Month"
m <- do.call(cbind, lapply(newtotal, func))
m
# [,1] [,2] [,3]
# [1,] "Color" "Price" "Quarter"
# [2,] "Price" NA "Month"
# [3,] "Color" NA "Month"
So this last point is pretty much what you need, though as a matrix. From here, it's easy enough to name things:
m <- do.call(cbind, lapply(newtotal, func))
colnames(m) <- c(paste0("Var", seq_len(ncol(m)-1L)), "Time")
df <- as.data.frame(m)
df$List <- paste0('List', seq_len(nrow(df)))
df
# Var1 Var2 Time List
# 1 Color Price Quarter List1
# 2 Price <NA> Month List2
# 3 Color <NA> Month List3
I have a 2 column data frame (DF) of which one column contains vectors and the other is characters.
Orig. Matched
AbcD c("ab.d","Acbd","AA.D","")
jKdf c("JJf.","K.dF","JkD.","")
My aim is to strip all the punctuation marks (commas and periods) as well make everything lowercase. This is easy enough for the character column, but the vector column is more challenging.
Some lower case methods I tried using are
lapply(DF, tolower). This causes the data frame to convert to a matrix. In doing so I lose the column of vectors structure.
In regards to the punctuation, I tried
gsub("\\.", "", DF) and
gsub("\\,", "", DF) to remove the periods and commas respectively.
This causes the data frame to convert to a character list.
I guess my questions are as follows:
Is there another way to remove punctuation and convert to lowercase that preserves the data frame structure?
If not, how may i be able to convert the above outputs back into the original format; that being of a column of vectors?
I'm sure there are other ways to get this done but here's an example that works pretty well:
DF = data.frame(a = c("JJf.","K.dF","JkD.",""), b = c("ab.d","Acbd","AA.D",""))
DF2 = as.data.frame(lapply(X = DF, FUN = tolower))
DF2$a = gsub(pattern = "\\.",replacement = "", x = DF2$a)
Data frames are just special cases of lists where all the elements have the same length so coercion back and fourth isn't usually a problem.
From your description, it sounds like you have some data that looks like:
mydf <- data.frame(Orig = c("AbcD", "jKdf"),
Matched = I(list(c("ab.d","Ac,bd","AA.D",""),
c("JJf.","K.dF","JkD.",""))))
mydf
# Orig Matched
# 1 AbcD ab.d, Ac....
# 2 jKdf JJf., K.....
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : Factor w/ 2 levels "AbcD","jKdf": 1 2
# $ Matched:List of 2
# ..$ : chr "ab.d" "Ac,bd" "AA.D" ""
# ..$ : chr "JJf." "K.dF" "JkD." ""
# ..- attr(*, "class")= chr "AsIs"
Usually, if you want to replace data while maintaining the same structure, you replace with [], like this:
mydf[] <- lapply(mydf, function(x) {
if (is.list(x)) {
lapply(x, function(y) {
tolower(gsub("[.,]", "", y))
})
} else {
tolower(gsub("[.,]", "", x))
}
})
Here's the result:
mydf
# Orig Matched
# 1 abcd abd, acbd, aad,
# 2 jkdf jjf, kdf, jkd,
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : chr "abcd" "jkdf"
# $ Matched:List of 2
# ..$ : chr "abd" "acbd" "aad" ""
# ..$ : chr "jjf" "kdf" "jkd" ""
This is my first question here, sorry for possible mistakes.
I have got a "tt" list of dataframes after I streamed-in a jason file.
some of dataframes are empty, some have predefined structure, here is an example:
> str(tt)
List of 2
$ :'data.frame': 0 obs. of 0 variables
$ :'data.frame': 2 obs. of 2 variables:
..$ key : chr [1:2] "issue_id" "letter_id"
..$ value: chr [1:2] "43" "223663"
> tt
[[1]]
data frame with 0 columns and 0 rows
[[2]]
key value
1 issue_id 43
2 letter_id 223663
I would like to get a column (e.g. named "t") with issue_id's out of "tt" structure, so that
t[1] = NA (or NULL)
t[2] = 43
I can do it accessing dataframes as a list elements like this
> tt[[1]][1,2]
NULL
> tt[[2]][1,2]
[1] "43"
How can I do this in a "vectorized" way? tried different things with no success like
> t <- tt[[]][1,2]
Error in tt[[]] : invalid subscript type 'symbol'
> t <- tt[][1,2]
Error in tt[][1, 2] : incorrect number of dimensions
> t <- tt[[]][1][2]
Error in tt[[]] : invalid subscript type 'symbol'
> t <- tt[][1][2]
> t
[[1]]
NULL
It should be something very simple I guess
We can use lapply to loop over the list. As there are null elements or if the number of rows are zero, we skip it and extract the 'value' from the other elements.
lapply(tt, function(x) if(!(is.null(x)|!nrow(x))) with(x, value[key=="issue_id"]))
As #MikeRSpencer mentioned in the comments, if we need to extract the first 'value'
sapply(tt, function(x) if(!(is.null(x)|!nrow(x))) x$value[1])
and it would be return a vector