from my simple data.table, for example, like this:
dt1 <- fread("
col1 col2 col3
AAA ab cd
BBB ef gh
BBB ij kl
CCC mn nm")
I am making new table, for example, like this:
dt1[,
.(col3, new=.N),
by=col1]
> col1 col3 new
>1: AAA cd 1
>2: BBB gh 2
>3: BBB kl 2
>4: CCC op 1
this works fine when I indicate column names explicitly. But when I have them in the variables and try to use with=F, this gives an error:
colBy <- 'col1'
colShow <- 'col3'
dt1[,
.(colShow, 'new'=.N),
by=colBy,
with=F]
# Error in `[.data.table`(dt1, , .(colShow, new = .N), by = colBy, with = F) : object 'ansvals' not found
I could not find any information about this error so far.
The reason why you are getting this error message is that when using with=FALSE you tell data.table to treat j as if it were a dataframe. It therefore expects a vector of columnnames and not an expression to be evaluated in j as new=.N.
From the documentation of ?data.table about with:
By default with=TRUE and j is evaluated within the frame of x; column
names can be used as variables. When with=FALSE j is a character
vector of column names or a numeric vector of column positions to
select, and the value returned is always a data.table.
When you use with=FALSE, you have to select the columnnames in j without a . before () like this: dt1[, (colShow), with=FALSE]. Other options are dt1[, c(colShow), with=FALSE] or dt1[, colShow, with=FALSE]. The same result can be obtained by using dt1[, .(col3)]
To sum up: with = FALSE is used to select columns the data.frame way. So, you should do it then as such.
Also by using by = colBy you are telling data.table to evaluate j which is in contradiction with with = FALSE.
From the documentation of ?data.table about j:
A single column name, single expresson of column names, list() of
expressions of column names, an expression or function call that
evaluates to list (including data.frame and data.table which are
lists, too), or (when with=FALSE) a vector of names or positions to
select.
j is evaluated within the frame of the data.table; i.e., it
sees column names as if they are variables. Use j=list(...) to return
multiple columns and/or expressions of columns. A single column or
single expression returns that type, usually a vector. See the
examples.
See also points 1.d and 1.g of the introduction vignette of data.table.
ansvals is a name used in data.table internals. You can see where it appears in the code by using ctrl+f (Windows) or cmd+f (macOS) here.
The error object 'ansvals' not found looks like a bug to me. It should either be a helpful message or just work. I've filed issue #1440 linking back to this question, thank you.
Jaap is completely correct. Following on from his answer, you can use get() in j like this :
dt1
# col1 col2 col3
#1: AAA ab cd
#2: BBB ef gh
#3: BBB ij kl
#4: CCC mn nm
colBy
#[1] "col1"
colShow
#[1] "col3"
dt1[,.(get(colShow),.N),by=colBy]
# col1 V1 N
#1: AAA cd 1
#2: BBB gh 2
#3: BBB kl 2
#4: CCC nm 1
Related
I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.
You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]
An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14
First question, let me know if more info or background is needed in the comments please.
Many answers on here and elsewhere deal with calling lapply in a data.table function. I want to do the opposite, which on paper should be easy lapply(list.of.dfs, fun(x) x) but I cant get it to work with data.table functions.
I have a list that contains several data.frames with the same columns but differing numbers of rows. This comes from the output of several simulation scenarios so they must be treated seperately and not rbind'ed.
#sample list of data.frames
scenarios <- replicate(5, data.frame(a=sample(letters[1:4],10,T),
b=sample(1:2,10,T),
x=sample(1:10, 10),
y =runif(10)), simplify = FALSE)
I want to add a column to every element that is the sum of x/y by a and b.
From the data.table documentation in the examples section the process to do this for one data.frame is the following (search: add new column by reference by group in the doc page):
test <- as.data.table(scenarios[[1]]) #must specify data.table class
test[, newcol := sum(x/y), by = .(a , b)][]
I want to use lapply to do the same thing to every element in the scenarios list and return the list.
My most recent attempt:
lapply(scenarios, function(i) {as.data.table(i[, z := sum(x/y), by=.(a,b)]); i})
but I keep getting the error unused argument (by = .a,b))
After pouring over the results of this and other sites I have been unable to solve this problem. Which I'm fairly sure means that there is something I dont understand about calling anonymous functions, and/or using the data.table function. Is this one of those cases where one you use the [ as the function? Or possibly my as.data.table is out of place.
This answer was a step in the right direction (I think), it covers the use of fun(x) {... ; x} to use an anonymous function and return x.
Thanks!
You can use setDT here instead.
scenarios <- lapply(scenarios, function(i) setDT(i)[, z := sum(x/y), by=.(a,b)])
scenarios[[1]]
a b x y z
1: c 2 2 0.87002174 2.298793
2: b 2 10 0.19720775 78.611837
3: b 2 8 0.47041670 78.611837
4: b 2 4 0.36705023 78.611837
5: a 1 5 0.78922686 12.774035
6: a 1 6 0.93186209 12.774035
7: b 1 3 0.83118438 3.609307
8: c 1 1 0.08248658 30.047494
9: c 1 7 0.89382050 30.047494
10: c 1 9 0.89172831 30.047494
Using as.data.table, the syntax would be
scenarios <- lapply(scenarios, function(i) {i <- as.data.table(i); i[, z := sum(x/y),
by=.(a,b)]})
but this wouldn't be recommended as it will create an additional copy, which is avoided by setDT.
I am trying to use data.table where my j function could and will return a different number of columns on each call. I would like it to behave like rbind.fill in that it fills any missing columns with NA.
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
In this case 'result' may end up with two columns; A and B. 'A' and 'B' was returned as part of the first call to 'fetch' and only 'B' was returned as part of the second. I would like the example code to return this result.
id A B
1 1 a b
2 2 <NA> b
Unfortunately, when run I get this error.
Error in `[.data.table`(data, , fetch(.BY, .SD), by = id) :
j doesn't evaluate to the same number of columns for each group
I can do this with plyr as follows, but in my real world use case plyr is running out of memory. Each call to fetch occurs rather quickly, but the memory crash occurs when plyr tries to merge all of the data back together. I am trying to see if data.table might solve this problem for me.
result <- ddply(data, "id", fetch)
Any thoughts appreciated.
DWin's approach is good. Or you could return a list column instead, where each cell is itself a vector. That's generally a better way of handling variable length vectors.
DT = data.table(A=rep(1:3,1:3),B=1:6)
DT
A B
1: 1 1
2: 2 2
3: 2 3
4: 3 4
5: 3 5
6: 3 6
ans = DT[, list(list(B)), by=A]
ans
A V1
1: 1 1
2: 2 2,3 # V1 is a list column. These aren't strings, the
3: 3 4,5,6 # vectors just display with commas
ans$V1[3]
[[1]]
[1] 4 5 6
ans$V1[[3]]
[1] 4 5 6
ans[,sapply(V1,length)]
[1] 1 2 3
So in your example you could use this as follows:
library(plyr)
rbind.fill(data[, list(list(fetch(.BY))), by = id]$V1)
# A B
#1 a b
#2 <NA> b
Or, just make the list returned conformant :
allcols = c("A","B")
fetch <- function(by) {
if(by == 1)
list(A=c("a"), B=c("b"))[allcols]
else
list(B=c("b"))[allcols]
}
Here are two approaches. The first roughly follows your strategy:
data[,list(A=if(.BY==1) 'a' else NA_character_,B='b'), by=id]
And the second does things in two steps:
DT <- copy(data)[,`:=`(A=NA_character_,B='b')][id==1,A:='a']
Using a by just to check for a single value seems wasteful (maybe computationally, but also in terms of clarity); of course, it could be that your application isn't really like that.
Try
data.table(A=NA, B=c("b"))
#NickAllen: I'm not sure from the comments whether you understood my suggestion. (I was posting from a mobile phone that limited my cut-paste capabilities and I suspect my wife was telling me to stop texting to S0 or she would divorce me.) What I meant was this:
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(A=NA, B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
The following code fails
a = data.table(date=seq(ymd('2001-6-30'),ymd('2003-6-30'),by='weeks'))
a = a[,list(date=date,a=rnorm(105),b=rnorm(105))]
b = seq(ymd('2001-6-30'),ymd('2001-07-28'),by='weeks')
a[date %in% b]
with the message
Empty data.table (0 rows) of 3 cols: date,a,b
Can anyone help identify what I'm doing wrong. It should find the data.
Nothing to do with lubridate.
Your issue is scoping. You have a column b in your data.table. data.table will look first in the data.table and then up along the search path. It cannot tell you want to look for b in the parent.frame
So, rename your vector in the parent (global) environment
B <- b
a[date %in% B]
date a b
1: 2001-06-30 -1.89904968 0.9230171
2: 2001-07-07 0.08599561 -0.0440927
3: 2001-07-14 -0.28606686 0.4649957
4: 2001-07-21 0.39191680 0.2907855
5: 2001-07-28 0.18732463 -0.1743267
I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17