r rowdies iteration on a data table, again - r

There are at least a couple of Q/As that are similar to this but I can't seem to get the hang of it. Here's a reproducible example. DT holds the data. I want food(n) = food(n-1) * xRatio.food(n)
DT <- fread("year c_Crust xRatio.c_Crust
X2005 0.01504110 NA
X2010 NA 0.9883415
X2015 NA 1.0685221
X2020 NA 1.0664189
X2025 NA 1.0348418
X2030 NA 1.0370386
X2035 NA 1.0333771
X2040 NA 1.0165511
X2045 NA 1.0010563
X2050 NA 1.0056368")
The code that gets closest to the formula is
DT[,res := food[1] * cumprod(xRatio.food[-1])]
but the res value is shifted up, and the first value is recycled to the last row with a warning. I want the first value of xRatio.food to be NA

I'd rename/reshape...
myDT = melt(DT, id = "year", meas=list(2,3),
variable.name = "food",
value.name = c("value", "xRatio"))[, food := "c_Crust"][]
# or for this example with only one food...
myDT = DT[, .(year, food = "c_Crust", xRatio = xRatio.c_Crust, value = c_Crust)]
... then do the computation per food group with the data in long form:
myDT[, v := replace(first(value)*cumprod(replace(xRatio, 1, 1)), 1, NA), by=food]
# or more readably, to me anyways
library(magrittr)
myDT[, v := first(value)*cumprod(xRatio %>% replace(1, 1)) %>% replace(1, NA), by=food]
Alternately, there's myDT[, v := c(NA, first(value)*cumprod(xRatio[-1])), by=food], extending the OP's code, though I prefer just operating on full-length vectors with replace rather than trying to build vectors with c, since the latter can run into weird edge cases (like if there is only one row, will it do the right thing?).

Related

Replacing "NA" (NA string) with NA inplace data.table

I have this dummy dataset:
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"))
where I am trying to replace "NA" with standard NA; in place using data.table. I tried:
for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA])
for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA_character_])
for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))
However still with this I get:
abc$a
"NA" "bc" "x"
What am I missing?
EDIT: I tried #frank answer in this question which makes use of type.convert(). (Thanks frank; didn't know such obscure albeit useful function) In documentation of type.convert() it is mentioned: "This is principally a helper function for read.table." so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical. For such case abc will be:
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
EDIT2: To summerize code present in original question:
for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))
works fine but only in current latest version of data.table (> 1.11.4). So if one is facing this problem then its better to update data.table and use this code than type.convert()
I'd do...
chcols = names(abc)[sapply(abc, is.character)]
abc[, (chcols) := lapply(.SD, type.convert, as.is=TRUE), .SDcols=chcols]
which yields
> str(abc)
Classes ‘data.table’ and 'data.frame': 3 obs. of 3 variables:
$ a: chr NA "bc" "x"
$ b: num 1 2 3
$ c: chr "n" NA NA
- attr(*, ".internal.selfref")=<externalptr>
Your DT[, i :=] code did not work because it creates a column literally named "i"; and your set code does work already, as #AdamSampson pointed out. (Note: OP upgraded from data.table 1.10.4-3 to 1.11.4 before this was the case on their comp.)
so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical.
Oh right. Your original approach is safer against this problem:
# op's new example
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
# op's original code
for(i in names(abc))
set(abc, which(abc[[i]] == "NA"), i, NA)
Side note: NA has type logical; and usually data.table would warn when assigning values of an incongruent type to a column, but I guess they wrote in an exception for NAs:
DT = data.table(x = 1:2)
DT[1, x := NA]
# no problem, even though x is int and NA is logi
DT = data.table(x = 1:2)
DT[1, x := TRUE]
# Warning message:
# In `[.data.table`(DT, 1, `:=`(x, TRUE)) :
# Coerced 'logical' RHS to 'integer' to match the column's type. Either change the target column ['x'] to 'logical' first (by creating a new 'logical' vector length 2 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
I really liked Frank's response, but want to add to it because it assumes you're only performing the change for character vectors. I'm also going to try to include some info on "why" it works.
To replace all NA you could do something like:
chcols = names(abc)
abc[,(chcols) := lapply(.SD, function(x) ifelse(x == "NA",NA,x)),.SDcols = chcols]
Let's breakdown what we are doing here.
We are looking at every row in abc (because there is nothing before the first comma).
After the next comma is the columns. Let's break that down.
We are putting the results into all of the columns listed in chcols. The (chcols) tells the data.table method to evaluate the vector of names held in the chcols object. If you left off the parentheses and used chcols it would try to store the results in a column called chcols instead of using the column names you want.
.SD is returning a data.table with the results of every column listed in .SDcols (in my case it is returning all columns...). But we want to evaluate a single column at a time. So we use lapply to apply a function to every column in .SD one at a time.
You can use any function that will return the correct values. Frank used type.convert. I'm using an anonymous function that evaluates an ifelse statement. I used ifelse because it evaluates and returns an entire vector/column.
You already know how to use a := to replace values in place.
After the next column you either put the by information or you put additional options. We will add additional options in the form of .SDcols.
We need to put a .SDcols = chcols to tell data.table which columns to include in .SD. My code is evaluating all columns, so if you left off .SDcols my code would still work. But it's a bad habit to leave this column off because you can lose time in the future if you make a change to only evaluate certain columns. Frank's example only evaluated columns that were of the character class for instance.
Here are two other approaches:
Subsetting
library(data.table)
abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
for (col in names(abcd)) abcd[get(col) == "NA", (col) := NA]
abcd[]
a b c d
1: <NA> 1 n <NA>
2: bc 2 <NA> <NA>
3: x 3 <NA> <NA>
Update while joining
Here, data.table is rather strict concerning variable type.
abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
for (col in names(abcd))
if (is.character(abcd[[col]]))
abcd[.("NA", NA_character_), on = paste0(col, "==V1"), (col) := V2][]
abcd
a b c d
1: <NA> 1 n <NA>
2: bc 2 <NA> <NA>
3: x 3 <NA> <NA>

.N not working properly within data.table

I came across a surprising result with data.table. Here is a really simple example :
library(data.table)
df <- data.table(x = 1:10)
df[,x[x>3][.N]]
[1] NA
This syntax gives NA, but this work:
df[,x[x>3][1]]
[1] 4
and of course this
df[,x[.N]]
[1] 10
I know that in this simple example case you can do
df[x>3,x[.N]]
but I wanted to use the df[,x[x>3][.N]] syntax while using lapply on .SD to avoid a loop on the i selection, so something like
df2 <- data.table(x = rep(1:10,2), y = rep(2:11,2),ID = rep(c("A","B"),each = 10))
cols = c("x","y")
df2[,lapply(.SD,function(x){x[x>3][.N]}),.SDcols = cols, by = ID]
But this fail, same as in my simple example. Is it because .N is not implemented in this case, or am I doing something wrong ?
My actual work around:
Reduce(merge,lapply(cols,function(col){df2[col>3,setNames(list( get(col)[.N]),col),by = ID]}))
ID x y
1: A 10 11
2: B 10 11
but I am not fully happy with it, I find it less readable. Has anyone an explanation and a better work around ?
Thank you !!
df[,x[x>3]] has 7 elements. .N is 10. You are trying to subset a vector out of range.
So you can access the last element of the vector in lapply using:
df2[, lapply(.SD, function(x) tail(x[x>3], 1) ), .SDcols = c('x','y'), by = ID]
Or more idiomatic for data.table we can use
df2[, lapply(.SD, function(x) last(x[x>3]) ), .SDcols = c('x','y'), by = ID]

Create column names based on "by" argument the data.table way

Say I have the following data.table
dt <- data.table(var = c("a", "b"), val = c(1, 2))
Now I want to add two new columns to dt, named a, and b with the respective values (1, 2). I can do this with a loop, but I want to do it the data.table way.
The result would be a data.table like this:
dt.res <- data.table(var = c("a", "b"), val = c(1, 2), #old vars
a = c(1, NA), b = c(NA, 2)) # newly created vars
So far I came up with something like this
dt[, c(xx) := val, by = var]
where xx would be a data.table-command similar to .N which addresses the value of the by-group.
Thanks for the help!
Appendix: The for-loop way
The non-data.table-way with a for-loop instead of a by-argument would look something like this:
for (varname in dt$var){
dt[var == varname, c(varname) := val]
}
Based on the example showed, we can use dcast from the data.table to convert the long format to wide, and join with the original dataset on the 'val' column.
library(data.table)#v1.9.6+
dt[dcast(dt, val~var, value.var='val'), on='val']
# var val a b
#1: a 1 1 NA
#2: b 2 NA 2
Or as #CathG mentioned in the comments, for previous versions either merge or set the key column and then join.
merge(dt, dcast.data.table(dt, val~var, value.var='val'))

data.table - group by all except one column

Can I group by all columns except one using data.table? I have a lot of columns, so I'd rather avoid writing out all the colnames.
The reason being I'd like to collapse duplicates in a table, where I know one column has no relevance.
library(data.table)
DT <- structure(list(N = c(1, 2, 2), val = c(50, 60, 60), collapse = c("A",
"B", "C")), .Names = c("N", "val", "collapse"), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
> DT
N val collapse
1: 1 50 A
2: 2 60 B
3: 2 60 C
That is, given DT, is there something like like DT[, print(.SD), by = !collapse] which gives:
> DT[, print(.SD), .(N, val)]
collapse
1: A
collapse
1: B
2: C
without actually having to specify .(N, val)? I realise I can do this by copy and pasting the column names, but I thought there might be some elegant way to do this too.
To group by all columns except one, you can use:
by = setdiff(names(DT), "collapse")
Explanation: setdiff takes the general form of setdiff(x, y) which returns all values of x that are not in y. In this case it means that all columnnames are returned except the collapse-column.
Two alternatives:
# with '%in%'
names(dt1)[!names(dt1) %in% 'colB']
# with 'is.element'
names(dt1)[!is.element(names(dt1), 'colB')]

data.table lag operator throwing error

Hi I am trying to create a data.table with lagged variables by group id. Certain id's have only 1 row in the data.table in that case the shift operator for lag gives error but the lead operator works fine. Here is an example
dt = data.table(id = 1, week = as.Date('2014-11-11'), sales = 1)
lead = 2
lag = 2
lagSalesNames = paste('lag_sales_', 1:lag, sep = '')
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lag'), by = list(id)]
This gives me the following error
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead
(much quicker), or cbind or merge afterwards.
But if I try the same thing with lead instead, it works fine
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lead'), by = list(id)]
It also seem to work fine if the data.table has more than 1 row e.g. you can try the following with 2 rows which works fine
dt = data.table(id = 1, week = as.Date(c('2014-11-11', '2014-11-11')), sales = 1:2)
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lag'), by = list(id)]
I am using data.table version 1.9.5 on a linux machine with R version 3.1.0. Any help would be much appreciated.
Thanks,
Ashin
Thanks for the report. This is now fixed (issue #1014) with commit #1722 in data.table v1.9.5.
Now works as intended:
dt
# id week sales lag_sales_1 lag_sales_2
# 1: 1 2014-11-11 1 NA NA

Resources