Replacing "NA" (NA string) with NA inplace data.table

Replacing "NA" (NA string) with NA inplace data.table - r

I have this dummy dataset:
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"))
where I am trying to replace "NA" with standard NA; in place using data.table. I tried:
for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA])
for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA_character_])
for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))
However still with this I get:
abc$a
"NA" "bc" "x"
What am I missing?
EDIT: I tried #frank answer in this question which makes use of type.convert(). (Thanks frank; didn't know such obscure albeit useful function) In documentation of type.convert() it is mentioned: "This is principally a helper function for read.table." so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical. For such case abc will be:
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
EDIT2: To summerize code present in original question:
for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))
works fine but only in current latest version of data.table (> 1.11.4). So if one is facing this problem then its better to update data.table and use this code than type.convert()

I'd do...
chcols = names(abc)[sapply(abc, is.character)]
abc[, (chcols) := lapply(.SD, type.convert, as.is=TRUE), .SDcols=chcols]
which yields
> str(abc)
Classes ‘data.table’ and 'data.frame': 3 obs. of 3 variables:
$ a: chr NA "bc" "x"
$ b: num 1 2 3
$ c: chr "n" NA NA
- attr(*, ".internal.selfref")=<externalptr>
Your DT[, i :=] code did not work because it creates a column literally named "i"; and your set code does work already, as #AdamSampson pointed out. (Note: OP upgraded from data.table 1.10.4-3 to 1.11.4 before this was the case on their comp.)
so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical.
Oh right. Your original approach is safer against this problem:
# op's new example
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
# op's original code
for(i in names(abc))
set(abc, which(abc[[i]] == "NA"), i, NA)
Side note: NA has type logical; and usually data.table would warn when assigning values of an incongruent type to a column, but I guess they wrote in an exception for NAs:
DT = data.table(x = 1:2)
DT[1, x := NA]
# no problem, even though x is int and NA is logi
DT = data.table(x = 1:2)
DT[1, x := TRUE]
# Warning message:
# In `[.data.table`(DT, 1, `:=`(x, TRUE)) :
# Coerced 'logical' RHS to 'integer' to match the column's type. Either change the target column ['x'] to 'logical' first (by creating a new 'logical' vector length 2 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

I really liked Frank's response, but want to add to it because it assumes you're only performing the change for character vectors. I'm also going to try to include some info on "why" it works.
To replace all NA you could do something like:
chcols = names(abc)
abc[,(chcols) := lapply(.SD, function(x) ifelse(x == "NA",NA,x)),.SDcols = chcols]
Let's breakdown what we are doing here.
We are looking at every row in abc (because there is nothing before the first comma).
After the next comma is the columns. Let's break that down.
We are putting the results into all of the columns listed in chcols. The (chcols) tells the data.table method to evaluate the vector of names held in the chcols object. If you left off the parentheses and used chcols it would try to store the results in a column called chcols instead of using the column names you want.
.SD is returning a data.table with the results of every column listed in .SDcols (in my case it is returning all columns...). But we want to evaluate a single column at a time. So we use lapply to apply a function to every column in .SD one at a time.
You can use any function that will return the correct values. Frank used type.convert. I'm using an anonymous function that evaluates an ifelse statement. I used ifelse because it evaluates and returns an entire vector/column.
You already know how to use a := to replace values in place.
After the next column you either put the by information or you put additional options. We will add additional options in the form of .SDcols.
We need to put a .SDcols = chcols to tell data.table which columns to include in .SD. My code is evaluating all columns, so if you left off .SDcols my code would still work. But it's a bad habit to leave this column off because you can lose time in the future if you make a change to only evaluate certain columns. Frank's example only evaluated columns that were of the character class for instance.

Here are two other approaches:
Subsetting
library(data.table)
abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
for (col in names(abcd)) abcd[get(col) == "NA", (col) := NA]
abcd[]
a b c d
1: <NA> 1 n <NA>
2: bc 2 <NA> <NA>
3: x 3 <NA> <NA>
Update while joining
Here, data.table is rather strict concerning variable type.
abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))
for (col in names(abcd))
if (is.character(abcd[[col]]))
abcd[.("NA", NA_character_), on = paste0(col, "==V1"), (col) := V2][]
abcd
a b c d
1: <NA> 1 n <NA>
2: bc 2 <NA> <NA>
3: x 3 <NA> <NA>

Related

Fast alternative to ifelse for replacing strings in data.table

I have a large list of data.tables (~10m rows in total) that contain a number of string variants of NA, such as "N/A" or "Unknown". I would like to replace these observations with a missing value, for all of the columns in all of my data.tables.
A simplified example of the data is set out below:
library(data.table)
dt1 <- data.table(v1 = 1:4, v2 = c("yes", "no", "unknown", NA))
dt2 <- data.table(v1 = c("1", "2", "not applicable", "4"), v2 = c("yes", "yes", "n/a", "no"))
master_list <- list(dt1 = dt1, dt2 = dt2)
The following solution works, but it is taking a prohibitively long time (~30 minutes with the full data) so I am trying to find a more efficient solution:
unknowns <- c("n/a", "not applicable", "unknown")
na_edit <- function(x){ifelse(x %in% unknowns, NA, x)}
master_list <- lapply(master_list, function(dt) {
dt[, lapply(.SD, na_edit)]
})
> master_list$dt1
v1 v2
1: 1 yes
2: 2 no
3: 3 <NA>
4: 4 <NA>
I have tried something resembling the following, removing the need for the ifelse, but I have not been able to make this work across multiple columns.
lapply(master_list, function(dt) {
dt[v2 %in% unknowns, v2 := NA]
})
I feel an answer may lie in the responses in this thread. Would anyone be able to help me apply similar, or other methods, to the problem above? Many thanks in advance.

format data.table column values (row wise) if numeric

I am creating a summary data.table to be inserted in a knitr report using xtable. I would like to check each row value in each column if is.numeric() == TRUE and if it is, format the number, then revert it back to a character. If is.numeric() == FALSE then return the value. The actual data.table may have many columns.
Here's what I have below, with the desired output at the bottom:
library(data.table)
library(magrittr)
dt <- data.table(A = c("apples",
"bananas",
1000000.999),
B = c("red",
5000000.999,
0.99))
dt
a <- dt[, lapply(.SD,
function(x) {
if (is.na(is.numeric(x))) {
prettyNum(as.numeric(x), digits = 0, big.mark = ",")
} else {
x
}
})]
a
b <- dt[, A := ifelse(is.na(is.numeric(A)),
format(as.numeric(A), digits = 0, big.mark = ","),
A)] %>%
.[, B := ifelse(is.na(is.numeric(B)),
format(as.numeric(B), digits = 0, big.mark = ","),
B)]
b
b
desired <- data.table(A = c("apples",
"bananas",
"1,000,000"),
B = c("red",
"5,000,000",
"1"))
desired
From my understanding lapply in the j argument of data.table syntax operates on the vector, so it can be used for functions like mean(), sum(), na.approx(), etc. and wouldn't necessarily work here. But I would like to loop over each column in the data.table without specifying each column name since there could be many columns and naming them would be cumbersome. It's kind of like I know the circle doesn't go in the square but I really want it to!
I tried the := ifelse() approach which I thought should work, but it seems to be returning the first element. On a different data.table where the column is entirely numeric, employing the same approach yields all NA.
Thanks for any help!

We can use set with number. Loop through the sequence of columns with a for loop, identify the index of elements that are all digits or . ('i1'), use that as the i in set, convert those elements to numeric, apply the number to set the format for that element
library(scales)
library(data.table)
for(j in seq_along(dt)) {
i1 <- grep("^[0-9.]+$", dt[[j]])
set(dt, i = i1, j = j, value = number(as.numeric(dt[[j]][i1]), big.mark = ","))
}
dt
# A B
#1: apples red
#2: bananas 5,000,001
#3: 1,000,001 1

r rowdies iteration on a data table, again

There are at least a couple of Q/As that are similar to this but I can't seem to get the hang of it. Here's a reproducible example. DT holds the data. I want food(n) = food(n-1) * xRatio.food(n)
DT <- fread("year c_Crust xRatio.c_Crust
X2005 0.01504110 NA
X2010 NA 0.9883415
X2015 NA 1.0685221
X2020 NA 1.0664189
X2025 NA 1.0348418
X2030 NA 1.0370386
X2035 NA 1.0333771
X2040 NA 1.0165511
X2045 NA 1.0010563
X2050 NA 1.0056368")
The code that gets closest to the formula is
DT[,res := food[1] * cumprod(xRatio.food[-1])]
but the res value is shifted up, and the first value is recycled to the last row with a warning. I want the first value of xRatio.food to be NA

I'd rename/reshape...
myDT = melt(DT, id = "year", meas=list(2,3),
variable.name = "food",
value.name = c("value", "xRatio"))[, food := "c_Crust"][]
# or for this example with only one food...
myDT = DT[, .(year, food = "c_Crust", xRatio = xRatio.c_Crust, value = c_Crust)]
... then do the computation per food group with the data in long form:
myDT[, v := replace(first(value)*cumprod(replace(xRatio, 1, 1)), 1, NA), by=food]
# or more readably, to me anyways
library(magrittr)
myDT[, v := first(value)*cumprod(xRatio %>% replace(1, 1)) %>% replace(1, NA), by=food]
Alternately, there's myDT[, v := c(NA, first(value)*cumprod(xRatio[-1])), by=food], extending the OP's code, though I prefer just operating on full-length vectors with replace rather than trying to build vectors with c, since the latter can run into weird edge cases (like if there is only one row, will it do the right thing?).

Recode a variable using data.table

I am trying to recode a variable using data.table. I have googled for almost 2 hours but couldn't find an answer.
Assume I have a data.table as the following:
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
I want to recode V1 and V2. For V1, I want to recode 1s to 0 and 2s to 1.
For V2, I want to recode A to T, B to K, C to D.
If I use dplyr, it is simple.
library(dplyr)
DT %>%
mutate(V1 = recode(V1, `1` = 0L, `2` = 1L)) %>%
mutate(V2 = recode(V2, A = "T", B = "K", C = "D"))
But I have no idea how to do this in data.table
DT[V1==1, V1 := 0]
DT[V1==2, V1 := 1]
DT[V2=="A", V2 := "T"]
DT[V2=="B", V2 := "K"]
DT[V2=="C", V2 := "D"]
Above is the code that I can think as my best. But there must be a better and a more efficient way to do this.
Edit
I changed how I want to recode V2 to make my example more general.

With data.table the recode can be solved with an update on join:
DT[.(V1 = 1:2, to = 0:1), on = "V1", V1 := i.to]
DT[.(V2 = LETTERS[1:3], to = c("T", "K", "D")), on = "V2", V2 := i.to]
which converts DT to
V1 V2 V4
1: 0 T 1
2: 0 K 2
3: 1 D 3
4: 0 T 4
5: 0 K 5
6: 1 D 6
7: 0 T 7
8: 0 K 8
9: 1 D 9
10: 0 T 10
11: 0 K 11
12: 1 D 12
Edit: #Frank suggested to use i.to to be on the safe side.
Explanation
The expressions .(V1 = 1:2, to = 0:1) and .(V2 = LETTERS[1:3], to = c("T", "K", "D")), resp., create lookup tables on-the-fly.
Alternatively, the lookup tables can be set-up beforehand
lut1 <- data.table(V1 = 1:2, to = 0:1)
lut2 <- data.table(V2 = LETTERS[1:3], to = c("T", "K", "D"))
lut1
V1 to
1: 1 0
2: 2 1
lut2
V2 to
1: A T
2: B K
3: C D
Then, the update joins become
DT[lut1, on = "V1", V1 := i.to]
DT[lut2, on = "V2", V2 := i.to]
Edit 2: Answers to How can I use this code dynamically?
mat asked "How can I use this code dynamically?"
So, here is a modified version where the name of column to update is provided as a character variable my_var_name but the lookup tables still are created on-the-fly:
my_var_name <- "V1"
DT[.(from = 1:2, to = 0:1), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
my_var_name <- "V2"
DT[.(from = LETTERS[1:3], to = c("T", "K", "D")), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
There are 3 points to note:
Instead of naming the first column of the lookup table dynamically it gets a fixed name from. This requires a join between differently named columns (foreign key join). The names of the columns to join on have to be specified via the on parameter.
The on parameter accepts character strings for foreign key joins of the form "V1==from". This string is created dynamically using paste0().
In the expression (my_var_name) := i.to, the parentheses around the variable my_var_name forces to use the contents of my_var_name.
Dynamic code using pre-defined lookup tables
Now, while the column to recode is specified dynamically by a variable, the lookup tables to use are still hard-coded in the statement which means we have stopped halfways: We need also to select the appropriate lookup table dynamically.
This can be achieved by storing the lookup tables in a list where each list element is named according to the column of DT it is supposed to recode:
lut_list <- list(
V1 = data.table(from = 1:2, to = 0:1),
V2 = data.table(from = LETTERS[1:3], to = c("T", "K", "D"))
)
lut_list
$V1
from to
<int> <int>
1: 1 0
2: 2 1
$V2
from to
<char> <char>
1: A T
2: B K
3: C D
Now, we can pick the appropriate lookup table from the list dynamically as well:
my_var_name <- "V1"
DT[lut_list[[my_var_name]], on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
Going one step further, we can recode all relevant columns of DT in a loop:
for (v in intersect(names(lut_list), colnames(DT))) {
DT[lut_list[[v]], on = paste0(v, "==from"), (v) := i.to]
}
Note that DT is updated by reference, i.e., only the affected elements are replaced in place without copying the whole object. So, the for loop is applied iteratively on the same data object. This is a speciality of data.table and will not work with data.frames or tibbles.

I think this might be what you're looking for. On the left hand side of := we name the variables we want to update and on the right hand side we have the expressions we want to update the corresponding variables with.
DT[, c("V1","V2") := .(as.numeric(V1==2), sapply(V2, function(x) {if(x=="A") "T"
else if (x=="B") "K"
else if (x=="C") "D" }))]
# V1 V2 V4
#1: 0 T 1
#2: 0 K 2
#3: 1 D 3
#4: 0 T 4
#5: 0 K 5
#6: 1 D 6
#7: 0 T 7
#8: 0 K 8
#9: 1 D 9
#10: 0 T 10
#11: 0 K 11
#12: 1 D 12
Alternatively, just use recode within data.table:
library(dplyr)
DT[, c("V1","V2") := .(as.numeric(V1==2), recode(V2, "A" = "T", "B" = "K", "C" = "D"))]

mapvalues() from plyr, in combination with data.table, works really well.
I use it on large-ish data (50 mio - 400 mio rows). Although I haven't benchmarked it as compared to other possibilities, I find the clear syntax is worth a lot, as it means fewer errors in complicated recode operations.
library(data.table)
library(plyr)
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
DT[, V1 := mapvalues(V1, from=c(1, 2), to=c(0, 1))]
DT[, V2 := mapvalues(V2, from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))]
For more complicated recode operations, I would always create a new variable first with NA, and use another data.table with from-to vectors as variables.
A feature that in some use-cases is more of a bug is that mapvalues() keeps those values from the old variable that isn't in the from argument.
This is a problem if you're sure that all the correct values is in the from-vector, so that any values in the data.table that isn't in this vector should be NA instead.
DT <- data.table(V1=c(LETTERS[1:3], 'i dont want this value transfered'),
V4=1:12)
map_DT <- data.table(from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))
# NA variable to begin with is good practice because it is clearer to spot an error
DT[, V1_new := NA_character_]
DT[V1 %in% map_DT$from , V1_new := mapvalues(V1, from=map_DT$from, to=map_DT$to)][]
note that plyr is deprecated, so the mapvalues-function is somewhat at risk of disappearing at some point in the future. the update-joins method proposed might be a better method because of this, although I find mapvalues to be just a tad clearer to read. although it will probably take years before mapvalues is deprecated, most likely, a lot of years. But still, something to keep in mind when deciding to use it as a tool or not.

data.table - group by all except one column

Can I group by all columns except one using data.table? I have a lot of columns, so I'd rather avoid writing out all the colnames.
The reason being I'd like to collapse duplicates in a table, where I know one column has no relevance.
library(data.table)
DT <- structure(list(N = c(1, 2, 2), val = c(50, 60, 60), collapse = c("A",
"B", "C")), .Names = c("N", "val", "collapse"), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
> DT
N val collapse
1: 1 50 A
2: 2 60 B
3: 2 60 C
That is, given DT, is there something like like DT[, print(.SD), by = !collapse] which gives:
> DT[, print(.SD), .(N, val)]
collapse
1: A
collapse
1: B
2: C
without actually having to specify .(N, val)? I realise I can do this by copy and pasting the column names, but I thought there might be some elegant way to do this too.

To group by all columns except one, you can use:
by = setdiff(names(DT), "collapse")
Explanation: setdiff takes the general form of setdiff(x, y) which returns all values of x that are not in y. In this case it means that all columnnames are returned except the collapse-column.
Two alternatives:
# with '%in%'
names(dt1)[!names(dt1) %in% 'colB']
# with 'is.element'
names(dt1)[!is.element(names(dt1), 'colB')]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replacing "NA" (NA string) with NA inplace data.table - r

Related

Fast alternative to ifelse for replacing strings in data.table

format data.table column values (row wise) if numeric

r rowdies iteration on a data table, again

Recode a variable using data.table

data.table - group by all except one column

Categories

Resources