Subsetting data.table columns using [] - r

This is very strange. When I try to select columns on my data.table by doing
df1[, 30]
It just gives me 30, or whatever number I put in there. Not column 30.
Data here: https://github.com/pourque/country-data/blob/master/data/df1.csv
I've checked, and everything works properly when I just produce a test data.frame:
df2 <- data.frame(x = 1:3, y = 3:1, z = 7:9)
> df2[, 2]
[1] 3 2 1
Any ideas on what might be happening?

When working with data.table you need to use following to chose column by numbers:
df2[, 2]
df2[, .SD, .SDcols=2]
This will still return data.table, not a vector.
As always on list you can also use below to return a vector:
df2[[2]]

Related

How to obtain values from data table? [duplicate]

I have the following data.table (DT):
DT <- data.table(V1 = 1:3, V2 = 4:6, V3 = 7:9)
I would like to select a subset of the variables programmatically (dynamically), by using an object where the relevant variable names are stored. For example, I want to select the two columns "V1" and "V3" stored in a variable "keep"
keep <- c("V1", "V3")
If we were to select the "keep" columns from a data.frame, the following would work:
DT[keep]
Unfortunately, this is not working when this is a data.table. I thought the data.frame and data.table are identical with this kind of behavior, but apperently they aren't. Anybody able to advise on the correct syntax?
This is covered in FAQ 1.1, 1.2 and 2.17.
Some possibilities:
DT[, keep, with = FALSE]
DT[, c('V1', 'V3'), with = FALSE]
DT[, c(1, 3), with = FALSE]
DT[, list(V1, V3)]
The reason DF[c('V1','V3')] works as it does for a data.frame is covered in ?`[.data.frame`
Data frames can be indexed in several modes. When [ and [[ are used
with a single vector index (x[i] or x[[i]]), they index the data frame
as if it were a list. In this usage a drop argument is ignored, with a
warning.
From data.table 1.10.2, you may use the .. prefix when subsetting columns programmatically:
When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers [...] It is experimental.
Thus:
DT[ , ..keep]
# V1 V3
# 1: 1 7
# 2: 2 8
# 3: 3 9
Some more possibilities:
DT[, .SD, .SDcols = keep]
DT[, mget(keep)]

Replace NULL with NA in r data.table with lists

I have a dataset imported from a MongoDb database as a data.table, where some of the columns are formated as lists and contain some NULL values. The NULL values were causing me some issues when trying to fill a column in another data.table by reference to the first table, as the destination column was not in list format (and therefore can't have NULL values).
I found a solution below, which works fine for now, but my test dataset is only 6 records and I'm wondering if this would struggle when working with larger datasets or if there is a more efficient way to do this (in data.table)?
Here is some example data:
library(data.table)
dt <- data.table(id = c(1,2,3), age = list(12, NULL, 15), sex = list("F", "M", NULL))
And here is the solution I applied:
# Function to change NULL to NA in a data.table with lists:
null2na <- function(dtcol){
nowna = lapply(dtcol, function(x) ifelse(is.null(x), NA_real_, x))
return(nowna)
}
# Apply the function to the data.table to replace NULLs with NAs:
dt[, c(names(dt)) := lapply(.SD, null2na), .SDcols = names(dt)]
You can save one lapply call by using the lengths function.
library(data.table)
null2na <- function(dtcol){
dtcol[lengths(dtcol) == 0] <- NA
return(dtcol)
}
dt[, names(dt) := lapply(.SD, null2na)]
dt
# id age sex
#1: 1 12 F
#2: 2 NA M
#3: 3 15 NA
The age and sex column are still lists. If you want them as a simple vector return unlist(dtcol) from the function.
Here another way to solve your problem:
cols <- names(dt)[sapply(dt, is.list)] # get names of list columns
dt[, (cols) := lapply(.SD, function(x) replace(x, lengths(x)==0L, NA)), .SDcols=cols]
My toy example is too small to compare timings, but combining both solutions suggested by #B. Christian Kamgang and #Ronak Shah works well for me:
# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
return(fullcol)
# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]
Two things I found advantageous with this approach (thanks to both respondants for suggesting):
Avoiding use of base r ifelse, dplyr::if_else and data.table::fifelse; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario because NA is interpreted as differing in type from the other values in the list.
The use of the function lengths(dtcol) == 0L targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.
I've gone with replace() rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).

.N not working properly within data.table

I came across a surprising result with data.table. Here is a really simple example :
library(data.table)
df <- data.table(x = 1:10)
df[,x[x>3][.N]]
[1] NA
This syntax gives NA, but this work:
df[,x[x>3][1]]
[1] 4
and of course this
df[,x[.N]]
[1] 10
I know that in this simple example case you can do
df[x>3,x[.N]]
but I wanted to use the df[,x[x>3][.N]] syntax while using lapply on .SD to avoid a loop on the i selection, so something like
df2 <- data.table(x = rep(1:10,2), y = rep(2:11,2),ID = rep(c("A","B"),each = 10))
cols = c("x","y")
df2[,lapply(.SD,function(x){x[x>3][.N]}),.SDcols = cols, by = ID]
But this fail, same as in my simple example. Is it because .N is not implemented in this case, or am I doing something wrong ?
My actual work around:
Reduce(merge,lapply(cols,function(col){df2[col>3,setNames(list( get(col)[.N]),col),by = ID]}))
ID x y
1: A 10 11
2: B 10 11
but I am not fully happy with it, I find it less readable. Has anyone an explanation and a better work around ?
Thank you !!
df[,x[x>3]] has 7 elements. .N is 10. You are trying to subset a vector out of range.
So you can access the last element of the vector in lapply using:
df2[, lapply(.SD, function(x) tail(x[x>3], 1) ), .SDcols = c('x','y'), by = ID]
Or more idiomatic for data.table we can use
df2[, lapply(.SD, function(x) last(x[x>3]) ), .SDcols = c('x','y'), by = ID]

Aggregating in R

I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.
df <- structure(list(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"),
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")
I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date
I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)
I have been playing around with aggregate and count function but to no avail
mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
train,
function(x) length(unique(which(!is.na(x)))))
mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
testtrain,
function(x) length(which(!is.na(x))))
Can anyone help?
Thank you for your time
Using data.table you could do (I've added a real NA to your original data).
I'm also not sure if you really looking for length(unique()) or just length?
library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") :=
lapply(.SD, function(x) length(unique(na.omit(x)))),
.SDcols = cols,
by = ID]
# ID SMS.Type SMS.Date Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900 DF1 12/02/2015 19:51 NA 2 1
# 2: 1045937900 WCB14 13/02/2015 08:38 13/02/2015 09:52 2 1
In the devel version (v >= 1.9.5) you also could use uniqueN function
Explanation
This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.
lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
:= performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
by argument is specifying the aggregator columns
After converting your empty strings to NAs:
library(dplyr)
mutate(df, SMS.Type.count = sum(!is.na(SMS.Type)),
Reply.Date.count = sum(!is.na(Reply.Date)))

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

Resources