This question already has answers here:
Select column of data.table and return vector
(2 answers)
Closed 2 years ago.
After indexing DT column with a variable name, the data is returned as type data.table data.frame, and the column is not an accessible vector, I have to unlist it first. Am I doing everything as intended?
Consider this example:
require(data.table)
DT <- data.table(a=seq(1.001, 10.999, length=100), b=factor(c(rep('a', 55), rep('b', 45))))
col.name <- 'a'
diff(DT[, col.name]) #column name not found error
diff(DT[, col.name, with=FALSE]) #null data table
diff(DT[, col.name, with=FALSE][[1]]) #works
The second example is what question is about.
You have many options to retrieve single columns. In my opinion, the most readable option, is using .SD, though not the fastest. It's also often desired that single column data.tables are not converted to vectors.
require(data.table)
DT <- data.table(a=seq(1.001, 10.999, length=100), b=factor(c(rep('a', 55), rep('b', 45))))
DT[ , get(col.name) ] # vector
DT[[ col.name ]] # vecotr
DT[ , col.name, with = FALSE ] # data.table
DT[ , .SD, .SDcols = col.name ] # data.table
Two (straight-forward) ways to to get just the vector and not the data.table, given your inputs:
DT[, get(col.name)]
DT[[col.name]] # as per comment by ismirsehregal
And some equally convoluted as DT[, col.name, with=FALSE][[1]]
with(DT, eval(parse(text=col.name)))
DT[, ..col.name][[1]] # as per comment by ismirsehregal
Related
I have a dataset imported from a MongoDb database as a data.table, where some of the columns are formated as lists and contain some NULL values. The NULL values were causing me some issues when trying to fill a column in another data.table by reference to the first table, as the destination column was not in list format (and therefore can't have NULL values).
I found a solution below, which works fine for now, but my test dataset is only 6 records and I'm wondering if this would struggle when working with larger datasets or if there is a more efficient way to do this (in data.table)?
Here is some example data:
library(data.table)
dt <- data.table(id = c(1,2,3), age = list(12, NULL, 15), sex = list("F", "M", NULL))
And here is the solution I applied:
# Function to change NULL to NA in a data.table with lists:
null2na <- function(dtcol){
nowna = lapply(dtcol, function(x) ifelse(is.null(x), NA_real_, x))
return(nowna)
}
# Apply the function to the data.table to replace NULLs with NAs:
dt[, c(names(dt)) := lapply(.SD, null2na), .SDcols = names(dt)]
You can save one lapply call by using the lengths function.
library(data.table)
null2na <- function(dtcol){
dtcol[lengths(dtcol) == 0] <- NA
return(dtcol)
}
dt[, names(dt) := lapply(.SD, null2na)]
dt
# id age sex
#1: 1 12 F
#2: 2 NA M
#3: 3 15 NA
The age and sex column are still lists. If you want them as a simple vector return unlist(dtcol) from the function.
Here another way to solve your problem:
cols <- names(dt)[sapply(dt, is.list)] # get names of list columns
dt[, (cols) := lapply(.SD, function(x) replace(x, lengths(x)==0L, NA)), .SDcols=cols]
My toy example is too small to compare timings, but combining both solutions suggested by #B. Christian Kamgang and #Ronak Shah works well for me:
# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
return(fullcol)
# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]
Two things I found advantageous with this approach (thanks to both respondants for suggesting):
Avoiding use of base r ifelse, dplyr::if_else and data.table::fifelse; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario because NA is interpreted as differing in type from the other values in the list.
The use of the function lengths(dtcol) == 0L targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.
I've gone with replace() rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).
I'm trying to use data.table rather data.frame(for a faster code). Despite the syntax difference between than, I'm having problems when I need to extract a specific character column and use it as character vector. When I call:
library(data.table)
DT <- fread("file.txt")
vect <- as.character(DT[, 1, with = FALSE])
class(vect)
###[1] "character"
head(vect)
It returns:
[1] "c(\"uc003hzj.4\", \"uc021ofx.1\", \"uc021olu.1\", \"uc021ome.1\", \"uc021oov.1\", \"uc021opl.1\", \"uc021osl.1\", \"uc021ovd.1\", \"uc021ovp.1\", \"uc021pdq.1\", \"uc021pdv.1\", \"uc021pdw.1\")
Any ideas of how to avoid these "\" in the output?
The as.character works on vectors and not on data.frame/data.table objects in the way the OP expected. So, if we need to get the first column as character class, subset with .SD[[1L]] and apply the as.character
DT[, as.character(.SD[[1L]])]
If there are multiple columns, we can specify the column index with .SDcols and loop over the .SD to convert to character and assign (:=) the output back to the particular columns.
DT[, (1:2) := lapply(.SD, as.character), .SDcols= 1:2]
data
DT <- data.table(Col1 = 1:5, Col2= 6:10, Col3= LETTERS[1:5])
I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]
I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.
This question already has an answer here:
Apply a function to a subset of data.table columns, by column-indices instead of name
(1 answer)
Closed 8 years ago.
I want to be able to apply a function over a subset of columns, and return those columns that have been manipulated along with the rest of the data columns that weren't touched. Is there a way to do this with data.table. I wasn't able to figure out the syntax.
In this example I have NAs and want to overwrite them with something else for a few different columns. I need a way to also return other columns that weren't touched.
library(data.table)
# make data set
a <- sample(c(letters[1:5], NA), 50, replace=TRUE)
b <- sample(c(LETTERS[1:5], NA), 50, replace=TRUE)
c <- sample(runif(50))
x <- data.table(a,b,c)
# function to apply to a single column
overwriteNA <- function(vec, new="") ifelse(is.na(vec), new, vec)
# Only returns .SDcols but would like to also include rest of columns in data.table
x[, lapply(.SD, overwriteNA), .SDcols=c("a", "b")]
# Need something along these lines
x[, `:=` lapply(.SD, overwriteNA), .SDcols=c("a", "b")]
Try
x[, c("a", "b") := lapply(.SD, overwriteNA), .SDcols = c("a", "b")]
Edit:
Per OPs additional request.
myCols <- c("a", "b")
x[, (myCols) := lapply(.SD, overwriteNA), .SDcols = myCols]