I'm trying to use data.table rather data.frame(for a faster code). Despite the syntax difference between than, I'm having problems when I need to extract a specific character column and use it as character vector. When I call:
library(data.table)
DT <- fread("file.txt")
vect <- as.character(DT[, 1, with = FALSE])
class(vect)
###[1] "character"
head(vect)
It returns:
[1] "c(\"uc003hzj.4\", \"uc021ofx.1\", \"uc021olu.1\", \"uc021ome.1\", \"uc021oov.1\", \"uc021opl.1\", \"uc021osl.1\", \"uc021ovd.1\", \"uc021ovp.1\", \"uc021pdq.1\", \"uc021pdv.1\", \"uc021pdw.1\")
Any ideas of how to avoid these "\" in the output?
The as.character works on vectors and not on data.frame/data.table objects in the way the OP expected. So, if we need to get the first column as character class, subset with .SD[[1L]] and apply the as.character
DT[, as.character(.SD[[1L]])]
If there are multiple columns, we can specify the column index with .SDcols and loop over the .SD to convert to character and assign (:=) the output back to the particular columns.
DT[, (1:2) := lapply(.SD, as.character), .SDcols= 1:2]
data
DT <- data.table(Col1 = 1:5, Col2= 6:10, Col3= LETTERS[1:5])
Related
I have a dataset imported from a MongoDb database as a data.table, where some of the columns are formated as lists and contain some NULL values. The NULL values were causing me some issues when trying to fill a column in another data.table by reference to the first table, as the destination column was not in list format (and therefore can't have NULL values).
I found a solution below, which works fine for now, but my test dataset is only 6 records and I'm wondering if this would struggle when working with larger datasets or if there is a more efficient way to do this (in data.table)?
Here is some example data:
library(data.table)
dt <- data.table(id = c(1,2,3), age = list(12, NULL, 15), sex = list("F", "M", NULL))
And here is the solution I applied:
# Function to change NULL to NA in a data.table with lists:
null2na <- function(dtcol){
nowna = lapply(dtcol, function(x) ifelse(is.null(x), NA_real_, x))
return(nowna)
}
# Apply the function to the data.table to replace NULLs with NAs:
dt[, c(names(dt)) := lapply(.SD, null2na), .SDcols = names(dt)]
You can save one lapply call by using the lengths function.
library(data.table)
null2na <- function(dtcol){
dtcol[lengths(dtcol) == 0] <- NA
return(dtcol)
}
dt[, names(dt) := lapply(.SD, null2na)]
dt
# id age sex
#1: 1 12 F
#2: 2 NA M
#3: 3 15 NA
The age and sex column are still lists. If you want them as a simple vector return unlist(dtcol) from the function.
Here another way to solve your problem:
cols <- names(dt)[sapply(dt, is.list)] # get names of list columns
dt[, (cols) := lapply(.SD, function(x) replace(x, lengths(x)==0L, NA)), .SDcols=cols]
My toy example is too small to compare timings, but combining both solutions suggested by #B. Christian Kamgang and #Ronak Shah works well for me:
# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
return(fullcol)
# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]
Two things I found advantageous with this approach (thanks to both respondants for suggesting):
Avoiding use of base r ifelse, dplyr::if_else and data.table::fifelse; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario because NA is interpreted as differing in type from the other values in the list.
The use of the function lengths(dtcol) == 0L targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.
I've gone with replace() rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).
I would like to do something more efficient than
dataframe$col <- as.character(dataframe$col)
since I have many numeric columns.
In base R, we may either use one of the following i.e. loop over all the columns, create an if/else conditon to change it
dataframe[] <- lapply(dataframe, function(x) if(is.numeric(x))
as.character(x) else x)
Or create an index for numeric columns and loop only on those columns and assign
i1 <- sapply(dataframe, is.numeric)
dataframe[i1] <- lapply(dataframe[i1], as.character)
It may be more flexible in dplyr
library(dplyr)
dataframe <- dataframe %>%
mutate(across(where(is.numeric), as.character))
All said by master akrun! Here is a data.table alternative. Note it converts all columns to character class:
library(data.table)
data.table::setDT(df)
df[, (colnames(df)) := lapply(.SD, as.character), .SDcols = colnames(df)]
I am trying to select some columns from a data.table but getting unexpected results.
For the following, I want to select columns y and z and this works as expected
library(data.table)
dt <- data.table(x=1:4, y=5:8, z=9:6)
dt[, c("y", "z")]
When I try to do this using setdiff it returns nonsense
omit_var <- "x"
dt[, setdiff(c("x","y","z"), omit_var)]
Even though they are equivalent all.equal(setdiff(c("x","y","z"), omit_var), c("y", "z"))
Why is this happening -- I guessing a scoping issue but can I avoid it while keeping the code similar?
(I realise I can do i <- setdiff(c("x","y","z"), omit); dt[,..i])
Following comments from #Roland: "If you call a function in j, data.table can't know that you want to subset. It assumes that you want the return value of the function.". So can use either of
dt[, mget(setdiff(c("x","y","z"), omit_var))]
dt[, setdiff(c("x","y","z"), omit_var), with = FALSE]
#sindri_baldur gave another alternative using .SDcols
dt[, .SD, .SDcols = setdiff(c("x","y","z"), omit_var)]
And more options from #DavidArenburg
dt[, .SD, .SDcols = -omit_var]
If you want to keep all columns except the one in a character string you can use
dt[, !"x"]
Or you could use the .. operator
cols <- setdiff(names(dt), omit_var)
dt[, ..cols]
I am using data.table and I want to filter a data.table within a function where I am passing the name of the column as a character vector.
For a simple reproducible example let's take the mtcars dataset of base R.
I can write using data.table syntax:
mtcars[am == 1, .N ]
But what if the name of the variable of interest -- i.e. am -- is stored as a character vector, i.e. "am"?
Your advice will be appreciated.
One option is to use get (?get search object by name):
mtcars[get('am') == 1, .N]
# [1] 13
Another option is to specify it in the .SDcols
mtcars[, sum(.SD==1) ,.SDcols = 'am']
#[1] 13
We can also include multiple variables
mtcars[, sum(Reduce(`&`, lapply(.SD, `==`, 1))), .SDcols = c('am', 'carb')]
#[1] 4
In R I have a data.table df with an integer column X. I want to convert this column from integer to a character.
This is really easy with df[, X:=as.character(X)].
Now for the question, I have the name of the column (X) stored in a variable like this:
col_name <- 'X'.
How do I access the column (and convert it to a character column) while only knowing the variable.
I tried numerious things all yielding in nothing useful or a column of NA's. Which syntax will get me the result I want?
library(data.table)
DT <- as.data.table(iris)
col_name <- "Petal.Length"
Use ( to evaluate the LHS of := and use list subsetting to select the column:
DT[, (col_name) := as.character(DT[[col_name]])]
class(DT[[col_name]])
#[1] "character"
We can specify it in .SDcols and assign the columns to character
df[, (col_name) := as.character(.SD[[1L]]), .SDcols = col_name]
If there are more than one column, use lapply
df[, (col_names) := lapply(.SD, as.character), .SDcols = col_names]
data
df <- data.table(X = as.integer(1:5), Y = LETTERS[1:5])
col_name <- "X"