I'm unsure of what to Google.
I have a column, lets call it x. Within this variable, each row is a list of strings. For example
1: A,B,C,D,E
2: A,B,C,D,E
I am wondering the name of the R function to select, process, etc. within each row? E.g. I may wish to extract only B from each row. Or perhaps delete all C's.
Assuming that it is a data.table, we extract the character 'B' with str_extract
library(data.table)
library(stringr)
dt[, x:= str_extract(x, "B")]
and if we want to delete all 'C's, it can be done with gsub from base R or str_replace_all from stringr
dt[, x := gsub(",*C", "", x)]
data
dt <- data.table(x = c('A,B,C,D,E', 'A,C,D', 'B,C,C,D'))
Related
This should be easy but google and me are failing. Say I have this data:
library(data.table)
mydata <- data.table(a = c(1, NA),
b = c(NA, NA),
pointer = c(1,2))
and I want to get the rows where both a and b are NA. Of course i can do this manually like:
mydata[is.na(a) & is.na(b)]
but the issue arises deep in other code and I want to do this based on a character vector (or list, or whatever, this is flexible) of the column names such as:
myvector <- c("a","b")
again I can do this manually if I know how many elements the vector has:
mydata[is.na(get(myvector[1])) & is.na(get(myvector[2]))]
But I don't know how many elements myvector has in my application. How can I do this without specifying the number of entries in myvector? Essentially, I'm looking for something like with = F but for i in data.table. So I want to use myvector like this:
mydata[is.na(somefunction(myvector))]
I tried all kind of paste0(myvector, collapse = " & ") combinations with get() or as.formula() but it is getting me nowhere.
We can specify the .SDcols with the vector of column names, loop over the .SD (Subset of Data.table), create a list of logical vectors with is.na and Reduce the list to a single logical vector with & (which checks the corresponding elements of the list or column with & condition), use that to subset the rows of data
library(data.table)
mydata[mydata[, Reduce(`&`, lapply(.SD, is.na)), .SDcols = myvector]]
-output
# a b pointer
#1: NA NA 2
Or use mget
mydata[mydata[, Reduce(`&`, lapply(mget(myvector), is.na))]]
Here is another solution assuming that myvector is a character vector:
library(data.table)
mydata[rowSums(!is.na(mydata[, ..myvector])) == 0]
Is it possible to select a column of a data.table and get back a vector? In base R, the argument drop=TRUE would do the trick. For example,
library(data.table)
dat <- as.data.table(iris)
dat[,"Species"] # returns data.table
dat[,"Species", drop=TRUE] # same
iris[, "Species", drop=TRUE] # a factor, wanted result
Is there a way to do this with data.table?
EDIT: the dat[,Species] method is fine, however I need a method where I can pass the column name in a variable:
x <- "Species"
dat[,x, drop=TRUE]
With data.frame, the default is drop = TRUE and in data.table, it is the opposite while it is done internally. According to ?data.table
drop - Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame.
In order to get the same behavior, we can use [[ to extract the column by passing a string
identical(dat[["Species"]], iris[, "Species"])
#[1] TRUE
Or
dat$Species
By using [[ or $, it extracts as a vector while also bypass the data.table overhead
See data.table FAQ #1.1. This comes as a feature since 2006.
I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?
I've been looking through tutorials and documentation, but have not figured out how to assign a vector of values for all columns to one existing row in a data.table.
I start with an empty data.table that has already the correct number of columns and rows:
dt <- data.table(matrix(nrow=10, ncol=5))
Now I calculate some values for one row outside of the data.table and place them in a vector vec, e. g.:
vec <- rnorm(5)
How could I assign the values of vec to e. g. the first row of the data.table while achieving a good performance (since I also want to fill the other rows step by step)?
First you need to get the correct column types, as the NA matrix you've created is logical. The column types won't be magically changed by assigning numerics to them.
dt[, names(dt) := lapply(.SD, as.numeric)]
Then you can change the first row's values with
dt[1, names(dt) := as.list(vec)]
That said, if you begin with a numeric matrix you wouldn't have to change the column types.
dt <- data.table(matrix(numeric(), 10, 5))
dt[1, names(dt) := as.list(vec)]
I have a data.table containing a name column, and I'm trying to extract a regular expression from this name. The most obvious way to do it in this case is with the := operator, as I'm assigning this extracted string as the actual name of the data. In doing so, I find that this doesn't actually apply the function in the way that I would expect. I'm not sure if it's intentional, and I was wondering if there's a reason it does what it does or if it's a bug.
library(data.table)
dt <- data.table(name = c('foo123', 'bar234'))
Searching for the desired expression in a simple character vector behaves as expected:
name <- dt[1, name]
pattern <- '(.*?)\\d+'
regmatches(name, regexec(pattern, name))
[[1]]
[1] "foo123" "foo"
I can easily subset this to get what I want
regmatches(name, regexec(pattern, name))[[1]][2]
[1] "foo"
However, I run into issues when I try to apply this to the entire data.table:
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2]]
dt
name name_final
1: foo123 foo
2: bar234 foo
I don't know how data.table works internally, but I would guess that the function was applied to the entire name column first, and then the result is coerced into a vector somehow and then assigned to the new name_final column. However, the behavior I would expect here would be on a row-by-row basis. I can emulate this behavior by adding a dummy id column;
dt[, id := seq_along(name)]
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2], by = list(id)]
dt
name name_final id
1: foo123 foo 1
2: bar234 bar 2
Is there a reason that this isn't the default behavior? If so, I would guess that it had to do with columns being atomic to the data.table rather than the rows, but I'd like to understand what's going on there.
Pretty much nothing in R runs on a row-by-row basis. It's always better to work with columns of data at a time so you can pretty much assume that the entire column vector of values will be passed in as a parameter to your function. Here's a way to extract the second element for each item in the regmatches list
dt[, name_final := sapply(regmatches(name, regexec(pattern, name)), `[`, 2)]
Functions like sapply() or Vectorize() can "fake" a per-row type call for functions that aren't meant to be run on a vector/list of data at a time.