Delete multiple columns by reference using reverse selection in data.Table [duplicate] - r

This question already has an answer here:
How do I subset column variables in DF1 based on the important variables I got in DF2?
(1 answer)
Closed 5 years ago.
I want to delete the columns that are not in a list using reference.
library("data.table")
df <- data.frame("ID"=1:10,"A"=1:10,"B"=1:10,"C"=1:10,"D"=1:10)
setDT(df,key="ID")
list_to_keep <- c("ID","A","B","C")
df[,!names(df)%in%list_to_keep,with=FALSE]
gives me a selection of the columns that I want to delete, but when I do:
df <- data.frame("ID"=1:10,"A"=1:10,"B"=1:10,"C"=1:10,"D"=1:10)
setDT(df,key="ID")
list_to_keep <- c("ID","A","B","C")
df[,!names(df)%in%list_to_keep:=NULL,with=FALSE]
I get LHS of := isn't a column names ('character' or positions ('integer' or 'numeric'). What is the correct way of doing this?

We can use the setdiff to get the names of the dataset that are not in the list_to_keep and assign (:=) it to NULL
df[, setdiff(names(df), list_to_keep) := NULL]
As #rosscova mentioned, using which on the logical vector can be used to get the position of the column and to assign the columns to NULL
df[, which(!names(df)%in%list_to_keep):=NULL]

LHS of := is "A character vector of column names (or numeric positions) or a variable that evaluates as such."
!names(df)%in%list_to_keep is logical vector.
So,
df[,names(df)[!names(df)%in%list_to_keep]:=NULL]
will work.

Related

R: filtering elements of large vector that appear in a smaller vector [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 3 years ago.
Suppose we have a numeric vector. Actually, suppose we have a dataframe consisting of a single column.
example = data.frame("column" = rnorm(10000, 10, 3))
We'll be treating it as a dataframe in order to use the filter function of the dplyr package.
Also, suppose we have another vector of smaller length. This particular vector is just for the sake of the example. It doesn't necessarily have to be a sequence.
numbers = 8:100
What I would like to do is to keep those values of the larger vector that are equal to any of the values of the smaller vector and discard those values that are not.
Fair enough. The filter function can do that. Except that I would have to write this:
filtered = dplyr::filter(example, column == numbers[1] | column == numbers[2] | ... | column == numbers[length(numbers)])
I would have to write the condition column == numbers[i] for each of the elements of the numbers vector.
Executing this code
filtered = dplyr::filter(example, column == numbers)
gives as output a dataframe called filtered that consists of a single column with no rows. There are no rows because, since all the rows of the example dataframe consist of scalars, none of those rows is equal to the whole numbers vector.
Is there an smarter method that doesn't require me to write that condition for each element of the numbers vector?
You can use the operator %in% to check if your values are "in" the vector.
Code:
new_data <- old_data %>%
dplyr::filter(column %in% numbers)
Are you looking for:
filtered <- dplyr::filter(example, column %in% numbers)
An option with base R
subset(example, column %in% numbers)

creating, directly, data.tables with column names from variables, and using variables for column names with := [duplicate]

This question already has answers here:
Select / assign to data.table when variable names are stored in a character vector
(6 answers)
Closed 3 years ago.
The only way I know so far is in two steps: creating the columns with dummy names and then using setnames(). I would like to do it in one step, probably there is some parameter/option, but am not able to find it
# the awkward way I have found so far
col_names <- c("one", "two","three")
dt <- data.table()
# add columns with dummy names...
setnames(dt, col_names )
Also interested in a way to be able to use a variable with :=, something like
colNameVar <- "dummy_values"
DT[ , colNameVar := 1:10]
This question to me does not seem a duplicate of Select / assign to data.table when variable names are stored in a character vector
here I ask about when creating a data.table, word "creating" in the title.
This is totally different from when the data table is already created, which is the subject of the question indicated as duplicate, for the latter there are kown ways clearly documented, that do not work in the case I ask about here.
PS. Note similar question indicated in comment by # Ronak Shah: Create empty data frame with column names by assigning a string vector?
For the first question, I'm not absolutely sure, but you may want to try and see if fread is of any help creating an empty data.table with named columns.
As for the second question, try
DT[, c(nameOfCols) := 10]
Where nameOfCols is the vector with names of the columns you want to modify. See ?data.table

Dropping unused factor levels in data.table

I am trying to figure out the syntax for dropping unused factor levels in a data.table given a character vector of column names similar to what's done in this link. However in that example "y" is the actual column name of the data.table "x". I would like to pass instead a character vector holding the column names but I could not figure out the syntax.
We can use .SDcols to specify the columns of interest. It can take a vector of columns names (length of 1 or greater than 1) or column index. Now, the .SD i.e. Subset of Data.table would have those columns specified in the .SDcols. As there is only a single column, extract that column with [[, apply the droplevels on the vector and assign (:=) it back to the column of interest. Not the parens around the object identifier v1. It is to evaluate the object to get the value in it instead of creating a column 'v1'
x[, (v1) := droplevels(.SD[[1]]), .SDcols = v1]
Usually, the syntax would be
x[, (v1) := lapply(.SD, droplevels), .SDcols = v1]
It can take one column or multiple columns. The only reason to extract ([[) is because we know it is a single column
Another option is get
x[, (v1) := droplevels(get(v1))]
where,
v1 <- "y"
#akrun's answer works well, i think this works too
x[, (v1):=droplevels(x[[v1]])]

data.table subsetting in i with column number [duplicate]

This question already has answers here:
data.table in r : subset using column index
(2 answers)
Closed 5 years ago.
Is it possible to subset a data.table in i, referencing the column not by its name (e.g. by number/position)?
Example:
library(data.table)
dt <- data.table(A=1:18, Name=c('A','B','C'))
dt2 <- data.table(A=2:20, Username=c('A','B','C'))
#stuff happens and eventually I end up with either dt or dt2 copied to a final dt
#depending on which is there, I want to get only "A"s
final[Name=='A']
final[Username=='A']
But I want a way that I can subset both data.tables with the same call despite the different column names. One potential solution is to set the key for each dt as Name and Username, then subset like this: final['A'] but I am wondering if there is another way.
I can't change the column names because they are going into a table in a shiny app.
If this is based on position, then we extract the column with numeric column index using [[ and do the comparison to get the logical vector and subset the rows based on it
final[final[[2]]=="A"]

Convert any column in a data frame with numbers to a numeric value (is.numeric) [duplicate]

This question already has an answer here:
How do I test for numeric values in a dataframe of characters, and convert those to numeric?
(1 answer)
Closed 6 years ago.
I have a large data frame and it's a mixture of character, integer, and numeric columns. From what I've seen, some columns that I want to be numeric are not always numeric. I can manually convert each one but I wanted to see if there was a way to detect a column with numbers in it and then convert it to numeric.
You can use an lapply if you first create a vector of the column names with numbers in them. You will likely need to create this vector of column names manually, versus creating it based upon searching for columns using is.numeric, as your issue is that they are incorrectly classified. This is using library(data.table).
cols <- c("var1", "var2", "var3", "var4")
DT[, (cols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcols=cols]

Resources