Dropping unused factor levels in data.table

Dropping unused factor levels in data.table - r

I am trying to figure out the syntax for dropping unused factor levels in a data.table given a character vector of column names similar to what's done in this link. However in that example "y" is the actual column name of the data.table "x". I would like to pass instead a character vector holding the column names but I could not figure out the syntax.

We can use .SDcols to specify the columns of interest. It can take a vector of columns names (length of 1 or greater than 1) or column index. Now, the .SD i.e. Subset of Data.table would have those columns specified in the .SDcols. As there is only a single column, extract that column with [[, apply the droplevels on the vector and assign (:=) it back to the column of interest. Not the parens around the object identifier v1. It is to evaluate the object to get the value in it instead of creating a column 'v1'
x[, (v1) := droplevels(.SD[[1]]), .SDcols = v1]
Usually, the syntax would be
x[, (v1) := lapply(.SD, droplevels), .SDcols = v1]
It can take one column or multiple columns. The only reason to extract ([[) is because we know it is a single column
Another option is get
x[, (v1) := droplevels(get(v1))]
where,
v1 <- "y"

#akrun's answer works well, i think this works too
x[, (v1):=droplevels(x[[v1]])]

Related

Want to replace certain rows in one dataframe with rows from another based on matching timestamps (both dataframes have timestamps in the same tz)

I want to be able to take some values from one dataframe and have these inserted into another dataframe (both have the same amount of columns with the same titles)
I want the values in each row from dataframe 1 to replace those in dataframe 2 based on a matching timestamps.
For most of the rows/timestamps I want the original data to remain in dataframe 1 so this is only for a set of specific timestamps (those in dataframe 2)
Does dplyr solve this somehow?

It may be more easier in data.table with a join. Get the column names of the first dataset except the 'timestamp' column ('nm1' - Note that here we assume column names to be same), join on by the 'timestamp' column and assign the corresponding columns (i.) from the second dataset when it matches the 'timestamp' column
library(data.table)
nm1 <- setdiff(names(df1), "timestamp")
nm2 <- paste0("i.", nm1)
setDT(df1)[df2, (nm1) := mget(nm2), on = .(timestamp)]

Replacing a text in all columns with another text in datatable R

Found solution for dataframe to replace a text in all columns with another text. But i could not use the same for datatable. Below is what i tried. But when changed data.frame to data.table it doesnt give the correct data.
DF<- data.frame(lapply(DT, function(x) {gsub("abc", "xyz", x)}))
I need to find and replace all occurances of abc with xyz in all columns of a data.table object

If it is a data.table and we want to change all the column values, then use the data.table methods. Based on the OP's code, we are selecting all the columns (so no need to specify .SDcols), loop through the Subset of Data.table with lapply, replace the 'abc' with 'xyz' with gsub (assuming there are multiple instances of 'abc') and update the original column by assigning (:=) the output back to the original columns
attrdata2[, names(attrdata2) := lapply(.SD, function(x) gsub("abc", "xyz", x))]

Delete multiple columns by reference using reverse selection in data.Table [duplicate]

This question already has an answer here:
How do I subset column variables in DF1 based on the important variables I got in DF2?
(1 answer)
Closed 5 years ago.
I want to delete the columns that are not in a list using reference.
library("data.table")
df <- data.frame("ID"=1:10,"A"=1:10,"B"=1:10,"C"=1:10,"D"=1:10)
setDT(df,key="ID")
list_to_keep <- c("ID","A","B","C")
df[,!names(df)%in%list_to_keep,with=FALSE]
gives me a selection of the columns that I want to delete, but when I do:
df <- data.frame("ID"=1:10,"A"=1:10,"B"=1:10,"C"=1:10,"D"=1:10)
setDT(df,key="ID")
list_to_keep <- c("ID","A","B","C")
df[,!names(df)%in%list_to_keep:=NULL,with=FALSE]
I get LHS of := isn't a column names ('character' or positions ('integer' or 'numeric'). What is the correct way of doing this?

We can use the setdiff to get the names of the dataset that are not in the list_to_keep and assign (:=) it to NULL
df[, setdiff(names(df), list_to_keep) := NULL]
As #rosscova mentioned, using which on the logical vector can be used to get the position of the column and to assign the columns to NULL
df[, which(!names(df)%in%list_to_keep):=NULL]

LHS of := is "A character vector of column names (or numeric positions) or a variable that evaluates as such."
!names(df)%in%list_to_keep is logical vector.
So,
df[,names(df)[!names(df)%in%list_to_keep]:=NULL]
will work.

Use string to select column per row in dplyr (or base R)

I have a column filled with other column names. I want get the value in each of the column names.
# three columns with values and one "key" column
library(dplyr)
data = data.frame(
x = runif(10),
y = runif(10),
z = runif(10),
key = sample(c('x', 'y', 'z'), 10, replace=TRUE)
)
# now get the value named in 'key'
data = data %>% mutate(value = VALUE_AT_COLUMN(key))
I'm pretty sure the answer has something to do with the lazy eval version of mutate, but I can't for the life of me figure it out.
Any help would be appreciated.

We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by the sequence of rows, we use .SD to subset the columns specified by 'key'.
library(data.table)
setDT(data)[, .SD[, key[[1L]], with=FALSE] ,1:nrow(data)]
Or another option is get after converting the 'key' to character class (as it factor) after grouping by sequence of rows as in the previous case.
setDT(data)[, get(as.character(key)), 1:nrow(data)]
Here is one option with do
library(dplyr)
data %>%
group_by(rn = row_number()) %>%
do(data.frame(., value= .[[.$key]]))

Here's a Base R solution:
data$value = diag(as.matrix(data[,data$key]))

For a memory efficient and fast solution, you should update your original data.table by performing a join as follows:
data[.(key2 = unique(key)), val := get(key2), on=c(key="key2"), by=.EACHI][]
For each key2 the matching rows in data$key are calculated. Those rows are updated with the values from the column that is contained in key2. For example, key2="x" matches with rows 1,2,6,8,10. The corresponding values of data$x are data$x[c(1,2,6,8,10)]. by=.EACHI ensures the expression get(key2) is executed for each value of key2.
Since this operation is performed only on unique values it should be considerably faster than running it row-wise. And since the data.table is updated by reference, it should be quite memory efficient (and that contributes to speed as well).

It definitely feels like there should be a base R solution to this, but the best I could do was with tidyr, to first transform the data to wide form, then filter for just those observations that match the desired key.
data %>%
add_rownames("index") %>%
gather(var, value, -index, -key) %>%
filter(key == var)
A base R solution that almost works:
data[cbind(seq_along(data$key), data$key)]
For the data given, it does works, but because it uses a matrix, it has two serious problems. One is that the order of the factor matters, because it's just coercing that out, and selecting columns by factor level, not by the column name. The other is that the resulting output is a character, not a numeric, because in the conversion to a matrix, the type character is chosen because of the key column. The key problem is that there is no data.frame analog to the matrix behavior of
When indexing arrays by '[' a single argument 'i' can be a matrix with as many columns as there are dimensions of 'x'; the result is then a vector with elements corresponding to the sets of indices in each row of 'i'.
Given these problems, I would probably go with the tidyr solution, since the fact that the columns are variably selectable means that they probably represent different observations for the same observable unit.

How to pass a variable column name to the "by" command?

I use the data.table package in R to summarize data often. In this particular case, I'm just counting the number of occurrences in a dataset for given column groups. But I'm having trouble incorporating a loop to do this dynamically.
Normally, I'd summarize data like this.
data <- data.table(mpg)
data.temp1 <- data[, .N, by="manufacturer,class"]
data.temp2 <- data[, .N, by="manufacturer,trans"]
But now I want to loop through the columns of interest in my dataset and plot. Rather than repeating the code over and over, I want to put it in a for loop. Something like this:
columns <- c('class', 'trans')
for (i in 1:length(columns)) {
data.temp <- data[, .N, by=list(manufacturer,columns[i])]
#plot data
}
If I only wanted the column of interest, I could do this in the loop and it works:
data.temp <- data[, .N, by=get(columns[i])]
But if I want to put in a static column name, like manufacturer, it breaks. I can't seem to figure out how to mix a static column name along with a dynamic one. I've looked around but can't find an answer.
Would appreciate any thoughts!

You should be fine if you just quote `"manufacturer"
data.temp <- data[, .N, by=c("manufacturer",columns[i])]
From the ?'[.data.table' help page, by= can be
A single unquoted column name, a list() of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names.
This seems like the easiest way to give you what you need.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dropping unused factor levels in data.table - r

#akrun's answer works well, i think this works too x[, (v1):=droplevels(x[[v1]])]

Related

Want to replace certain rows in one dataframe with rows from another based on matching timestamps (both dataframes have timestamps in the same tz)

Replacing a text in all columns with another text in datatable R

Delete multiple columns by reference using reverse selection in data.Table [duplicate]

Use string to select column per row in dplyr (or base R)

How to pass a variable column name to the "by" command?

Categories

Resources