Replacing a text in all columns with another text in datatable R - r

Found solution for dataframe to replace a text in all columns with another text. But i could not use the same for datatable. Below is what i tried. But when changed data.frame to data.table it doesnt give the correct data.
DF<- data.frame(lapply(DT, function(x) {gsub("abc", "xyz", x)}))
I need to find and replace all occurances of abc with xyz in all columns of a data.table object

If it is a data.table and we want to change all the column values, then use the data.table methods. Based on the OP's code, we are selecting all the columns (so no need to specify .SDcols), loop through the Subset of Data.table with lapply, replace the 'abc' with 'xyz' with gsub (assuming there are multiple instances of 'abc') and update the original column by assigning (:=) the output back to the original columns
attrdata2[, names(attrdata2) := lapply(.SD, function(x) gsub("abc", "xyz", x))]

Related

Dropping unused factor levels in data.table

I am trying to figure out the syntax for dropping unused factor levels in a data.table given a character vector of column names similar to what's done in this link. However in that example "y" is the actual column name of the data.table "x". I would like to pass instead a character vector holding the column names but I could not figure out the syntax.
We can use .SDcols to specify the columns of interest. It can take a vector of columns names (length of 1 or greater than 1) or column index. Now, the .SD i.e. Subset of Data.table would have those columns specified in the .SDcols. As there is only a single column, extract that column with [[, apply the droplevels on the vector and assign (:=) it back to the column of interest. Not the parens around the object identifier v1. It is to evaluate the object to get the value in it instead of creating a column 'v1'
x[, (v1) := droplevels(.SD[[1]]), .SDcols = v1]
Usually, the syntax would be
x[, (v1) := lapply(.SD, droplevels), .SDcols = v1]
It can take one column or multiple columns. The only reason to extract ([[) is because we know it is a single column
Another option is get
x[, (v1) := droplevels(get(v1))]
where,
v1 <- "y"
#akrun's answer works well, i think this works too
x[, (v1):=droplevels(x[[v1]])]

Directly paste two data table columns

I have a syntax question because I do not understand the behaveior of data.table for my problem.
Similiar to this question I want to paste two columns directly together using a predefined character vector. I do not want to create a new column.
MWE:
dt <- data.table(L=1:5,A=letters[7:11],B=letters[12:16])
cols<-c("A", "B")
I can paste directly using the col names without brackets as from the other question
dt[,paste0(A,B)]
But i cant using with=F or .SD
dt[,paste0(cols),with=F]
dt[,paste0(.SD),.SDcols=cols]
Why do I have to use a do.call?
dt[,do.call(paste0,.SD), .SDcols=cols]

Selecting rows in data.table on the basis of a substring match to any of multiple columns

I have a data.table like this one, but with many more columns:
library(data.table)
the_dt = data.table(DetailCol1=c("Deets1","Deets2","Deets3","Deets4"), DetailCol2 = c("MoreDeets1","MoreDeets2","MoreDeets3","MoreDeets4"), DataCol1=c("ARP","AARPP","ABC","ABC"), DataCol2=c("ABC","ABC","ABC","ARPe"), DataCol3 = c("ABC", "ARP", "ABC","ABC"))
I want to retrieve DetailCol1 of only those rows that contain a match to the string 'ARP'.
This question was useful in pointing me to like, but I'm still not sure how do this for multiple columns, especially if there are dozens of columns in which I would like to search.
For instance, this is how I could search within DataCol1
the_dt[DataCol1 %like% 'ARP',DetailCol1], but how would I conduct the same search in DataCols 1-100?
We can specify the columns to compare in .SDcol, loop through it with lapply, convert it to logical using %like%, check whether there is at least one TRUE per each row using Reduce, use that to subset the elements from 'DetailCol1'.
the_dt[the_dt[, Reduce(`|`, lapply(.SD, `%like%`, "ARP")),
.SDcols= DataCol1:DataCol3], DetailCol1]

How to pass a variable column name to the "by" command?

I use the data.table package in R to summarize data often. In this particular case, I'm just counting the number of occurrences in a dataset for given column groups. But I'm having trouble incorporating a loop to do this dynamically.
Normally, I'd summarize data like this.
data <- data.table(mpg)
data.temp1 <- data[, .N, by="manufacturer,class"]
data.temp2 <- data[, .N, by="manufacturer,trans"]
But now I want to loop through the columns of interest in my dataset and plot. Rather than repeating the code over and over, I want to put it in a for loop. Something like this:
columns <- c('class', 'trans')
for (i in 1:length(columns)) {
data.temp <- data[, .N, by=list(manufacturer,columns[i])]
#plot data
}
If I only wanted the column of interest, I could do this in the loop and it works:
data.temp <- data[, .N, by=get(columns[i])]
But if I want to put in a static column name, like manufacturer, it breaks. I can't seem to figure out how to mix a static column name along with a dynamic one. I've looked around but can't find an answer.
Would appreciate any thoughts!
You should be fine if you just quote `"manufacturer"
data.temp <- data[, .N, by=c("manufacturer",columns[i])]
From the ?'[.data.table' help page, by= can be
A single unquoted column name, a list() of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names.
This seems like the easiest way to give you what you need.

R update table column based on search string from another table

I am trying to update Cell B in a table based on the value of cell A in the same table. To filter the rows I want to update I am using grepl to compare cell A to a list of character strings from a list/table/vector or some other external source. For all rows where cell A matches the search criteria, I want to update cell B to say "xxxx". I need to do this for all rows in my table.
So far I have something like this where cat1 is a list of some sort that has strings to search for.
for (x in 1:length(cat1)){
data %<>% mutate(Cat = ifelse(grepl(cat1[i],ItemName),"xxx",Cat))
}
I am open to any better way of accomplishing this. I've tried for loops with dataframes and I'm open to a data.table solution.
Thank you.
To avoid the loop you can collapse the character vector with | and then use it as a single pattern in grepl, for example you can try:
cat1_collapsed <- paste(cat1, collapse = "|")
data %>% mutate(Cat = ifelse(grepl(cat1_collapsed, ItemName),"xxx", Cat))
Or the equivalent using data.table (or base R of course).
use the following code assuming that you have a data frame called "data" with column "A" and "B" and that "cat1" is a vector of the desired strings, as described
library(data.table)
setDT(data)
data[A %in% cat1,B:="XXXX"]

Resources