Directly paste two data table columns - r

I have a syntax question because I do not understand the behaveior of data.table for my problem.
Similiar to this question I want to paste two columns directly together using a predefined character vector. I do not want to create a new column.
MWE:
dt <- data.table(L=1:5,A=letters[7:11],B=letters[12:16])
cols<-c("A", "B")
I can paste directly using the col names without brackets as from the other question
dt[,paste0(A,B)]
But i cant using with=F or .SD
dt[,paste0(cols),with=F]
dt[,paste0(.SD),.SDcols=cols]
Why do I have to use a do.call?
dt[,do.call(paste0,.SD), .SDcols=cols]

Related

How do I apply a function to specific columns in a dataframe and replace the original columns?

I have got a large dataframe containing medical data (my.medical.data).
A number of columns contain dates (e.g. hospital admission date), the names of each of these columns end in "_date".
I would like to apply the lubridate::dmy() function to the columns that contain dates and overwrite my original dataframe with the output of this function.
It would be great to have a general solution that can be applied using any function, not just my dmy() example.
Essentially, I want to apply the following to all of my date columns:
my.medical.data$admission_date <- lubridate::dmy(my.medical.data$admission_date)
my.medical.data$operation_date <- lubridate::dmy(my.medical.data$operation_date)
etc.
I've tried this:
date.columns <- select(ICB, ends_with("_date"))
date.names <- names(date.columns)
date.columns <- transmute_at(my.medical.data, date.names, lubridate::dmy)
Now date.columns contains my date columns, in the "Date" format, rather than the original factors. Now I want to replace the date columns in my.medical.data with the new columns in the correct format.
my.medical.data.new <- full_join(x = my.medical.data, y = date.columns)
Now I get:
Error: cannot join a Date object with an object that is not a Date object
I'm a bit of an R novice, but I suspect that there is an easier way to do this (e.g. process the original dataframe directly), or maybe a correct way to join / merge the two dataframes.
As usual it's difficult to answer without an example dataset, but this should do the work:
library(dplyr)
my.medical.data <- my.medical.data %>%
mutate_at(vars(ends_with('_date')), lubridate::dmy)
This will mutate in place each variable that end with '_date', applying the function. It can also apply multiple functions. See ?mutate_at (which is also the help for mutate_if)
Several ways to do that.
If you work with voluminous data, I think data.table is the best approach (will bring you flexibility, speed and memory efficiency)
data.table
You can use the := (update by reference operator) together with lapplỳ to apply lubridate::ymd to all columns defined in .SDcols dimension
library(data.table)
setDT(my.medical.data)
cols_to_change <- endsWith("_date", colnames(my.medical.date))
my.medical.data[, c(cols_to_change) := lapply(.SD, lubridate::ymd), .SDcols = cols_to_change]
base R
A standard lapply can also help. You could try something like that (I did not test it)
my.medical.data[, cols_to_change] <- lapply(cols_to_change, function(d) lubridate::ymd(my.medical.data[,d]))

creating, directly, data.tables with column names from variables, and using variables for column names with := [duplicate]

This question already has answers here:
Select / assign to data.table when variable names are stored in a character vector
(6 answers)
Closed 3 years ago.
The only way I know so far is in two steps: creating the columns with dummy names and then using setnames(). I would like to do it in one step, probably there is some parameter/option, but am not able to find it
# the awkward way I have found so far
col_names <- c("one", "two","three")
dt <- data.table()
# add columns with dummy names...
setnames(dt, col_names )
Also interested in a way to be able to use a variable with :=, something like
colNameVar <- "dummy_values"
DT[ , colNameVar := 1:10]
This question to me does not seem a duplicate of Select / assign to data.table when variable names are stored in a character vector
here I ask about when creating a data.table, word "creating" in the title.
This is totally different from when the data table is already created, which is the subject of the question indicated as duplicate, for the latter there are kown ways clearly documented, that do not work in the case I ask about here.
PS. Note similar question indicated in comment by # Ronak Shah: Create empty data frame with column names by assigning a string vector?
For the first question, I'm not absolutely sure, but you may want to try and see if fread is of any help creating an empty data.table with named columns.
As for the second question, try
DT[, c(nameOfCols) := 10]
Where nameOfCols is the vector with names of the columns you want to modify. See ?data.table

Replacing a text in all columns with another text in datatable R

Found solution for dataframe to replace a text in all columns with another text. But i could not use the same for datatable. Below is what i tried. But when changed data.frame to data.table it doesnt give the correct data.
DF<- data.frame(lapply(DT, function(x) {gsub("abc", "xyz", x)}))
I need to find and replace all occurances of abc with xyz in all columns of a data.table object
If it is a data.table and we want to change all the column values, then use the data.table methods. Based on the OP's code, we are selecting all the columns (so no need to specify .SDcols), loop through the Subset of Data.table with lapply, replace the 'abc' with 'xyz' with gsub (assuming there are multiple instances of 'abc') and update the original column by assigning (:=) the output back to the original columns
attrdata2[, names(attrdata2) := lapply(.SD, function(x) gsub("abc", "xyz", x))]

R: Drop columns from data.table, by reference, without having the name

This is almost a duplicate of this. I want to drop columns from a data table, but I want to do it efficiently. I have a list of names of columns that I want to keep. All the answers to the linked question imply doing something akin to
data.table.new <- data.table.old[, my.list]
which at some crucial point will give me a new object, while the old object is still in memory. However, my data.table.old is huge, and hence I prefer to do this via reference, as suggested here
set(data.table.old, j = 'a', value = NULL)
However, as I have a whitelist of columns, and not a blacklist, I would need to iterate through all the column names, checks whether they are in my.list, and then apply set(). Is there any cleaner/other way to doing so?
Not sure if you can do by reference ops on data.frame without making it data.table.
Below code should works if you consider to use data.table.
library(data.table)
setDT(data.frame.old)
dropcols <- names(data.frame.old)[!names(data.frame.old) %in% my.list]
data.frame.old[, c(dropcols) := NULL]

R, create a new column in a data frame that applies a function of all the columns with similar names

I have a data frame in which the names of the columns are something like a,b,v1,v2,v3...v100.
I want to create a new column that applies a function to only the columns whose names include 'v'.
For example, given this data frame
df<-data.frame(a=rnorm(3),v1=rnorm(3),v2=rnorm(3),v3=rnorm(3))
I want to create a new column in which each element is the sum of the elements of v1, v2 and v3 that are in the same row.
grep on names to get the column positions, then use rowSums:
rowSums(df[,grep("v",names(df))])
To combine both #James's and #Anatoliy's answers,
apply(df[grepl('^v', names(df))], 1, sum)
I went ahead and anchored the v in the regular expression to the beginning of the string. Other examples haven't done that but it appears that you want all columns that begin with v not the larger set that may have a v in their name. If I am wrong you could just do
apply(df[grepl('v', names(df))], 1, sum)
You should avoid using subset() when programming, as stated in ?subset
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
Also, as I learned yesterday from Richie Cotton, when indexing it is better to use grepl than grep.
That should do:
df$sums<- rowSums(subset(df, select=grepl("v", names(df))))
For a more general approach:
apply(subset(df, select=grepl("v", names(df))), 1, sum)

Resources