R: Drop columns from data.table, by reference, without having the name - r

This is almost a duplicate of this. I want to drop columns from a data table, but I want to do it efficiently. I have a list of names of columns that I want to keep. All the answers to the linked question imply doing something akin to
data.table.new <- data.table.old[, my.list]
which at some crucial point will give me a new object, while the old object is still in memory. However, my data.table.old is huge, and hence I prefer to do this via reference, as suggested here
set(data.table.old, j = 'a', value = NULL)
However, as I have a whitelist of columns, and not a blacklist, I would need to iterate through all the column names, checks whether they are in my.list, and then apply set(). Is there any cleaner/other way to doing so?

Not sure if you can do by reference ops on data.frame without making it data.table.
Below code should works if you consider to use data.table.
library(data.table)
setDT(data.frame.old)
dropcols <- names(data.frame.old)[!names(data.frame.old) %in% my.list]
data.frame.old[, c(dropcols) := NULL]

Related

How do I apply a function to specific columns in a dataframe and replace the original columns?

I have got a large dataframe containing medical data (my.medical.data).
A number of columns contain dates (e.g. hospital admission date), the names of each of these columns end in "_date".
I would like to apply the lubridate::dmy() function to the columns that contain dates and overwrite my original dataframe with the output of this function.
It would be great to have a general solution that can be applied using any function, not just my dmy() example.
Essentially, I want to apply the following to all of my date columns:
my.medical.data$admission_date <- lubridate::dmy(my.medical.data$admission_date)
my.medical.data$operation_date <- lubridate::dmy(my.medical.data$operation_date)
etc.
I've tried this:
date.columns <- select(ICB, ends_with("_date"))
date.names <- names(date.columns)
date.columns <- transmute_at(my.medical.data, date.names, lubridate::dmy)
Now date.columns contains my date columns, in the "Date" format, rather than the original factors. Now I want to replace the date columns in my.medical.data with the new columns in the correct format.
my.medical.data.new <- full_join(x = my.medical.data, y = date.columns)
Now I get:
Error: cannot join a Date object with an object that is not a Date object
I'm a bit of an R novice, but I suspect that there is an easier way to do this (e.g. process the original dataframe directly), or maybe a correct way to join / merge the two dataframes.
As usual it's difficult to answer without an example dataset, but this should do the work:
library(dplyr)
my.medical.data <- my.medical.data %>%
mutate_at(vars(ends_with('_date')), lubridate::dmy)
This will mutate in place each variable that end with '_date', applying the function. It can also apply multiple functions. See ?mutate_at (which is also the help for mutate_if)
Several ways to do that.
If you work with voluminous data, I think data.table is the best approach (will bring you flexibility, speed and memory efficiency)
data.table
You can use the := (update by reference operator) together with lapplỳ to apply lubridate::ymd to all columns defined in .SDcols dimension
library(data.table)
setDT(my.medical.data)
cols_to_change <- endsWith("_date", colnames(my.medical.date))
my.medical.data[, c(cols_to_change) := lapply(.SD, lubridate::ymd), .SDcols = cols_to_change]
base R
A standard lapply can also help. You could try something like that (I did not test it)
my.medical.data[, cols_to_change] <- lapply(cols_to_change, function(d) lubridate::ymd(my.medical.data[,d]))

R: data.table : searching on multiple columns AND setting data type

Q1:
Is it possible for me to search on two different columns in a data table. I have a 2 million odd row data and I want to have the option to search on either of the two columns. One has names and other has integers.
Example:
x <- data.table(foo=letters,bar=1:length(letters))
x
want to do
x['c'] : searching on foo column
as well as
x[2] : searching on bar column
Q2:
Is it possible to change the default data types in a data table. I am reading in a matrix with both character and integer columns however everything is being read in as a character.
Thanks!
-Abhi
To answer your Q2 first, a data.table is a data.frame, both of which are internally a list. Each column of the data.table (or data.frame) can therefore be of a different class. But you can't do that with a matrix. You can use := to change the class (by reference - no unnecessary copy being made), for example, of "bar" here:
x[, bar := as.integer(as.character(bar))]
For Q1, if you want to use fast subset (using binary search) feature of data.table, then you've to set key, using the function setkey.
setkey(x, foo)
allows you to fast-subset on 'x' alone like: x['a'] (or x[J('a')]). Similarly setting a key on 'bar' allows you to fast-subset on that column.
If you set the key on both 'foo' and 'bar' then you can provide values for both like so:
setkey(x) # or alternatively setkey(x, foo, bar)
x[J('c', 3)]
However, this'll subset those where x == 'c' and y == 3. Currently, I don't think there is a way to do a | operation with fast-subset directly. You'll have to resort to a vector-scan approach in that case.
Hope this is what your question was about. Not sure.
Your matrix is already a character. Matrices hold only one data type. You can try X['c'] and X[J(2)]. You can change data types as X[,col := as.character(col)]

Subsetting data.table with a condition

How to sample a subsample of large data.table (data.table package)? Is there more elegant way to perform the following
DT<- data.table(cbind(site = rep(letters[1:2], 1000), value = runif(2000)))
DT[site=="a"][sample(1:nrow(DT[site=="a"]), 100)]
Guess there is a simple solution, but can't choose the right wording to search for.
UPDATE:
More generally, how can I access a row number in data.table's i argument without creating temporary column for row number?
One of the biggest benefits of using data.table is that you can set a key for your data.
Using the key and then .I (a built in vairable. see ?data.table for more info) you can use:
setkey(DT, site)
DT[DT["a", sample(.I, 100)]]
As for your second question "how can I access a row number in data.table's i argument"
# Just use the number directly:
DT[17]
Using which, you can find the row-numbers. Instead of sampling from 1:nrow(...) you can simply sample from all rows with the desired property. In your example, you can use the following:
DT[sample(which(site=="a"), 100)]

Conflicting/duplicate column names in J()?

I have two data.tables (dat and results) that share column names. On a side note, results holds summary statistics computed earlier on *sub*groups of dat. In other words, nrow(results) != nrow(dat) (but I don't think this is relevant for the question)
Now I want to incorporate these results back into dat (i.e. the original data.table) by adding a new column (i.e. NewColZ) to dat
This doesn't work as I expect:
dat[,list(colA,colB,NewColZ=results1[colX==colX & colY==colY,colZ])
,by=list(colX, colY)]
Why? because "colX" and "colY" are columns names in both data.tables (i.e. dat and results). What I want to say is, results1[take_from_self(colX)==take_from_parent(colX)]
Therefore the following works (observe I have only RENAMED the columns)
dat[,list(colA,colB,NewCol=results1[cx==colX & cy==colY,colZ,])
,by=list(colX, colY)]
Though I have a feeling that this can simply and easily be done by a join. But dat has many more columns than results
What you are trying to do is a join on colX and colY. You can use := to assign by reference. Joining is most straightforward when you have unique combinations (which I am assuming you do)
keys <- c('colX', 'colY')
setkeyv(dat, keys)
setkeyv(results, keys)
dat[results, newcolZ := colZ]
# perhap use `i.` if there is a colZ in dat
# dat[results, newcolZ := i.colZ]
I do concur with the comments that suggest reading the FAQ and introduction vignettes as well as going through the many examples in ?data.table.
Your issue was a scoping issue, but your primary issue was not being fully aware of the data.table idioms. The join approach is the idoimatically data.table approach.

Rename multiple dataframe columns, referenced by current names

I want to rename some random columns of a large data frame and I want to use the current column names, not the indexes. Column indexes might change if I add or remove columns to the data, so I figure using the existing column names is a more stable solution.
This is what I have now:
mydf = merge(df.1, df.2)
colnames(mydf)[which(colnames(mydf) == "MyName.1")] = "MyNewName"
Can I simplify this code, either the original merge() call or just the second line? "MyName.1" is actually the result of an xts merge of two different xts objects.
The trouble with changing column names of a data.frame is that, almost unbelievably, the entire data.frame is copied. Even when it's in .GlobalEnv and no other variable points to it.
The data.table package has a setnames() function which changes column names by reference without copying the whole dataset. data.table is different in that it doesn't copy-on-write, which can be very important for large datasets. (You did say your data set was large.). Simply provide the old and the new names:
require(data.table)
setnames(DT,"MyName.1", "MyNewName")
# or more explicit:
setnames(DT, old = "MyName.1", new = "MyNewName")
?setnames
names(mydf)[names(mydf) == "MyName.1"] = "MyNewName" # 13 characters shorter.
Although, you may want to replace a vector eventually. In that case, use %in% instead of == and set MyName.1 as a vector of equal length to MyNewName
plyr has a rename function for just this purpose:
library(plyr)
mydf <- rename(mydf, c("MyName.1" = "MyNewName"))
names(mydf) <- sub("MyName\\.1", "MyNewName", names(mydf))
This would generalize better to a multiple-name-change strategy if you put a stem as a pattern to be replaced using gsub instead of sub.
You can use the str_replace function of the stringr package:
names(mydf) <- str_replace(names(mydf), "MyName.1", "MyNewName")

Resources