Rename multiple dataframe columns, referenced by current names - r

I want to rename some random columns of a large data frame and I want to use the current column names, not the indexes. Column indexes might change if I add or remove columns to the data, so I figure using the existing column names is a more stable solution.
This is what I have now:
mydf = merge(df.1, df.2)
colnames(mydf)[which(colnames(mydf) == "MyName.1")] = "MyNewName"
Can I simplify this code, either the original merge() call or just the second line? "MyName.1" is actually the result of an xts merge of two different xts objects.

The trouble with changing column names of a data.frame is that, almost unbelievably, the entire data.frame is copied. Even when it's in .GlobalEnv and no other variable points to it.
The data.table package has a setnames() function which changes column names by reference without copying the whole dataset. data.table is different in that it doesn't copy-on-write, which can be very important for large datasets. (You did say your data set was large.). Simply provide the old and the new names:
require(data.table)
setnames(DT,"MyName.1", "MyNewName")
# or more explicit:
setnames(DT, old = "MyName.1", new = "MyNewName")
?setnames

names(mydf)[names(mydf) == "MyName.1"] = "MyNewName" # 13 characters shorter.
Although, you may want to replace a vector eventually. In that case, use %in% instead of == and set MyName.1 as a vector of equal length to MyNewName

plyr has a rename function for just this purpose:
library(plyr)
mydf <- rename(mydf, c("MyName.1" = "MyNewName"))

names(mydf) <- sub("MyName\\.1", "MyNewName", names(mydf))
This would generalize better to a multiple-name-change strategy if you put a stem as a pattern to be replaced using gsub instead of sub.

You can use the str_replace function of the stringr package:
names(mydf) <- str_replace(names(mydf), "MyName.1", "MyNewName")

Related

Function in R for several variables

I want to unclass several factor variables in R. I need this functionality for a lot of variables. At the moment I repeat the code for each variable which is not convenient:
unclass:
myd$ati_1 <-unclass(myd$ati_1)
myd$ati_2 <-unclass(myd$ati_2)
myd$ati_3 <-unclass(myd$ati_3)
myd$ati_4 <-unclass(myd$ati_4)
I've looked into the apply() function family but I do not even know if this is the correct approach. I also read about for loops but every example is only about simple integers, not when you need to loop over several variables.
Would be glad if someone could help me out.
You can use a loop:
block <- c("ati_1", "ati_2", "ati_3", "ati_4")
for (j in block) {myd[[j]] <- unclass(myd[[j]])}
# The double brackets allows you to specify actual names to extrapolate within the data frame
Here are a few ways. We use CO2 which comes with R and has several factor columns. This unclasses those columns.
If you need some other criterion then
set ix to the names or positions or a logical vector defining those columns to be transformed in the base R solution
replace is.factor in the collapse solution with a vector of names or positions or a logical vector denoting the columns to convert
in the dplyr solution replace where(...) with the same names, positions or logical.
Code follows. In all of these the input is not overridden so you still have the input available unchanged if you want to rerun it from scratch and, in general, overwriting objects is error prone.
# Base R
ix <- sapply(CO2, is.factor)
replace(CO2, ix, lapply(CO2[ix], unclass))
# collapse
library(collapse)
ftransformv(CO2, is.factor, unclass)
# dplyr
library(dplyr)
CO2 %>%
mutate(across(where(is.factor), unclass))
Depending on what you want this might be sufficient or omit the as.data.frame if a matrix result is ok.
as.data.frame(data.matrix(CO2))

How do I apply a function to specific columns in a dataframe and replace the original columns?

I have got a large dataframe containing medical data (my.medical.data).
A number of columns contain dates (e.g. hospital admission date), the names of each of these columns end in "_date".
I would like to apply the lubridate::dmy() function to the columns that contain dates and overwrite my original dataframe with the output of this function.
It would be great to have a general solution that can be applied using any function, not just my dmy() example.
Essentially, I want to apply the following to all of my date columns:
my.medical.data$admission_date <- lubridate::dmy(my.medical.data$admission_date)
my.medical.data$operation_date <- lubridate::dmy(my.medical.data$operation_date)
etc.
I've tried this:
date.columns <- select(ICB, ends_with("_date"))
date.names <- names(date.columns)
date.columns <- transmute_at(my.medical.data, date.names, lubridate::dmy)
Now date.columns contains my date columns, in the "Date" format, rather than the original factors. Now I want to replace the date columns in my.medical.data with the new columns in the correct format.
my.medical.data.new <- full_join(x = my.medical.data, y = date.columns)
Now I get:
Error: cannot join a Date object with an object that is not a Date object
I'm a bit of an R novice, but I suspect that there is an easier way to do this (e.g. process the original dataframe directly), or maybe a correct way to join / merge the two dataframes.
As usual it's difficult to answer without an example dataset, but this should do the work:
library(dplyr)
my.medical.data <- my.medical.data %>%
mutate_at(vars(ends_with('_date')), lubridate::dmy)
This will mutate in place each variable that end with '_date', applying the function. It can also apply multiple functions. See ?mutate_at (which is also the help for mutate_if)
Several ways to do that.
If you work with voluminous data, I think data.table is the best approach (will bring you flexibility, speed and memory efficiency)
data.table
You can use the := (update by reference operator) together with lapplỳ to apply lubridate::ymd to all columns defined in .SDcols dimension
library(data.table)
setDT(my.medical.data)
cols_to_change <- endsWith("_date", colnames(my.medical.date))
my.medical.data[, c(cols_to_change) := lapply(.SD, lubridate::ymd), .SDcols = cols_to_change]
base R
A standard lapply can also help. You could try something like that (I did not test it)
my.medical.data[, cols_to_change] <- lapply(cols_to_change, function(d) lubridate::ymd(my.medical.data[,d]))

Error in 'colsplit' function?

Im am trying to split a column of a dataframe into 2 columns using transform and colsplit from reshape package. I don't get what I am doing wrong. Here's an example...
library(reshape)
df1 <- data.frame(col1=c("x-1","y-2","z-3"))
Now I am trying to split the col1 into col1.a and col1.b at the delimiter '-'. the following is my code...
df1 <- transform(df1,col1 = colsplit(col1,split='-',names = c('a','b')))
Now in my RStudio when I do View(df1) I do get to see col1.a and col1.b split the way I want to.
But when I run...
df1$col1.a or head(df1$col1.a) I get NULL. Apparently I am not able to make any further operations on these split columns. What exactly is wrong with this?
colsplit returns a list, the easiest (and idiomatic) way to assign these to multiple columns in the data frame is to use [<-
eg
df1[c('col1.a','col1.b')] <- colsplit(df1$col1,'-',c('a','b'))
it will be much harder to do this within transform (see Assign multiple new variables on LHS in a single line in R)

R: Drop columns from data.table, by reference, without having the name

This is almost a duplicate of this. I want to drop columns from a data table, but I want to do it efficiently. I have a list of names of columns that I want to keep. All the answers to the linked question imply doing something akin to
data.table.new <- data.table.old[, my.list]
which at some crucial point will give me a new object, while the old object is still in memory. However, my data.table.old is huge, and hence I prefer to do this via reference, as suggested here
set(data.table.old, j = 'a', value = NULL)
However, as I have a whitelist of columns, and not a blacklist, I would need to iterate through all the column names, checks whether they are in my.list, and then apply set(). Is there any cleaner/other way to doing so?
Not sure if you can do by reference ops on data.frame without making it data.table.
Below code should works if you consider to use data.table.
library(data.table)
setDT(data.frame.old)
dropcols <- names(data.frame.old)[!names(data.frame.old) %in% my.list]
data.frame.old[, c(dropcols) := NULL]

How to indicate row.names=1 using fread() in data.table?

I want to consider the first column in my .csv file as a sequence of rownames. Usually I used to do the following:
read.csv("example_file.csv", row.names=1)
But I want to do this with the fread() function in the data.table R package, as it runs very quickly.
X <- as.matrix(fread("bigmatrix.csv"),rownames=1)
Why not saving the rownames in a column:
df <- data.frame(x=rnorm(1000))
df$row_name = row.names(df)
fwrite(df,file="example_file.csv")
Then you can load the saved CSV.
df <- fread(file="example_file.csv")
From a small search I've done, data.tables never uses row names. Since data.tables inherit from data.frames, it still has the row names attribute. But it never uses them.
However, you can probably use this answer (similar post) and later make the rowname column into your actual rownames. Though, it might not be efficient.
Just one function, convert to a dataframe
a <- fread(file="example_file.csv") %>% as.data.frame()
row.names(a) <- a$V1

Resources