Custom function to rename all columns - r

I want to manipulate the names of all the columns in a dataframe with this function that I wrote:
clean_names <- function(df) {
names(df) <- tolower(names(df))
names(df) <- gsub('\\s', '\\_', names(df))
names(df) <- gsub('\\(|\\)|\\/|,|\\.', '\\_', names(df))
names(df) <- gsub('(\\_)\\_', '\\1', names(df))
names(df) <- gsub('\\_$', '', names(df))
}
That said, when actually called, it doesn't do anything (no error just nothing).
What's the problem here?
I suspect the problem is that I'm only assigning things and not returning anything. But in this case I don't want to return a value just change the column names.
The only parameter here is df and I'm calling the names() function multiple times. Shouldn't this work? Any help is appreciated!

Two things here:
R tends to not operate in side-effect, so while you may pass a data.frame in to it, the first time you change anything about it, the df in the function is completely copied into a new object that will go away when the function is done. The original frame is untouched. There are some functions in R that operate in side-effect, but most of R is not. With this, you cannot just make changes inside the function and assume that it will have an effect outside of the function. For this, you would need to reassign the results back to the frame, as in:
mydata <- clean_names(mydata)
When there is no literal return(.) statement in a function, R will return the last expression (often invisibly). You will often see functions end with the desired object (df here) without using the literal return function; that function is useful in some circumstances but usually not needed.
The last expression is usually invisible. You can see what is really happening by capturing the return value in a new variable or, as a shortcut, just (clean_names(mydata)). My gut feeling is that the output from that function is a vector of strings.
Why? Because the last expression is a reassignment of names. The RHS of that assignment is producing a character vector, and that is passed to the `names<-` function on the LHS, and that value (the vector of strings) is then used as the return value of the function.
The resolution here is to add df (or return(df) if you must) to the end of your function, as in:
clean_names <- function(df) {
names(df) <- tolower(names(df))
names(df) <- gsub('\\s', '\\_', names(df))
names(df) <- gsub('\\(|\\)|\\/|,|\\.', '\\_', names(df))
names(df) <- gsub('(\\_)\\_', '\\1', names(df))
names(df) <- gsub('\\_$', '', names(df))
df
}
After doing both of those steps, you should then get data.

From the names documentation:
For names<-, the updated object. (Note that the value of names(x) <- value is that of the assignment, value, not the return value from the left-hand side.)
Therefore you should try:
clean_names <- function(df) {
names(df) <- tolower(names(df))
names(df) <- gsub('\\s', '\\_', names(df))
names(df) <- gsub('\\(|\\)|\\/|,|\\.', '\\_', names(df))
names(df) <- gsub('(\\_)\\_', '\\1', names(df))
names(df) <- gsub('\\_$', '', names(df))
return(df)
}

Related

Convert column/variable to row names for multiple lists in R with sapply

I am trying to convert a column to row names for each list of an object using sapply.
tibble::column_to_rownames() is not an option because it "always returns a data frame".
For a single variable (class sf, data.frame), using the base function:
rownames(polygon_nyc_listings[[1]]) <- polygon_nyc_listings[[1]]$zipcode
It works perfectly without changing the class: sf, data.frame.
So I am trying to repeat for each list the function above without success:
test <- sapply(polygon_nyc_listings,
function(x){rownames(x) <- x$zipcode},
simplify = FALSE, USE.NAMES = TRUE)
I instead get an object with zipcode lists class character.
Does someone know how to proceed?
Because in R functions return the last line or explicit return() call, your anonymous function function(x) { rownames(x) <- x$zipcode } returns the result of row.names which per docs:
row.names returns a character vector.
This can be quickly fixed by calling x after row names change as commented, function(x) { rownames(x) <- x$zipcode; x } or function(x) { rownames(x) <- x$zipcode; return(x) }.
However, consider row.names<- (subtle difference with assignment operator embedded):
function(x) { `rownames<-`(x, value= x$zipcode }
which per same docs:
row.names<- returns a data frame with the row names changed.
Altogether, removing redundant curly braces {...} and default USE.NAMES = TRUE:
test_list <- sapply(polygon_nyc_listings,
function(df) `row.names<-`(df, value = df$zipcode),
simplify = FALSE)

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

Changing column names using regular expressions

I have five columns named organoleptic.1., organoleptic.2., organoleptic.3. and so forth in a data frame called "df". I want to rename them to organoleptic1, organoleptic2, organoleptic3, etc. That is, I want to remove the two dots surrounding the number. I did it using the names function:
names(df)[names(df) == "organoleptic.1."] <- "organoleptic1"
names(df)[names(df) == "organoleptic.2."] <- "organoleptic2"
names(df)[names(df) == "organoleptic.3."] <- "organoleptic3"
names(df)[names(df) == "organoleptic.4."] <- "organoleptic4"
names(df)[names(df) == "organoleptic.5."] <- "organoleptic5"
However, I would like to do it just typing one line of code. Is it possible to do that using regular expressions or any other trick? Many thx!
We can try by using gsub function. Edit: Fixed from sub to gsub
colnames(df) <- gsub('.', '', colnames(df), fixed=TRUE)

Subset variables in data frame based on column type

I need to subset data frame based on column type - for example from data frame with 100 columns I need to keep only those column with type factor or integer. I've written a short function to do this, but is there any simpler solution or some built-in function or package on CRAN?
My current solution to get variable names with requested types:
varlist <- function(df=NULL, vartypes=NULL) {
type_function <- c("is.factor","is.integer","is.numeric","is.character","is.double","is.logical")
names(type_function) <- c("factor","integer","numeric","character","double","logical")
names(df)[as.logical(sapply(lapply(names(df), function(y) sapply(type_function[names(type_function) %in% vartypes], function(x) do.call(x,list(df[[y]])))),sum))]
}
The function varlist works as follows:
For every requested type and for every column in data frame call "is.TYPE" function
Sum tests for every variable (boolean is casted to integer automatically)
Cast result to logical vector
subset names in data frame
And some data to test it:
df <- read.table(file="http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", sep=" ", header=FALSE, stringsAsFactors=TRUE)
names(df) <- c('ca_status','duration','credit_history','purpose','credit_amount','savings', 'present_employment_since','installment_rate_income','status_sex','other_debtors','present_residence_since','property','age','other_installment','housing','existing_credits', 'job','liable_maintenance_people','telephone','foreign_worker','gb')
df$gb <- ifelse(df$gb == 2, FALSE, TRUE)
df$property <- as.character(df$property)
varlist(df, c("integer","logical"))
I'm asking because my code looks really cryptic and hard to understand (even for me and I've finished the function 10 minutes ago).
Just do the following:
df[,sapply(df,is.factor) | sapply(df,is.integer)]
subset_colclasses <- function(DF, colclasses="numeric") {
DF[,sapply(DF, function(vec, test) class(vec) %in% test, test=colclasses)]
}
str(subset_colclasses(df, c("factor", "integer")))

Rename columns of a data frame by searching column name

I am writing a wrapper to ggplot to produce multiple graphs based on various datasets. As I am passing the column names to the function, I need to rename the column names so that ggplot can understand the reference.
However, I am struggling with renaming of the columns of a data frame
here's a data frame:
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
here are my column names for search:
col1_search <- "col1"
col2_search <- "col2"
col3_search <- "col3"
and here are column names to replace:
col1_replace <- "new_col1"
col2_replace <- "new_col2"
col3_replace <- "new_col3"
when I search for column names, R sorts the column indexes and disregards the search location.
for example, when I run the following code, I expected the new headers to be new_col1, new_col2, and new_col3, instead the new column names are: new_col3, new_col2, and new_col1
colnames(df)[names(df) %in% c(col3_search,col2_search,col1_search)] <- c(col3_replace,col2_replace,col1_replace)
Does anyone have a solution where I can search for column names and replace them in that order?
require(plyr)
df <- data.frame(col2=1:3,col1=3:5,col3=6:8)
df <- rename(df, c("col1"="new_col1", "col2"="new_col2", "col3"="new_col3"))
df
And you can be creative in making that second argument to rename so that it is not so manual.
> names(df)[grep("^col", names(df))] <-
paste("new", names(df)[grep("^col", names(df))], sep="_")
> names(df)
[1] "new_col1" "new_col2" "new_col3"
If you want to replace an ordered set of column names with an arbitrary character vector, then this should work:
names(df)[sapply(oldNames, grep, names(df) )] <- newNames
The sapply()-ed grep will give you the proper locations for the 'newNames' vector. I suppose you might want to make sure there are a complete set of matches if you were building this into a function.
hmm, this might be way to complicated, but the first that come into my mind:
lookup <- data.frame(search = c(col3_search,col2_search,col1_search),
replace = c(col3_replace,col2_replace,col1_replace))
colnames(df) <- lookup$replace[match(lookup$search, colnames(df))]
I second #justin's aes_string suggestion. But for future renaming you can try.
require(stringr)
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
oldNames <- c("col1", "col2", "col3")
newNames <- c("new_col1", "new_col2", "new_col3")
names(df) <- str_replace(string=names(df), pattern=oldNames, replacement=newNames)

Resources