Renaming Data Frame Column Names using Partial Matching - r

If I have the code at the bottom of the post, how can I replace the column names of df1 with the second column of df2 using partial matching of df2's first column? The output should look like df3. The entirety of my data frame is filled with many other names besides .length (i.e. CS1.1.width, CS2.12.height, etc), but the CS#.#. always remains in the name.
I then need to remove the ".length" from the colnames.
I have tried using pmatch below for the first part of the question, but the output is not correct.
names(df1) <- df2$new[pmatch(names(df1), df2$partial_atch)]
How would I go about this? Thanks.
old <- c("CS1.1.length", "CS1.7.length", "CS1.10.length", "CS1.12.length", "CS2.4.length", "CS2.6.length", "CS2.9.length", "CS2.11.length", "CS1.1.height")
df1 <- data.frame()
for (k in old) df1[[k]] <- as.character()
new <- c("Bob", "Alex", "Gary", "Taylor", "Tom", "John", "Pat", "Mary")
partial_match <- c("CS1.1", "CS1.7", "CS1.10", "CS1.12", "CS2.4", "CS2.6", "CS2.9", "CS2.11")
df2 <- data.frame(Partial_Match = partial_match, Name = new)
new1 <- c("Bob.length", "Alex.length", "Gary.length", "Taylor.length", "Tom.length", "John.length", "Pat.length", "Mary.length", "Bob.height")
df3 <- data.frame()
for (k in new) df3[[k]] <- as.character()
Edit: The number of columns in df1 is greater than the number of elements in partial_match, so added an additional column in df1 as example.

Here's an option with str_replace from the stringi package:
This works because you can use a vector of pattern = to replace with a matching replacement =.
We need to paste on the trailing . because this prevents CS1.1 replacing CS1.11 and CS1.10.
library(stringi)
stri_replace_all_regex(names(df1),
pattern = paste0(as.character(df2$Partial_Match),"\\."),
replacement = paste0(as.character(df2$Name),"\\."),
vectorize_all = FALSE)
#[1] "Bob.length" "Alex.length" "Gary.length" "Taylor.length" "Tom.length" "John.length" "Pat.length"
#[8] "Mary.length"

Related

R Remove columns from a data.frame with other df

Hello I have 2 dataframes:
df1 looks like:
and the df2 looks like:
I have noticied that the df1 has point symbol (.), while df2 has "-". It is weird because both of them, if I open with a text editor or excel, they have "-".
What I need is to drop all the columns of df1 that match with a value of df2.
I have used this:
DataGenSample = df1[,!(names(df1) %in% df2)]
#DataGenSample <- df1[ , !(colnames(df1) %in% df2)]
but there is no change.
All the Data can be foun here. Whith the code that I have used.
# Data (df1):
DataGen <- read.table("data_CNA.txt",sep="\t", header=TRUE, check.names = FALSE)
# Samples (df2):
DeleteSample <- read.table("MuestrasEliminar.txt",sep="\t", header=TRUE, check.names = FALSE)
#Delete columns:
#DataGenSample = DataGen[,!(names(DataGen) %in% DeleteSample)]
DataGenSample <- DataGen[ , !(colnames(DataGen) %in% DeleteSample)]
The issue is - vs ..
When you read in the data, your read command probably has an argument like check.names that changes the names to make them "standard" R names - which means no punctuation other than _ and .. If you set check.names = FALSE the original names will be kept, and your code should work just fine.
Ok, I found out that you need to convert your df in vector first:
vecDeleteSample <- DeleteSample$SAMPLE_ID
And then you can drop the mathing columns of your vector/list:
DataGenSample <- DataGen[,!(names(DataGen) %in% vecDeleteSample)]

add running counter for semi-consecutive strings in vector

I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?
You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"

Initialize an empty tibble with column names and 0 rows

I have a vector of column names called tbl_colnames.
I would like to create a tibble with 0 rows and length(tbl_colnames) columns.
The best way I've found of doing this is...
tbl <- as_tibble(data.frame(matrix(nrow=0,ncol=length(tbl_colnames)))
and then I want to name the columns so...
colnames(tbl) <- tbl_colnames.
My question: Is there a more elegant way of doing this?
something like tbl <- tibble(colnames=tbl_colnames)
my_tibble <- tibble(
var_name_1 = numeric(),
var_name_2 = numeric(),
var_name_3 = numeric(),
var_name_4 = numeric(),
var_name_5 = numeric()
)
Haven't tried, but I guess it works too if instead of initiating numeric vectors of length 0 you do it with other classes (for example, character()).
This SO question explains how to do it with other R libraries.
According to this tidyverse issue, this won't be a feature for tribbles.
Since you want to combine a list of tibbles. You can just assign NULL to the variable and then bind_rows with other tibbles.
res = NULL
for(i in tibbleList)
res = bind_rows(res,i)
However, a much efficient way to do this is
bind_rows(tibbleList) # combine all tibbles in the list
For anyone still interested in an elegant way to create a 0-row tibble with column names given by a character vector tbl_colnames:
tbl_colnames %>% purrr::map_dfc(setNames, object = list(logical()))
or:
tbl_colnames %>% purrr::map_dfc(~tibble::tibble(!!.x := logical()))
or:
tbl_colnames %>% rlang::rep_named(list(logical())) %>% tibble::as_tibble()
This, of course, results in each column being of type logical.
The following command will create a tibble with 0 row and variables (columns) named with the contents of tbl_colnames
tbl <- tibble::tibble(!!!tbl_colnames, .rows = 0)
You could abuse readr::read_csv, which allow to read from string. You can control names and types, e.g.:
tbl_colnames <- c("one", "two", "three", "c4", "c5", "last")
read_csv("\n", col_names = tbl_colnames) # all character type
read_csv("\n", col_names = tbl_colnames, col_types = "lcniDT") # various types
I'm a bit late to the party, but for future readers:
as_tibble(matrix(nrow = 0, ncol = length(tbl_colnames)), .name_repair = ~ tbl_colnames)
.name_repair allows you to name you columns within the same function.

replace substring in subset of column names in R

I have a dataframe where I want to replace a subset of the columns with new names created by prepending an identifier to the old one. For example, to prepend columns 3:7 with the string, "TEST", I tried the following.
What am I missing here?
# Make a test df
df <- data.frame(replicate(10,sample(0:1,100,rep=TRUE)))
#Subsetting works fine
colnames(df[,3:7])
#sub works fine
sub("^", "TEST.", colnames(df[,3:7]))
#replacing the subset of column names with sub does not
colnames(df[,3:7]) <- sub("^", "TEST.", colnames(df[,3:7]))
colnames(df)
#Also doesn't work
colnames(df[,3:7]) <- paste("TEST.", colnames(df[,3:7]), sep ="")
colnames(df)
The column names should be a vector, with the indices outside of the parentheses:
colnames(df)[3:7] <- sub("^", "TEST.", colnames(df)[3:7])
You could also:
colnames(df)[3:7] <- paste0("TEST.", colnames(df)[3:7])

Rename columns of a data frame by searching column name

I am writing a wrapper to ggplot to produce multiple graphs based on various datasets. As I am passing the column names to the function, I need to rename the column names so that ggplot can understand the reference.
However, I am struggling with renaming of the columns of a data frame
here's a data frame:
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
here are my column names for search:
col1_search <- "col1"
col2_search <- "col2"
col3_search <- "col3"
and here are column names to replace:
col1_replace <- "new_col1"
col2_replace <- "new_col2"
col3_replace <- "new_col3"
when I search for column names, R sorts the column indexes and disregards the search location.
for example, when I run the following code, I expected the new headers to be new_col1, new_col2, and new_col3, instead the new column names are: new_col3, new_col2, and new_col1
colnames(df)[names(df) %in% c(col3_search,col2_search,col1_search)] <- c(col3_replace,col2_replace,col1_replace)
Does anyone have a solution where I can search for column names and replace them in that order?
require(plyr)
df <- data.frame(col2=1:3,col1=3:5,col3=6:8)
df <- rename(df, c("col1"="new_col1", "col2"="new_col2", "col3"="new_col3"))
df
And you can be creative in making that second argument to rename so that it is not so manual.
> names(df)[grep("^col", names(df))] <-
paste("new", names(df)[grep("^col", names(df))], sep="_")
> names(df)
[1] "new_col1" "new_col2" "new_col3"
If you want to replace an ordered set of column names with an arbitrary character vector, then this should work:
names(df)[sapply(oldNames, grep, names(df) )] <- newNames
The sapply()-ed grep will give you the proper locations for the 'newNames' vector. I suppose you might want to make sure there are a complete set of matches if you were building this into a function.
hmm, this might be way to complicated, but the first that come into my mind:
lookup <- data.frame(search = c(col3_search,col2_search,col1_search),
replace = c(col3_replace,col2_replace,col1_replace))
colnames(df) <- lookup$replace[match(lookup$search, colnames(df))]
I second #justin's aes_string suggestion. But for future renaming you can try.
require(stringr)
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
oldNames <- c("col1", "col2", "col3")
newNames <- c("new_col1", "new_col2", "new_col3")
names(df) <- str_replace(string=names(df), pattern=oldNames, replacement=newNames)

Resources