Change column name by looking up - r

I know that I can change a data.frame column name by:
colnames(df)[3] <- "newname"
But there might be cases where the column I want to change is not in the 3rd position. Is there a way to look up the column by name and change it? Like this...
colnames(df)[,"oldname"] <- "newname"
BTW, I have tried this code and I keep getting incorrect number of subscripts on matrix.
Thanks.

colnames(df)[colnames(df)=="oldname"] <- "newname"
or just names
names(df)[names(df)=="oldname"] <- "newname"
There are various functions for renaming columns in packages as well.

colnames(df)[colnames(df)=="oldname"] <- "newname"
or
names(df)[names(df)=="oldname"] <- "newname"
(since names and colnames are equivalent for a data frame)
or you might be looking for
library(reshape)
df <- rename(df,c(oldname="newname"))

I was using package data.table today and when I tried to change a column name using my usual method a message appeared recommending this approach:
library(data.table)
df <- read.table(text= "
region state county
1 1 1
1 2 2
1 2 3
2 1 4
2 1 4
", header=TRUE, na.strings=NA)
df
setnames(df, "county", "district")
df

A somewhat more general approach that will replace all of the "old"s at the beginning of any current name with "new" in the same character location:
names(df) <- sub("^old", "new", names(df) )

Related

Replace a character string based on a separate list/dataframe R

I'm trying to do something that I thought would be pretty simple that has me stumped.
Say I have the following data frame:
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code))
> df
id code
1 bob_geldof blah
2 billy_bragg di
3 melvin_smith blink
And another like this:
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2))
> alternates
ID1 ID2
1 bob_geldof the_builder
2 melvin_smith kelvin
If the character string in df$id matches alternates$ID1, I'd like to replace it with alternates$ID2. If it doesn't match I'd like to just leave it as it is.
The final df should look like
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
This is obviously a silly example and my real dataset requires lots of replacements.
I've included the 'code' column to demonstrate that I'm working with a data frame and not just a character vector.
I’ve been using gsub to replace them individually but it's time consuming and the list keeps changing.
I looked into str_replace but it seems you can only specify one replacement value.
Any help would be much appreciated.
Cheers!
EDIT: Not all ids contain underscores, and I need to retain the bit that does match. E.g. bob_geldolf becomes bob_the_builder.
EDIT 2(!): Thanks for your suggestions everyone. I've got round the problem by merging the data frames (so that there are NAs where there's no change to be made), and creating new IDs using an ifelse statement. It's a bit clunky but it works!
When creating the dataframes use stringsAsFactors = FALSE so as to not deal with factors. Then, if the rows are ordered, just apply:
df <- as.data.frame(cbind(id,code),stringsAsFactors = FALSE)
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = FALSE)
df$id[c(TRUE,FALSE)]=paste(gsub("(.*)(_.*)","\\1",df$id[c(TRUE,FALSE)]),
alternates$ID2,sep="_")
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
If they are unordered, we can use dlyr:
df%>%rowwise()%>%mutate(id=if_else(length(which(alternates$ID1==id))>0,
paste(gsub("(.*)(_.*)","\\1",id),
alternates$ID2[which(alternates$ID1==id)],sep="_"),
id))
# A tibble: 3 x 2
id code
<chr> <chr>
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
We are using the same logic as before. Here we check the df by row. If its id matches any of alternatives$ID1 (checked by length()), we update it.
The following solution uses base-R and is streamlined a bit. Step 1: merge the main "df" and the "alternates" df together, using a left-join. Step 2: check where there the ID2 value is not missing (NA) and then assign those values to "id". This will keep your original id where available; and replace it with ID2 where those matching IDs are available
The solution:
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
With full original data frame definitions (using stringsAsFactors=F):
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code),stringsAsFactors = F)
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = F)
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
Results: (the full merge below, you can also do combined[,c("id","code")] for the streamlined results). Here, the non-matching "billy_bragg" is kept; and the others are replaced with the matched ID
> combined
id code ID2
1 billy_bragg di <NA>
2 the_builder blah the_builder
3 kelvin blink kelvin

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

How can I autocomplete column names in dplyr?

I know I can do names(df) to get the columns of a dataframe. But is there a more convenient way to rename using dplyr in Rstudio?
Earlier:
names(df)=c("anew","bnew","cnew")
Now?:
library(dplyr)
rename(df, aold = anew, bold = bnew, cold= cnew)
dplyr makes it more difficult as I have to know/type both the old and new column names.
I can see certain conversations around autocompletion of column names in dplyr toolchain. But I can't seem to make it work and I have the latest RStudio.
https://plus.google.com/+SharonMachlis/posts/FHknZcbAdLE
You can try something like this (you don't need to use dplyr to transform names automatically). Just replace the modify_names function with whatever transformation you want to apply to the names.
> modify_names <- function(any_string) {
+ return(paste0(any_string, "-new"))
+ }
>
> df <- data.frame(c(0, 1, 2), c(3, 4, 5))
> names(df) <- c("a", "b")
> df
a b
1 0 3
2 1 4
3 2 5
> names(df) <- modify_names(names(df))
> df
a-new b-new
1 0 3
2 1 4
3 2 5
There's nothing wrong with using names(*) <- new_value. dplyr isn't the be-all and end-all of data manipulation in R.
That said, if you want to include this in a dplyr pipeline, here's how to do it:
df %>% `names<-`(c("a_new", "b_new", "c_new"))
This works because (almost) everything in R is a function, and in particular assigning new names is really a call to the names<- function.
Recently I had the same question and found this RStudio article: https://support.rstudio.com/hc/en-us/articles/205273297-Code-Completion
Following the article, to autocomplete column names with dplyr in RStudio you have to use the magrittr’s %>% operator (pipelines):
library(dplyr)
df %>% rename(aold = anew, bold = bnew, cold= cnew) #Select the variable (old) name after typing the initials (3) + tab
You can find the visual example in the article and manipulate the completation delay (to type less) and other completation options in: RStudio>Tools>Global options...>Code>Completation>Completation delay.

Set up column names in a new data frame based on variable

My goal is to be able to allocate column names to a data frame that I create based on a passed variable. For instance:
i='column1'
data.frame(i=1)
i
1 1
Above the column name is 'i' when I want it to be 'column1'. I know the following works but isn't as efficient as I'd like:
i='column1'
df<-data.frame(x=1)
setnames(df,i)
column1
1 1
It's good to learn how base R works this way:
i <- 'cloumn1'
df <- `names<-`(data.frame(1), i)
df
# cloumn1
#1 1
Aside from the answers posted by other users, I think you may be stuck with the solution you've already presented. If you already have a data frame with the intended number of rows, you can add a new column using brackets:
df <- data.frame('column1'=1)
i <- 'column2'
df[[i]] <- 2
df
column1 column2
1 2
If the idea is to get rid of the setNames, you would probably never do this but
i <- 'column1'
data.frame(`attr<-`(list(1), "names", i))
# column1
# 1 1
You can see in data.frame, it has the code
x <- list(...)
vnames <- names(x)
so, you can mess with the name attribute.
Not exactly sure how you want it more efficient but you could add all the column names at once after your data frame has been assembled with colnames. Here's an example based on yours.
data.frame(Td)
a b
1 1 4
2 1 5
nam<-c("Test1","Test2")
colnames(Td)<-nam
data.frame(Td)
Test1 Test2
1 1 4
2 1 5
You could simply pass the name of your column variable and its values as arguments to a dataframe, without adding more lines:
df <- data.frame(column1=1)
df
# column1
#1 1

Average across some rows in R

I have not found a way to take an average across SOME columns in R when working with a data frame table. Basically, I want to take the average of the 3 controls (CTR_R1+CTR_R2+CTR_R3) and insert that value as another column right after CTR_R3 (see below). The same for the TRT.
Is there away to take the average and insert it in a specific location?
GeneID|CTR_R1|CTR_R2|CTR_R3|CTR_AVG|TRT_R1| TRT_R2| TRT_R3|TRT_AVG|pValue
How about
df$CTR_AVG <- rowMeans(df[,2:4])
df$TRT_AVG <- rowMeans(df[,6:8])
This code should work for you, if your data.frame is named df:
df$CTR_AVG <- ( df$CTR_R1 + df$CTR_R2 + df$CTR_R3 ) / 3
That is assuming that the CTR_AVG column already exists as you shown in your question. If it does not the code will put the column at the end of the data.frame. To move it to the right spot, you will need to select the columns in the correct order, like so:
df[ , c( 'GeneID', 'CTR_R1', 'CTR_R2', 'CTR_R3', 'CTR_AVG', 'TRT_R1', 'TRT_R2', 'TRT_R3','TRT_AVG','pValue' ]
The below code should work even if there are many CTR or TRT columns (i.e. 100s). But, I am guessing #beginneR's solution to be faster.
indx <- grep("^CTR", colnames(df1), value=TRUE)
indxT <- grep("^TRT", colnames(df1), value=TRUE)
df1[,c('CTR_Avg', 'TRT_Avg')] <- lapply(list(indx, indxT),
function(x) Reduce(`+`, df1[,x])/length(x))
or you can use rowMeans in the above step.
df2 <- df1[,c('GeneID', indx, 'CTR_Avg', indxT, 'TRT_Avg', 'pValue')]
head(df2,2)
# GeneID CTR_R1 CTR_R2 CTR_R3 CTR_Avg TRT_R1 TRT_R2 TRT_R3 TRT_Avg pValue
#1 1 6 2 10 6.000000 10 11 15 12 0.091
#2 2 5 12 8 8.333333 5 3 13 7 0.051
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(1:20,20*6, replace=TRUE), ncol=6))
colnames(df1) <- c("CTR_R1", "CTR_R2", "CTR_R3", "TRT_R1", "TRT_R2", "TRT_R3")
df1 <- cbind(GeneID=1:20, df1,
pValue=sample(seq(0.001, 0.10, by=0.01), 20, replace=TRUE))
make some dummy data
df=data.frame(CTR_R1=1:10,CTR_R2=1:10,CTR_R3=1:10,somethingelse=1:10)
get a new column
df$CTR_AVG=apply(df[c("CTR_R1","CTR_R2","CTR_R3")],1,mean)
Thanks so much for your replies. I am sorry I did not phrase my original question better. I meant to ask how to write one script to take the average and place that value in the right place. I do not have in my table the column that says "CTR_AVG", nor the column "TRT_AVG".
I was wondering if i could do it more 'elegantly' than doing what i did below (which works too).
Many thanks.
#
names (edgeR_table)
"GeneID" "CTR_R1" "CTR_R2" "CTR_R3" "TRT_R1" "TRT_R2" "TRT_R3" "logFC" "logCPM" "LR" "PValue" "FDR"
#
edgeR_table$CTR_AVG <- rowMeans(edgeR_table[,2:4])
edgeR_table$TRT_AVG <- rowMeans(edgeR_table[,5:7])
edgeR_table <- edgeR_table[, c(1,2,3,4,13,5,6,7,14,8,9,10,11,12)]

Resources