Create a character from column names (in R) [duplicate] - r

This question already has answers here:
R regex find last occurrence of delimiter
(4 answers)
Closed 1 year ago.
I have a matrix with thousands of columns which names are as shown below:
Z41_5_tes_ACGTTCCATAGCCGTA
Z41_5_ACGTTCCAGAGCGGTA
Z53_5_ACGTTCCAGAGCCGTA
Z53_5_ACGTTCCAGATCTGTA
Z41_5_ACGTTGCATAGCGGTA
Z41_5_tes_ACGTTCGCTAGCCGTA
I would like to create a vector with names that include the beginning of each columns names as shown below:
Z41_5_tes
Z41_5
Z53_5
Z53_5
Z41_5
Z41_5_tes
I have tried but here I did not capture Z41_5_tes.
names <- gsub("^([^]*[^_]).$", "\1", colnames(x#data))
Z41_5
Z53_5

Remove everything after the last underscore.
sub('_[^_]*$', '', x)
#[1] "Z41_5_tes" "Z41_5" "Z53_5" "Z53_5" "Z41_5" "Z41_5_tes"
Extract everything before last underscore.
sub('(.*)_.*', '\\1', x)
#[1] "Z41_5_tes" "Z41_5" "Z53_5" "Z53_5" "Z41_5" "Z41_5_tes"
data
x <- c("Z41_5_tes_ACGTTCCATAGCCGTA", "Z41_5_ACGTTCCAGAGCGGTA",
"Z53_5_ACGTTCCAGAGCCGTA", "Z53_5_ACGTTCCAGATCTGTA",
"Z41_5_ACGTTGCATAGCGGTA", "Z41_5_tes_ACGTTCGCTAGCCGTA")

Related

Rename column names by pattern [duplicate]

This question already has answers here:
Splitting a column in a data frame by an nth instance of a character
(3 answers)
Accessing element of a split string in R
(4 answers)
First entry from string split
(7 answers)
Closed 1 year ago.
I want to rename my columns cause it's too long, for example:
chrX:99883666-99894988_TSPAN6_ENSG00000000003.10 to TSPAN6
chrX:99839798-99854882_TNMD_ENSG00000000005.5 to TNMD
chr20:49505584-49575092_DPM1_ENSG00000000419.8 to DPM1
How can I rename it consider the elements I want to delete differs from every columns?
Using strsplit we can try:
names(df) <- strsplit(names(df), "_")[[1]][2]
If you only want to target a certain subset of names, then simply filter names(df) using that logic.
You can do that using regex. How about extracting a word between two underscores ?
x <- c("chrX:99883666-99894988_TSPAN6_ENSG00000000003.10",
"chrX:99839798-99854882_TNMD_ENSG00000000005.5",
"chr20:49505584-49575092_DPM1_ENSG00000000419.8")
sub('.*?_(\\w+)_.*', '\\1', x)
#[1] "TSPAN6" "TNMD" "DPM1"
For names of the column you can use names(df) instead of x.
names(df) <- sub('.*?_(\\w+)_.*', '\\1', names(df))
and if you prefer dplyr -
library(dplyr)
df <- df %>% rename_with(~sub('.*?_(\\w+)_.*', '\\1', .))

How to set column names in R by repeating character? [duplicate]

This question already has answers here:
How to create a sequence starting with a character and then with numbers in R
(1 answer)
Make sequential numeric column names prefixed with a letter
(3 answers)
Closed 1 year ago.
Suppose I want to create a column name in R called L1, L2, ..., up to L200. How could I do this for a data frame?
I tried colnames(df) <- c('L1':'L200'), but this does not work (returns error message NAs introduced by coercion), even though there are 200 columns.
Help on this appreciated!
We can use paste
colnames(df) <- paste0("L", 1:200)
or to make it more automatic
colnames(df) <- paste0("L", seq_along(df))
NOTE: The range (:) operator works for integer, and not with character in base R i.e. 'L1' is a string, while 1 is integer, so 1:200 gives the range of values from 1 to 200
Here is another solution:
colnames(df) <- sprintf("L%d", 1:200)

Split column names in R using a separator [duplicate]

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 2 years ago.
I have a dataframe X with column names such as
1_abc,
2_fgy,
27_msl,
936_hhq,
3_hdv
I want to just keep the numbers as the column name (so instead of 1_abc, just 1). How do I go about removing it while keeping the rest of the data intact?
All column names have underscore as the separator between numeric and character variables. There are about 400 columns so I want to be able to code this without using specific column name
You may use sub here for a base R option:
names(df) <- sub("^(\\d+).*$", "\\1", names(df))
Another option might be:
names(df) <- sub("_.*", "", names(df))
This would just strip off everything from the first underscore until the end of the column name.

Changing a full last name to just the first letter of the name in R [duplicate]

This question already has answers here:
Getting and removing the first character of a string
(7 answers)
Extract the first (or last) n characters of a string
(5 answers)
Closed 2 years ago.
I'm working in R. I have a dataset with people first and last names. There is a column called "First" and another column called "Last".
I want to change "Bodie" to just "B" and do the same for all the observations in the "Last" column.
I'm newer to programming so I don't even know where to start. I have looked at some of the string packages in R and can't quite figure out what to do. Thanks for the help.
We can use substr to extract the first letter of the 'Last' column
df1$Last <- substr(df1$Last, 1, 1)
Or sub to remove all the characters other than the first
df1$Last <- sub("^(.).*", "\\1", df1$Last)
Or another option is to split the characters, select the first element
df1$Last <- sapply(strsplit(df1$Last, ""), `[`, 1)
Just a variation on the #akrun answer which uses sub sans a capture group:
df1$Last <- sub("(?<=.).*$", "", df1$Last, perl=TRUE)

How do I extract elements from a dataframe by pattern? [duplicate]

This question already has answers here:
Subset data to contain only columns whose names match a condition
(10 answers)
Closed 3 years ago.
I have a dataframe dat that has many variables like
"x_tp1_y"
"g_tp1_z"
"f_tp2_h"
I would like to extract elements that include "tp1".
I already tried this:
grep("tp1", dat)
grepl("tp1", dat)
dat["tp1",]
I just want R to give me elements with this pattern so I do not have to type in all variable names that are in the dataframe dat.
Like this:
command that extracts elements with pattern "tp1"
R returns parts of the dataframe that have pattern "tp1":
x_tp1_y g_tp1_z
1 2
0 3
And then I would like to create a new dataframe.
I know that I just can use
newdat <- data.frame( dat[[1]], dat[ c(1:30)])
but I have so many elements in my dataframe that this would take ages.
Thank you for your help!
dat[,grep("tp1", colnames(dat))]
grep finds the index numbers in the column names of the data.frame (the vector colnames(dat)) that contain the necessary pattern. "[" subsets

Resources