r Remove parts of column name after certain characters - r

I have a large data set with thousands of columns. The column names include various unwanted characters as follows:
col1_3x_xxx
col2_3y_xyz
col3_3z_zyx
I would like to remove all character strings starting with "_3" from all column names to be left with clean:
col1
col2
col3
What is the most efficient way to do this for 5000+ columns?

certainly late for this answer, but just in case someone is looking for a solution
colnames(df1)[col] <- sub("_3.*", "", colnames(df1)[col])
And if you have multiple columns :
for ( col in 1:ncol(df1)){
colnames(df1)[col] <- sub("_3.*", "", colnames(df1)[col])
}

We can use sub
sub("_3.*", "", df1[,1])
#[1] "col1" "col2" "col3"

We can try the str_extract with regular expression pattern "^[^_]+(?=_)":
stringr::str_extract(c("col1_3x_xxx", "col2_3y_xyz", "col3_3z_zyx"), "^[^_]+(?=_)")
[1] "col1" "col2" "col3"
where in the pattern:
The first ^ matches the beginning of the string; [^_]+ matches one
or more non _ character, ^_ means any character but _. (?=...)
stands for lookahead, so we are looking for pattern ahead of _.

You can use
names(df) = gsub(pattern = "_3.*", replacement = "", x = names(df))

Related

Convert sign in column names if not at certain position in R [duplicate]

I have a character string of names which look like
"_6302_I-PAL_SPSY_000237_001"
I need to remove the first occurred underscore, so that it will be as
"6302_I-PAL_SPSY_000237_001"
I aware of gsub but it removes all of underscores. Thank you for any suggestions.
gsub function do the same, to remove starting of the string symbol ^ used
x <- "_6302_I-PAL_SPSY_000237_001"
x <- gsub("^\\_","",x)
[1] "6302_I-PAL_SPSY_000237_001"
We can use sub with pattern as _ and replacement as blanks (""). This will remove the first occurrence of '_'.
sub("_", "", str1)
#[1] "6302_I-PAL_SPSY_000237_001"
NOTE: This will remove the first occurence of _ and it will not limit based on the position i.e. at the start of the string.
For example, suppose we have string
str2 <- "6302_I-PAL_SPSY_000237_001"
sub("_", "", str2)
#[1] "6302I-PAL_SPSY_000237_001"
As the example have _ in the beginning, another option is substring
substring(str1, 2)
#[1] "6302_I-PAL_SPSY_000237_001"
data
str1 <- "_6302_I-PAL_SPSY_000237_001"
This can be done with base R's trimws() too
string1<-"_6302_I-PAL_SPSY_000237_001"
trimws(string1, which='left', whitespace = '_')
[1] "6302_I-PAL_SPSY_000237_001"
In case we have multiple words with leading underscores, we may have to include a word boundary (\\b) in our regex, and use either gsub or stringr::string_remove:
string2<-paste(string1, string1)
string2
[1] "_6302_I-PAL_SPSY_000237_001 _6302_I-PAL_SPSY_000237_001"
library(stringr)
str_remove_all(string2, "\\b_")
> str_remove_all(string2, "\\b_")
[1] "6302_I-PAL_SPSY_000237_001 6302_I-PAL_SPSY_000237_001"

Remove some special characters from a string and convert to decimal a column of a data frame [duplicate]

I have a column in a dataframe as follows:
COL1
$54,345
$65,231
$76,234
How do I convert it into this:
COL1
54345
65231
76234
The way I tried it at first was:
df$COL1<-as.numeric(as.character(df$COL1))
That didn't work because it said NA's were introduced.
Then I tried it like this:
df$COL1<-as.numeric(gsub("\\$","",as.character(df$COL1)))
And the same this happened.
Any ideas?
We could use parse_number from readr package which removes any non-numeric characters.
library(readr)
parse_number(df$COL1)
#[1] 54345 65231 76234
The reason why the gsub didn't work was there was , in the column, which is still non-numeric. So when convert to 'numeric' with as.numeric, all the non-numeric elements are converted to NA. So, we need to remove both , and $ to make it work.
df1$COL1 <- as.numeric(gsub('[$,]', '', df1$COL1))
We match the $ and , inside the square brackets ([$,]) so that it will be considered as that character ($ left alone has special meaning i.e. it signifies the end of the string.) and replace it with ''.
Or we can escape (\\) the character ($) to match it and replace by ''.
df1$COL1 <- as.numeric(gsub('\\$|,', '', df1$COL1))
Another option using stringr library to remove '$' and ',' then convert as follows:
df %>% mutate(COL1 = COL1 %>% str_remove_all("\\$,") %>% as.numeric())
Nested gsub to handle negatives and transform to make it functional and to take advantage of NSE
transform(df, COL1 = as.numeric(gsub("[$),]", "", gsub("^\\(", "-", COL1))))

Regular expression to exclude a string pattern in R

Please, I want to rename the columns of my table by removing the year label. Here are my columns names :
"PROV_201601" "MNT_201602" "PROV_201612" .... and so on !
My objective is to remove the "2016" from the name of the column. I am only familiar with R but not with regular expressions.
Any help is appreciated !
Thank you.
We can try with sub to match a _ capture as a group followed by four digits (\\d{4}) and replace with the backreference of the captured group (\\1) or use _
sub("(_)\\d{4}", "\\1", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
If it is specific to 2016 then
sub("2016", "", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
data
v1 <- c("PROV_201601", "MNT_201602", "PROV_201612")
First, use sub() to replace all instances of "2016" with "". This will eliminate 2016 from the character strings.
col1 <- c("PROV_201601", "MNT_201602", "PROV_201612")
col2 <- sub("2016", "", col1)
Now rename your columns of data frame dat using names():
names(dat) <- col2

How to delete lines with a special character?

I would like to delete the lines which contain the opening bracket "(" from my dataframe.
I tried the following:
df[!grepl("(", df$Name),]
But this does not track down the ( sign
You have to double-escape the ( with \\.
x <- c("asdf", "asdf", "df", "(as")
x[!grepl("\\(", x)]
# [1] "asdf" "asdf" "df"
Just apply this to your df like df[!grepl("\\(", df$Name), ]
You could also think about removing all puctuation characters by using regex:
x[!grepl("[[:punct:]]", x)]
As pointed out by #CSquare in the comments, here is a great summary about special characters in R regex
Additional input from the comments:
#Sotos: Gaining performance with pattern='(' and fixed = TRUE since the regex could be bypassed.
x[!grepl('(', x, fixed = TRUE)]

Transform suffix into prefix in column name

I would like to move the suffix of a column name to its beginning so that it becomes its prefix. I have many columns with changing names (except the suffix), so manually renaming is not an option.
Example:
set.seed(1)
dat <- data.frame(ID = 1:5,
speed.x.alpha = runif(5),
power.x.alpha = rpois(5, 1),
force.x.alpha = rexp(5),
speed.y.beta = runif(5),
power.y.beta = rpois(5, 1),
force.y.beta = rexp(5))
In the end end the dataframe should have the following column names:
ID, alpha.speed.x, alpha.power.x, alpha.force.x, beta.speed.x, beta.power.x, force.power.x.
I strongly assume I need a gsub/sub expression which allows me to select the characters after the last dot, which I would then paste to the colnames, and eventually remove from the end. So far without success though...
A couple of gsubs and paste0 will do the trick:
gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x" "beta.speed.x"
[6] "beta.power.x" "beta.force.x"
The () in the regular expression capture the characters that match the subexpression. "\." is used to match the literal "." and the "$" anchors the expression to the end of the string. The second argument pastes together the captured sub-expressions. This result is fed to a second gsub which replaces the ending "y" with an "x" if one is found.
to rename the variables, use
names(dat) <- gsub("y$", "x", gsub("(^.*)\\.(.*a$)", paste0("\\2", ".", "\\1"), names(dat)))
Here is one option with sub. We match one or more characters that are not a . ([^.]+) from the start (^) of the string, capture it as group ((...)- inside the braces), followed by a dot (\\. - note that . is a metacharacter which signifies for any character. So, it needs to be escaped (\\) to read it as the literal character or place it inside square brackets), followed by another set of characters that are not a dot (inside the second capture group) followed by another dot and the rest of the characters until the end of the string. In the replacement, we change the order of backreferences of capture groups to get the expected output.
names(dat) <- sub("^([^.]+)\\.([^.]+)\\.(.*)", "\\3.\\1.\\2", names(dat))
names(dat)
#[1] "ID" "alpha.speed.x" "alpha.power.x" "alpha.force.x"
#[5] "beta.speed.y" "beta.power.y" "beta.force.y"

Resources