Only keep part before the 2th pattern in R [duplicate] - r

This question already has answers here:
How to delete everything after nth delimiter in R?
(2 answers)
Remove text after second colon
(3 answers)
Remove all characters after the 2nd occurrence of "-" in each element of a vector
(1 answer)
Closed 3 years ago.
How could I remove everything before the second pattern occurence in a dataframe using R?
I used:
for (i in 1:length(df1)){
df1[, i]<- gsub(".*_", "",df1[, i])
}
But I guess there is a better way to apply that for all the dataframe?
Here is an exemple of a value in the dataframe:
name_000004_A_B_C
name_00003_C_D
and get
A_B_C
C_D
Thank you for your help.

x <- c("name_000004_A_B_C", "name_00003_C_D")
gsub("(name_[0-9]*_)(.*)", "\\2", x)
##[1] "A_B_C" "C_D"
More generalised:
gsub("([a-z0-9]*_[a-z0-9]*_)(.*)", "\\2", x)
#[1] "A_B_C" "C_D"
The global substitution takes two matching group patterns into consideration, first is the pattern (name_[0-9]*_) and the second is whatever comes after. It keeps the second matching group. Hope this hepls!

Related

Restructuring column names in a df [duplicate]

This question already has answers here:
Getting and removing the first character of a string
(7 answers)
Remove prefix letter from column variables
(3 answers)
Closed 2 years ago.
I've got a data with column names that look like this:
X121.10.21 X131.90.23
I want to remove the X at the beginning of each string, remove the third number after the . and then reorder the first and second number. Like this:
10.121 90.131
How can I do this? I would especially appreciate a way to do this with dplyr, if possible.
We can use sub, capture as a group and replace with the backreference of the captured group
names(df1) <- sub("X(\\d+)\\.(\\d+)\\..*", "\\2.\\1", names(df1))

Remove part of column name post the second "_" [duplicate]

This question already has answers here:
Exclude everything after the second occurrence of a certain string
(2 answers)
Closed 3 years ago.
I have a vector which has names of the columns
group <- c("amount_bin_group", "fico_bin_group", "cltv_bin_group", "p_region_bin")
I want to replace the part after the second "_" from each element i.e. I want it to be
group <- c("amount_bin", "fico_bin", "cltv_bin", "p_region")
I can split this into two vectors and try gsub or substr. However, it would be nice to do that in vector. Any thoughts?
I checked other posts regarding the same question, but none of them has this framework
> sub("(.*)_.*$", "\\1", group)
[1] "amount_bin" "fico_bin" "cltv_bin" "p_region"

Difference between [A-Z] and LETTERS in grep [duplicate]

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 5 years ago.
I am trying to only keep rows whose id contains letters. And I find the following two ways give different results.
df[grep("[A-Z]",df$id),]
df[grep(LETTERS,df$id),]
It seems the second way will omit many rows that actually have letters.
Why?
If you want to grep patterns in a vector try this:
to_match <- paste(LETTERS, collapse = "|")
to_match
[1] "A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z"
and then
df[grep(to_match, df$id), ]
Explanation:
You will match any of the characters in "to_match" since they are separated by the "or" operator "|".

Remove characters in string before specific symbol(including it) [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Use gsub remove all string before first white space in R
(4 answers)
Closed 5 years ago.
at the beginning, yes - simillar questions are present here, however the solution doesn't work as it should - at least for me.
I'd like to remove all characters, letters and numbers with any combination before first semicolon, and also remove it too.
So we have some strings:
x <- "1;ABC;GEF2"
y <- "X;EER;3DR"
Let's do so gsub() with . and * which means any symbol with occurance 0 or more:
gsub(".*;", "", x)
gsub(".*;", "", y)
And as a result i get:
[1] "GEF2"
[1] "3DR"
But I'd like to have:
[1] "ABC;GEF2"
[1] "EER;3DR"
Why did it 'catch' second occurence of semicolon instead of first?
You could use
gsub("[^;]*;(.*)", "\\1", x)
# [1] "ABC;GEF2"

Cutting value in vector by determine positions [duplicate]

This question already has answers here:
Trying to return a specified number of characters from a gene sequence in R
(3 answers)
Extracting the last n characters from a string in R
(15 answers)
Closed 5 years ago.
Is there a function in R that I can cut a value in vector.
for example i got this vec:
40754831597
64278107602
64212163451
and each vale in the vec i want to cut so from the number pos 3 to 6 for example and get a new vector look like this
7548
2781
2121
and so on
I don't really get why you would like to do this, but here you go:
# assuming it's a character vector
substring(vec,3,6)
# if it's numeric
substring(as.character(vec),3,6)
#output
#[1] "7548" "2781" "2121"
We can use sub
sub(".{2}(.{4}).*", "\\1", v1)
#[1] "7548" "2781" "2121"
data
v1 <- c(40754831597, 64278107602, 64212163451)

Resources