Delete everything after second comma from string [duplicate] - r

This question already has answers here:
How to delete everything after nth delimiter in R?
(2 answers)
Closed 3 years ago.
I would like to remove anything after the second comma in a string -including the second comma-. Here is an example:
x <- 'Day,Bobby,Jean,Gav'
gsub("(.*),.*", "\\1", x)
and it gives:
[1] "Day, Bobby, Jean"
while I want:
[1] "Day, Bobby
regardless of the number of names that may exist in x

Use
> x <- 'Day, Bobby, Jean, Gav'
> sub("^([^,]*,[^,]*),.*", "\\1", x)
[1] "Day, Bobby"
The ^([^,]*,[^,]*),.* pattern matches
^ - start of string
([^,]*,[^,]*) - Group 1: 0+ non-commas, a comma, and 0+ non-commas
,.* - a comma and the rest of the string.
The \1 in the replacement pattern will keep Group 1 value in the result.

We can also use strsplit and then paste
toString(head(strsplit(x, ",")[[1]], 2))
#[1] "Day, Bobby"

Related

How to get a date out of a string? [duplicate]

This question already has answers here:
R Regexp - extract number with 5 digits
(4 answers)
Closed 1 year ago.
I have a file with name "test_result_20210930.xlsx". I would like to get "20210930" out to a new variable date. How should I do that? I think I can say pattern="[0-9]+" What if I have more numbers in the file name, and I only want the part that will stand for the date? (8digt together?)
Any suggestion?
Using gsub with \\D+ matches all non-digits and in the replacement, specify blank ("")
gsub("\\D+", "", str1)
[1] "20210930"
If the pattern also includes other digits, and want to return only the 8 digits
sub(".*_(\\d{8})_.*", "\\1", "test_result_20210930_01.xlsx")
[1] "20210930"
Or use str_extract
library(stringr)
str_extract("test_result_20210930_01.xlsx", "(?<=_)\\d{8}(?=_)")
[1] "20210930"
If we need to automatically convert to Date object
library(parsedate)
parse_date(str1)
[1] "2021-09-30 UTC"
-output
str1 <- "test_result_20210930.xlsx"
You can also use str_extract from the stringr package to obtain the desired result.
library(stringr)
str_extract("test_result_20210930.xlsx", "[0-9]{8}")
# [1] "20210930"

Move characters from beginning of column name to end of column name (additonal question) [duplicate]

This question already has an answer here:
regular expression match digit and characters
(1 answer)
Closed 2 years ago.
I need an additional solution to the previous question/answer
Move characters from beginning of column name to end of column name
I have a dataset where column names have two parts divided by _ e.g.
pal036a_lon
pal036a_lat
pal036a_elevation
I would like to convert the prefixes into suffixes so that it becomes:
lon_pal036a
lat_pal036a
elevation_pal036a
The answer to the previous question
names(df) <- sub("([a-z])_([a-z]+)", "\\2_\\1", names(df))
does not work for numbers within the prefixes.
Assuming your names have a single _. You could also you strsplit():
sapply(strsplit(names(df), '_'), function(x) paste(rev(x), collapse = '_'))
If you have more than one you could modify the above as suggested by jay.sf:
sapply(strsplit(x, "_"), function(x) paste(c(x[length(x)], x[-length(x)]), collapse="_"))
You can include alphanumeric characters in the first group:
names(df) <- sub("([a-z0-9]+)_([a-z]+)", "\\2_\\1", names(df))
For example :
x <- c("pal036a_lon","pal036a_lat","pal036a_elevation")
sub("([a-z0-9]+)_([a-z]+)", "\\2_\\1",x)
#[1] "lon_pal036a" "lat_pal036a" "elevation_pal036a"

Match all elements with punctuation mark except asterisk in r [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I have a vector vec which has elements with a punctuation mark in it. I want to return all elements with punctuation mark except the one with asterisk.
vec <- c("a,","abc","ef","abc-","abc|","abc*01")
> vec[grepl("[^*][[:punct:]]", vec)]
[1] "a," "abc-" "abc|" "abc*01"
why does it return "abc*01" if there is a negation mark[^*] for it?
Maybe you can try grep like below
grep("\\*",grep("[[:punct:]]",vec,value = TRUE), value = TRUE,invert = TRUE) # nested `grep`s for double filtering
or
grep("[^\\*[:^punct:]]",vec,perl = TRUE, value = TRUE) # but this will fail for case `abc*01|` (thanks for feedback from #Tim Biegeleisen)
which gives
[1] "a," "abc-" "abc|"
You could use grepl here:
vec <- c("a,","abc-","abc|","abc*01")
vec[grepl("^(?!.*\\*).*[[:punct:]].*$", vec, perl=TRUE)]
[1] "a," "abc-" "abc|"
The regex pattern used ^(?!.*\\*).*[[:punct:]].*$ will only match contents which does not contain any asterisk characters, while also containing at least one punctuation character:
^ from the start of the string
(?!.*\*) assert that no * occurs anywhere in the string
.* match any content
[[:punct:]] match any single punctuation character (but not *)
.* match any content
$ end of the string

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

how to split a sequence in R into multiple sub parts [duplicate]

This question already has answers here:
Chopping a string into a vector of fixed width character elements
(13 answers)
Closed 6 years ago.
seq="GAGTAGGAGGAG",how to split this sequence into the following sub sequence "GAG","TAG","GAG","GAG"i.e how to split the sequence in groups of threes
We can create a function called fixed_split that will split a character string into equal parts. The regular expression is a lookbehind that matches on n elements together:
fixed_split <- function(text, n) {
strsplit(text, paste0("(?<=.{",n,"})"), perl=TRUE)
}
fixed_split("GAGTAGGAGGAG", 3)
[[1]]
[1] "GAG" "TAG" "GAG" "GAG"
Edit
In your comment you say sequence ="ATGATGATG" does not work:
strsplit(sequence,"(?<=.{3})", perl=TRUE)
[[1]]
[1] "ATG" "ATG" "ATG"

Resources