extract string before dash in R - r

I have a column for name, the format is mixed with AAA and AAA-D. I want to extract the name before dash (if it has dash) or keep the non-dashed name.
the list is
Name
W1-D1
Empty
W2-D1
what I want to extract are
Name
W1
Empty
W2
I found several syntaxes like v1<-gsub("^(.*?)-.*", "\\1",v) but this does not work on my list, I got this “c(\"W1" in v1 . Did I use this syntax wrong?

You can as well use stringr
library(stringr)
v2<-str_extract(v, "[^-]+")

The following regex will do it.
sub("(^[^-]+)-.*", "\\1", Name)
#[1] "W1" "Empty" "W2"
Data.
Name <- scan(what = character(), text ="
W1-D1
Empty
W2-D1
")

Related

Add a character to a specific part of a string?

I have a list of file names as such:
"A/B/file.jpeg"
"A/C/file2.jpeg"
"B/C/file3.jpeg"
and a couple of variations of such.
My question is how would I be able to add a "new" or any characters into each of these file names after the second "/" such that the length of the string/name doesn't matter just that it is placed after the second "/"
Results would ideally be:
"A/B/newfile.jpeg"
"A/B/newfile2.jpeg" etc.
Thanks!
Another possible solution, based on stringr::str_replace:
library(stringr)
l <- c("A/B/file.jpeg", "A/B/file2.jpeg", "A/B/file3.jpeg")
str_replace(l, "\\/(?=file)", "\\/new")
#> [1] "A/B/newfile.jpeg" "A/B/newfile2.jpeg" "A/B/newfile3.jpeg"
Using gsub.
gsub('(file)', 'new\\1', x)
# [1] "A/B/newfile.jpeg" "A/C/newfile2.jpeg" "B/C/newfile3.jpeg"
Data:
x <- c("A/B/file.jpeg", "A/C/file2.jpeg", "B/C/file3.jpeg")

RegEx for replacing part of a string in R [duplicate]

This question already has answers here:
Escaped Periods In R Regular Expressions
(3 answers)
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
I am trying to do an exact pattern match using the gsub/sub and replace function. I am not getting the desired response. I am trying to remove the .x and .y from the names without affecting other names.
name = c("company", "deriv.x", "isConfirmed.y")
new.name = gsub(".x$|.y$", "", name)
new.name
[1] "compa" "deriv" "isConfirmed"
company has become compa.
I have also tried
remove = c(".x", ".y")
replace(name, name %in% remove, "")
[1] "company" "deriv.x" "isConfirmed.y"
I would like the outcome to be.
"company", "deriv", "isConfirmed"
How do I solve this problem?
Here we can have a simple expression that removes the undesired . and anything after that:
(.+?)(?:\..+)?
or for exact match:
(.+?)(?:\.x|\.y)?
R Test
Your code might look like something similar to:
gsub("(.+?)(?:\\..+)?", "\\1", "deriv.x")
or
gsub("(.+?)(?:\.x|\.y)?", "\\1", "deriv.x")
R Demo
RegEx Demo 1
RegEx Demo 2
Description
Here, we are having a capturing group (.+?), where our desired output is and a non-capturing group (?:\..+)? which swipes everything after the undesired ..
The dot matches any character except a newline ao .x$|.y$ would also match the ny in company
There is no need for any grouping structure to match a dot followed by x or y. You could match a dot and match either x or y using a character class:
\\.[xy]
Regex demo | R demo
And replace with an empty string:
name = c("company", "deriv.x", "isConfirmed.y")
new.name = gsub("\\.[xy]", "", name)
new.name
Result
[1] "company" "deriv" "isConfirmed"
In a regex, . represents "any character". In order to recognize literal . characters, you need to escape the character, like so:
name <- c("company", "deriv.x", "isConfirmed.y")
new.name <- gsub("\\.x$|\\.y$", "", name)
new.name
[1] "company" "deriv" "isConfirmed"
This explains why in your original example, "company" was being transformed to "compa" (deleting the "any character of 'n', followed by a 'y' and end of string").
Onyambu's comment would also work, since within the [ ] portion of a regex, . is interpreted literally.
gsub("[.](x|y)$", "", name)

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name"
This contains things like
s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path"
and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/
and got this
s:///*.[^/]*
as the regex to use
so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing
01 GROUP etc
I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong?
When I use: https://regexr.com/
I get
\d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT:
There are two solutions here.
The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did:
To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems.
To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")
You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).
Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP

Extract text with gsub

I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:
Baseline/Cell_Line_2_KB_1813_B_Baseline
Dose 0001/Cell_Line_3_KB1720_1_0001
Dose 0010/Cell_Line_1_KB1810 mat_0010
I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.
I used gsub with the following command:
df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)
I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?
Thank you :)
We can use str_extract
library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
Or the same pattern can be captured as a group with sub
sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
data
df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)
The way to do this is using regex groups:
x <- c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010")
gsub("^.+Cell_Line_._(.+)_.+$", "\\1", x)
[1] "KB_1813_B" "KB1720_1" "KB1810 mat"

Using regular expression in string replacement

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!
lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()
The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".
Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

Resources