Extracting pattern substrings from a text file in R - r

I wish to extract all unique substrings of text from a text file using R, that adhere to the form "matrixname[rowname,column number]". I have achieved only limited success with grep and extract_string_all (stringr) in the sense that it will only return the entire line and not the substring only. Trying to replace the unwanted text using gsub has been unsuccessful. Here is an example of the code that I have been using.
#Read in file
txt<-read.table("Project_R_code.R")
#create new object to create lines that contain this pattern
txt2<-grep("param\\[.*1\\]",txt$V1, value=TRUE)
#remove all text that does not match the above pattern
gsub("[^param\\[.*1\\]]","", txt2,perl=TRUE)
The second line works (but again doesn't give me a substring of that pattern only). However the gsub code for removing non-matching patterns keeps the lines and turns them into something like this:
[200] "[p.p]param[ama1]param[ama11]*[r1]param[ama1]...
and I have no idea why. I realise this method of paring down the line into something more manageable is more tedious but it's the only way I know how to get the patterns.
Preferably I would prefer R to spit out a list of all the (unique) substrings it finds in the text file, that match my pattern, but I don't know the command. Any help on this is much appreciated.

If you'd like to extract individual components, try str_match:
test <- c("aaa[name1,1]", "bbb[name2,3]", "ccc[name3,3]")
stringr::str_match(test, "([a-zA-Z0-9_]+)[[]([a-zA-Z0-9_]+),.*?(\\d+)\\]")
## [,1] [,2] [,3] [,4]
## [1,] "aaa[name1,1]" "aaa" "name1" "1"
## [2,] "bbb[name2,3]" "bbb" "name2" "3"
## [3,] "ccc[name3,3]" "ccc" "name3" "3"
Otherwise, use str_extract.
Note that to match [ in ERE/TRE we use a set containing a single [ character, i.e. [[].
Moreover, if you have many matches in a single string, use str_match_all or str_extract_all.

Related

How do you split the following vector of strings based on a delimiter that occurs after a certain character pattern in R?

Here is an example of the vector:
strings<- (c("SPG_L_SPG_R", "SAS_SPG_R_SFG_L", "s_cere_R_SPG_L" ))
I need the split strings to be "SPG_L", "SPG_R","SAS_SPG_R", "SFG_L", "s_cere_R", "SPG_L"
I want to split the string at "_" that occurs after either an "_L" or an "_R"
I know there is a way of splitting strings like this using regex and then I want to use an apply function to apply the string splitting function to the entire vector. I have searched the forum for examples to help me do this, but I am still struggling. Any help is appreciated!
Using positive look-behind assertion, we can split at _ preceded by R or L
stringr::str_split(strings, '(?<=[RL])_', simplify = TRUE)
[,1] [,2]
[1,] "SPG_L" "SPG_R"
[2,] "SAS_SPG_R" "SFG_L"
[3,] "s_cere_R" "SPG_L"

R - Replace Group 1 match in regex but not full match

Suppose I want to extract all letters between the letter a and c. I've been so far using the stringr package which gives a clear idea of the full matches and the groups. The package for example would give the following.
library(stringr)
str_match_all("abc", "a([a-z])c")
# [[1]]
# [,1] [,2]
# [1,] "abc" "b"
Suppose I only want to replace the group, and not the full match---in this case the letter b. The following would, however, replace the full match.
str_replace_all("abc", "a([a-z])c", "z")
[1] "z"
# Desired result: "azc"
Would there be any good ways to replace only the capture group? suppose I wanted to do multiple matches.
str_match_all("abcdef", "a([a-z])c|d([a-z])f")
# [[1]]
# [,1] [,2] [,3]
# [1,] "abc" "b" NA
# [2,] "def" NA "e"
str_replace_all("abcdef", "a([a-z])c|d([a-z])f", "z")
# [1] "zz"
# Desired result: "azcdzf"
Matching groups was easy enough, but I haven't found a solution when a replacement is desired.
It is not the way regex was designed. Capturing is a mechanism to get the parts of strings you need and when replacing, it is used to keep parts of matches, not to discard.
Thus, a natural solution is to wrap what you need to keep with capturing groups.
In this case here, use
str_replace_all("abc", "(a)[a-z](c)", "\\1z\\2")
Or with lookarounds (if the lookbehind is a fixed/known width pattern):
str_replace_all("abc", "(?<=a)[a-z](?=c)", "z")
Usually when I want to replace certain pattern of characters in a text\string I use the grep family functions, that is what we call working with regular expressions.
You can use sub function of the grep family functions to make replacements in strings.
Exemple:
sub("b","z","abc")
[1] "azc"
You may face more challenges working with replacement, for that, grep family functions offers many functionality:
replacing all characters by your preference except a and c:
sub("[^ac]+","z","abBbbbc")
[1] "azc"
replacing the second match
sub("b{2}","z","abBbbbc")
[1] "abBzbc"
replacing all characters after the pattern:
sub("b.*","z","abc")
[1] "az"
the same above except c:
sub("b.*[^c]","z","abc")
[1] "abc"
So on...
You can look for "regular expressions in R using grep" into internet and find many ways to work with regular expressions.

Split string WITHOUT regex

I'm sure I used to know this, and I'm sure this is covered somewhere but since I can't find any Google/SO hits for this title search there probably should be one..
I want to split a string without using regex, e.g.
str = "abcx*defx*ghi"
Of course we can use stringr::str_split or strsplit with argument 'x[*]', but how can we just suppress regex entirely?
The argument fixed=TRUE can be useful in this instance
strsplit(str, "x*", fixed=TRUE)[[1]]
#[1] "abc" "def" "ghi"
Since the question also mentions a stringr::str_split, a stringr way might be of help, too.
You may use str_split with fixed(<YOUR_DELIMITER_STRING_HERE>, ignore_case = FALSE) or coll(pattern, ignore_case = FALSE, locale = "en", ...). See the stringr docs:
fixed: Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.
coll Compare strings respecting standard collation rules
See the following R demo:
> str_split(str, fixed("x*"))
[[1]]
[1] "abc" "def" "ghi"
Collations are better illustrated with a letter that can have two representations:
> x <- c("Str1\u00e1Str2", "Str3a\u0301Str4")
> str_split(x, fixed("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3áStr4" ""
> str_split(x, coll("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3" "Str4"
A note about fixed():
fixed(x) only matches the exact sequence of bytes specified by x. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using fixed() with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent.
...
coll(x) looks for a match to x using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a locale parameter.
Simply wrap the regex inside fixed() to stop it being treated as a regex inside stringr::str_split()
Example
Normally, stringr::str_split() will treat the pattern as a regular expression, meaning certain characters have special meanings, which can cause errors if those regular expressions are not valid, e.g.:
library(stringr)
str_split("abcdefg[[[klmnop", "[[[")
Error in stri_split_regex(string, pattern, n = n, simplify = simplify, :
Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)
But if we simply wrap the pattern we are splitting by inside fixed(), it treat's it as a string literal, rather than a regular expression:
str_split("abcdefg[[[klmnop", fixed("[[["))
[[1]]
[1] "abcdefg" "klmnop"

Regular Expression: Exclude keywords from Email R

I have a text field that has email addresses for which I made below pattern
library(stringr)
str_extract_all(Data, "([a-zA-Z0-9.-])+#([a-zA-Z0-9.-])")
It works perfect and detects everything. However, I need to exclude emails from certain domains like gmail.com. For instance, I don't want emails that are with #gmail.com.
Using not symbol (^), I should be able to achieve my need, however, I have no idea why I get stock after several trial of adding ^gmail.com to my pattern.
Here are a couple obvious ways to go, starting from...
x = c("Espanta#gmail.com","Frank#notgmail.com","Jaap#gmail.com.com")
baddoms = c("gmail.com","yahoo.com")
filter first...
str_split_fixed(x[grep(paste0("#(",paste(baddoms,collapse="|"),")$"), x, invert=TRUE)], "#", 2)
# [,1] [,2]
# [1,] "Frank" "notgmail.com"
# [2,] "Jaap" "gmail.com.com"
... or filter afterwards ...
y = str_split_fixed(x, "#", 2)
y[!(y[,2] %in% baddoms),]
# [,1] [,2]
# [1,] "Frank" "notgmail.com"
# [2,] "Jaap" "gmail.com.com"
As far as code complexity and computational time goes, the second approach is much better. One could argue that the first means saving RAM, but I really doubt that would be a problem in practice.
The OP's idea of using ^gmail.com does not work because ^ has two uses in regex:
identifying the start of the string; and
negating characters inside a character class [^...].
To dodge entire strings, negative lookaheads and lookbehinds are handy, but I know of no way to (1) extract parts from a string and (2) filter results in a single step.

How to read a csv but separating only at first two comma separators?

I have a CSV file. I want to read the file in R but use only the first 2 commas i.e. if there is a line like this in the file,
1,1000,I, am done, with you
In R I want this to the row of a dataframe with three columns like this
> df <- data.frame("Id"="1","Count" ="1000", "Comment" = "I, am done, with you")
> df
Id Count Comment
1 1 1000 I, am done, with you
Regular expression will work.
For example, suppose str are the rows you want to recognize. Here suppose your csv file looks like
1,1000,I, am done, with you
2,500, i don't know
If you want to read from file, just call readLines() to read all lines of the file as a character vector in R, just like str.
The technique is very simple. Here I use {stringr} package to match the text and extract the information I need.
str <- c("1,1000,I, am done, with you", "2,500, i don't know")
library(stringr)
# match the strings by pattern integer,integer,anything
matches <- str_match(str,pattern="(\\d+),(\\d+),\\s*(.+)")
Here I briefly explains the pattern (\\d+),(\\d+),\\s*(.+). \\d represents digit character, \\s represents space character, . represents anything. + means one or more, * means none or some. () groups the patterns so that the function knows what we regard as a group of information.
If you look at matches, it looks like
[,1] [,2] [,3] [,4]
[1,] "1,1000,I, am done, with you" "1" "1000" "I, am done, with you"
[2,] "2,500, i don't know" "2" "500" "i don't know"
Look, str_match function successfully split the texts by the pattern to a matrix. Then our work is only to transform the matrix to a data frame with correct data types.
df <- data.frame(matches[,-1],stringsAsFactors=F)
colnames(df) <- c("Id","Count","Comment")
df <- transform(df,Id=as.integer(Id),Count=as.integer(Count))
df is our target:
Id Count Comment
1 1 1000 I, am done, with you
2 2 1002 i don't know

Resources