Use substr with start and stop words, instead of integers - r

I want to extract information from downloaded html-Code. The html-Code is given as a string. The required information is stored inbetween specific html-expressions. For example, if I want to have every headline in the string, I have to search for "H1>" and "/H1>" and the text between these html expressions.
So far, I used substr(), but I had to calculate the position of "H1>" and "/H1>" first.
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])
The output is correct, but to calculate every single start and stopposition is a lot of work. Instead I search for a similar function like substr (), where you can use start and stop words instead of the position. For example like this:
function(htmlcode, startword = "H1>", stopword = "/H1>")

I'd agree that using a package built for html processing is probably the best way to handle the example you give. However, one potential way to sub-string a string based on character values would be to do the following.
Step 1: Define a simple function to return to position of a character in a string, in this example I am only using fixed character strings.
strpos_fixed=function(string,char){
a<-gregexpr(char,string,fixed=T)
b<-a[[1]][1:length(a[[1]])]
return(b)
}
Step 2: Define your new sub-string function using the strpos_fixed() function you just defined
char_substr<-function(string,start,stop){
x<-strpos_fixed(string,start)+nchar(start)
y<-strpos_fixed(string,stop)-1
z<-cbind(x,y)
apply(z,1,function(x){substr(string,x[1],x[2])})
}
Step 3: Test
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")

You have two options here. First, use a package that has been developed explicitly for the parsing of HTML structures, e.g., rvest. There are a number of tutorials online.
Second, for edge cases where you may need to extract from strings that are not necessarily well-formatted HTML you should use regular expressions. One of the simpler implementations for this comes from stringr::str_match:
# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>
str_match(htmlcode, "(<H1>)(.*?)(</H1>)")
This will yield a matrix where the columns are (in order) the fully matched string followed by each independent regex group we specified. You would just want to pull the second group in this case if you want whatever text is between the <H1> tags (3rd column).

Related

Extracting a specific identifier from a column containing excess information

I'm trying to use stringr/dplyr to extract a pathway name from a table cell containing excess information. All cells in this table follow the same general format. Some examples are:
(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.
(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.
3,4',5-trihydroxystilbene biosynthesis
From these examples, I want to extract "(R)-lactate from methylglyoxal", "(S)-dihydroorotate from bicarbonate", and "3,4',5-trihydroxystilbene biosynthesis" respectively. I'm struggling to figure out which combination of regular expressions to use in order to accomplish this. I've been trying to use the positive look behind assertion ?<=... along with str_extract to extract all information preceding the first ":", but I can't get it to work. Any help would be appreciated!
please try the following pattern:
(?<=^)(.+?)(:|$)
(?<=^) the first part is looking exclusively at the beginning of the sentence
(.+?)(:|$) the second part is looking for at least one character before first ":" or end of sentence
enter image description here
You don't need any lookarounds, you can match the values using:
^[^\r\n:]+
The pattern matches:
^ Start of string
[^\r\n:]+ Match 1+ chars other than newlines or :
Regex demo
library(stringr)
s <- c("(R)-lactate from methylglyoxal: step 1/2. {ECO:0000256|ARBA:ARBA00005008, ECO:0000256|RuleBase:RU361179}.",
"(S)-dihydroorotate from bicarbonate: step 3/3. {ECO:0000256|ARBA:ARBA00004880}.",
"3,4',5-trihydroxystilbene biosynthesis")
str_extract(s, "^[^\\r\\n:]+")
Output
[1] "(R)-lactate from methylglyoxal"
[2] "(S)-dihydroorotate from bicarbonate"
[3] "3,4',5-trihydroxystilbene biosynthesis"

How to extract a substring from main string starting from valid uuid using lua

I have a main string as below
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
From the main string i need to extract a substring starting from the uuid part
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
I tried
string.match("/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/", "/[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}/(.)/(.)/$"
But noluck.
if you want to obtain
"/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
from
"/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
or let's say 7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0, output and 9999.317528060546245771146821638997525068657 as this is what your pattern attempt suggests. Otherwise leave out the parenthesis in the following solution.
You can use a pattern like this:
local text = "/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(text:match("/([%x%-]+)/([^/]+)/([^/]+)"))
"/([^/]+)/" captures at least one non-slash-character between two slashs.
On your attempt:
You cannot give counts like {4} in a string pattern.
You have to escape - with % as it is a magic character.
(.) would only capture a single character.
Please read the Lua manual to find out what you did wrong and how to use string patterns properly.
Try also the code
s="/tmp/xjtscpdownload/7eb17cc6-b3c9-4ebd-945b-c0e0656a33f0/output/9999.317528060546245771146821638997525068657/"
print(s:match("/.-/.-(/.+)$"))
It skips the first two "fields" by using a non-greedy match.

How do I extract a section number and the text after it?

I have a question.
My text file contains lines such as:
1.1        Description.
This is the description.
1.1.1      Quality Assurance
Random sentence.
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1        Description
1.1.1      Quality Assurance
1.6.1    Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1        Description.
1.1.1      Quality Assurance
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))

How to write a regex OR statement within strapply in R

I have been using strapplyc in R to select different portions of a string that match one particular set of criteria. These have worked successfully until I found a portion of the string where the required portion could be defined one of two ways.
Here is an example of the string which is liberally sprinkled with \t:
\t\t\tsome words here\t\t\tDefect: some more words here Action: more words
I can write the strapply statement to capture the text between Defect: and the start of Action:
strapplyc(record[i], "Defect:(.*?)Action")
This works and selects the chosen text between Defect: and Action. In some cases there is no action section to the string and I've used the following code to capture these cases.
strapplyc(record[i], "Defect:(.*?)$")
What I have been trying to do is capture the text that either ends with Action, or with the end of the string (using $).
This is the bit that keeps failing. It returns nothing for either option. Here is my failing code:
strapplyc(record[i], "Defect:(.*?)Action|$")
Any idea where I'm going wrong, or a better solution would be much appreciated.
If you are up for a more efficient solution, you could drop the .*? matching and unroll your pattern like:
Defect:((?:[^A]+|A(?!ction))*)
This matches Defect: followed by any amount of characters that are not an A or are an A and not followed by ction. This avoids the expanding that is needed for the lazy dot matching. It will work for both ways, as it does stop matching when it hits Action or the end of your string.
As suggested by Wiktor, you can also use
Defect:([^A]*(?:A(?!ction)[^A]*)*)
Which is a little bit faster when there are many As in the string.
You might want to consider to use A(?!ction:) or A(?!ction\s*:), to avoid false early matches.
The alternation operator | is the regex operator with the lowest precedence. That means the regex Defect:(.*?)Action|$ is actually a combination of Defect:(.*?)Action and $ - since an empty string is a valid match for $, your regex returns the empty string.
To solve that, you should combine the regexes Defect:(.*?)Action and Defect:(.*?)$ with an OR:
Defect:(.*?)Action|Defect:(.*?)$
Or you can enclose Action|$ in a group as Sebastian Proske said in the comments:
Defect:(.*?)(?:Action|$)

How to remove the words that are contain within some tags in R

Suppose A is a data frame and structure of A is as follows
Row no C1
1 <p>I'd like to check if an uploaded file is </p>
2 <p>Is there a way to</p>
3 <p>I am import matlab file and construct</p> <pre><code>Error in model.frame.default(formula = expert_data_frame$t_labels ~ .,</code></pre>
For the column C1 what I am doing is using the tm package I am turning the rows to corpus and then using the different function like removewhitespace, removesopwords. But how to remove the words withing a specific tags. In the above example I want to remove the words that are within the (code)--(/code) tags but unable to do so.
The correct answer is to use an HTML parser. That requires more explanation. You can also get this done in an incorrect way with the qdap package:
library(qdap)
genX(A$C1, "<code>", "</code>")
## [1] "<p>I'd like to check if an uploaded file is </p>"
## [2] "<p>Is there a way to</p>"
## [3] "<p>I am import matlab file and construct</p> <pre></pre>"
At a pinch, you could do:
A$C1 <- gsub('<code>.*?</code>', '', A$C1)
However, there are many caveats to parsing HTML with regular expressions.
For example, if I had the a string ' # this is a tag ', the last ' tag ' would not be stripped.
If I adjusted the regex to use .* instead of .*? to get around this, the string ' some code and some text and some more code ' would have everything stripped from it, even the (legitimate) text between the two code blocks.
What it boils down to is what you know about A$C1. Can you rely on it to not have more than one code block in one string (or more than one occurence of </code>)? Then use <code>.*</code>. Can you rely on the string '' never appearing within a code block? then use <code>.*?</code>.
If you really want to be sure, you can actually parse the XML with the XML package (can you rely on the contents of A$C1 to be well-formed HTML, i.e. no missing tags?).

Resources