Strings Comparing between result set and correct set - string-matching

I'm working on an algorithm to extract keywords from a text, I have a test set of scientific abstracts with their tags (keywords) , my question is What is the best way to compare the correct tags with the tags my algorithm produce ?
Should I strictly compare them ex.
if (correct_tag == result_tag)
...or do a similarity check ? Given that sometimes I get something like the following:
For the same document:
**correct_tag** = ["eigenvalues and eigenfunctions in quantum mechanics"]
**result_tag** = ["eigenvalues and eigenfunctions"]
For Another Document:
**correct_tag** = ["cardiovascular system"]
**result_tag** = ["cardiovascular physiology",""cardiovascular system""]
NOTE: These tags are in text tags , meaning they are extracted from the text
Guys any help is appreciated , thanks

Related

How to replace comma with a dot in GTM for JSON structured data?

I am noob with structured data implementation and don't have any code knowledge.
I have been looking for a week how to solve a warning with price in Google structured data testing tool.
My prices are with a comma which is not accepted by Google.
By checking the http://schema.org/price it tells me that "Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator."
I have a CSS variable element #PdtPrixRef named in a variable "Product-price" with a comma "12.5" but I can't find how to replace it in my structured data with the value "12.5"... Someone to help me?
Hereafter my actual script :
My actual GTM script
Should I add something to my script or making an VARIABLE (Custom Js)?
I think it's something like
value.replace(",", ".")
But I do't know how to write the full proper function from beginning to end...
Yes you can just create a Custom JavaScript Variable
Here is the code
function(){
var price = {{Product-price}};
return price.replace("," , ".");
}
Then using this variable to your JSON-LD script.

SimpleDom Search Via plaintext Text

I am using "PHP Simple HTML DOM Parser" library and looking forward to find elements based on its text value (plaintext)
For example i need to find span element using its value "Red".
<span class="color">Red</span>
I was expecting bellow code to work but seems that it just replaces the value instead of searching it.
$brand = $html->find('span',0)->plaintext='Red';
I read Manual and also i tried to look in library code itself but was not able to find the solution, kindly advise if i am missing something or it is simply not possible to do via Simple Html DOM Parser.
P.S
Kindly note that i am aware of other ways like regex.
Using $html->find('span', 0) will find the (N)th span where in this case n is zero.
Using $html->find('span',0)->plaintext='Red'; will set the plaintext to Red
If you want to find the elements where the text is Red you could use a loop and omit the 0 to find all the spans.
For example, using innertext instead of plaintext:
$spansWithRedText = [];
foreach($html->find('span') as $element) {
if ($element->innertext === "Red") {
$spansWithRedText[] = $element;
}
}

Use substr with start and stop words, instead of integers

I want to extract information from downloaded html-Code. The html-Code is given as a string. The required information is stored inbetween specific html-expressions. For example, if I want to have every headline in the string, I have to search for "H1>" and "/H1>" and the text between these html expressions.
So far, I used substr(), but I had to calculate the position of "H1>" and "/H1>" first.
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])
The output is correct, but to calculate every single start and stopposition is a lot of work. Instead I search for a similar function like substr (), where you can use start and stop words instead of the position. For example like this:
function(htmlcode, startword = "H1>", stopword = "/H1>")
I'd agree that using a package built for html processing is probably the best way to handle the example you give. However, one potential way to sub-string a string based on character values would be to do the following.
Step 1: Define a simple function to return to position of a character in a string, in this example I am only using fixed character strings.
strpos_fixed=function(string,char){
a<-gregexpr(char,string,fixed=T)
b<-a[[1]][1:length(a[[1]])]
return(b)
}
Step 2: Define your new sub-string function using the strpos_fixed() function you just defined
char_substr<-function(string,start,stop){
x<-strpos_fixed(string,start)+nchar(start)
y<-strpos_fixed(string,stop)-1
z<-cbind(x,y)
apply(z,1,function(x){substr(string,x[1],x[2])})
}
Step 3: Test
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")
You have two options here. First, use a package that has been developed explicitly for the parsing of HTML structures, e.g., rvest. There are a number of tutorials online.
Second, for edge cases where you may need to extract from strings that are not necessarily well-formatted HTML you should use regular expressions. One of the simpler implementations for this comes from stringr::str_match:
# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>
str_match(htmlcode, "(<H1>)(.*?)(</H1>)")
This will yield a matrix where the columns are (in order) the fully matched string followed by each independent regex group we specified. You would just want to pull the second group in this case if you want whatever text is between the <H1> tags (3rd column).

R Using Regex to find a word after a pattern

I'm grabbing the following page and storing it in R with the following code:
gQuery <- getURL("https://www.google.com/#q=mcdimalds")
Within this, there's the following snippet of code
Showing results for</span> <a class="spell" href="/search?rlz=1C1CHZL_enUS743US743&q=mcdonalds&spell=1&sa=X&ved=0ahUKEwj9koqPx_TTAhUKLSYKHRWfDlYQvwUIIygA"><b><i>mcdonalds</i></b></a>
Everything other than "showing results for" and the italics tags encasing the desired name for extraction are subject to change from query to query.
What I want to do is extract the mcdonalds out of this string using regex that occurs here: <b><i>mcdonalds</i> aka the second instance of mcdonalds. However, I'm not too sure how to write the regex to do so.
Any help accomplishing this would be greatly appreciated. As always, please let me know if any additional information should be added to clarify the question.

HTML scraping - R scrapR

I am trying to parse data encoded in HTML format. Example of the string I am trying to parse is:
Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />
I want to get the text before <img and the text in alt=
Desired output:
Simplify the polynomial by combining like terms. 3x+12-11x+14
I tried scrapeR.
y1 = scrape (str1) # the above string is in str1 (as a vector)
I get the following error message
Error in which(value == defs) :
argument "code" is missing, with no default
Has anyone played with scrapeR. I am not sure what "code" refers to as it is an option
and is not described in the manual. Just trying to see which default value is affecting this.
Here's one way to extract that information
str1<-"Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />"
library(scrapeR)
y<-scrape(object="str1")[[1]] #just get the first result
pretext <- sapply(xpathSApply(y, "//img/preceding::text()"), xmlValue)
alttext <- xpathSApply(y, "//img/#alt")
paste(pretext, alttext)
#[1] "Simplify the polynomial by combining like terms. 3x+12-11x+14"
The scrape() will return HTML/XML like document that you can work with using functions like xpathSApply to find nodes and extract values.

Resources