I am new to the United Medical Language System. I would like to annotate some text that gives reports about endoscopy examinations. The terminology is therefore specific to gastroenterology. Some of the text contains acronyms like TI which would mean terminal ileum. However according to UMLS TI also stands for a number of other terms that are non-gastroenterological. I would like to build a gastroenterology lexicon only from UMLS terms. Is there a way to do this?
Related
I need to generate n random sentences using an array of keywords or tokens in R and preferably using dplyr.
Each sentence may have a different number of words.
The array of words is:
(Helsinki, Town, big, pollution, a, much, Not, is, nice)
and sentences should be in a way that the grammatical order of words in a sentence makes sense and can be something like the following:
big Helsinki,
Helsinki is a nice town,
nice Helsinki
I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do not want the French part (je suis désolé). I have tried using tabulizer extract_area(), but I don't want to have to select the area by hand 90 times (I'm going to do this for multiple pieces of legislation).
Obviously I don't have a minimal reproducible example coded out... But the pdf is downloadable here: https://laws-lois.justice.gc.ca/eng/acts/F-27/
Option 2 is to write something to pull it out via XML, but I'm a bit less used to working with XML files. Unless it's incredibly annoying to do using either pdftools or tabulizer, I'd prefer the answer using one of those libraries (mostly for learning purposes).
I've seen some similarish questions on stackoverflow, but they're all confusingly written/designed for tables, of which this is not. I am not a quant/data science researcher by training, so an explanation would be super helpful (but not required).
Here's an option that reads in the pdf text and detects language. You're probably going to have to do a lot of text cleanup after reading in the pdf. Assume you don't care about retaining formatting.
library(pdftools)
a = pdf_text('F-27.pdf')
#split text to get sentence chunks, mostly.
b = sapply(a,strsplit,'\r\n')
#do a bunch of other text cleanup, here's an example using the third list element. You can expand this to cover all of b with a loop or list function like sapply.
#Two spaces should hopefully retain most sentence-like fragments, you can get more sophisticated:
d = strsplit(b[[3]], ' ')[[1]]
library(cld3) #language tool to detect french and english
x = sapply(d,detect_language)
#Keep only English
x[x=='en']
I have this problem in R where I have a list of Spanish communities and inside each community there is a list of towns/municipalities.
For example, this is a list of municipalities inside the community of Catalonia.
https://en.wikipedia.org/wiki/Municipalities_of_Catalonia
So; Catalonia is one community and within this community it has a list of towns/cities which I would like to group/ assign a new value 'Catalona'.
I have a list of all the municipalities/towns/cities in my dataset and I would like to group them into communities such as; Andalusia, Catalonia, Basque Country, Madrid etc.
Firstly, how can I go about grouping these rows into the list of communities?
For example; el prat de llobregat is a municipality within Catalonia so I would like to assign this to the region of Catalonia. Getafe is a municipality of Madrid so I would like to assign this to a value of Madrid. Alicante is a municipality of Valencia so I would like to assign this to a value Valencia. Etc.
#
That was my first question and if you are able to help with just that, I would be very thankful.
However, my dataset is not that clean, I did my best to remove Spanish accents, remove unnecessary code identifiers in the municipality names but there still contains some small errors. For example, castellbisbal is a municipality of Catalonia, however some entries have very small spelling mistakes, i.e. including 1 'l' instead of two, spelling; (castelbisbal).
These errors are human errors and are very small, is there a way I can work around this?
I was thinking of a vector of all correctly spelt names and then rename the incorrectly spelt names based on a percentage of incorectness, could this work? For instance castellbisbal is 13 characters long, and has an error of 1 character, with less than an 8% error rate. Can I rename values based on an error rate?
Do you have any suggestions on how I can proceed with the second part?
Any tips/suggestions would be great.
As for the spelling errors, have you tried the soundex algorithm? It was meant for that and at least two R packages implement it.
library(stringdist)
phonetic("barradas")
[1] "B632"
phonetic("baradas")
[1] "B632"
And the soundex codes for for the same words are the same with package phonics.
library(phonics)
soundex("barradas")
[1] "B632"
soundex("baradas")
[1] "B632"
All you would have to do would be to compare soundex codes, not the words themselves. Note that soundex was designed for the english language so it can only handle english language characters, not accents. But you say you are already taking care of those, so it might work with the words you have to process.
I'm working on a system to analyze texts in english: I use stanford-core nlp to make sentences from whole documents and to make tokens from sentences. I also use the maxent tagger to get tokens pos tags.
Now, considering that I use this corpus to build a supervised classifier, it would be good if I could replace any word like 're, 's, havin, sayin', etc. to its standard form(are, is, having, saying). I've been searching for some english dictionary file, but I don't know how to use it. There are so many distinct cases to consider that I don't think it's an easy task to realize: is there some similar work or whole project that I could use?
Ideas:
I) use string edit distance on a subset of your text and try to match words that do not exist in the dictionary using edit distance against existing words in the dictionary.
II) The key feature of lots of those examples you have is that they are only 1 character different from the correct spelling. So, I suggest for those words that you fail to match with a dictionary entry, try and add all english characters to the front or back and lookup the resulting word in a dictionary. This is very expensive in the beginning but if you keep track of those misspellings in a lookup table (re -> are) at some point you will have 99.99% of the common misspellings (or whatever you call them) in your lookup table with their actual correct spelling.
III) Train a word-level 2-gram or 3-gram language model on proper and clean english text (i.e. newspaper articles), then run it over the entire corpus that you have and see for those words that your language model considers as unknown words (which means it hasn't seen them in training phase), what is the highest probable word according to the language model. Most probably the language model top-10 prediction will be the correct spelled word.
I retrieved xml file from a site using the code:
library (XML)
abstract <- xmlParse(file = 'http://ieeexplore.ieee.org/gateway/ipsSearch.jsp?querytext=%28systematic%20review%20OR%20systematic%20literature%20review%20AND%20text%20mining%20techniques%29&pys=2009&&hc=1000', isURL = T)
the returned xml looks like:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<totalfound>40420</totalfound>
<totalsearched>3735435</totalsearched>
<document>
<rank>1</rank>
<title><![CDATA[Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics]]></title>
<authors><![CDATA[Ghose, A.; Ipeirotis, P.G.]]></authors>
<affiliations><![CDATA[Dept. of Inf., Oper., & Manage. Sci., New York Univ., New York, NY, USA]]></affiliations>
<controlledterms>
<term><![CDATA[Internet]]></term>
<term><![CDATA[data mining]]></term>
<term><![CDATA[electronic commerce]]></term>
<term><![CDATA[pattern classification]]></term>
</controlledterms>
<thesaurusterms>
<term><![CDATA[Communities]]></term>
<term><![CDATA[Economics]]></term>
<term><![CDATA[History]]></term>
<term><![CDATA[Marketing and sales]]></term>
<term><![CDATA[Measurement]]></term>
</thesaurusterms>
<pubtitle><![CDATA[Knowledge and Data Engineering, IEEE Transactions on]]></pubtitle>
<punumber><![CDATA[69]]></punumber>
<pubtype><![CDATA[Journals & Magazines]]></pubtype>
<publisher><![CDATA[IEEE]]></publisher>
<volume><![CDATA[23]]></volume>
<issue><![CDATA[10]]></issue>
<py><![CDATA[2011]]></py>
<spage><![CDATA[1498]]></spage>
<epage><![CDATA[1512]]></epage>
<abstract><![CDATA[With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we reexamine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. By using Random Forest-based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: “reviewer-related” features, “review subjectivity” features, and “review readability” features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates eco- - nometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.]]></abstract>
<issn><![CDATA[1041-4347]]></issn>
<htmlFlag><![CDATA[1]]></htmlFlag>
<arnumber><![CDATA[5590249]]></arnumber>
<doi><![CDATA[10.1109/TKDE.2010.188]]></doi>
<publicationId><![CDATA[5590249]]></publicationId>
<partnum><![CDATA[5590249]]></partnum>
<mdurl><![CDATA[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5590249&contentType=Journals+%26+Magazines]]></mdurl>
<pdf><![CDATA[http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5590249]]></pdf>
</document>
I want to extract title and match with author. I used XpathSApply and getNode on "//title" and "//authors" using:
getNodeSet(abstract, "//title")
getNodeSet(abstract, "//authors")
titlenodes <- xpathSApply(abstract, "//title")
then I discovererd some documents are without title. So if I extracted the separately, it will be impossible to match the title to its corresponding author. I need a way to detect which document has no title and pick onlu author for such documents returning NA for its title.
Consider importing all of the XML content into dataframe off the parent node, document. In this way, you can see which rows have missing titles and/or authors.
xmldf <- xmlToDataFrame(nodes = getNodeSet(abstract, "//document"))
# subset data frame of only title and author (to see NAs)
titleauthorsdf <- xmldf[, c("title", "authors")]
# character vector of authors with no titles
notitleauthorslist <- c(xmldf$authors[is.na(xmldf$title)])
If all you want is a list of authors where there is not title, you can do it this way:
xpathSApply(abstract,"//document[not(title)]/authors", xmlValue)
# [1] "Armstrong, R.; Baillie, C.; Cumming-Potvin, W." "Stede, M."
# [3] "Government Documents" "Piotrowski, M."
# ...