Arranging text lines in R - r

My data is in this format. It's a text file and the class is "character". I have posted few lines from the file. There are about 14000 lines.
"KEY: Aback"
"SYN: Backwards, rearwards, aft, abaft, astern, behind, back."
"ANT: Onwards, forwards, ahead, before, afront, beyond, afore."
"KEY: Abandon"
"SYN: Leave, forsake, desert, renounce, cease, relinquish,"
"discontinue, castoff, resign, retire, quit, forego, forswear,"
"depart_from, vacate, surrender, abjure, repudiate."
"ANT: Pursue, prosecute, undertake, seek, court, cherish, favor,"
"protect, claim, maintain, defend, advocate, retain, support, uphold,"
"occupy, haunt, hold, assert, vindicate, keep."
Line 6 and 7 is the continuation of line 5. Line 9 and 10 is the continuation of line 8. My struggle is how can I bring up line 6 and 7 to line 5 and similarly line 9 and 10 to line 8.
Any hints gratefully received.

First thing that comes to mind (your text is stored as x):
#prefix each line starter (identifies as pattern: `CAPS:`) with a newline (\n)
strsplit(gsub("([A-Z]+:)", "\n\\1", paste(x, collapse = " ")),
split = "\n")[[1L]][-1L]
# [1] "KEY: Aback "
# [2] "SYN: Backwards, rearwards, aft, abaft, astern, behind, back. "
# [3] "ANT: Onwards, forwards, ahead, before, afront, beyond, afore. "
# [4] "KEY: Abandon "
# [5] "SYN: Leave, forsake, desert, renounce, cease, relinquish, discontinue, castoff, resign, retire, quit, forego, forswear, depart_from, vacate, surrender, abjure, repudiate. "
# [6] "ANT: Pursue, prosecute, undertake, seek, court, cherish, favor, protect, claim, maintain, defend, advocate, retain, support, uphold, occupy, haunt, hold, assert, vindicate, keep."

Related

How to allow a space into a wildcard?

Let's say I have this sentence :
text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")
When I write this (kwicis a quantedafunction) :
kwic(text,phrase("great* cake*"))
I get
[text1, 7:8] want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very | great cakes | but I want to find
However, when I do
kwic(text,phrase("great*cake*"))
I get a kwicobject with 0 row, i.e. nothing
I would like to know what does the *replace exactly and, more important, how to "allow" a space to be taken into account in the wildcard ?
To answer what the * matches, you need to understand the "glob" valuetype, which you can read about using ?valuetype and also here. In short, * matches any number of any characters including none. Note that this is very different from its use in a regular expression, which means "match none or more of the preceding character".
The pattern argument in kwic() matches one pattern per token, after tokenizing the text. Even wrapped in the phrase() function, it still only considers sequences of matches to tokens. So you cannot match the whitespace (which defines the boundaries between tokens) unless you actually include these inside the token's value itself.
How could you do that? Like this:
toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
# [1] "I want" "want to" "to find" "find both" "both the"
# [6] "the greatest" "greatest cake" "cake of" "of the" "the world"
# [11] "world but" "but also" "also some" "some very" "very great"
# [16] "great cakes" "cakes but" "but I" "I want" "want to"
# [21] "to find" "find this" "this last" "last part" "part :"
# [26] ": isn't" "isn't it"
kwic(toksbi, "great*cake*", window = 2)
# [text1, 7] both the the greatest | greatest cake | cake of of the
# [text1, 16] some very very great | great cakes | cakes but but I
But your original usage of kwic(text, phrase("great* cake*")) is the recommended approach.

R - Regex whitespace pattern matches excessively

I have what seems like a simple regular expression I want to match on and replace. I'm processing a bunch of free form text and respondents have a variety of ways of denoting a line break. One such is at least 4 sequential bits of whitespace. Another is a heavy dot. However, in R (perl=FALSE) I get some very strange behavior. The regex \\s{4,}|• replaces the whole string with one <br>, if I change the repetition to be just 4 (\\s{4}|•) then it returns 19 <br>. If I remove the |• then it works fine. If I explicitly call out 4 whitespace characters or the heavy dot, \\s\\s\\s\\s+|•, it works fine.
What is it about repeating \\s or checking for a heavy dot • that causes such erratic behavior?
x = "Call Narrative <br>11/15/2017 19:53:00 J574511 <br> <br>"
replacement = "<br>"
orig_pattern = "\\s{4,}|•"
alt1 = "\\s\\s\\s\\s+|•"
alt2 = "\\s{4,}"
alt3 = "\\s{4,}|<p>"
alt4 = "\\s{4}|•"
gsub(orig_pattern,replacement,x)
#> [1] "<br>"
gsub(alt1,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt2,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt3,replacement,x)
#> [1] "Call Narrative <br>11/15/2017<br>19:53:00<br>J574511 <br> <br>"
gsub(alt4,replacement,x)
#> [1] "<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>"
UPDATE
It seems to be associated with the OS. The problem originated on Amazon Linux 2 but works fine on Windows.

Python operators of unequality

i have a code, not written by me . I am learning python and at very basics
codes are below:
ten_things = "Apples Oranges Crows Telephone Light Sugar"
print("Wait there are not 10 things in that list. Let's fix that.")
stuff = ten_things.split(' ')
more_stuff = ["Day", "Night", "Song", "Frisbee",
"Corn", "Banana", "Girl", "Boy"]
while len(stuff) != 10:
next_one = more_stuff.pop()
print("Adding: ", next_one)
stuff.append(next_one)
print(f"There are {len(stuff)} items now.")
print("There we go: ", stuff)
print("Let's do some things with stuff.")
print(stuff[1])
print(stuff[-1]) # whoa! fancy
print(stuff.pop())
print(' '.join(stuff)) # what? cool!
print('#'.join(stuff[3:5])) # super stellar!
my question is we use while len(stuff) !=10 but it runs till 10 only, it should skip 10 and go ahead until it pops the whole list but why it runs till 10? there are 11,12,13 and 14 which are also not equal to 10.It should also include them.
ideally, it should be while len(stuff) => 10: but we used != which is also not equal
can anyone help me?

Lower Case Certain Words R

I need to convert certain words to lower case. I am working with a list of movie titles, where prepositions and articles are normally lower case if they are not the first word in the title. If I have the vector:
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
What I need is this:
movies_updated = c('The Kings of Summer', 'The Words', 'Out of the Furnace', 'Me and Earl and the Dying Girl')
Is there an elegant way to do this without using a long series of gsub(), as in:
movies_updated = gsub(' In ', ' in ', movies)
movies_updated = gsub(' In', ' in', movies_updated)
movies_updated = gsub(' Of ', ' of ', movies)
movies_updated = gsub(' Of', ' of', movies_updated)
movies_updated = gsub(' The ', ' the ', movies)
movies_updated = gsub(' the', ' the', movies_updated)
And so on.
In effect, it appears that you are interested in converting your text to title case. This can be easily achieved with use of the stringi package, as shown below:
>> stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings Of Summer" "The Words" "Out Of The Furnace"
Alternative approach would involve making use of the toTitleCase function available in the the tools package:
>> tools::toTitleCase(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings of Summer" "The Words" "Out of the Furnace"
Though I like #Konrad's answer for its succinctness, I'll offer an alternative that is more literal and manual.
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace',
'Me And Earl And The Dying Girl')
gr <- gregexpr("(?<!^)\\b(of|in|the)\\b", movies, ignore.case = TRUE, perl = TRUE)
mat <- regmatches(movies, gr)
regmatches(movies, gr) <- lapply(mat, tolower)
movies
# [1] "The Kings of Summer" "The Words"
# [3] "Out of the Furnace" "Me And Earl And the Dying Girl"
The tricks of the regular expression:
(?<!^) ensures we don't match a word at the beginning of a string. Without this, the first The of movies 1 and 2 will be down-cased.
\\b sets up word-boundaries, such that in in the middle of Dying will not match. This is slightly more robust than your use of space, since hyphens, commas, etc, will not be spaces but do indicate the beginning/end of a word.
(of|in|the) matches any one of of, in, or the. More patterns can be added with separating pipes |.
Once identified, it's as simple as replacing them with down-cased versions.
Another example of how to turn certain words to lower case with gsub (with a PCRE regex):
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
gsub("(?!^)\\b(Of|In|The)\\b", "\\L\\1", movies, perl=TRUE)
See the R demo
Details:
(?!^) - not at the start of the string (it does not matter if we use a lookahead or lookbehind here since the pattern inside is a zero-width assertion)
\\b - find leading word boundary
(Of|In|The) - capture Of or In or The into Group 1
\\b - assure there is a trailing word boundary.
The replacement contains the lowercasing operator \L that turns all the chars in the first backreference value (the text captured into Group 1) to lower case.
Note it can turn out a more flexible approach than using tools::toTitleCase. The code part that keeps specific words in lower case is:
## These should be lower case except at the beginning (and after :)
lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"
If you only need to apply lowercasing and do not care about the other logic in the function, it might be enough to add these alternatives (do not use ^ and $ anchors) to the regex at the top of the post.

Replace rogue double-quotes in vector in R

I have a broken CSV file with long text fields containing both double quotes and commas. I've been able to clean it up to some extent and now have tab-separated fields as a vector of whole lines (each value is a line).
head(temp, 2)
[1] "\"org_order\"\t\"organizations.api_path\"\t\"permalink\"\t\"api_path\"\t\"web_path\"\t\"name\"\t\"also_known_as\"\t\"short_description\"\t\"description\"\t\"profile_image_url\"\t\"primary_role\"\t\"role_company\"\t\"role_investor\"\t\"role_group\"\t\"role_school\"\t\"founded_on\"\t\"founded_on_trust_code\"\t\"is_closed\"\t\"closed_on\"\t\"closed_on_trust_code\"\t\"num_employees_min\"\t\"num_employees_max\"\t\"stock_exchange\"\t\"stock_symbol\"\t\"total_funding_usd\"\t\"number_of_investments\"\t\"homepage_url\"\t\"created_at\"\t\"updated_at\""
[2] "1\t\"organizations/care1st-health-plan-arizona\"\t\"care1st-health-plan-arizona\"\t\"organizations/care1st-health-plan-arizona\"\t\"organization/care1st-health-plan-arizona\"\t\"Care1st Health Plan Arizona\"\t\"\"\t\"Care1st Health Plan Arizona provides high quality health care services.\"\t\"Care1st is a health plan providing support and services to meet the health care needs of eligible members enrolled in KidsCare, AHCCCS, and DDD.\"\t\"http://public.crunchbase.com/t_api_images/v1475743278/m2teurxnhkwacygzdn2m.png\"\t\"company\"\t\"\"\t\"\"\t\"\"\t\"\"\t\"2003-01-01\"\t\"4\"\t\"FALSE\"\t\"\"\t\"0\"\t\"251\"\t\"500\"\t\"\"\t\"\"\t\"0\"\t\"0\"\t\"\"\t\"1475743348\"\t\"1475899305\""
I then write temp as a file and read it back (which I've found much faster than textConnection). However, read.table("temp", sep = "\t", quote = "\"", encoding = "UTF-8", colClasses = "character") chokes on certain lines and gives me messages such as:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec
= dec, : line 66951 did not have 29 elements
I think this is due to rogue double quotes, as in the following line (rogue quote can be found immediately after "TripAdvisor de la sant?").
temp[66951]
[1] "67654\t\"organizations/docotop\"\t\"docotop\"\t\"organizations/docotop\"\t\"organization/docotop\"\t\"DOCOTOP\"\t\"\"\t\"Le 'TripAdvisor de la sant?\" est arriv?. Docotop permet de trouver le meilleur professionnel de sant?gr?e ?la communaut?de patients\"\t\"\"\t\"http://public.crunchbase.com/t_api_images/v1455271104/ry9lhcfezcmemoifp92h.png\"\t\"company\"\t\"TRUE\"\t\"\"\t\"\"\t\"\"\t\"2015-11-17\"\t\"7\"\t\"\"\t\"\"\t\"0\"\t\"1\"\t\"10\"\t\"EURONEXT\"\t\"\"\t\"0\"\t\"0\"\t\"http://docotop.com/\"\t\"1455271299\"\t\"1473443321\""
I propose to replace rogue double quotes by single quotes, but I have to leave expected quotes in place. Quotes are expected right before or after a separator (tab) and at the beginning (first line only) and the end of a line. I've written the following attempt at regex with lookarounds for tab and line start and end, but it doesn't work:
temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)
EDIT: I tried #akrun's solution, but get:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec
= dec, : line 181 did not have 29 elements
The line in question (which didn't cause an error before):
temp[181]
[1] "198\torganizations/playfusion\tplayfusion\torganizations/playfusion\torganization/playfusion\tPlayFusion\t\tPlayFusion is a developer of computer games.\tPlayFusion is pioneering the next generation of connected interactive entertainment. PlayFusion's proprietary technology platform fuses video games, robotics, toys, and trans-media entertainment. The company is currently working on its own original IP to trail-blaze its vision ahead of opening its platform to others. PlayFusion is an independent, employee-owned company with offices in Cambridge and Derby in the UK, Douglas in the Isle of Man, and New York and San Francisco in the USA.\thttp://public.crunchbase.com/t_api_images/v1475688372/xnhrd4t254pxj6yxegzt.png\tcompany\t\t\t\t\t2015-01-01\t4\tFALSE\t\t0\t11\t50\t\t\t0\t0\thttp://playfusion.com/#intro\t1475688521\t1475899292"
Your (?<![^\t])"(?![\t$]) regex matches a " that is not preceded with a char other than a tab (so, there must be a tab or start of string before the "), and that is not followed with a tab or $ symbol.
So, the ^ and $ inside character classes lose their anchor meaning.
Replace the character classes with alternation groups:
gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)
The (?<!\t|^) lookbehind requires that the " is not at the start of the string and is not preceded with a tab.
The (?!\t|$) lookahead requires that the " is not at the end of the string ($) and is not followed with a tab char.

Resources