Replacing first character in line in multi-line text column - r

I am trying to replace the "o " with "• " in this text:
• Direct the Department’s technical
• Perform supervisory and managerial responsibilities as leader of the program
o Set direction to ensure goals and objectives
o Select management and other key personnel
o Collaborate with executive colleagues to develop and execute corporate initiatives and
department strategy
o Oversee the preparation and execution of department’s Annual Financial Plan and budget
o Manage merit pay
• Perform other duties as assigned
Since these are at the beginning of the line I've tried
test<- sub(test, pattern = "o ", replacement = "• ") # does not work
test<- gsub(test, pattern = "^o ", replacement = "• ") # does not work
test<- gsub(test, pattern = "o ", replacement = "• ") # works but it also replaces to to t•
Why does "^o " not work since it only appears at the beginning of each the line

Is this is all in a single value? If so, use a lookbehind to find o following either line breaks or string start:
test2 <- gsub(test, pattern = "(?<=\n|\r|^)o ", replacement = "• ", perl = TRUE)
cat(test2)
• Direct the Department’s technical
• Perform supervisory and managerial responsibilities as leader of the program
• Set direction to ensure goals and objectives
• Select management and other key personnel
• Collaborate with executive colleagues to develop and execute corporate initiatives and department strategy
• Oversee the preparation and execution of department’s Annual Financial Plan and budget
• Manage merit pay
• Perform other duties as assigned
Alternatively, split into individual values per line, then use your original regex:
test3 <- gsub(unlist(strsplit(test, "\n|\r")), pattern = "^o ", replacement = "• ")
test3
[1] "• Direct the Department’s technical"
[2] ""
[3] "• Perform supervisory and managerial responsibilities as leader of the program"
[4] ""
[5] "• Set direction to ensure goals and objectives"
[6] ""
[7] "• Select management and other key personnel"
[8] ""
[9] "• Collaborate with executive colleagues to develop and execute corporate initiatives and department strategy"
[10] ""
[11] "• Oversee the preparation and execution of department’s Annual Financial Plan and budget"
[12] ""
[13] "• Manage merit pay"
[14] ""
[15] "• Perform other duties as assigned"

You do not need any lookbehind here, use ^ with (?m) flag:
test <- gsub(test, pattern = "(?m)^o ", replacement = "• ", perl=TRUE)
The (?m) redefines the behavior of the ^ anchor that means "start of a line" if you specify the m flag.
See the online R demo:
test <- "• Direct the Department’s technical\n\no Set direction to ensure goals and objectives\n\no Select management and other key personnel"
cat(gsub(test, pattern = "(?m)^o ", replacement = "• ", perl=TRUE))
Output:
• Direct the Department’s technical
• Set direction to ensure goals and objectives
• Select management and other key personnel

Related

R extract specific word after keyword

How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))
You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

add text to atomic (character) vector in r

Good afternoon, I am not an expert in the topic of atomic vectors but I would like some ideas about it
I have the script for the movie "Coco" and I want to be able to get a row that is numbered in the form 1., 2., ... (130 scenes throughout the movie). I want to convert the line of each scene of the movie into a row that contains "Scene 1", "Scene 2", up to "Scene 130" and achieve it sequentially.
url <- "https://www.imsdb.com/scripts/Coco.html"
coco <- read_lines("coco2.txt") #after clean
class(coco)
typeof(coco)
" 48."
[782] " arms full of offerings."
[783] " Once the family clears, Miguel is nowhere to be seen."
[784] " INT. NEARBY CORRIDOR"
[785] " Miguel and Dante hide from the patrolman. But Dante wanders"
[786] " off to inspect a side room."
[787] " INT. DEPARTMENT OF CORRECTIONS"
[788] " Miguel catches up to Dante. He overhears an exchange in a"
[789] " nearby cubicle."
[797] " 49."
[798] " And amigos, they help their amigos."
[799] " worth your while."
[800] " workstation."
[801] " Miguel perks at the mention of de la Cruz."
[809] " Miguel follows him."
[810] " 50." # Its scene number
[811] " INT. HALLWAY"
s <- grep(coco, pattern = "[^Level].[0-9].$", value = TRUE)
My solution is wrong because it is not sequential
v <- gsub(s, pattern = "[^Level].[0-9].$", replacement = paste("Scene", sequence(1:130)))
[1] " Scene1"
[2] " Scene1"
[3] " Scene1"
[4] " Scene1"
[5] " Scene1"
[6] " Scene1"
I'm not clear on what [^Level] represents. However, if the numbers at the end of lines in the text represent the Scene numbers, then you can use ( ) to capture the numbers and substitute them in your replacement text as shown below:
v <- gsub(s, pattern = " ([0-9]{1,3})\\.$", replacement = "Scene \\1")

How to allow a space into a wildcard?

Let's say I have this sentence :
text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")
When I write this (kwicis a quantedafunction) :
kwic(text,phrase("great* cake*"))
I get
[text1, 7:8] want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very | great cakes | but I want to find
However, when I do
kwic(text,phrase("great*cake*"))
I get a kwicobject with 0 row, i.e. nothing
I would like to know what does the *replace exactly and, more important, how to "allow" a space to be taken into account in the wildcard ?
To answer what the * matches, you need to understand the "glob" valuetype, which you can read about using ?valuetype and also here. In short, * matches any number of any characters including none. Note that this is very different from its use in a regular expression, which means "match none or more of the preceding character".
The pattern argument in kwic() matches one pattern per token, after tokenizing the text. Even wrapped in the phrase() function, it still only considers sequences of matches to tokens. So you cannot match the whitespace (which defines the boundaries between tokens) unless you actually include these inside the token's value itself.
How could you do that? Like this:
toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
# [1] "I want" "want to" "to find" "find both" "both the"
# [6] "the greatest" "greatest cake" "cake of" "of the" "the world"
# [11] "world but" "but also" "also some" "some very" "very great"
# [16] "great cakes" "cakes but" "but I" "I want" "want to"
# [21] "to find" "find this" "this last" "last part" "part :"
# [26] ": isn't" "isn't it"
kwic(toksbi, "great*cake*", window = 2)
# [text1, 7] both the the greatest | greatest cake | cake of of the
# [text1, 16] some very very great | great cakes | cakes but but I
But your original usage of kwic(text, phrase("great* cake*")) is the recommended approach.

How to read csv with double quotes from WoS?

I'm trying to read CSV files from the citation report of Web of Science. This is the structure of the file:
TI=clinical case of cognitive dysfunction syndrome AND CU=MEXICO
null
Timespan=All years. Indexes=SCI-EXPANDED, SSCI, A&HCI, ESCI.
"Title","Authors","Corporate Authors","Editors","Book Editors","Source Title","Publication Date","Publication Year","Volume","Issue","Part Number","Supplement","Special Issue","Beginning Page","Ending Page","Article Number","DOI","Conference Title","Conference Date","Total Citations","Average per Year","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016"
""Didy," a clinical case of cognitive dysfunction syndrome","Heiblum, Moises; Labastida, Rocio; Chavez Gris, Gilberto; Tejeda, Alberto","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","MAY-JUN 2007","2007","2","3","","","","68","72","","10.1016/j.jveb.2007.05.002","","","2","0.20","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","0","0","0","1","0","0","0"
""Didy," a clinical case of cognitive dysfunction syndrome (vol 2, pg 68, 2007)","Heiblum, A.; Labastida, R.; Gris, Chavez G.; Tejeda, A.; Edwards, Claudia","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","SEP-OCT 2007","2007","2","5","","","","183","183","","","","","0","0.00","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
I manage to import the it using fread, however, I still want to know which is the appropriate quote and why is assigning "Didy," as row names despite that the argument is NULL. This are the arguments that I'm using.
s_file <- read.csv(savedrecs.txt,
skip = 4,
header = TRUE,
row.names = NULL,
quote = '\"',
stringsAsFactors = FALSE)
What you have shown is not a valid csv file format. There are some double double quotes (i.e. "") without a comma. For example there is one at the beginning of the second line.
""Didy," a clinical case of cognitive dysfunction syndrome", etc.
So it thinks there is a null followed by Diddy, followed by " a clinical case of cognitive dysfunction syndrome" Fix up the file and you should be ok. E.g. the second line should start with
"","Didy","a clinical case of cognitive dysfunction syndrome"

Arranging text lines in R

My data is in this format. It's a text file and the class is "character". I have posted few lines from the file. There are about 14000 lines.
"KEY: Aback"
"SYN: Backwards, rearwards, aft, abaft, astern, behind, back."
"ANT: Onwards, forwards, ahead, before, afront, beyond, afore."
"KEY: Abandon"
"SYN: Leave, forsake, desert, renounce, cease, relinquish,"
"discontinue, castoff, resign, retire, quit, forego, forswear,"
"depart_from, vacate, surrender, abjure, repudiate."
"ANT: Pursue, prosecute, undertake, seek, court, cherish, favor,"
"protect, claim, maintain, defend, advocate, retain, support, uphold,"
"occupy, haunt, hold, assert, vindicate, keep."
Line 6 and 7 is the continuation of line 5. Line 9 and 10 is the continuation of line 8. My struggle is how can I bring up line 6 and 7 to line 5 and similarly line 9 and 10 to line 8.
Any hints gratefully received.
First thing that comes to mind (your text is stored as x):
#prefix each line starter (identifies as pattern: `CAPS:`) with a newline (\n)
strsplit(gsub("([A-Z]+:)", "\n\\1", paste(x, collapse = " ")),
split = "\n")[[1L]][-1L]
# [1] "KEY: Aback "
# [2] "SYN: Backwards, rearwards, aft, abaft, astern, behind, back. "
# [3] "ANT: Onwards, forwards, ahead, before, afront, beyond, afore. "
# [4] "KEY: Abandon "
# [5] "SYN: Leave, forsake, desert, renounce, cease, relinquish, discontinue, castoff, resign, retire, quit, forego, forswear, depart_from, vacate, surrender, abjure, repudiate. "
# [6] "ANT: Pursue, prosecute, undertake, seek, court, cherish, favor, protect, claim, maintain, defend, advocate, retain, support, uphold, occupy, haunt, hold, assert, vindicate, keep."

Resources