R Remove character followed by specific character in Street Address - r

I would like to remove everything after certain character but with few exceptions as follows:
In 1st string i want to remove everything after 'st' (my interpretation here is St represents street) and in 2nd string 'St' represents saint so would like to retain the address as it is.
In 3rd string i want to remove everything after 'Dr' (my interpretation here is Dr represents drive) and in 4th string 'Dr' represents doctor so would like to retain the address as it is.
Below is an sample input
str <- c("852 union St End",
"852 St johns street",
"30 Sandpiper Dr 35",
"30 Dr Botero drive")
My expected output is
c("852 union St",
"852 St johns street",
"30 Sandpiper Dr",
"30 Dr Botero drive")
Below is the sample code am using, however it is removing everything after St / Dr
Scrubdata <- mgsub(str,
c(" drive.*", " dr .*",
" street.*", " st .*"),
c(" drive", " dr",
" street", " st"), ignore.case = T)
Has anyone got an idea?
Thank you!

Here's a way which removes a word after 'St' or 'Dr' if there is only one word following it :
sub('(?<=(St|Dr)) \\w+$', '', str, perl = TRUE)
#[1] "852 union St" "852 St johns street" "30 Sandpiper Dr" "30 Dr Botero drive"
Using str_remove :
stringr::str_remove(str, '(?<=(St|Dr)) \\w+$')

Related

Replacing first character in line in multi-line text column

I am trying to replace the "o " with "• " in this text:
• Direct the Department’s technical
• Perform supervisory and managerial responsibilities as leader of the program
o Set direction to ensure goals and objectives
o Select management and other key personnel
o Collaborate with executive colleagues to develop and execute corporate initiatives and
department strategy
o Oversee the preparation and execution of department’s Annual Financial Plan and budget
o Manage merit pay
• Perform other duties as assigned
Since these are at the beginning of the line I've tried
test<- sub(test, pattern = "o ", replacement = "• ") # does not work
test<- gsub(test, pattern = "^o ", replacement = "• ") # does not work
test<- gsub(test, pattern = "o ", replacement = "• ") # works but it also replaces to to t•
Why does "^o " not work since it only appears at the beginning of each the line
Is this is all in a single value? If so, use a lookbehind to find o following either line breaks or string start:
test2 <- gsub(test, pattern = "(?<=\n|\r|^)o ", replacement = "• ", perl = TRUE)
cat(test2)
• Direct the Department’s technical
• Perform supervisory and managerial responsibilities as leader of the program
• Set direction to ensure goals and objectives
• Select management and other key personnel
• Collaborate with executive colleagues to develop and execute corporate initiatives and department strategy
• Oversee the preparation and execution of department’s Annual Financial Plan and budget
• Manage merit pay
• Perform other duties as assigned
Alternatively, split into individual values per line, then use your original regex:
test3 <- gsub(unlist(strsplit(test, "\n|\r")), pattern = "^o ", replacement = "• ")
test3
[1] "• Direct the Department’s technical"
[2] ""
[3] "• Perform supervisory and managerial responsibilities as leader of the program"
[4] ""
[5] "• Set direction to ensure goals and objectives"
[6] ""
[7] "• Select management and other key personnel"
[8] ""
[9] "• Collaborate with executive colleagues to develop and execute corporate initiatives and department strategy"
[10] ""
[11] "• Oversee the preparation and execution of department’s Annual Financial Plan and budget"
[12] ""
[13] "• Manage merit pay"
[14] ""
[15] "• Perform other duties as assigned"
You do not need any lookbehind here, use ^ with (?m) flag:
test <- gsub(test, pattern = "(?m)^o ", replacement = "• ", perl=TRUE)
The (?m) redefines the behavior of the ^ anchor that means "start of a line" if you specify the m flag.
See the online R demo:
test <- "• Direct the Department’s technical\n\no Set direction to ensure goals and objectives\n\no Select management and other key personnel"
cat(gsub(test, pattern = "(?m)^o ", replacement = "• ", perl=TRUE))
Output:
• Direct the Department’s technical
• Set direction to ensure goals and objectives
• Select management and other key personnel

Understanding Createview in SQLITE3

I am trying to understand CREATE VIEW command in SQLite, but getting strange view.
I have a CSV file,
table.csv / tab separated
ID NAME AGE ADDRESS SALARY
1 Paul 32 California 20000.0
2 Allen 25 Texas 15000.0
3 Teddy 23 Norway 20000.0
4 Mark 25 Rich-Mond 65000.0
5 David 27 Texas 85000.0
6 Kim 22 South-Hall 45000.0
7 James 24 Houston 10000.0
run commands*
sqlite3 sample.db
.mode csv
.import table.csv COMPANY
CREATE VIEW COMPANY_VIEW AS
SELECT "ID", "NAME", "AGE"
FROM COMPANY;
SELECT * FROM COMPANY_VIEW;
getting
"1 ",NAME,"32 "
"2 ",NAME,"25 "
"3 ",NAME,"23 "
"4 ",NAME,"25 "
"5 ",NAME,"27 "
"6 ",NAME,"22 "
"7 ",NAME,"24 "
Why I am not getting NAME Column. What I am doing wrong. I am an absolute beginner in SQLITE3.
Suggest issuing .schema COMPANY and/or SELECT * FROM COMPANY before the CREATE VIEW command.
From the sqlite doc
Use the ".import" command to import CSV (comma separated value) data into an SQLite table.
The operative term here is comma separated.
Use the .separator command to change the separator to tab.

Extract first letter in each word but keeping specific punctuation

I have a vector with people names with a couple of millions long that I want to remove all characters but the first letter of each word (i.e. initials) and some characters such as ';' and '-'. The vector has large variation in name formats and a small sample would look like this:
text <- c("Alwyn Howard Gentry", "a. h. gentry", "A H GENTRY", "A. H. G.",
"Carl von Martius", "Leitão Filho, H. F. ; Shepherd, G. J.",
"Dárdano de Andrade - Lima")
I was using the solution below, which gives the desired output, but it is too time-consuming:
unlist(lapply(strsplit(text, " ", fixed = TRUE),
function(x) paste0(substr(x, 1, 1), collapse="")))
"AHG" "ahg" "AHG" "AHG" "CvM" "LFHF;SGJ" "DdA-L"
So I tried to adapt an answer I found here based on regexp and gsub. I managed to get the initials but not the initals and the characters at the same time:
gsub('\\b(\\pL)|.', '\\1', text, perl = TRUE)
"AHG" "ahg" "AHG" "AHG" "CvM" "LFHFSGJ" "DdAL"
I am really new to regexp. I tried to adapt '\b(\pL)|.' part of the code to include the characters in the pattern but I gave up after a couple of hours trying.
Any ideas on which regular expression I should use to get with gsub() the same result from the one I got with strsplit() and sapply()?
Thanks a lot!
You can use
text <- c("Alwyn Howard Gentry", "a. h. gentry", "A H GENTRY", "A. H. G.", "Carl von Martius", "Leitão Filho, H. F. ; Shepherd, G. J.", "Dárdano de Andrade - Lima")
gsub("(*UCP)(\\b\\p{L}|[;-])(*SKIP)(*F)|.", "", text, perl=TRUE)
## Or, alternatively,
gsub("(*UCP)[^;-](?<!\\b\\p{L})", "", text, perl=TRUE)
See the R demo and a regex demo #1/regex demo #2.
Details:
(*UCP) - a PCRE verb that makes \b Unicode-aware
(\b\p{L}|[;-])(*SKIP)(*F) - any Unicode letter at the start of a word or a ; or -, and then the match is skipped, and the next match is searched for from the position where the failure occurred
| - or
. - any char but line break chars
[^;-](?<!\b\p{L}) - any char but ; and - that are not any Unicode letter that is preceded with either start of string or a non-word char.

How do i get the twenty one like this <<twenty one >> and not like <<twenty>> <<one>>

stuff= c("my favoiet number is 23","zev is the best","i love 23,456", "twenty one", "10", "123,123,123" ,"dfghjklkjhgfghj",
"three is my numner" ,"this cost $1.23" , "roman numeral VI is awesome ")
WordNumber= "(one|two|three|four|five|six|seven|eight|nine|ten|
eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|
thirty|forty|fifty|sixty|seventy|eighty|ninety
hundred|thousand|million|billion|trillion)"
gsub(WordNumber,"<<\\1>>" , stuff)
You need to re-arrange your parentheses and add optional spaces:
WordNumber= "((?:(?:one|two|three|four|five|six|seven|eight|nine|ten|
eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|
thirty|forty|fifty|sixty|seventy|eighty|ninety
hundred|thousand|million|billion|trillion)\\s*)+)"
gsub(WordNumber,"<<\\1>>" , stuff)
This yields
[1] "my favoiet number is 23" "zev is the best"
[3] "i love 23,456" "<<twenty one>>"
[5] "10" "123,123,123"
[7] "dfghjklkjhgfghj" "<<three >>is my numner"
[9] "this cost $1.23" "roman numeral VI is awesome "

add text to atomic (character) vector in r

Good afternoon, I am not an expert in the topic of atomic vectors but I would like some ideas about it
I have the script for the movie "Coco" and I want to be able to get a row that is numbered in the form 1., 2., ... (130 scenes throughout the movie). I want to convert the line of each scene of the movie into a row that contains "Scene 1", "Scene 2", up to "Scene 130" and achieve it sequentially.
url <- "https://www.imsdb.com/scripts/Coco.html"
coco <- read_lines("coco2.txt") #after clean
class(coco)
typeof(coco)
" 48."
[782] " arms full of offerings."
[783] " Once the family clears, Miguel is nowhere to be seen."
[784] " INT. NEARBY CORRIDOR"
[785] " Miguel and Dante hide from the patrolman. But Dante wanders"
[786] " off to inspect a side room."
[787] " INT. DEPARTMENT OF CORRECTIONS"
[788] " Miguel catches up to Dante. He overhears an exchange in a"
[789] " nearby cubicle."
[797] " 49."
[798] " And amigos, they help their amigos."
[799] " worth your while."
[800] " workstation."
[801] " Miguel perks at the mention of de la Cruz."
[809] " Miguel follows him."
[810] " 50." # Its scene number
[811] " INT. HALLWAY"
s <- grep(coco, pattern = "[^Level].[0-9].$", value = TRUE)
My solution is wrong because it is not sequential
v <- gsub(s, pattern = "[^Level].[0-9].$", replacement = paste("Scene", sequence(1:130)))
[1] " Scene1"
[2] " Scene1"
[3] " Scene1"
[4] " Scene1"
[5] " Scene1"
[6] " Scene1"
I'm not clear on what [^Level] represents. However, if the numbers at the end of lines in the text represent the Scene numbers, then you can use ( ) to capture the numbers and substitute them in your replacement text as shown below:
v <- gsub(s, pattern = " ([0-9]{1,3})\\.$", replacement = "Scene \\1")

Resources