Replacing pattern using UNIX commands - unix

I have some strings like:
Sample Input:
Also known as temple of the city,
xxx as Pune Banglore as kolkata Delhi India,
as Mumbai India or as Bombay India,
Calcutta,India is now know as Kolkata,India,
From the above I want to convert as xxx xxxx xx, to as xxx_xxxx_xx, and it should be effective after the last as.
Sample output for above:
Also known as temple_of_the_city,
xxx as Pune Banglore as kolkata_Delhi_India,
as Mumbai India or as Bombay_India,
Calcutta,India is now know as Kolkata,India,
There should be no space separated string after the last as in a line.
Please let me know if it is not clear.
Thanks

Paul is right that it's not really a simple task. This is a sed solution that I put together:
sed 's/\(.*as \)/\1\n/;h;y/ /_/;G;s/.*\n\(.*\)\n\(.*\)\n.*/\2\1/' file.txt
Demonstration on your data:
$ echo 'Also known as temple of the city,
> xxx as Pune Banglore as kolkata Delhi India,
> as Mumbai India or as Bombay India,
> Calcutta,India is now know as Kolkata,India,' | \
> sed 's/\(.*as \)/\1\n/;h;y/ /_/;G;s/.*\n\(.*\)\n\(.*\)\n.*/\2\1/'
Also known as temple_of_the_city,
xxx as Pune Banglore as kolkata_Delhi_India,
as Mumbai India or as Bombay_India,
Calcutta,India is now know as Kolkata,India,

I'd be inclined to use Perl, the swiss army chainsaw, but sed is also an option. In either case you're looking at a substantial learning curve.
The replacement you've described is probably complex enough that you'd be better off writing a script rather than trying to do it as a one liner.
If you're going to write a script and don't already know Perl there's no reason why you shouldn't pick your scripting language of choice (python, ruby, etc) as long as it has some sort of text pattern matching syntax.
I don't know of a simple, shallow learning curve method of doing a complex pattern match and replacement of this sort. Is this a one time thing where you need to do this replacement only? Or are you going to be doing similar sorts of complicated pattern replacements in the future. If you're going to be doing this frequently you really should invest the time in learning some scripting language but I won't impose my Perl bias on you. Just pick any language that seems accessible.

Related

How to stop readtext function adding extra quotes in R?

I want to read some Word documents into R and extract the sentences that are contained within quotation marks. When I used the readtext function from that package it adds extra quotes around the whole string of each article. Is there a way to change this?
path <- "folder"
mydata <-
readtext(paste0(path, "\\*.docx"))
mydata$text
quotes <- vector()
for (i in c(1:2)){
quotes[i] <- sub('.*?"([^"]+)"', "\\1", mydata$text[i])
}
Here's the content of both Word documents:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
and this is what my current output looks like
[1] "We got him and he is healthy, said Houston Police Department (HPD) Major Offenders Commander Ron Borza."
[2] "A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK."
It looks like the issue is related to the different types of quotations marks i.e. curly Vs straight. I can remove them as follows:
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))

How do I parse a movie script for lines of dialogue that have consistent spacing with R?

'''
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
RIDER
Hey -- sorry.
''''
I'm scraping some scripts that I want to do some text analysis with. I want to pull only dialogue from the scripts and it looks like it has a certain amount of spacing.
So for example, I want that line "Hey -- sorry.". I know that the spacing is 20 and that is consistent throughout the script. So I how can I only read in that line and the rest that have equal spacing?
I want to say that I am going to use read.fwf, reading a fixed width.
What do you guys think?
I'm scraping from urls like this:
https://imsdb.com/scripts/10-Things-I-Hate-About-You.html
library(tidytext)
library(tidyverse)
text <- c("PADUA HIGH SCHOOL - DAY
Welcome to Padua High School,, your typical urban-suburban
high school in Portland, Oregon. Smarties, Skids, Preppies,
Granolas. Loners, Lovers, the In and the Out Crowd rub sleep
out of their eyes and head for the main building.
PADUA HIGH PARKING LOT - DAY
KAT STRATFORD, eighteen, pretty -- but trying hard not to be
-- in a baggy granny dress and glasses, balances a cup of
coffee and a backpack as she climbs out of her battered,
baby blue '75 Dodge Dart.
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
She grabs his skateboard and uses it to SHOVE him against a
car, skateboard tip to his throat. He whimpers pitifully
and she lets him go. A path clears for her as she marches
through a pack of fearful students and SLAMS open the door,
entering school.
INT. GIRLS' ROOM - DAY
BIANCA STRATFORD, a beautiful sophomore, stands facing the
mirror, applying lipstick. Her less extraordinary, but
still cute friend, CHASTITY stands next to her.
BIANCA
Did you change your hair?
CHASTITY
No.
BIANCA
You might wanna think about it
Leave the girls' room and enter the hallway.
HALLWAY - DAY- CONTINUOUS
Bianca is immediately greeted by an admiring crowd, both
boys
and girls alike.
BOY
(adoring)
Hey, Bianca.
GIRL
Awesome shoes.
The greetings continue as Chastity remains wordless and
unaddressed by her side. Bianca smiles proudly,
acknowledging her fans.
GUIDANCE COUNSELOR'S OFFICE - DAY
CAMERON JAMES, a clean-cut, easy-going senior with an open,
farm-boy face, sits facing Miss Perky, an impossibly cheery
guidance counselor.")
names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")
text %>%
as_tibble() %>%
unnest_tokens(text, value, token = "lines") %>%
filter(str_detect(text, "\\s{15,}")) %>%
mutate(text = str_trim(text)) %>%
filter(!str_detect(text, names_stopwords))
Output:
# A tibble: 9 x 1
text
<chr>
1 hey -- sorry.
2 leave it
3 i said, leave it!
4 did you change your hair?
5 no.
6 you might wanna think about it
7 (adoring)
8 hey, bianca.
9 awesome shoes.
You can include further character names in the names_stopwords vector.
You can try the following :
url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'
url %>%
#Read webpage line by line
readLines() %>%
#Remove '<b>' and '</b>' from string
gsub('<b>|</b>', '', .) %>%
#select only the text which begins with 20 whitespace characters
grep('^\\s{20,}', ., value = TRUE) %>%
#Remove whitespace
trimws() %>%
#Remove all caps string
grep('^([A-Z]+\\s?)+$', ., value = TRUE, invert = TRUE)
#[1] "Hey -- sorry." "Leave it" "KAT (continuing)"
#[4] "I said, leave it!" "Did you change your hair?" "No."
#...
#...
I have tried cleaning this as much as possible but might require some more cleaning based on what you actually want to extract.

RegEx. How to remove blank space after a period before a punctuation character

I have a question on regex. Suppose i have this string
"She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
I want to remove every blank space after period and before character ” and delete character ”
For example this part of sentence
She was like an eating machine. ”Trump, a man who wants to be president:
should become
She was like an eating machine.Trump, a man who wants to be president: "
Thanks guys, regex is not easy to learn. Appreciate any help! bye
p.s i'm using software R but i think it's irrelevant since regex works in every programming language
UPDATE
I solved my problem and i'd like to share it, maybe could help someone else. I have this dataset downloaded from kaggle about trump and hillary tweet.
I have to do some cleaning before importing data on Knime(project at university).
I have solved all encoding issues through gsub except this. i finally manage to solve it writing a csv file in R with Encoding UTF-8. Clearly i read that file in Knime with the same encoding
If you need to match any number of whitespaces (1 or more) between a dot and the curly double quote, you may use
x <- "She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
gsub("\\.\\s+”", ".", x)
## => [1] "She gained about 55 pounds in...9 months. She was like an eating machine.Trump, a man who wants to be president: "
The \\. matches a dot, \\s+ matches 1 or more whitespace symbols and ” matches a ”.
See the regex demo and an R demo.
If there is only 1 regular space between the dot and the quote, you may use a fixed string replacement:
gsub(". ”", ".", x, fixed=TRUE)
See this R demo.
May be this could help:
var str = 'She was like an eating machine. "Trump, a man who wants to be president. "New value';
str.replace(/\.\s"/g,".");
http://regexr.com/ is a great tool for learning and testing regular expressions.
The only thing I'd add to Wiktor's answer is that it won't match "machine.”Trump". To match any number of spaces after a dot and before a quote, use the * quantifier:
x <- "She gained about 55 pounds in...9 months. She was like an eating machine. ”Trump, a man who wants to be president: "
gsub("\\.\\s*”", ".", x)

How to determine the correct file encoding for use with read.fwf (or use a workaround to remove non-conforming characters)

I tried the approach in the following question and am still stuck.
How to detect the right encoding for read.csv?
This following code should be reproduceable... Any ideas? I'd rather not use scan() or readLines because I've been using this code successfully for assorted state level ACS data in the past....
My other thought is to edit the text file prior to importing it. However I store the files zipped and use a script to unzip and then access the data. Having to edit the file outside of the R environment would really gum up that process. Thanks in advance!
Filename <- "g20095us.txt"
Url <- "http://www2.census.gov/acs2005_2009_5yr/summaryfile/2005-2009_ACSSF_By_State_By_Sequence_Table_Subset/UnitedStates/All_Geographies_Not_Tracts_Block_Groups/"
Widths <- c(6,2,3,2,7,1,1,1,2,2,3,5,5,6,1,5,4,5,1,3,5,5,5,3,5,1,1,5,3,5,5,5,2,3,
3,6,3,5,5,5,5,5,1,1,6,5,5,40,200,6,1,50)
Classes <- c(rep('character',4),'integer',rep('character',47))
Names <- c('fileid','stusab','sumlev','geocomp','logrecno','us','region','division',
'statece','state','county','cousub','place','tract','blkgrp','concit',
rep('blank',14),'ua',rep('blank',11),'ur',rep('blank',4),'geoid','name',rep('blank',3))
GeoHeader <- read.fwf(paste0(Url,Filename),widths=Widths,
colClasses=Classes,col.names=Names,fill=TRUE,strip.white=TRUE)
Four lines from the file "g2009us.txt" below. The second one "Canoncito" is causing the problems. The other files in the download are csv but this one is fixed-width and necessary to identify geographies of interest (the organization of the data is not very intuitive).
ACSSF US251000000964 2430 090 25100US2430090 Cameron Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT
ACSSF US251000000965 2430 092 25100US2430092 Cañoncito Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT
ACSSF US251000000966 2430 095 25100US2430095 Casamero Lake Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT
ACSSF US251000000967 2430 105 25100US2430105 Chi Chil Tah Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT
First, we start by identifying all non-ASCII characters. I do this by converting
converting to a raw vector, and then looking for values over 127 (the last
unambiguously encoded value in ASCII).
lines <- readLines("g20095us.txt")
non_ascii <- function(x) {
any(charToRaw(x) > 127)
}
bad <- vapply(lines, non_ascii, logical(1), USE.NAMES = FALSE)
lines[bad]
We then need to figure out what the correct encoding is. This is challenging
when we only have two cases, and often involves some trial and error. In this
case I googled for "encoding \xf1", and discovered
Why doesn't this conversion to utf8 work?, which suggested that latin1 might
be the corect encoding.
I tested that using iconv which converts from one encoding to another (and
you always want to use utf-8):
iconv(lines[bad], "latin1", "utf-8")
Finally, we reload with the correct encoding. Confusingly, the encoding
argument to any of the read.* functions doesn't do this - you need to
manually specify an encoding on the connection:
fixed <- readLines(file("g20095us.txt", encoding = "latin1"))
fixed[bad]

Dictionary Textfile UNIX

Can anyone tell me where the dictionary textfile is located on UNIX systems? Or where I can get a good dictionary textfile? I have been currently using a textfile from SUN but it contains abbreviations that are not followed by a period (or else I could remove them). Could somebody point me in the right direction? I cannot seem to find anything helpful on the Mac developer dictionary tools either. I am looking for something that only contains English words, no abbreviations, and no proper nouns. It is for a word game.
Try /usr/dict/words, /usr/share/dict/, or /var/lib/dict/.
Or google "linux dictionary text" or "linux words" and find:
http://www.linuxquestions.org/questions/linux-newbie-8/dictionary-file-in-linux-652559/
http://en.wikipedia.org/wiki/Words_(Unix)
etc.

Resources