unix telephone syntax with different ways of writing American phone numbers - unix

Ok so I need I have a .txt file with names followed by their respective phone numbers and need to grab all the numbers following the ###-###-#### syntax which I have accomplished with this code
grep -E "([0-9]{3})-[0-9]{3}-[0-9]{4}" telephonefile_P2
but my problem is that there are instances of
(###)-###-####
(###) ### ####
### ### ####
###-####
This is the file:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Star Club 49 040–31 77 78 0
Lolita Spengler (816) 756 8657
Hoffman's Kleider 049 37 1836 027
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Hermann's Speilhaus 49 25 8377 1765
Hal Kubrick 44 1289 332934
Sister Sue 978 0672
Auggie Keller 49 089/594 393
JCCC 913-469-8500
This is my desired output:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500
and I don't know how to account for these alternate forms...
obviously new to Unix, please be gentle!

$ awk '/(^|[[:space:]])(\(?[0-9]{3}\)?[- ])?[0-9]{3}[- ][0-9]{4}([[:space:]]|$)/' file
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500

Related

How do I scrape elements at the same level from a website using bs4

I'm trying to scrape this list of books and authors from the following site:
https://www.theguardian.com/books/2019/sep/21/best-books-of-the-21st-century
I first make a soup object using:
soup = BeautifulSoup(r.content, features='lxml')
Then I inspect the specific element on chrome, and filter in on the specific part of the page by:
listicle = soup.find('div', class_='content__article-body from-content-api js-article__body')
Now, for the parts that are confusing:
The list has the index, the book title, and the author name all at the same level (h2). I can do a find('h2') to get to 'index' and then try to access the rest with next_sibling. Is there a better way?
Even if I figure out No. 1 above, I need to write a 'for-loop' to get to the rest of the entries in the listicle? I can't seem to figure out how to do that as the 'listicle' variable that I created only contains a list and it wouldn't necessarily list through each entry (book 1, book 2, etc.) but through each element in the list(book 1 index, book 1 author, etc.).
I am completely new to web-scraping. So apologies if this is a very dumb question.
One solution is to use select all <h2> and use zip() function. For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.theguardian.com/books/2019/sep/21/best-books-of-the-21st-century'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
h2s = soup.select('.content__article-body h2')[::-1] # we want go in ascending fashion
for author, title, i in zip(h2s[::3], h2s[1::3], h2s[2::3]):
print('{:<5} {:<60} {}'.format(i.text, title.text, author.text))
Prints:
1 Wolf Hall by Hilary Mantel (2009)
2 Gilead by Marilynne Robinson (2004)
3 Secondhand Time by Svetlana Alexievich (2013), translated by Bela Shayevich (2016)
4 Never Let Me Go by Kazuo Ishiguro (2005)
5 Austerlitz by WG Sebald (2001), translated by Anthea Bell (2001)
6 The Amber Spyglass by Philip Pullman (2000)
7 Between the World and Me by Ta-Nehisi Coates (2015)
8 Autumn by Ali Smith (2016)
9 Cloud Atlas David Mitchell (2004)
10 Half of a Yellow Sun by Chimamanda Ngozi Adichie (2006)
11 My Brilliant Friend by Elena Ferrante (2011), translated by Ann Goldstein (2012)
12 The Plot Against America by Philip Roth (2004)
13 Nickel and Dimed by Barbara Ehrenreich (2001)
14 Fingersmith by Sarah Waters (2002)
15 The Sixth Extinction by Elizabeth Kolbert (2014)
16 The Corrections by Jonathan Franzen (2001)
17 The Road by Cormac McCarthy (2006)
18 The Shock Doctrine by Naomi Klein (2007)
19 The Curious Incident of the Dog in the Night‑Time by Mark Haddon (2003)
20 Life After Life by Kate Atkinson (2013)
21 Sapiens by Yuval Noah Harari (2011), translated by Harari with John Purcell and Haim Watzman (2014)
22 Tenth of December by George Saunders (2013)
23 The Noonday Demon by Andrew Solomon (2001)
24 A Visit from The Goon Squad by Jennifer Egan (2011)
25 Normal People by Sally Rooney (2018)
26 Capital in the Twenty First Century by Thomas Piketty (2013), translated by Arthur Goldhammer (2014)
27 Hateship, Friendship, Courtship, Loveship, Marriage by Alice Munro (2001)
28 Rapture by Carol Ann Duffy (2005)
29 A Death in the Family by Karl Ove Knausgaard (2009), translated by Don Bartlett (2012)
30 The Underground Railroad by Colson Whitehead (2016)
31 The Argonauts by Maggie Nelson (2015)
32 The Emperor of All Maladies by Siddhartha Mukherjee (2010)
33 Fun Home by Alison Bechdel (2006)
34 Outline by Rachel Cusk (2014)
35 The Hare with Amber Eyes by Edmund de Waal (2010)
36 Experience by Martin Amis (2000)
37 The Green Road by Anne Enright (2015)
38 The Line of Beauty by Alan Hollinghurst (2004)
39 White Teeth by Zadie Smith (2000)
40 The Year of Magical Thinking by Joan Didion (2005)
41 Atonement by Ian McEwan (2001)
42 Moneyball by Michael Lewis (2010)
43 Citizen: An American Lyric by Claudia Rankine (2014)
44 Hope in the Dark by Rebecca Solnit (2004)
45 Levels of Life by Julian Barnes (2013)
46 Human Chain by Seamus Heaney (2010)
47 Persepolis by Marjane Satrapi (2000-2003), translated by Mattias Ripa (2003-2004)
48 Night Watch by Terry Pratchett (2002)
49 Why Be Happy When You Could Be Normal? by Jeanette Winterson (2011)
50 Oryx and Crake by Margaret Atwood (2003)
51 Brooklyn by Colm Tóibín (2009)
52 Small Island by Andrea Levy (2004)
53 True History of the Kelly Gang by Peter Carey (2000)
54 Women & Power by Mary Beard (2017)
55 The Omnivore’s Dilemma by Michael Pollan (2006)
56 Underland by Robert Macfarlane (2019)
57 The Amazing Adventures of Kavalier and Clay by Michael Chabon (2000)
58 Postwar by Tony Judt (2005)
59 The Beauty of the Husband by Anne Carson (2002)
60 Dart by Alice Oswald (2002)
61 This House of Grief by Helen Garner (2014)
62 Mother’s Milk by Edward St Aubyn (2006)
63 The Immortal Life of Henrietta Lacks by Rebecca Skloot (2010)
64 On Writing by Stephen King (2000)
65 Gone Girl by Gillian Flynn (2012)
66 Seven Brief Lessons on Physics by Carlo Rovelli (2014)
67 The Silence of the Girls by Pat Barker (2018)
68 The Constant Gardener by John le Carré (2001)
69 The Infatuations by Javier Marías (2011), translated by Margaret Jull Costa (2013)
70 Notes on a Scandal by Zoë Heller (2003)
71 Jimmy Corrigan: The Smartest Kid on Earth by Chris Ware (2000)
72 The Age of Surveillance Capitalism by Shoshana Zuboff (2019)
73 Nothing to Envy by Barbara Demick (2009)
74 Days Without End by Sebastian Barry (2016)
75 Drive Your Plow Over the Bones of the Dead by Olga Tokarczuk (2009), translated by Antonia Lloyd-Jones (2018)
76 Thinking, Fast and Slow by Daniel Kahneman (2011)
77 Signs Preceding the End of the World by Yuri Herrera (2009), translated by Lisa Dillman (2015)
78 The Fifth Season by NK Jemisin (2015)
79 The Spirit Level by Richard Wilkinson and Kate Pickett (2009)
80 Stories of Your Life and Others by Ted Chiang (2002)
81 Harvest by Jim Crace (2013)
82 Coraline by Neil Gaiman (2002)
83 Tell Me How It Ends by Valeria Luiselli (2016), translated by Luiselli with Lizzie Davis (2017)
84 The Cost of Living by Deborah Levy (2018)
85 The God Delusion by Richard Dawkins (2006)
86 Adults in the Room by Yanis Varoufakis (2017)
87 Priestdaddy by Patricia Lockwood (2017)
88 Noughts & Crosses by Malorie Blackman (2001)
89 Bad Blood by Lorna Sage (2000)
90 Visitation by Jenny Erpenbeck (2008), translated by Susan Bernofsky (2010)
91 Light by M John Harrison (2002)
92 The Siege by Helen Dunmore (2001)
93 Darkmans by Nicola Barker (2007)
94 The Tipping Point by Malcolm Gladwell (2000)
95 Chronicles: Volume One by Bob Dylan (2004)
96 A Little Life by Hanya Yanagihara (2015)
97 Harry Potter and the Goblet of Fire by JK Rowling (2000)
98 The Girl With the Dragon Tattoo by Stieg Larsson (2005), translated by Steven T Murray (2008)
99 Broken Glass by Alain Mabanckou (2005), translated by Helen Stevenson (2009)
100 I Feel Bad About My Neck by Nora Ephron (2006)

R Extract names from text

I'm trying to extract a list of rugby players names from a string. The string contains all of the information from a table, containing the headers (team names) as well as the name of the player in each position for each team. It also has the player ranking but I don't care about that.
Important - a lot of player rankings are missing. I found a solution to this however doesn't handle missing rankings (for example below Rabah Slimani is the first player not to have a ranking recorded).
Note, the 1-15 numbers indicate positions, and there's always two names following each position (home player and away player).
Here's the sample string:
" Team Sheets # FRA France RPI IRE Ireland RPI 1 Jefferson Poirot 72 Cian Healy 82 2 Guilhem Guirado 78 Rory Best 85 3 Rabah Slimani Tadhg Furlong 85 4 Arthur Iturria 82 Iain Henderson 84 5 Sebastien Vahaamahina 84 James Ryan 92 6 Wenceslas Lauret 82 Peter O'Mahony 93 7 Yacouba Camara 70 Josh van der Flier 64 8 Kevin Gourdon CJ Stander 91 9 Maxime Machenaud Conor Murray 87 10 Matthieu Jalibert Johnny Sexton 90 11 Virimi Vakatawa Jacob Stockdale 89 12 Henry Chavancy Bundee Aki 83 13 Rémi Lamerat Robbie Henshaw 78 14 Teddy Thomas Keith Earls 89 15 Geoffrey Palis Rob Kearney 80 Substitutes # FRA France RPI IRE Ireland RPI 16 Adrien Pelissie Sean Cronin 84 17 Dany Priso 70 Jack McGrath 70 18 Cedate Gomes Sa 71 John Ryan 86 19 Paul Gabrillagues 77 Devin Toner 90 20 Marco Tauleigne Dan Leavy 80 21 Antoine Dupont 92 Luke McGrath 22 Anthony Belleau 65 Joey Carbery 86 23 Benjamin Fall Fergus McFadden "
Note - it comes from here: https://www.rugbypass.com/live/six-nations/france-vs-ireland-at-stade-de-france-on-03022018/2018/info/
So basically what I want is just the list of names with the team names as the headers e.g.
France Ireland
Jefferson Poirot Cian Healy
Guilhem Guirado Rory Best
... ...
Any help would be much appreciated!
I tried this on an advanced notepad editor and tried to find occurrences of 2 consecutive numbers and replaced those with a new line. the ReGex is
\d+\s+\d+
Once you are done replacing, you will be left with 2 names in each line separated by a number. Then use the below ReGex to replace that number with a single tab
\s+\d+\s+
Hope that helps

How to get number from holt-winters forecast in Rstudio

Like the title said, is there anyway to get an exact number from a Holt-winters forecast? For example, say I have a time-series object like this:
Date Total
6/1/2014 150
7/1/2014 219
8/1/2014 214
9/1/2014 47
10/1/2014 311
11/1/2014 198
12/1/2014 169
1/1/2015 253
2/1/2015 167
3/1/2015 262
4/1/2015 290
5/1/2015 319
6/1/2015 405
7/1/2015 395
8/1/2015 391
9/1/2015 345
10/1/2015 401
11/1/2015 390
12/1/2015 417
1/1/2016 375
2/1/2016 397
3/1/2016 802
4/1/2016 466
After storing it in variable hp, I used Holt Winters to make a forecast:
hp.ts <- ts(hp$Total, frequency = 12, start = c(2014,4))
hp.ts.hw <- HoltWinters(hp.ts)
library(forecast)
hp.ts.hw.fc <- forecast.HoltWinters(hp.ts.hw, h = 5)
plot(hp.ts.hw.fc)
However, what I need to know is how exactly the Total in 2016/05 is (predictly) going to be. Is there anyway to get the exact value?
By the way, I noticed that the blue (forecast) line is NOT connected to the black line. Is that normal? Or I should fix my code?
Thank you for reading.
I don't know why you went round around while you have called the library(forecast). Below provides direct answers for your questions:
hp.ts.hw <- hw(hp.ts)
hp.ts.hw.fc <- forecast(hp.ts.hw, h = 5)
plot(hp.ts.hw.fc)
hp.ts.hw.fc
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Mar 2016 546.5311 448.2997 644.7624 396.2992 696.7630
Apr 2016 623.7030 525.4716 721.9344 473.4711 773.9349
May 2016 671.8989 573.6675 770.1303 521.6670 822.1309
Jun 2016 667.3722 569.1408 765.6036 517.1402 817.6041
Jul 2016 500.0710 401.8396 598.3024 349.8390 650.3030
I'm not sure if I understood your doubt. But you can get the forecasted value by:
hp.ts.hw.fc$mean
You can use accuracy function to measure how good is your results.

Shiny App R - Reactive variables not responding to change in Input variables

I'm making an app that will predict an NFL running back's number of rush attempts and rush yards AFTER a season of 1800+ rush yards. I use slider inputs for the # of rushing yards and attempts, which gets run through lm() and predict() and returns estimates for next year's attempts and rush yards (I know it's not a very good predictor at all, but this is just an exercise in making a Shiny app). Here's the data from my excel file and then the code.
Player Yr. Team Attempts Att.Next.Yr Yards Yards.Next.Yr YPC YPC.Next.Yr
1 Adrian Peterson 2012 MIN 348 279 2097 1266 6.0 4.5
2 Chris Johnson 2009 TEN 358 316 2006 1364 5.6 4.3
3 LaDainian Tomlinson 2006 SD 348 315 1815 1474 5.2 4.7
4 Shaun Alexander 2005 SEA 370 252 1880 896 5.1 3.6
5 Tiki Barber 2005 NYG 357 327 1860 1662 5.2 5.1
6 Jamal Lewis 2003 BAL 387 235 2066 1006 5.3 4.3
7 Ahman Green 2003 GB 355 259 1883 1163 5.3 4.5
8 Ricky Williams 2002 MIA 383 392 1853 1372 4.8 3.5
9 Terrell Davis 1998 DEN 392 67 2008 211 5.1 3.1
10 Jamal Anderson 1998 ATL 410 19 1846 59 4.5 3.1
11 Barry Sanders 1997 DET 335 343 2053 1491 6.1 4.3
12 Barry Sanders 1994 DET 331 314 1883 1500 5.7 4.8
13 Eric Dickerson 1986 RAM 404 60 1821 277 4.5 4.6
14 Eric Dickerson 1984 RAM 379 292 2105 1234 5.6 4.2
15 Eric Dickerson 1983 RAM 390 379 1808 2105 4.6 5.6
16 Earl Campbell 1980 HOU 373 361 1934 1376 5.2 3.8
17 Walter Payton 1977 CHI 339 333 1852 1395 5.5 4.2
18 O.J. Simpson 1975 BUF 329 290 1817 1503 5.5 5.2
19 O.J. Simpson 1973 BUF 332 270 2003 1125 6.0 4.2
20 Jim Brown 1963 CLE 291 280 1863 1446 6.4 5.2
Server.R
# server.R
library(UsingR)
library(xlsx)
rawdata <- read.xlsx("RushingYards.xlsx", sheetIndex=1)
data <- rawdata[c(2:21),]
rownames(data) <- NULL
# Att
set.seed(1)
fitAtt <- lm(Att.Next.Yr ~ Yards + Attempts, data)
# Yds
set.seed(1)
fitYds <- lm(Yards.Next.Yr ~ Yards + Attempts, data)
shinyServer(
function(input, output) {
output$newPlot <- renderPlot({
iYards <- input$Yards
iAttempts <- input$Attempts
test <- data.frame(iYards,iAttempts)
names(test) <- c("Yards", "Attempts")
predictAtt <- predict(fitAtt, test)
predictYds <- predict(fitYds, test)
qplot(data=data, x=Attempts, y=Yards) +
geom_point(aes(x=predictAtt, y=predictYds, color="Estimate"))
output$renderYds <- renderPrint({predictYds})
output$renderAtt <- renderPrint({predictAtt})
})
}
)
UI.R
# ui.R
shinyUI(pageWithSidebar(
headerPanel("Rushing Projections"),
sidebarPanel(
sliderInput('Yards', 'How many yards rushed for this season',
value=1700, min=1500, max=2500, step=25,),
sliderInput('Attempts', 'How many attempts this season',
value=350, min=250, max=450, step=5,),
submitButton('Submit')
),
mainPanel(
plotOutput('newPlot'),
h3('Predicted rushing yards next year: '),
verbatimTextOutput("renderYds"),
h3('Predict attempts next year: '),
verbatimTextOutput("renderAtt")
)
))
The problem I'm having is I can't seem to output BOTH the plot (next year's estimates plotted in red against historical performances for running backs > 1800 rush yards) and the text of next year's estimated rushing yards and attempts at the same time. I can get one or the other to show up depending on where I put those statements. If I put
output$renderYds <- renderPrint({predictYds})
output$renderAtt <- renderPrint({predictAtt})
outside of the output$newPlot (but still inside of function(input, output)) line I can get the plot to show up and the point for next year's estimates changes as the input is changed but I get error messages of
object 'predictYds' not found' and object 'predictAtt' not found for the text. If I put those two lines inside of the function(input, output) line (as I have in the code above) then those two text numbers show up with the correct value but the plot doesn't generate.
Can anyone help with this please?
I changed the structure of Server.R and now it works.
shinyServer(function(input, output) {
predictYds <- function(Y, A){
test <- data.frame(Y, A)
names(test) <- c("Yards", "Attempts")
predict(fitYds, test)
}
predictAtt <- function(Y, A){
test <- data.frame(Y, A)
names(test) <- c("Yards", "Attempts")
predict(fitAtt, test)
}
output$newPlot <- renderPlot({
newYards <- predictYds(input$Yards, input$Attempts)
newAttempts <- predictAtt(input$Yards, input$Attempts)
qplot(data=data, x=Attempts, y=Yards) +
geom_point(aes(x=newAttempts, y=newYards, color="Estimate"))
})
output$renderYds <- renderPrint({predictYds(input$Yards, input$Attempts)})
output$renderAtt <- renderPrint({predictAtt(input$Yards, input$Attempts)})
}
)
Basically PredictYds and PredictAtt were rewritten as normal functions called inside render functions using input variables.

Fuzzy string matching in r

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

Resources