R Extract names from text

R Extract names from text - r

I'm trying to extract a list of rugby players names from a string. The string contains all of the information from a table, containing the headers (team names) as well as the name of the player in each position for each team. It also has the player ranking but I don't care about that.
Important - a lot of player rankings are missing. I found a solution to this however doesn't handle missing rankings (for example below Rabah Slimani is the first player not to have a ranking recorded).
Note, the 1-15 numbers indicate positions, and there's always two names following each position (home player and away player).
Here's the sample string:
" Team Sheets # FRA France RPI IRE Ireland RPI 1 Jefferson Poirot 72 Cian Healy 82 2 Guilhem Guirado 78 Rory Best 85 3 Rabah Slimani Tadhg Furlong 85 4 Arthur Iturria 82 Iain Henderson 84 5 Sebastien Vahaamahina 84 James Ryan 92 6 Wenceslas Lauret 82 Peter O'Mahony 93 7 Yacouba Camara 70 Josh van der Flier 64 8 Kevin Gourdon CJ Stander 91 9 Maxime Machenaud Conor Murray 87 10 Matthieu Jalibert Johnny Sexton 90 11 Virimi Vakatawa Jacob Stockdale 89 12 Henry Chavancy Bundee Aki 83 13 Rémi Lamerat Robbie Henshaw 78 14 Teddy Thomas Keith Earls 89 15 Geoffrey Palis Rob Kearney 80 Substitutes # FRA France RPI IRE Ireland RPI 16 Adrien Pelissie Sean Cronin 84 17 Dany Priso 70 Jack McGrath 70 18 Cedate Gomes Sa 71 John Ryan 86 19 Paul Gabrillagues 77 Devin Toner 90 20 Marco Tauleigne Dan Leavy 80 21 Antoine Dupont 92 Luke McGrath 22 Anthony Belleau 65 Joey Carbery 86 23 Benjamin Fall Fergus McFadden "
Note - it comes from here: https://www.rugbypass.com/live/six-nations/france-vs-ireland-at-stade-de-france-on-03022018/2018/info/
So basically what I want is just the list of names with the team names as the headers e.g.
France Ireland
Jefferson Poirot Cian Healy
Guilhem Guirado Rory Best
... ...
Any help would be much appreciated!

I tried this on an advanced notepad editor and tried to find occurrences of 2 consecutive numbers and replaced those with a new line. the ReGex is
\d+\s+\d+
Once you are done replacing, you will be left with 2 names in each line separated by a number. Then use the below ReGex to replace that number with a single tab
\s+\d+\s+
Hope that helps

Related

Questions about how to divide and find averages of a dataset

Let's say I have a dataset where I have a list of names and their ages
Tom 65
Sam 40
Sue 88
Kay 4
Jon 25
Lia 85
Ian 39
Joe 10
Bea 17
Jan 43
Jen 17
Ike 24
Jay 35
Cam 77
Jin 12
Ron 1
Ray 45
Leo 29
Ken 98
Mel 56
Amy 49
Joy 67
Ivy 3
Noe 14
Max 31
Jax 61
Lee 19
Ace 28
Ben 5
Guy 74
I'm trying to divide the dataset into ten equal bins by descending order (Ex. the first bin will have Ken, Sue, and Lia and the last bin will have Ben, Ivy, and Ron) and I want to find the average age for each bin (So the average age for the first bin would be 90.33). I was able to do this on MS excel quite easily but I'm not exactly sure how to do this efficiently on R. Any suggestions?

We can use cut to create a group and then summarise by taking the mean
library(dplyr)
df1 %>%
group_by(grp = cut(v2, breaks = 10)) %>%
summarise(v1 = list(v1), v2 = mean(v2))

Scraping data from Australian Open stats

I'd like to scrape the stats from The official website of Australian Open, specifically the data from the table, using rvest library, however, when I use
read_html("https://ausopen.com/event-stats") %>% html_nodes("table")
It returns {xml_nodeset (0)}, how would I attempt to fix this? The website is a bit confusing because every data of each statistics is in one webpage.

There is ton of information at https://prod-scores-api.ausopen.com/year/2021/stats which you can read with jsonlite::fromJSON. The difficult task is to find the relevant data that you need.
For example, to get aces and player name you can do :
library(dplyr)
dat <- jsonlite::fromJSON('https://prod-scores-api.ausopen.com/year/2021/stats')
aces <- bind_rows(dat$statistics$rankings[[1]]$players)
dat$players %>%
inner_join(aces, by = c('uuid' = 'player_id')) %>%
select(full_name, value) %>%
arrange(-value)
# full_name value
#1 Novak Djokovic 103
#2 Alexander Zverev 86
#3 Milos Raonic 82
#4 Daniil Medvedev 80
#5 Nick Kyrgios 69
#6 Alexander Bublik 66
#7 Reilly Opelka 61
#8 Jiri Vesely 58
#9 Andrey Rublev 57
#10 Lloyd Harris 55
#11 Aslan Karatsev 54
#12 Taylor Fritz 53
#...
#...

How do I scrape elements at the same level from a website using bs4

I'm trying to scrape this list of books and authors from the following site:
https://www.theguardian.com/books/2019/sep/21/best-books-of-the-21st-century
I first make a soup object using:
soup = BeautifulSoup(r.content, features='lxml')
Then I inspect the specific element on chrome, and filter in on the specific part of the page by:
listicle = soup.find('div', class_='content__article-body from-content-api js-article__body')
Now, for the parts that are confusing:
The list has the index, the book title, and the author name all at the same level (h2). I can do a find('h2') to get to 'index' and then try to access the rest with next_sibling. Is there a better way?
Even if I figure out No. 1 above, I need to write a 'for-loop' to get to the rest of the entries in the listicle? I can't seem to figure out how to do that as the 'listicle' variable that I created only contains a list and it wouldn't necessarily list through each entry (book 1, book 2, etc.) but through each element in the list(book 1 index, book 1 author, etc.).
I am completely new to web-scraping. So apologies if this is a very dumb question.

One solution is to use select all <h2> and use zip() function. For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.theguardian.com/books/2019/sep/21/best-books-of-the-21st-century'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
h2s = soup.select('.content__article-body h2')[::-1] # we want go in ascending fashion
for author, title, i in zip(h2s[::3], h2s[1::3], h2s[2::3]):
print('{:<5} {:<60} {}'.format(i.text, title.text, author.text))
Prints:
1 Wolf Hall by Hilary Mantel (2009)
2 Gilead by Marilynne Robinson (2004)
3 Secondhand Time by Svetlana Alexievich (2013), translated by Bela Shayevich (2016)
4 Never Let Me Go by Kazuo Ishiguro (2005)
5 Austerlitz by WG Sebald (2001), translated by Anthea Bell (2001)
6 The Amber Spyglass by Philip Pullman (2000)
7 Between the World and Me by Ta-Nehisi Coates (2015)
8 Autumn by Ali Smith (2016)
9 Cloud Atlas David Mitchell (2004)
10 Half of a Yellow Sun by Chimamanda Ngozi Adichie (2006)
11 My Brilliant Friend by Elena Ferrante (2011), translated by Ann Goldstein (2012)
12 The Plot Against America by Philip Roth (2004)
13 Nickel and Dimed by Barbara Ehrenreich (2001)
14 Fingersmith by Sarah Waters (2002)
15 The Sixth Extinction by Elizabeth Kolbert (2014)
16 The Corrections by Jonathan Franzen (2001)
17 The Road by Cormac McCarthy (2006)
18 The Shock Doctrine by Naomi Klein (2007)
19 The Curious Incident of the Dog in the Night‑Time by Mark Haddon (2003)
20 Life After Life by Kate Atkinson (2013)
21 Sapiens by Yuval Noah Harari (2011), translated by Harari with John Purcell and Haim Watzman (2014)
22 Tenth of December by George Saunders (2013)
23 The Noonday Demon by Andrew Solomon (2001)
24 A Visit from The Goon Squad by Jennifer Egan (2011)
25 Normal People by Sally Rooney (2018)
26 Capital in the Twenty First Century by Thomas Piketty (2013), translated by Arthur Goldhammer (2014)
27 Hateship, Friendship, Courtship, Loveship, Marriage by Alice Munro (2001)
28 Rapture by Carol Ann Duffy (2005)
29 A Death in the Family by Karl Ove Knausgaard (2009), translated by Don Bartlett (2012)
30 The Underground Railroad by Colson Whitehead (2016)
31 The Argonauts by Maggie Nelson (2015)
32 The Emperor of All Maladies by Siddhartha Mukherjee (2010)
33 Fun Home by Alison Bechdel (2006)
34 Outline by Rachel Cusk (2014)
35 The Hare with Amber Eyes by Edmund de Waal (2010)
36 Experience by Martin Amis (2000)
37 The Green Road by Anne Enright (2015)
38 The Line of Beauty by Alan Hollinghurst (2004)
39 White Teeth by Zadie Smith (2000)
40 The Year of Magical Thinking by Joan Didion (2005)
41 Atonement by Ian McEwan (2001)
42 Moneyball by Michael Lewis (2010)
43 Citizen: An American Lyric by Claudia Rankine (2014)
44 Hope in the Dark by Rebecca Solnit (2004)
45 Levels of Life by Julian Barnes (2013)
46 Human Chain by Seamus Heaney (2010)
47 Persepolis by Marjane Satrapi (2000-2003), translated by Mattias Ripa (2003-2004)
48 Night Watch by Terry Pratchett (2002)
49 Why Be Happy When You Could Be Normal? by Jeanette Winterson (2011)
50 Oryx and Crake by Margaret Atwood (2003)
51 Brooklyn by Colm Tóibín (2009)
52 Small Island by Andrea Levy (2004)
53 True History of the Kelly Gang by Peter Carey (2000)
54 Women & Power by Mary Beard (2017)
55 The Omnivore’s Dilemma by Michael Pollan (2006)
56 Underland by Robert Macfarlane (2019)
57 The Amazing Adventures of Kavalier and Clay by Michael Chabon (2000)
58 Postwar by Tony Judt (2005)
59 The Beauty of the Husband by Anne Carson (2002)
60 Dart by Alice Oswald (2002)
61 This House of Grief by Helen Garner (2014)
62 Mother’s Milk by Edward St Aubyn (2006)
63 The Immortal Life of Henrietta Lacks by Rebecca Skloot (2010)
64 On Writing by Stephen King (2000)
65 Gone Girl by Gillian Flynn (2012)
66 Seven Brief Lessons on Physics by Carlo Rovelli (2014)
67 The Silence of the Girls by Pat Barker (2018)
68 The Constant Gardener by John le Carré (2001)
69 The Infatuations by Javier Marías (2011), translated by Margaret Jull Costa (2013)
70 Notes on a Scandal by Zoë Heller (2003)
71 Jimmy Corrigan: The Smartest Kid on Earth by Chris Ware (2000)
72 The Age of Surveillance Capitalism by Shoshana Zuboff (2019)
73 Nothing to Envy by Barbara Demick (2009)
74 Days Without End by Sebastian Barry (2016)
75 Drive Your Plow Over the Bones of the Dead by Olga Tokarczuk (2009), translated by Antonia Lloyd-Jones (2018)
76 Thinking, Fast and Slow by Daniel Kahneman (2011)
77 Signs Preceding the End of the World by Yuri Herrera (2009), translated by Lisa Dillman (2015)
78 The Fifth Season by NK Jemisin (2015)
79 The Spirit Level by Richard Wilkinson and Kate Pickett (2009)
80 Stories of Your Life and Others by Ted Chiang (2002)
81 Harvest by Jim Crace (2013)
82 Coraline by Neil Gaiman (2002)
83 Tell Me How It Ends by Valeria Luiselli (2016), translated by Luiselli with Lizzie Davis (2017)
84 The Cost of Living by Deborah Levy (2018)
85 The God Delusion by Richard Dawkins (2006)
86 Adults in the Room by Yanis Varoufakis (2017)
87 Priestdaddy by Patricia Lockwood (2017)
88 Noughts & Crosses by Malorie Blackman (2001)
89 Bad Blood by Lorna Sage (2000)
90 Visitation by Jenny Erpenbeck (2008), translated by Susan Bernofsky (2010)
91 Light by M John Harrison (2002)
92 The Siege by Helen Dunmore (2001)
93 Darkmans by Nicola Barker (2007)
94 The Tipping Point by Malcolm Gladwell (2000)
95 Chronicles: Volume One by Bob Dylan (2004)
96 A Little Life by Hanya Yanagihara (2015)
97 Harry Potter and the Goblet of Fire by JK Rowling (2000)
98 The Girl With the Dragon Tattoo by Stieg Larsson (2005), translated by Steven T Murray (2008)
99 Broken Glass by Alain Mabanckou (2005), translated by Helen Stevenson (2009)
100 I Feel Bad About My Neck by Nora Ephron (2006)

R read.table vs read.csv

grades <- read.table("studentgrades.csv",header = TRUE,row.names="StudentID", sep = ",")
gradess <- read.csv("studentgrades.csv",header = TRUE,row.names="StudentID", sep = ",")
The result of read.table is:
grades
[1] First Last Math Science Social.Studies
<0 rows> (or 0-length row.names)
The result of read.csv is:
gradess
First Last Math Science Social.Studies
11 Bob Smith 90 80 67
12 Jane Weary 75 NA 80
10 Dan "Thornton" 65 75 70
40 Mary O'Leary 90 95 92
I just don't know why the read.tables can not give me the right result.

The problem is due to the quote (') in O'Leary of last name column. You will need to change the default quote option in read.table which is set to the (') by default to get desired result.
If you use quote=NULL in read.table like below
grades <- read.table("studentgrades.csv",header = TRUE,sep=",",quote=NULL,row.names="StudentID")
Then you get the desired result.
> grades
First Last Math Science Social.Studies
11 Bob Smith 90 80 67
12 Jane Weary 75 NA 80
10 Dan "Thornton" 65 75 70
40 Mary O'Leary 90 95 92

unix telephone syntax with different ways of writing American phone numbers

Ok so I need I have a .txt file with names followed by their respective phone numbers and need to grab all the numbers following the ###-###-#### syntax which I have accomplished with this code
grep -E "([0-9]{3})-[0-9]{3}-[0-9]{4}" telephonefile_P2
but my problem is that there are instances of
(###)-###-####
(###) ### ####
### ### ####
###-####
This is the file:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Star Club 49 040–31 77 78 0
Lolita Spengler (816) 756 8657
Hoffman's Kleider 049 37 1836 027
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Hermann's Speilhaus 49 25 8377 1765
Hal Kubrick 44 1289 332934
Sister Sue 978 0672
Auggie Keller 49 089/594 393
JCCC 913-469-8500
This is my desired output:
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500
and I don't know how to account for these alternate forms...
obviously new to Unix, please be gentle!

$ awk '/(^|[[:space:]])(\(?[0-9]{3}\)?[- ])?[0-9]{3}[- ][0-9]{4}([[:space:]]|$)/' file
Sam Spade (212)-756-1045
Daffy Duck 312 450 2856
Mom 354-2015
Lolita Spengler (816) 756 8657
Dr. Harold Kranzler 765-986-9987
Ralph Spoilsport's Motors 967 882 6534
Sister Sue 978 0672
JCCC 913-469-8500

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Extract names from text - r

Related

Questions about how to divide and find averages of a dataset

Scraping data from Australian Open stats

How do I scrape elements at the same level from a website using bs4

R read.table vs read.csv

unix telephone syntax with different ways of writing American phone numbers

Categories

Resources