Repetition when scraping using rvest in R

Repetition when scraping using rvest in R - r

I am trying to scrape text using rvest in R and df1 is the output. For News 2, the text was spilt into 3 rows and this causes News 1 to repeat for 2 more extra rows. How can I make News 2 join into 1 complete sentence?
> dput(df1)
structure(list(`News 1` = c("Nike faces social media storm in China over Xinjiang statement",
"Nike faces social media storm in China over Xinjiang statement",
"Nike faces social media storm in China over Xinjiang statement"
), `News 2` = c("Biden calls for assault weapon ban after", "Colorado",
"shooting")), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
My desired output is having the whole sentence in the same row
df1 >
News 1 News 2
1 Nike faces social media storm in China over Biden calls for assault weapon ban
Xinjiang statement after Colorado shooting

Related

using key word to label a new column in R

I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.

you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD

Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT

Clean duplicate phone numbers in R dataframe column

We have a dataframe with a Phone column that has phone numbers, however phone numbers are duplicated in many of the columns:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242",
"256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020",
NA, "765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142",
"", "549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Title Phone
1 Head Coach (904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242
2 Athletic Trainer 256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849
7 Coach 516-302-1039
The correct output would remove phone number duplicates:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242",
"256-765-5020",
NA, "765.285.8142",
"", "549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Typically I would share our progress on this, but quite frankly we are lost as to how to even get started on this. Seems like a very difficult problem especially given (a) the \r\n\t\t\t\ that appear in the strings, (b) that there are NA and missing values and (c) not every row is duplicated, (d) different formats (some area codes, some with ., some with -, some with ()). Any recommendations on how to clean this column?

df$Phone = sub('\r.*', '', df$Phone)
Title Phone
1 Head Coach (904) 256-7242
2 Athletic Trainer 256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849
7 Coach 516-302-1039

We could remove the whitespace with gsub, split at the delimiter created (,) and extract the first element
df1$Phone <- sapply(strsplit(gsub("[\r\n\t]+", ",", df1$Phone), ","), \(x) x[1])
-output
df1$Phone
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" NA "549-5849" "516-302-1039"
Or another option is trimws - specify the whitespace to match the one or more [\r\n\t] followed by other characters (.*)
trimws(df1$Phone, whitespace = "[\r\n\t]+.*")
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" "" "549-5849" "516-302-1039"

Duplicate row and string manipulation in R

I have a dataframe in R which has some rows as follows:
c("LouDobbs", "gen_jackkeane") || RT #LouDobbs: #AmericaFirst- #gen_jackkeane: The Taliban for 9 months have told their fighters to kill as many people as you can, to includ…
above is an example of 2 columns where column 1 (I am using separator ||) has more than one username and column 2 has the tweet text. I want that this row should be duplicated into 2 (number of users) and each individual user singly can be placed in column 1 for all such rows in the data frame where more than 1 user is listed against the tweet text.
structure(list(user = list("Dandhy_Laksono", c("LouDobbs", "gen_jackkeane"
), "DeepStateExpose", "AndruewJamess", "jrossman12", "BiLLRaY2019",
"DeepStateExpose", "Dandhy_Laksono", "DeepStateExpose", "DeepStateExpose"),
full_text = c("RT #Dandhy_Laksono: Sebagian pendukung Jokowi ini mengalami bagaimana fitnah \"komunis dan PKI\" digunakan selama pemilu.\n\nSekarang mereka me…",
"RT #LouDobbs: #AmericaFirst- #gen_jackkeane: The Taliban for 9 months have told their fighters to kill as many people as you can, to includ…",
"RT #DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…",
"RT #AndruewJamess: #BillOReilly #KamalaHarris is wrong. #realDonaldTrump has accomplished a lot. He set a record for incoherent toilet twe…",
"RT #jrossman12: #SaraCarterDC Pakistan won't allow that as you already know. Your husband and the other U.S. troops have been forced to fig…",
"RT #BiLLRaY2019: JOKOWI TIDAK MEMBUNUH KPK..!\nMarkibong…\"Selamat tinggal Taliban di dalam KPK. Kalian kalah lagi, kalah lagi..!\"\n\n#JumatBer…",
"RT #DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…",
"RT #Dandhy_Laksono: Sebagian pendukung Jokowi ini mengalami bagaimana fitnah \"komunis dan PKI\" digunakan selama pemilu.\n\nSekarang mereka me…",
"RT #DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…",
"RT #DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…"
)), row.names = c(NA, 10L), class = "data.frame")

We can use lengths to get the length of each of the elements of the list column. It should be fast enough as lengths is fast
l1 <- lengths(df$user)
out <- data.frame(user = unlist(df$user), n = rep(l1, l1),
text = rep(df$full_text, l1))

Extract date from texts in corpus R

I have a corpus object from which I want to extract data so I can add them as docvar.
The object looks like this
v1 <- c("(SE22-y -7 A go q ,, Document of The World Bank FOR OFFICIAL USE ONLY il I ( >I8.( )]i 1 t'f-l±E C 4'( | Report No. 9529-LSO l il .rt N ,- / . t ,!I . 1. 'i 1( T v f) (: AR.) STAFF APPRAISAL REPORT KINGDOM OF LESOTHO EDUCATION SECTOR DEVELOPMENT PROJECT JUNE 19, 1991 Population and Human Resources Division Southern Africa Department This document has a restricted distribution and may be used by reipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization.",
"Document of The World Bank Report No. 13611-PAK STAFF APPRAISAL REPORT PAKISTAN POPULATION WELFARE PROGRAM PROJECT FREBRUARY 10, 1995 Population and Human Resources Division Country Department I South Asia Region",
"I Toward an Environmental Strategy for Asia A Summary of a World Bank Discussion Paper Carter Brandon Ramesh Ramankutty The World Bank Washliington, D.C. (C 1993 The International Bank for Reconstruction and Development / THiE WORLD BANK 1818 H Street, N.W. Washington, D.C. 20433 All rights reserved Manufactured in the United States of America First printing November 1993",
"Report No. PID9188 Project Name East Timor-TP-Emergency School (#) Readiness Project Region East Asia and Pacific Region Sector Other Education Project ID TPPE70268 Borrower(s) EAST TIMOR Implementing Agency Address UNTAET (UN TRANSITIONAL ADMINISTRATION FOR EAST TIMOR) Contact Person: Cecilio Adorna, UNTAET, Dili, East Timor Fax: 61-8 89 422198 Environment Category C Date PID Prepared June 16, 2000 Projected Appraisal Date May 27, 2000 Projected Board Date June 20, 2000",
"Page 1 CONFORMED COPY CREDIT NUMBER 2447-CHA (Reform, Institutional Support and Preinvestment Project) between PEOPLE'S REPUBLIC OF CHINA and INTERNATIONAL DEVELOPMENT ASSOCIATION Dated December 30, 1992")
c1 <- corpus(v1)
The first thing I want to do is extract the first occurring date, mostly it occurs as "Month Year" (December 1990) or "Month Day, Year" (JUNE 19, 1991) or with a typo FREBRUARY 10, 1995 in which case the month could be discarded.
My code is a combination of
Extract date text from string
&
Extract Dates in any format from Text in R:
lapply(c1$documents$texts, function(x) anydate(str_extract_all(c1$documents$texts, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")))
and get the error:
Error in anytime_cpp(x = x, tz = tz, asUTC = asUTC, asDate = TRUE, useR = useR, : Unsupported Type
However, I do not know how to supply the date format. Furthermore, I don't really get how to write the correct regular expressions.
https://www.regular-expressions.info/dates.html & https://www.regular-expressions.info/rlanguage.html
other questions on this subject are:
Extract date from text
Need to extract date from a text file of strings in R
http://r.789695.n4.nabble.com/Regexp-extract-first-occurrence-of-date-in-string-td997254.html
Extract date from given string in r

str_extract_all(texts(c1)
, "(\\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Oct(?:ober)?|Dec(?:ember)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|(\\b(?:JAN(?:UARY)?|FEB(?:RUARY)?|MAR(?:CH)?|APR(?:IL)?|MAY|JUN(?:E)?|JUL(?:Y)?|AUG(?:UST)?|SEP(?:TEMBER)?|NOV(?:EMBER)?|OCT(?:OBER)?|DEC(?:EMBER)?) (?:19[7-9]\\d|2\\d{3})(?=\\D|$))|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\\s+\\d{1,2},\\s+\\d{4})|(\\b(JAN(UARY)?|FEB(RUARY)?|MAR(CH)?|APR(IL)?|MAY|JUN(E)?|JUL(Y)?|AUG(UST)?|SEP(TEMBER)?|OCT(OBER)?|NOV(EMBER)?|DEC(EMBER)?)\\s+\\d{1,2},\\s+\\d{4})"
, simplify = TRUE)[,1]
This gives the first occurrence of format JUNE 19, 1991 or December 1990

How to set new column based on multiple conditions in data.table?

I'm trying to collect catalogue information based on text search. Search for a certain string in column Text, and put some description into a new column C_Organization.
Here is the sample data:
# load packages:
pacman::p_load("data.table",
"stringr")
# make sample data:
DE <- data.table(c("John", "Sussan", "Bill"),
c("Text contains MIT", "some text with Stanford University", "He graduated from Yale"))
colnames(DE) <- c("Name", "Text")
> DE
Name Text
1: John Text contains MIT
2: Sussan some text with Stanford University
3: Bill He graduated from Yale
search for a certain string and make a new data.table with new column:
mit <- DE[str_detect(DE$Text, "MIT"), .(Name, C_Organization = "MIT")]
yale <- DE[str_detect(DE$Text, "Yale"), .(Name, C_Organization = "Yale")]
stanford <- DE[str_detect(DE$Text, "Stanford"), .(Name, C_Organization = "Stanford")]
# bind them together:
combine_table <- rbind(mit, yale, stanford)
combine_table
Name C_Organization
1: John MIT
2: Bill Yale
3: Sussan Stanford
This pick-and-combine approach works fine but it seems a little bit tedious. Is it possible to do it in one step in data.table?
Edit
Due to my poor data analysis skill and the unclean data, I need to make the question clear:
The real data is a little complicated:
(1) There are cases where a person from more than two organizations, like Jack, UC Berkeley, Bell lab. and
(2) The same person of the same organization appears for different year, like Steven, MIT, 2011, Steven, MIT, 2014.
I want to figure out:
(1) How many people from each organization. If one person belongs to more than one organization, make the organization which appears most as his organization. (i.e. by popularity.) For example, John, MIT, AMS, Bell lab, if MIT appears 30 times, AMS 12 times, Bell lab 26 times. Then make MIT as his organization.
(2) count how many people for each year. This is not directly realted to my original question, but for later calculation, I don't want to throw away these records.

An alternative solution which takes into account for several matches in one text, operates rowwise and binds the matches together:
uni <- c("MIT","Yale","Stanford")
DE[,idx:=.I][, c_org := paste(uni[str_detect(Text, uni)], collapse=","), idx]
this gives:
> DE
Name Text idx c_org
1: John Text contains MIT 1 MIT
2: Sussan some text with Stanford University 2 Stanford
3: Bill He graduated from Yale, MIT, Stanford. 3 MIT,Yale,Stanford
4: Bill some text 4
The advantage of operating rowwise is evident when you have identical names in Name. When you do:
DE[, uni[str_detect(Text, uni)], Name]
you get not the correct result:
Name V1
1: John MIT
2: Sussan Stanford
3: Bill MIT
4: Bill Stanford
=> you don't know which Bill you have in the fourth row. Moreover, Yale isn't included for the 'first' Bill (i.e. row 3 of the original dataset).
Used data:
DE <- structure(list(Name = c("John", "Sussan", "Bill", "Bill"), Text = c("Text contains MIT", "some text with Stanford University", "He graduated from Yale, MIT, Stanford.", "some text")), .Names = c("Name", "Text"), row.names = c(NA, -4L), class = c("data.table", "data.frame"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex