Splitting strings in between 3rd and 4th characters in R - r

I'm grabbing information from Wikipedia on Canadian Forward Sortation Areas (FSAs - those are the first 3 digits of postal codes in Canada) and what cities/areas they belong to. Example of this information is below:
library(rvest)
library(tidyverse)
URL <- paste0("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_", "K")
FSAs <- URL %>%
read_html() %>%
html_nodes(xpath = "//td") %>%
html_text()
head(FSAs)
[1] "K1AGovernment of CanadaOttawa and Gatineau offices (partly in QC)\n" "K2AOttawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
[3] "K4AOttawa(Fallingbrook)\n" "K6AHawkesbury\n"
[5] "K7ASmiths Falls\n" "K8APembrokeCentral and northern subdivisions\n"
The problem I'm facing is that I would like to have a data frame with the first 3 digits of each spring in one column, and the rest of the information in another. I've thought there would be a solution involving a stringr function like str_split(), but this removes the pattern of the first 3 digits, which I of course don't want. In effect, I'm looking to split the string in-between the 3rd and 4th character of each string.
I've figured out this solution, with the last bit borrowed from this answer, but it's incredibly hackish. My question is, is there a better way of doing this?
FSAs %>%
enframe(name = NULL) %>%
separate(value, c(NA, "Location"), sep = "^...", remove = FALSE) %>%
separate(value, c("FSA", NA), sep = "(?<=\\G...)")
# A tibble: 195 x 2
FSA Location
<chr> <chr>
1 K1A "Government of CanadaOttawa and Gatineau offices (partly in QC)\n"
2 K2A "Ottawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
3 K4A "Ottawa(Fallingbrook)\n"
4 K6A "Hawkesbury\n"
5 K7A "Smiths Falls\n"
6 K8A "PembrokeCentral and northern subdivisions\n"
7 K9A "Cobourg\n"
8 K1B "Ottawa(Blackburn Hamlet / Pine View / Sheffield Glen)\n"
9 K2B "Ottawa(Britannia /Whitehaven / Bayshore / Pinecrest)\n"
10 K4B "Ottawa(Navan)\n"

Related

How to skip empty values with rbind.fill or replace them with another value?

first time asking on stack, so apologies for any mistakes in this question.
I am trying to scrape the suspension rates for all California high schools off of https://dq.cde.ca.gov/dataquest/, the public data sight for the California Department of education.
In case my code isn't very clear, let me describe my scraping process. The data I'm interested is on different webpages for each school with the only difference in the various URL's being the school CDS code. So using another dataframe composed of school CDS codes, I substitute the various school CDS codes into the URL, pull the data from the respective tables that schools have on their webpages. If there isn't data for a school in a specific year, no table is pulled up and the scraper will pull in empty values.
The problem I am running into is that when the scraper pulls in empty values (for when no data is found for a school in that year), I'm unable to continue binding scrapped data into my scrape dataframe.
I have two possible ways I think might solve the problem, but haven't been able to figure out the code for either of them.
First, I'm wondering is there a way to have my scraper either skip these school ID codes when the data is not found (and the html_text is then empty) when the values are empty, or to make it so I replace those empty values with NA's?
Secondly, is there a way to use the rbind.fill command where if empty values are found, to turn those into NA's or some other symbol that will represent missing data?
Any help would be appreciated, thanks
Code
#Initial Dataframe
CDS.code = c("01611190130229", "12626870111922", "19643031935618")
school = c("Alameda High", "American Indian Academy", "Mayfair High")
source = data.frame(CDS.code, school)
for (page_result in source$CDS.Code) {
link = paste0("https://dq.cde.ca.gov/dataquest/dqCensus/DisSuspCount.aspx?year=2020-21&agglevel=School&cds=", page_result," ")
page = read_html(link)
school_id = page_result
#Columns for Data
hs_name = page %>%
html_nodes("tr:nth-child(2) a") %>%
html_text()
total_suspensions = page %>%
html_nodes("#ContentPlaceHolder1_grdTotals tr:nth-child(2) td:nth-child(3)")%>%
html_text()
df_schools = rbind.fill(df_schools, data.frame(
school_id,
total_suspensions,
stringsAsFactors = FALSE))
I expected the missing values to be populated with NA's, I've tried replacing empty values with NA and a few other values.
I've also tried to figure out how to make the web scraping portion skip when no value is found, but it's broke each time.
How you can skip over data with empty table with purrr::possibly(). If the function encounter empty tables it will produce NA instead of summing the third column.
library(rvest)
library(httr2)
library(tidyverse)
library(magrittr)
Sample data
source = tibble(
CDS.code = c("01611190130229", "12626870111922", "19643031935618"),
school = c("Alameda High", "American Indian Academy", "Mayfair High")
)
# A tibble: 3 × 2
CDS.code school
<chr> <chr>
1 01611190130229 Alameda High
2 12626870111922 American Indian Academy
3 19643031935618 Mayfair High
Scraper function
get_susp <- function(cds_code) {
cat("SCraping CDS:", cds_code, "\n")
str_c(
"https://dq.cde.ca.gov/dataquest/dqCensus/DisSuspCount.aspx?year=2020-21&agglevel=School&cds=",
cds_code
) %>%
request() %>%
req_perform() %>%
resp_body_html() %>%
html_table() %>%
nth(8) %>% # Pluck the 8th table
mutate(across(3, as.numeric)) %$% # Convert it to numeric
sum(TotalSuspensions, na.rm = TRUE) # Sum of total suspension
}
Create a new column with the sum of total suspension for that High School
source %>%
mutate(total_susp = map_dbl(
CDS.code, possibly(get_susp, otherwise = NA_integer_)
))
# A tibble: 3 × 3
CDS.code school total_susp
<chr> <chr> <dbl>
1 01611190130229 Alameda High 5
2 12626870111922 American Indian Academy NA
3 19643031935618 Mayfair High 0

How to extract phrases with word limit after specific Word?

I have the following text, and I want to extract 5 words after a specific word from a string vector:
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remains fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."
my_teams <- tolower(c("Brazil", "Argentina"))
I want to extract the next 5 words after the word Brazil or Argentina and then combine them as an entire string.
I have used the following script to get the exact word, but not the phrases after a specific word:
pattern <- paste(my_teams, collapse = "|")
v <- unlist(str_extract_all(tolower(my_text), pattern))
paste(v, collapse=' ')
Any suggestions would be appreciated. Thanks!
You can use
library(stringr)
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remain fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."
my_teams <- tolower(c("Brazil", "Argentina"))
pattern <- paste0("(?i)\\b(?:", paste(my_teams, collapse = "|"), ")\\s+(\\S+(?:\\s+\\S+){4})")
res <- lapply(str_match_all(my_text, pattern), function (m) m[,2])
v <- unlist(res)
paste(v, collapse=' ')
# => [1] "from the top of the won the final within 90"
See the R demo. You can also check the regex demo. Note the use of str_match_all that keeps the captured texts.
Details:
(?i) - case insensitive matching on
\b - a word boundary
(?:Brazil|Argentina) - one of the countries
\s+ - one or more whitespaces
(\S+(?:\s+\S+){4}) - Group 1: one or more non-whitespaces and then four repetitions of one or more whitespaces followed with one or more non-whitespaces.
Maybe not the best possible, but:
Split into a vector of words, remove non-word characters, lowercase (to match targets):
words <- strsplit(my_text,'\\s', perl= TRUE)[[1]] |>
gsub(pattern = "\\W", replacement = "", perl = TRUE) |>
tolower()
Find locations of targets, get strings, paste back together:
loc <- which(words %in% my_teams)
sapply(loc, \(i) words[(i+1):(i+5)], simplify= FALSE) |>
sapply(paste, collapse=" ")
[1] "have failed to dislodge brazil" "from the top of the"
[3] "won the final within 90" "in the last eight tournaments"
[5] "the 1998 finalists getting beyond"
Maybe you need one more paste(., collapse = " ") at the end ?
Here is an alternative approach:
transform vector to tibble
use separate_rows to get one word in row
create helper x with lower case word
make groups starting with brazil or argentina
remove group == 0
get word 2 to 6 in each group
finale summarise:
my_teams <- tolower(c("Brazil", "Argentina"))
library(dplyr)
library(tidyr)
tibble(my_text = my_text) %>%
separate_rows(my_text, sep = " ") %>%
mutate(x = tolower(my_text)) %>%
group_by(group = cumsum(grepl(paste(my_teams, collapse = "|"), x))) %>%
filter(group > 0) %>%
slice(2:6) %>%
summarise(x = paste(my_text, collapse = " "))
group x
<int> <chr>
1 1 have failed to dislodge
2 2 from the top of the
3 3 won the final within 90
4 4 In the last eight tournaments
5 5 the 1998 finalists, getting beyond

R: find words from tweets in Lexicon, count them and save number in dataframe with tweets

I have a data set of 50,176 tweets (tweets_data: 50176 obs. of 1 variable). Now, I have created a self-made lexicon (formal_lexicon), which consists of around 1 million words, which are all formal language style. Now, I want to create a small code which per tweet counts how many (if there are any) words are also in that lexicon.
tweets_data:
Content
1 "Blablabla"
2 "Hi my name is"
3 "Yes I need"
.
.
.
50176 "TEXT50176"
formal_lexicon:
X
1 "admittedly"
2 "Consequently"
3 "Furthermore"
.
.
.
1000000 "meanwhile"
The output should thus look like:
Content Lexicon
1 "TEXT1" 1
2 "TEXT2" 3
3 "TEXT3" 0
.
.
.
50176 "TEXT50176" 2
Should be a simple for loop like:
for(sentence in tweets_data$Content){
for(word in sentence){
if(word %in% formal_lexicon){
...
}
}
}
I don't think "word" works and I'm not sure how to count in the specific column if a word is in the lexicon. Can anyone help?
structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
You can try something like this:
library(tidytext)
library(dplyr)
# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
tweets_data <- c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))
tweets_data_df %>%
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)
With result:
Joining, by = "id"
# A tibble: 6 x 3
id Content cnt
<int> <chr> <dbl>
1 1 "#barackobama Thank you for your incredible grace in leadership a~ 0
2 2 "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles Co~ 0
3 3 "2017 resolution: to embody authenticity!" 0
4 4 "Happy Holidays! Sending love and light to every corner of the ea~ 0
5 5 "Damn, it's hard to wrap presents when you're drunk. cc #santa" 0
6 6 "When my whole fam tryna have a peaceful holiday " 0
Hope this is useful for you:
library(magrittr)
library(dplyr)
library(tidytext)
# Data frame with tweets, including an ID
tweets <- data.frame(
id = 1:3,
text = c(
'Hello, this is the first tweet example to your answer',
'I hope that my response help you to do your task',
'If it is tha case, please upvote and mark as the correct answer'
)
)
lexicon <- data.frame(
word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)
# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
tidytext::unnest_tokens(output = 'words', input = text) %>%
# Determining if a word is in your lexicon
dplyr::mutate(
in_lexicon = words %in% lexicon$word
) %>%
dplyr::group_by(id) %>%
dplyr::summarise(words_in_lexicon = sum(in_lexicon))
# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)

better and easy way to find who spoke top 10 anger words from conversation text

I have a dataframe that contains variable 'AgentID', 'Type', 'Date', and 'Text' and a subset is as follows:
structure(list(AgentID = c("AA0101", "AA0101", "AA0101", "AA0101",
"AA0101"), Type = c("PS", "PS", "PS", "PS", "PS"), Date = c("4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019"), Text = c("I am on social security XXXX and I understand it can not be garnished by Paypal credit because it's federally protected.I owe paypal {$3600.00} I would like them to cancel this please.",
"My XXXX account is being reported late 6 times for XXXX per each loan I was under the impression that I was paying one loan but it's split into three so one payment = 3 or one missed payment would be three missed on my credit,. \n\nMy account is being reported wrong by all credit bureaus because I was in forbearance at the time that these late payments have been reported Section 623 ( a ) ( 2 ) States : If at any time a person who regularly and in the ordinary course of business furnishes information to one or more CRAs determines that the information provided is not complete or accurate, the furnisher must promptly provide complete and accurate information to the CRA. In addition, the furnisher must notify all CRAs that received the information of any corrections, and must thereafter report only the complete and accurate information. \n\nIn this case, I was in forbearance during that tie and document attached proves this. By law, credit need to be reported as of this time with all information and documentation",
"A few weeks ago I started to care for my credit and trying to build it up since I have never used my credit in the past, while checking my I discover some derogatory remarks in my XXXX credit report stating the amount owed of {$1900.00} to XXXX from XX/XX/2015 and another one owed to XXXX for {$1700.00} I would like to address this immediately and either pay off this debt or get this negative remark remove from my report.",
"I disputed this XXXX account with all three credit bureaus, the reported that it was closed in XXXX, now its reflecting closed XXXX once I paid the {$120.00} which I dont believe I owed this amount since it was an fee for a company trying to take money out of my account without my permission, I was charged the fee and my account was closed. I have notified all 3 bureaus to have this removed but they keep saying its correct. One bureau is showing XXXX closed and the other on shows XXXX according to XXXX XXXX, XXXX shows a XXXX, this account has been on my report for seven years",
"On XX/XX/XXXX I went on XXXX XXXX and noticed my score had gone down, went to check out why and seen something from XXXX XXXX and enhanced recovery company ... I also seen that it had come from XXXX and XXXX dated XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX ... I didnt have neither one before, I called and it the rep said it had come from an address Im XXXX XXXX, Florida I have never lived in Florida ever ... .I have also never had XXXX XXXX nor XXXX XXXX ... I need this taken off because it if affecting my credit score ... This is obviously identify theft and fraud..I have never received bills from here which proves that is was not done by me, I havent received any notifications ... if it was not for me checking my score I wouldnt have known nothing of this" )), row.names = c(NA, 5L), class = "data.frame")
First, I found out the top 10 anger words using the following:
library(tm)
library(tidytext)
library(tidyverse)
library(sentimentr)
library(wordcloud)
library(ggplot2)
CS <- function(txt){
MC <- Corpus(VectorSource(txt))
SW <- stopwords('english')
MC <- tm_map(MC, tolower)
MC<- tm_map(MC,removePunctuation)
MC <- tm_map(MC, removeNumbers)
MC <- tm_map(MC, removeWords, SW)
MC <- tm_map(MC, stripWhitespace)
myTDM <- as.matrix(TermDocumentMatrix(MC))
v <- sort(rowSums(myTDM), decreasing=TRUE)
FM <- data.frame(word = names(v), freq=v)
row.names(FM) <- NULL
FM <- FM %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, "x") <= 1)
return(FM)
}
DF <- CS(df$Text)
# using nrc
nrc <- get_sentiments("nrc")
# create final dataset
DF_nrc = DF %>% inner_join(nrc)
And the I created a vector of top 10 anger words as follows:
TAW <- DF_nrc %>%
filter(sentiment=="anger") %>%
group_by(word) %>%
summarize(freq = mean(freq)) %>%
arrange(desc(freq)) %>%
top_n(10) %>%
select(word)
Next what I wanted to do is to find which were the 'Agent'(s) who spoke these words frequently and rank them. But I am confused how we could do that? Should I search the words one by one and group all by agents or is there some other better way. What I am looking at as a result, something like as follows:
AgentID Words_Spoken Rank
A0001 theft, dispute, money 1
A0001 theft, fraud, 2
.......
If you are more of a dplyr/tidyverse person, you can take an approach using some dplyr verbs, after converting your text data to a tidy format.
First, let's set up some example data with several speakers, one of whom speaks no anger words. You can use unnest_tokens() to take care of most of your text cleaning steps with its defaults, such as splitting tokens, removing punctuation, etc. Then remove stopwords using anti_join(). I show using inner_join() to find the anger words as a separate step, but you could join these up into one big pipe if you like.
library(tidyverse)
library(tidytext)
my_df <- tibble(AgentID = c("AA0101", "AA0101", "AA0102", "AA0103"),
Text = c("I want to report a theft and there has been fraud.",
"I have taken great offense when there was theft and also poison. It is distressing.",
"I only experience soft, fluffy, happy feelings.",
"I have a dispute with the hateful scorpion, and also, I would like to report a fraud."))
my_df
#> # A tibble: 4 x 2
#> AgentID Text
#> <chr> <chr>
#> 1 AA0101 I want to report a theft and there has been fraud.
#> 2 AA0101 I have taken great offense when there was theft and also poison.…
#> 3 AA0102 I only experience soft, fluffy, happy feelings.
#> 4 AA0103 I have a dispute with the hateful scorpion, and also, I would li…
tidy_words <- my_df %>%
unnest_tokens(word, Text) %>%
anti_join(get_stopwords())
#> Joining, by = "word"
anger_words <- tidy_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment == "anger"))
#> Joining, by = "word"
anger_words
#> # A tibble: 10 x 3
#> AgentID word sentiment
#> <chr> <chr> <chr>
#> 1 AA0101 theft anger
#> 2 AA0101 fraud anger
#> 3 AA0101 offense anger
#> 4 AA0101 theft anger
#> 5 AA0101 poison anger
#> 6 AA0101 distressing anger
#> 7 AA0103 dispute anger
#> 8 AA0103 hateful anger
#> 9 AA0103 scorpion anger
#> 10 AA0103 fraud anger
Now you now which anger words each person used, and the next step is to count them up and rank people. The dplyr package has fantastic support for exactly this kind of work. First you want to group_by() the person identifier, then calculate a couple of summarized quantities:
the total number of words (so you can arrange by this)
a pasted-together string of the words used
Afterwards, arrange by the number of words and make a new column that gives you the rank.
anger_words %>%
group_by(AgentID) %>%
summarise(TotalWords = n(),
WordsSpoken = paste0(word, collapse = ", ")) %>%
arrange(-TotalWords) %>%
mutate(Rank = row_number())
#> # A tibble: 2 x 4
#> AgentID TotalWords WordsSpoken Rank
#> <chr> <int> <chr> <int>
#> 1 AA0101 6 theft, fraud, offense, theft, poison, distressi… 1
#> 2 AA0103 4 dispute, hateful, scorpion, fraud 2
Do notice that with this approach, you don't have a zero entry for the person who spoke no anger words; they get dropped at the inner_join(). If you want them in the final data set, you will likely need to join back up with an earlier dataset and use replace_na().
Created on 2019-09-11 by the reprex package (v0.3.0)
Not the most elegant solution, but here's how you could count the words based on the line number:
library(stringr)
# write a new data.frame retaining the AgentID and Date from the original table
new.data <- data.frame(Agent = df$AgentID, Date = df$Date)
# using a for-loop to go through every row of text in the df provided.
for(i in seq(nrow(new.data))){ # i represent row number of the original df
# write a temporary object (e101) that:
## do a boolean check to see if the text from row i df[i, "Text"] the TAW$Word with stringr::str_detect function
## loop the str_detect with sapply so that the str_detect do a boolean check on each TAW$Word
## return the TAW$Word with TAW$Word[...]
e101 <- TAW$word[sapply(TAW$word, function(x) str_detect(df[i, "Text"], x))]
# write the number of returned words in e101 as a corresponding value in new data.frame
new.data[i, "number_of_TAW"] <- length(e101)
# concatenate the returned words in e101 as a corresponding value in new data.frame
new.data[i, "Words_Spoken"] <- ifelse(length(e101)==0, "", paste(e101, collapse=","))
}
new.data
# Agent Date number_of_TAW Words_Spoken
# 1 AA0101 4/1/2019 0
# 2 AA0101 4/1/2019 0
# 3 AA0101 4/1/2019 2 derogatory,remove
# 4 AA0101 4/1/2019 3 fee,money,remove
# 5 AA0101 4/1/2019 1 theft

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Resources