Finding and Replacing parts of person names - r

I have a dataframe with a column that consists of politician names which are extracted of thousands of news articles. Each row is a specific article. I want to count which politicians are mentioned the most, but count each name only one time per article (row).
The entities recognition algorithm returned these results. Now I have to convert the names in a standard form to be able to summarise and compare them.
Because there are maximal 20 people I am interested in, I knew the names and thought despite the effort, manually coding the patterns for each name might be the fastest way (I am happy for other ideas).
#example-data
persons <- c("Merkel,Angela Merkel,Trump,Ursula,Merkels", "Ursula von,Trumps,Donald Trump,Leyen")
df <- data.frame(persons)
#change pattern
df <- df %>%
mutate(
persons= paste(" ",str_replace_all(df$persons,",", " , "), sep = "")
)
#example of exctracting the names.. and so on, you get the idea
str_replace_all(df$persons, c(" Trump(s)?" = "Donald Trump", ", Trump(s)?" = ", Donald Trump", "Donald Trumps" = "Donald Trump",
" Merkel(s)?") = "Angela Merkel")
My desired output is to have for each row just the full names. In the end I would remove the duplicated names per row and then I could count the dataset like desired.
The data would should look like this in the end:
persons <- c("Angela Merkel,Angela Merkel,Donald Trump,Ursula von der Leyen,Angela Merkel", "Ursula von der Leyen,Donald Trump,Donald Trump,Ursula von der Leyen")
I have especially a hard time with patterns for names which consists of more than two parts like Ursula von der Leyen. What would the best way to do convert the names and how would the pattern for replacement look like?
Edit
I wrote now a function, witch takes care that there is only one instance of a name in my dataframe for each row. Not really elegant and nice code but its working.
clean_name <- function(x) {
b <- unlist(strsplit(x, '[,]')) %>%
str_squish(.)
c <- b[!duplicated(b)]
#Lists mit forename and surname
ganzer_name <- vector()
nachname<-vector()
for (person in c){
if(any(str_count(person," ") == 0)){
nachname <- append(nachname,person)
} else{
ganzer_name <- append(ganzer_name,person)
}
}
#chckes if therese constructions like s wie Angela Merkels
ganzer_name <- ganzer_name%>%
str_sort() %>% str_replace(.,"\\+","")
i <-0
aussortieren <- c("","")
while(i < (length(ganzer_name)-1) ){
i <- i+1
if(str_detect(ganzer_name[i+1],
paste0(ganzer_name[i],"*")
)){
aussortieren <- append(aussortieren, ganzer_name[i+1] )
}else{ }
}
ganzer_name <- ganzer_name[!ganzer_name %in% aussortieren]
#check if surname is already in a full name
for( person in nachname ){
#check construction with s like Merkels
if(any(str_detect(ganzer_name, paste0(
"\\Q",
str_sub(person,end = nchar(person)-1),
"\\E" )
)
)
) {
} else {
ganzer_name <- append(ganzer_name,person)
}
}
return(paste(ganzer_name, collapse=","))
}

Instead of turning various combination of names into standard format, removing duplicates and then counting here is a different approach.
We can use grepl for pattern matching and count how many times a politician occurs in different news articles.
name <- c('Trump', 'Merkel')
sapply(name, function(x) sum(grepl(x, df$persons)))
# Trump Merkel
# 2 1
Use ignore.case = TRUE in grepl if you want to make the comparison case insensitive.

Related

String match error "invalid regular expression, reason 'Out of memory'"

I have a table that is shaped like this called df (the actual table is 16,263 rows):
title date brand
big farm house 2022-01-01 A
ranch modern 2022-01-01 A
town house 2022-01-01 C
Then I have a table like this called match_list (the actual list is 94,000 rows):
words_for_match
farm
town
clown
beach
city
pink
And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:
match_list <- match_list$words_for_match
match_list <- paste(match_list, collapse = "|")
match_list <- sprintf("\\b(%s)\\b", match_list)
df %>%
filter(grepl(match_list, title))
But then I get the following error:
Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'
If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!
You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words.
Here a simple implementation :
titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
w <- sprintf("\\b(%s)\\b", w)
is.match <- grepl(w, titles)
titles.to.keep <- c(titles.to.keep, titles[is.match])
titles <- titles[!is.match]
print(paste(length(titles), "remaining titles"))
}
titles.to.keep
If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones.
UPDATE
You can also make a mix with your previous strategy to make it faster :
gr.size <- 20
gr.words <- split(words_for_match, ceiling(seq_along(words_for_match) / gr.size))
gr.words <- sapply(gr.words, function(words)
{
words <- paste(words, collapse = "|")
sprintf("\\b(%s)\\b", words)
})
and then iterate on gr.words and not on words_for_match in the first code chunk.

How do I grab the word in a character column after two consecutive word matches in R?

I have a data frame 'key_words' with vectors of pairs of words
key_words <- data.frame( c1 = ('word1','word2'), c2 = ('word3, word4'), c3 = ('word5','word6'))
I would like to search for these pairs of key words in a character column 'text' in another data frame 'x' where each row can be a few sentences long. I want to grab the word following two consecutive matches of a column in the key_words data frame and insert that value into a table at the same index of where the match was found. For example, if 'word1' and 'word2' are found one after the other in text[1] then I want to grab the word that comes after in text[1] and insert it into table[1].
I have tried splitting each row in 'text' into a list, separating by a single space so that each word has its own index in each row. I have the following idea which seems very inefficient and I'm running into problems where the character value temp_list[k] is of length 0.
x <- x %>% mutate(text = strsplit(text, " "))
for (i in 1:ncol(key_words)) {
word1 <- key_words[i, 1]
word2 <- key_words[i, 2]
for (j in 1:length(x$text)) {
temp_list <- as.list(unlist(x$text[[j]]))
for (k in 1:length(temp_list))
if (word1 == temp_list[k]) {
if (word2 == temp_list[k + 1]) {
table$word_found[j] <- temp_list[k + 2]
}
}
}
Is there a better way to do this or can I search through the text column for 'word1 word2' and grab the next word which can be any length? I'm new to R and coding in general, but I know I should be avoiding nested loops like this. Any help would be appreciated, thanks!!
I would suggest that you create a small function like this one, that returns the word following the occurrence of the pair 'w1 w2'
get_word_after_pair <- function(text,w1,w2) {
stringr::str_extract(text, paste0("(?<=\\b", w1, "\\s", w2, "\\b\\s)\\w*(?=\\b)"))
}
and then you can do this
data.frame(
lapply(key_words, function(x) get_word_after_pair(texttable$text,x[1],x[2]))
)
Input:
(keywords is a list of word pairs, texttable is a frame with a column text)
key_words <- list( pair1 = c('has','important'), pair2 = c('sentence','has'), pair3 = c('third','sentence'))
texttable = data.frame(text=c("this sentence has important words that we must find",
"this second sentence has important words to find",
"this is the third sentence and it also has important words within")
)
Output:
pair1 pair2 pair3
1 words important <NA>
2 words important <NA>
3 words <NA> and

R - Haven - SPSS (.sav): Iterate over all columns and replace name and label of columns

for a project I needed to merge an Excel and an SPSS file with R.
Not sure if this was my best idea. I got the merge done, however in the process
I had to use the attribute(col)$label as the name in order to work.
My final merged data.frame has thus way to long column names including special characters (e.g. :).
Here are the first few examples how the colnames of the current merged df look like
colnames(combined_retro)
[1] "Zeitpunkt zu dem das Interview begonnen hat (Europe/Berlin)"
[2] "Studiencode: [01]"
[3] "Format"
[4] "Geschlecht"
[5] "Alter (direkt): Ich bin ... Jahre"
[6] "Staatsangehörigkeit"
So I created another data.frame Naming_Back in which I have two columns: Name Label
Naming_Back
Name Label
1 CASE Interview-Nummer (fortlaufend)
2 SERIAL Seriennummer (sofern verwendet)
3 REF Referenz (sofern im Link angegeben)
4 QUESTNNR Fragebogen, der im Interview verwendet wurde
5 MODE Interview-Modus
So now I would like to iterate over the Columns of my merged data.frame combined_retro
and check if the current name of the column (e.g. "Zeitpunkt zu dem das Interview begonnen hat (Europe/Berlin)" is available in the Label column of the second (Naming_Back) data.frame.
If it is I would like to exchange the current column name with the one provided by the Name column.
My Current approach is the following loop:
for(i in 1:ncol(retro)) { # for-loop over columns
new_name_buffer <- Naming_Back %>%
filter(Label == colnames(retro[ , i]))
if(!(is_empty(new_name_buffer$Name))){
colnames(retro[ , i]) <- new_name_buffer$Name
print(colnames(retro[ , i]))
print(new_name_buffer$Name)
}
}
Examples for the print commands from the loop
[1] "Geschlecht"
[1] "SD02"
[1] "Staatsangehörigkeit"
[1] "SD04"
[1] "Staatsangehörigkeit: Anders"
[1] "SD04_04"
So obviously the problem is this line colnames(retro[ , i]) <- new_name_buffer$Name as it does not change the column name. Has anyone a quick idea how to fix it?
EDIT: Found a solution, by creating a character vector and stepwise filling it with either the abbreviated name when available or the old name if not
new_col_names <- c()
for(i in 1:ncol(retro)) { # for-loop over columns
new_name_buffer <- Naming_Back %>%
filter(Label == colnames(retro[ , i]))
if(!(is_empty(new_name_buffer$Name))){
colnames(retro[ , i]) <- new_name_buffer$Name
new_col_names <- c(new_col_names, new_name_buffer$Name)
}
else{
new_col_names <- c(new_col_names, colnames(retro[ , i]))
}
}
colnames(retro) <- new_col_names
EDIT 2: Just found an alternative solution to overwrite a column name while iterating over the columns with a for loop, you can just do names(dataframe)[index] and then just assign a new value with <- "newColName"
for(i in 1:ncol(retro)) { # for-loop over columns
new_name_buffer <- Naming_Back %>%
filter(Label == colnames(retro[ , i]))
if(!(is_empty(new_name_buffer$Name))){
names(retro)[i] <- new_name_buffer$Name
print(colnames(retro[ , i]))
print(new_name_buffer$Name)
}
}
The problem (as correctly identified in the comments to the OP by #IRTFM) with the original code was, that the assignment of the new column name: colnames(retro[ , i]) <- new_name_buffer$Name was not working properly, as colnames does not work on an atomic vector.
I found work-around to overwrite a column, name while iterating over the columns of a data.frame with a for loop. One can just do call names(dataframe)[index] and then just assign a new column name with <- "newColName"in my example the important line would thus look like this:
Assigning a new column name
names(retro)[i] <- new_name_buffer$Name
The Complete solution with the for loop
for(i in 1:ncol(retro)) { # for-loop over columns
# Check if a row with the label is available in the Naming_Back dataframe
new_name_buffer <- Naming_Back %>%
filter(Label == colnames(retro[ , I]))
# When a Name matching the label is found, replace the old name
if(!(is_empty(new_name_buffer$Name))){
names(retro)[i] <- new_name_buffer$Name
}
}

Remove words per year in a corpus

I am working with a corpus with speeches spanning several years (aggregated to person-year level). I want to remove words that occur less than 4 times in a year (not remove it for the whole corpus, but only for the year in which it does not meet the threshold).
I have tried the following:
DT$text <- ifelse(grepl("1998", DT$session), mgsub(DT$text, words_remove_1998, ""), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), str_remove_all(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), drop_element(DT$text, words_remove_1998), DT$text)
However, none seem to work. Mgsub just substitutes the whole speech with "" for 1998, whilst the other options give error messages. The reason that removeWords does not work is that my words_remove_1998 vector is too large. I have tried to split the word vector and loop over the words (see code below), but R does not appear to like this (running forever).
group <- 100
n <- length(words_remove_1998)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(words_remove_1998,r)
for (i in 1:length(d)) {
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, c(paste(d[[i]]))), DT$text)
}
Any suggestions for how to solve this?
Thank you for your help!
Reproducible example:
text <- rbind(c("i like ice cream"), c("banana ice cream is my favourite"), c("ice cream is not my thing"))
name <- rbind(c("Arnold Ford"), c("Arnold Ford"), c("Leslie King"))
session <- rbind("1998", "1999", "1998")
DT <- cbind(name, session, text)
words_remove_1998 <- c("like", "ice", "cream")
newtext <- rbind(c("i"), c("banana ice cream is my favourite"), c("is not my thing"))
DT <- cbind(DT, newtext)
My real word vector that I want removed contains 30k elements.
I ended up not using any wrappings, as none of them could handle the size of the data. Insted I did it the old-fashioned and simple way; separate the text into several rows, count the occurences of each word per session (year) and person, then remove the rows corresponding to less than a threshold (same limit as I used to identify the vector with words I wanted to remove). Lastly, I aggregate the data back to it's initial level (person-year).
This only words because I am removing words according to a threshold. If I had a list of words to remove that I could not remove in this way, I would have been in more trouble.
DT_separate <- separate_rows(DT, text)
df <- DT_separate %>%
dplyr::group_by(session, text) %>%
dplyr::mutate(count = dplyr::n())
df <- df[df$count >5, ]
df <- aggregate(
text ~ x, #where x is a person-year id
data=df,
FUN=paste, collapse=' '
)
names(df)[names(df) == 'text'] <- 'text2'
DT <- left_join(DT, df, by="x")
DT$text <- DT$text2
DT <- DT[, !(colnames(DT) %in% c("text2"))]

Using str_detect (or some other function) and some way to loop through a list to essentially perform a vlookup

I have been searching for a way to do this and some results on here seem similar, nothing seems to be working, nor can I find a method that will loop through a list like a vlookup in excel. I apologize if I have missed it.
I am trying to add a new column to a data set with Mutate. What it is going to do is look at one column using str_replace (or some other function if necessary), and then loop through another list. I want to replace what it finds on with the corresponding value in another column. Essentially a vlookup in excel. It cannot be done in excel however because the file is simply too large.
I can do a simple str_replace one at a time, but there are 502 possible options that I need to choose from, so writing the code for that would take a very long time. Here is what I have so far:
testVendor <- vendorData %>%
select(TOUPPER(Addr1) %>%
mutate('NewAdd' = str_replace(Addr1, 'STREET', 'ST'))
However, rather than me specifying STREET and then ST, I want it to loop through a list of common postal abbreviations and return the standard abbreviation.
An example would be
addr1 <- c('123 MAIN STREET', '123 GARDEN ROAD', '123 CHARLESTON BOULEVARD')
state_abbrv <- c('FL', 'CA', 'NY')
vendor <- data.frame(addr1, state_abbrv)
usps_name <- c('STREET', 'LANE', 'BOULEVARD', 'ROAD', 'TURNPIKE')
usps_abbrv <- c('ST', 'LN', 'BLVD', 'RD', 'TPKE')
usps <- data.frame(usps_name, usps_abbrv)
The ideal output would be a new column on the vendor data frame and would look like this:
Any assistance with this is wonderful, and please allot me to expand on the question if it is unclear of what I am looking for.
Thank you in advance.
I would use a for loop:
usps[] = lapply(usps, as.character)
vendor$new_addr1 = as.character(vendor$addr1)
for(i in 1:nrow(usps)) {
vendor$new_addr1 = str_replace_all(
vendor$new_addr1,
pattern = usps$usps_name[i],
replacement = usps$usps_abbrv[i])
}
vendor
# addr1 state_abbrv new_addr1
# 1 123 MAIN STREET FL 123 MAIN ST
# 2 123 GARDEN ROAD CA 123 GARDEN RD
# 3 123 CHARLESTON BOULEVARD NY 123 CHARLESTON BLVD
To be extra safe, I'd add regex word boundaries to your patterns, as below, so that only whole words are replaced. (I assume you want AIRPLANE RD changed to AIRPLANE RD, not AIRPLN RD)
for(i in 1:nrow(usps)) {
vendor$new_addr1 = str_replace_all(
vendor$new_addr1,
pattern = paste0("\\b", usps$usps_name[i], "\\b"),
replacement = usps$usps_abbrv[i])
}
This might be one of the most confusing r code that I have ever written but it kind of solves the problem
library(tidyverse)
df_phrases <- tribble(~phrases,
"testing this street for pests",
"this street better be lit")
df_lookup <- tribble(~word,~replacement,
"street","st",
"pests","rats",
"lit","well iluminated")
lookup_function <- function(phrase,df_lookup){
wordss <- phrase %>%
str_split(" ")
table_to_join <- tibble(word = wordss) %>% unnest()
table_to_join %>%
left_join(df_lookup) %>%
mutate(new_vector = if_else(replacement %>% is.na,
word,
replacement)) %>%
pull(new_vector) %>%
str_flatten(collapse = " ")
# words_to_replace <- map(wordss,function(x) x %in% c(df_lookup$word))
# tibble(wordss,words_to_replace) %>%
# unnest()
}
df_phrases%>%
mutate(test = phrases %>% map_chr(lookup_function,df_lookup))

Resources