Count repeated words in a column using R

Count repeated words in a column using R - r

I have a dataframe with column 'NAME' like this:
NAME
Cybermart co
Hot burgers hot sandwiches
Landmark co
I want to add a new column to this dataframe depending on:
whether there is any word that gets repeated in the 'name' column.
So the new column would be like this:
REPEATED_WORD
No
Yes
No
Is there any way I can do this?

vapply(strsplit(tolower(x), "\\s+"), anyDuplicated, 1L) > 0L
#[1] FALSE TRUE FALSE

We can split te 'NAME' column by white space (\\s+), loop over the output list and check whether the length of unique elements are the same as the length of each list element to get a logical vector. Convert the logical vector to "Yes", "No" (if required)
df1$REPEATED_WORD <- c("No", "Yes")[sapply(strsplit(df1$NAME, '\\s+'),
function(x) length(unique(tolower(x)))!=length(x)) + 1L]
df1$REPEATED_WORD
#[1] "No" "Yes" "No"
If we are using regex, we can capture non-white space elements ((\\S+)) and use regex lookarounds to check if there is any repeated word.
library(stringi)
stri_detect(tolower(df1$NAME), regex="(\\S+)(?=.*\\s+\\1\\s+)")
#[1] FALSE TRUE FALSE
It is better to leave it as a logical vector instead of converting to "Yes/No". If that is need just add 1 to the logical vector (or using ifelse) and change the TRUE values to "Yes" and FALSE to "No" (as showed above)

I had a similar solution to #akrun's 2nd one (pure regex). I'm going to put it in case it's useful to future searchers:
NAME <-
c('Cybermart co',
'Hot burgers hot sandwiches',
'Landmark co'
)
grepl("(?i)\\b(\\w+)\\s+.*\\1\\b", NAME, perl=TRUE)
## [1] FALSE TRUE FALSE

Related

Substring match when filtering rows

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.
file1:
V1
species1
species121
species14341
file2
V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1
my try:
file1[!file1$V1 %in% file2$V1]

You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:
"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE
You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).
grepl("species1", "genus1|species1|strain1") # TRUE
There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:
grepl(file1$V1, "genus1|species1|strain1")
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
argument 'pattern' has length > 1 and only the first element will be used
The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".
Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:
grepl("species1", file2$V1)
[1] TRUE TRUE TRUE FALSE FALSE
We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:
library(dplyr)
file1 |>
rowwise() |> # This makes sure you only pass one element at a time to `grepl`
mutate(
in_v2 = any(grepl(V1, file2$V1))
) |>
filter(!in_v2)
# A tibble: 1 x 2
# Rowwise:
# V1 in_v2
# <chr> <lgl>
# 1 species14341 FALSE

One way to get what you want is using the grepl function. So, you can run the following code:
# Load library
library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
file2$V1[-elem.are]
In v we save the names of file2$V1 we are interested in (those
between | |)
Then we save at elem.are the positions of those names which appear
in file1$V1
Finally, we omit those elements using file2$V1[-elem.are]

How to find a subset of names in another column?

I have a list of file names that look like this:
files$name <-c("RePEc.aad.ejbejj.v.1.y.2010.i.0.p.84.pdf", "RePEc.aad.ejbejj.v.12.y.2017.i.2.p.1117.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.17.20.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.60.62.pdf")
I have a much longer list of IDs, which is a column of a larger dataframe, some of which correspond to the list of file names (names) but these names have different puncutation. The column looks like this:
df$repec_id <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62", "RePEc:aad.ejbejj:v:1:y:2010:i:0:p:99","RePEc:aad.ejbejj:v:1:y:2010:i:0:p:103")
I want to subset the list in df$repec_id so that I have only the strings that correspond to file names in files$name but they have different punctuation. In other words, I want an output that looks like this:
ID_subset <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62")
Initially, I thought that removing all the special characters from both lists and then comparing them would work. So I did this:
files$name <- str_replace_all(files$name, "\\.pdf", "")
files$name <- str_replace_all(files$name, "[[:punct:]]", "")
df$repec_id <- str_replace_all(files$name, "[[:punct:]]", "")
subset <- df[trimws(df$repec_id) %in% trimws(files$name), ]
However, I need a way of preserving the original structure of the IDs in df$repec_id because I need to provide a list of IDs from df$repec_id that are/ are not in the subset. Does anyone have any suggestions? Thanks in advance for your help!

We can use
!gsub('[^[:alnum:]]+', '', df$repec_id) %in% gsub('\\.pdf$|[^[:alnum:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE

You can remove all punctuations from repec_id and name and use %in% to find out the strings that match.
gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] TRUE TRUE TRUE TRUE FALSE FALSE
If you add negation(!) sign to this you would get strings that do not match.
!gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE
This maintains the length same as df$repec_id so you can use this to subset rows from df.

Find if list in data.table contains word in other column

I have a data.table. One of the columns is a list of words. I want to see if any of those words appear in another column, which is a single word, for each row. I feel like this should be easy but I am not getting the result I expect.
The difficulty seems to be the fact that the column includes lists, and possibly also that it is inconsistent (i.e. not lists all of the same length, some NAs, some that are just single words)?
Example data
words_data <- data.table(
word = c("Lots", "of", "words", "some", "are", "names",
"like", "Tom", "and", "Connolly", "or", "Pete", "Dawson"),
names = c(list(c("Tom", "Connolly")),
list(c("Tom", "Connolly")),
list(c("Tom", "Connolly")),
NA,
NA,
NA,
list(c("Tom", "Connolly")),
list(c("Tom", "Connolly", "Pete", "Dawson")),
list(c("Jenny", "Rogers")),
NA,
list(c("Pete", "Dawson")),
"Dawson",
NA)
)
Desired output
A data.table filtered to rows where the value in the word column can be found in names column.
Therefore the only one that would match in this particular dataset would be the 8th row, which has "Tom" as the word and c("Tom", "Connolly", "Pete", "Dawson") as the names.
Using %in%
This just returns one line, but I don't know why this line.
> words_data[word %in% names]
word names
1: Dawson NA
Using unlist()
This does identify that words that are names, so basically suggests that the entire names column is unlisted and all of the words checked against, which seems closer but I only want it to check for the row.
> words_data[word %in% unlist(names)]
word names
1: Tom Tom,Connolly,Pete,Dawson
2: Connolly NA
3: Pete Dawson
4: Dawson NA
Using sapply
I thought using sapply() might help with the row-wise issue but the output is the same as just doing word %in% names.
> words_data[word %in% sapply(names, unlist)]
word names
1: Dawson NA

This is essentially just a hidden loop, but it will work:
words_data[mapply(`%in%`, word, names)]
# word names
#1: Tom Tom,Connolly,Pete,Dawson
I thought it might scale terribly, but it is okay:
words_data <- words_data[rep(1:13,1e5),]
nrow(words_data)
#[1] 1300000
system.time(words_data[mapply(`%in%`, word, names)])
# user system elapsed
# 1.329 0.016 1.345
The issue with most of the attempts in the question is that they are not considering the word and names piece-by-piece in vectorised comparison across multiple vectors. Map or mapply will take care of this:
mapply(paste, 1:3, letters[1:3])
#[1] "1 a" "2 b" "3 c"
The reasons why the other results didn't work are varied. E.g.:
%in%
This compares each value of word in turn to see if it exists in names exactly
words_data$word %in% words_data$names
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[8] FALSE FALSE FALSE FALSE FALSE TRUE
"Dawson" in row 13 of word matches "Dawson" in row 12 of names. It won't match anything else that is a list containing "Dawson" along with other values though:
"Dawson" %in% list(list("Dawson","Tom"))
#[1] FALSE
unlist
"...basically suggests that the entire names column is unlisted and all of the names word checked against"
Yep, that's it.
sapply + unlist
The sapply here didn't do anything to the names object, because the unlist is only run inside every list item anyway:
identical(words_data$names, sapply(words_data$names, unlist))
#[1] TRUE
Then you can reference the %in% logic above for a reason as to why it didn't work as intended.

See which vector in a list is contained within a vector from another list (finding people's name matches)

I have one list of vectors of people's names, where each vector just has the first and last name and I have another list of vectors, where each vector has the first, middle, last names. I need to match the two lists to find people who are included in both lists. Because the names are not in order (some vectors have the first name as the first value, while others have the last name as the first value), I would like to match the two vectors by finding which vector in the second list (full name) contains all the values of a vector in the first list (first and last names only).
What I have done so far:
#reproducible example
first_last_names_list <- list(c("boy", "boy"),
c("bob", "orengo"),
c("kalonzo", "musyoka"),
c("anami", "lisamula"))
full_names_list <- list(c("boy", "juma", "boy"),
c("stephen", "kalonzo", "musyoka"),
c("james", "bob", "orengo"),
c("lisamula", "silverse", "anami"))
First, I tried to make a function that checks whether one vector is contained in another vector (heavily based on the code from here).
my_contain <- function(values,x){
tx <- table(x)
tv <- table(values)
z <- tv[names(tx)] - tx
if(all(z >= 0 & !is.na(z))){
paste(x, collapse = " ")
}
}
#value would be the longer vector (from full_name_list)
#and x would be the shorter vector(from first_last_name_list)
Then, I tried to put this function within sapply() so that I can work with lists and that's where I got stuck. I can get it to see whether one vector is contained within a list of vectors, but I'm not sure how to check all the vectors in one list and see if it is contained within any of the vectors from a second list.
#testing with the first vector from first_last_names_list.
#Need to make it run through all the vectors from first_last_names_list.
sapply(1:length(full_names_list),
function(i) any(my_contain(full_names_list[[i]],
first_last_names_list[[1]]) ==
paste(first_last_names_list[[1]], collapse = " ")))
#[1] TRUE FALSE FALSE FALSE
Lastly- although it might be too much to ask in one question- if anyone could give me any pointers on how to incorporate agrep() for fuzzy matching to account for typos in the names, that would be great! If not, that's okay too, since I want to get at least the matching part right first.

Since you are dealing with lists it would be better to collapse them into vectors to be easy to deal with regular expressions. But you just arrange them in ascending order. In that case you can easily match them:
lst=sapply(first_last_names_list,function(x)paste0(sort(x),collapse=" "))
lst1=gsub("\\s|$",".*",lst)
lst2=sapply(full_names_list,function(x)paste(sort(x),collapse=" "))
(lst3 = Vectorize(grep)(lst1,list(lst2),value=T,ignore.case=T))
boy.*boy.* bob.*orengo.* kalonzo.*musyoka.* anami.*lisamula.*
"boy boy juma" "bob james orengo" "kalonzo musyoka stephen" "anami lisamula silverse"
Now if you want to link first_name_last_name_list and full_name_list then:
setNames(full_names_list[ match(lst3,lst2)],sapply(first_last_names_list[grep(paste0(names(lst3),collapse = "|"),lst1)],paste,collapse=" "))
$`boy boy`
[1] "boy" "juma" "boy"
$`bob orengo`
[1] "james" "bob" "orengo"
$`kalonzo musyoka`
[1] "stephen" "kalonzo" "musyoka"
$`anami lisamula`
[1] "lisamula" "silverse" "anami"
where the names are from first_last_list and the elements are full_name_list. It would be great for you to deal with character vectors rather than lists:

Edit I've modified the solution to satisfy the constraint that a repeated name such as 'John John' should not match against 'John Smith'.
apply(sapply(first_last_names_list, unlist), 2, function(x){
any(sapply(full_names_list, function(y) sum(unlist(y) %in% x) >= length(x)))
})
This solution still uses %in% and the apply functions, but it now does a kind of reverse search - for every element in the first_last names it looks at
how many words in each name within the full_names list are matched. If this number is greater than or equal to the number of words in the first_list names item under consideration (always 2 words in your examples, but the code will work for any number), it returns TRUE. This logical array is then aggregated with ANY to pass back single vector showing if each first_last is matched to any full_name.
So for example, 'John John' would not be matched to 'John Smith Random', as only 1 of the 3 words in 'John Smith Random' are matched. However, it would be matched to 'John Adam John', as 2 of the 3 words in 'John Adam John' are matched, and 2 is equal to the length of 'John John'. It would also match to 'John John John John John' as 5 of the 5 words match, which is greater than 2.

Instead of my_contain, try
x %in% values
Maybe also unlist and work with data frames? Not sure if you considered it--might make things easier:
# unlist to vectors
fl <- unlist(first_last_names_list)
fn <- unlist(full_names_list)
# grab individual names and convert to dfs;
# assumptions: first_last_names_list only contains 2-element vectors
# full_names_list only contains 3-element vectors
first_last_df <- data.frame(first_fl=fl[c(T, F)],last_fl=fl[c(F, T)])
full_name_df <- data.frame(first_fn=fn[c(T,F,F)],mid_fn=fn[c(F,T,F)],last_fn=fn[c(F,F,T)])

Or you could do this:
first_last_names_list <- list(c("boy", "boy"),
c("bob", "orengo"),
c("kalonzo", "musyoka"),
c("anami", "lisamula"))
full_names_list <- list(c("boy", "juma", "boy"),
c("stephen", "kalonzo", "musyoka"),
c("james", "bob", "orengo"),
c("lisamula", "silverse", "anami"),
c("musyoka", "jeremy", "kalonzo")) # added just to test
# create copies of full_names_list without middle name;
# one list with matching name order, one with inverted order
full_names_short <- lapply(full_names_list,function(x){x[c(1,3)]})
full_names_inv <- lapply(full_names_list,function(x){x[c(3,1)]})
# check if names in full_names_list match either
full_names_list[full_names_short %in% first_last_names_list | full_names_inv %in% first_last_names_list]
In this case %in% does exactly what you want it to do, it checks if the complete name vector matches.

R: manipulating data.frames containing strings and booleans

I have a data.frame in R; it's called p. Each element in the data.frame is either True or False. My variable p has, say, m rows and n columns. For every row there is strictly only one TRUE element.
It also has column names, which are strings. What I would like to do is the following:
For every row in p I see a TRUE I would like to replace with the name of the corresponding column
I would then like to collapse the data.frame, which now contains FALSEs and column names, to a single vector, which will have m elements.
I would like to do this in an R-thonic manner, so as to continue my enlightenment in R and contribute to a world without for-loops.
I can do step 1 using the following for loop:
for (i in seq(length(colnames(p)))) {
p[p[,i]==TRUE,i]=colnames(p)[i]
}
but theres's no beauty here and I have totally subscribed to this for-loops-in-R-are-probably-wrong mentality. Maybe wrong is too strong but they're certainly not great.
I don't really know how to do step 2. I kind of hoped that the sum of a string and FALSE would return the string but it doesn't. I kind of hoped I could use an OR operator of some kind but can't quite figure that out (Python responds to False or 'bob' with 'bob'). Hence, yet again, I appeal to you beautiful Rstats people for help!

Here's some sample data:
df <- data.frame(a=c(FALSE, TRUE, FALSE), b=c(TRUE, FALSE, FALSE), c=c(FALSE, FALSE, TRUE))
You can use apply to do something like this:
names(df)[apply(df, 1, which)]
Or without apply by using which directly:
idx <- which(as.matrix(df), arr.ind=T)
names(df)[idx[order(idx[,1]),"col"]]

Use apply to sweep your index through, and use that index to access the column names:
> df <- data.frame(a=c(TRUE,FALSE,FALSE),b=c(FALSE,FALSE,TRUE),
+ c=c(FALSE,TRUE,FALSE))
> df
a b c
1 TRUE FALSE FALSE
2 FALSE FALSE TRUE
3 FALSE TRUE FALSE
> colnames(df)[apply(df, 1, which)]
[1] "a" "c" "b"
>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count repeated words in a column using R - r

vapply(strsplit(tolower(x), "\\s+"), anyDuplicated, 1L) > 0L #[1] FALSE TRUE FALSE

I had a similar solution to #akrun's 2nd one (pure regex). I'm going to put it in case it's useful to future searchers: NAME <- c('Cybermart co', 'Hot burgers hot sandwiches', 'Landmark co' ) grepl("(?i)\\b(\\w+)\\s+.*\\1\\b", NAME, perl=TRUE) ## [1] FALSE TRUE FALSE

Related

Substring match when filtering rows

How to find a subset of names in another column?

Find if list in data.table contains word in other column

See which vector in a list is contained within a vector from another list (finding people's name matches)

R: manipulating data.frames containing strings and booleans

Categories

Resources