How to find a subset of names in another column? - r

I have a list of file names that look like this:
files$name <-c("RePEc.aad.ejbejj.v.1.y.2010.i.0.p.84.pdf", "RePEc.aad.ejbejj.v.12.y.2017.i.2.p.1117.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.17.20.pdf", "RePEc.aad.ejbejj.v.2.y.2011.i.0.p.60.62.pdf")
I have a much longer list of IDs, which is a column of a larger dataframe, some of which correspond to the list of file names (names) but these names have different puncutation. The column looks like this:
df$repec_id <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62", "RePEc:aad.ejbejj:v:1:y:2010:i:0:p:99","RePEc:aad.ejbejj:v:1:y:2010:i:0:p:103")
I want to subset the list in df$repec_id so that I have only the strings that correspond to file names in files$name but they have different punctuation. In other words, I want an output that looks like this:
ID_subset <- c("RePEc:aad.ejbejj:v:1:y:2010:i:0:p:84", "RePEc:aad:ejbejj:v:12:y.2017:i:2:p:1117", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:17-20", "RePEc:aad:ejbejj:v:2:y:2011:i:0:p:60-62")
Initially, I thought that removing all the special characters from both lists and then comparing them would work. So I did this:
files$name <- str_replace_all(files$name, "\\.pdf", "")
files$name <- str_replace_all(files$name, "[[:punct:]]", "")
df$repec_id <- str_replace_all(files$name, "[[:punct:]]", "")
subset <- df[trimws(df$repec_id) %in% trimws(files$name), ]
However, I need a way of preserving the original structure of the IDs in df$repec_id because I need to provide a list of IDs from df$repec_id that are/ are not in the subset. Does anyone have any suggestions? Thanks in advance for your help!

We can use
!gsub('[^[:alnum:]]+', '', df$repec_id) %in% gsub('\\.pdf$|[^[:alnum:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE

You can remove all punctuations from repec_id and name and use %in% to find out the strings that match.
gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] TRUE TRUE TRUE TRUE FALSE FALSE
If you add negation(!) sign to this you would get strings that do not match.
!gsub('[[:punct:]]', '', df$repec_id) %in%
gsub('\\.pdf$|[[:punct:]]', '',files$name)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE
This maintains the length same as df$repec_id so you can use this to subset rows from df.

Related

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.
grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE
For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

If any value from the first column is matching (partial/ full) in an another column in R

I have two lists, which are like following. I am looking for an output where every row of dat1 will match on complete column in dat, and on the basis of that, I will get the results
dat <- data.frame(v=c('apple', 'le123', 'app', 'being', 'aple',"beiling"))
dat1 <- data.frame(v1=c('app','123', 'be'))
I have tried following two alternatives but without success
test <- mapply(grepl, pattern=dat1$v1, x=dat$v)
str_detect(as.character(dat$v),dat1)
the output I am getting is
TRUE TRUE FALSE FALSE FALSE TRUE
but the desired output I am looking for is
TRUE TRUE TRUE TRUE FALSE TRUE
How can I go ahead with this, every help is important
We can paste the pattern dataset column ('dat1$v1') together by collapseing with "|" and this will look for any matches. It is basically telling that either one of these patterns are in the 'v' column of 'dat'
stringr::str_detect(as.character(dat$v),paste(as.character(dat1$v1), collapse="|"))
#[1] TRUE TRUE TRUE TRUE FALSE TRUE
Note: To avoid any substring mismatches it is better wrap with word boundary (\\b)
pat <- paste0("\\b(", paste(as.character(dat1$v1), collapse="|"), ")\\b")
stringr::str_detect(as.character(dat$v), pat)
which seems to be not the case in the OP's data
Update
If the pattern list is very long, then we can loop over the patterns, get a list of logical vectors and Reduce it to single vector
Reduce(`|`, lapply(as.character(dat1$v1), str_detect, string = as.character(dat$v)))
#[1] TRUE TRUE TRUE TRUE FALSE TRUE
Moreover, you can use sqldf and do this in SQL format:
require(sqldf)
dat <- data.frame(v=c('apple', 'le123', 'app', 'being', 'aple','beiling'))
dat1 <- data.frame(v1=c('app','123', 'be'))
sqldf("SELECT dat.* FROM dat JOIN dat1 on dat.v like ('%' || dat1.v1 || '%')")
And result would be:
v
1 apple
2 le123
3 app
4 being
5 beiling

R: Checking if mutliple elements of a vector appear in vector of strings

I'm trying to create a function that checks if all elements of a vector appear in a vector of strings. The test code is presented below:
test_values = c("Alice", "Bob")
test_list = c("Alice,Chris,Mark", "Alice,Bob,Chris", "Alice,Mark,Zach", "Alice,Bob,Mark", "Mark,Bob,Zach", "Alice,Chris,Bob", "Mark,Chris,Zach")
I would like the output for this to be FALSE TRUE FALSE TRUE FALSE TRUE FALSE.
I first thought I'd be able to switch the | to & in the command grepl(paste(test_values, collapse='|'), test_list) to get when Alice and Bob are both in the string instead of when either of them appear, but I was unable to get the correct answer.
I also would rather not use the command: grepl(test_values[1], test_list) & grepl(test_values[2], test_list) because the test_values vector will change dynamically (varying from length 0 to 3), so I'm looking for something to take that into account.
We can use Reduce with grepl
Reduce(`&`, lapply(test_values, grepl, test_list))
#[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE

Count repeated words in a column using R

I have a dataframe with column 'NAME' like this:
NAME
Cybermart co
Hot burgers hot sandwiches
Landmark co
I want to add a new column to this dataframe depending on:
whether there is any word that gets repeated in the 'name' column.
So the new column would be like this:
REPEATED_WORD
No
Yes
No
Is there any way I can do this?
vapply(strsplit(tolower(x), "\\s+"), anyDuplicated, 1L) > 0L
#[1] FALSE TRUE FALSE
We can split te 'NAME' column by white space (\\s+), loop over the output list and check whether the length of unique elements are the same as the length of each list element to get a logical vector. Convert the logical vector to "Yes", "No" (if required)
df1$REPEATED_WORD <- c("No", "Yes")[sapply(strsplit(df1$NAME, '\\s+'),
function(x) length(unique(tolower(x)))!=length(x)) + 1L]
df1$REPEATED_WORD
#[1] "No" "Yes" "No"
If we are using regex, we can capture non-white space elements ((\\S+)) and use regex lookarounds to check if there is any repeated word.
library(stringi)
stri_detect(tolower(df1$NAME), regex="(\\S+)(?=.*\\s+\\1\\s+)")
#[1] FALSE TRUE FALSE
It is better to leave it as a logical vector instead of converting to "Yes/No". If that is need just add 1 to the logical vector (or using ifelse) and change the TRUE values to "Yes" and FALSE to "No" (as showed above)
I had a similar solution to #akrun's 2nd one (pure regex). I'm going to put it in case it's useful to future searchers:
NAME <-
c('Cybermart co',
'Hot burgers hot sandwiches',
'Landmark co'
)
grepl("(?i)\\b(\\w+)\\s+.*\\1\\b", NAME, perl=TRUE)
## [1] FALSE TRUE FALSE

R: manipulating data.frames containing strings and booleans

I have a data.frame in R; it's called p. Each element in the data.frame is either True or False. My variable p has, say, m rows and n columns. For every row there is strictly only one TRUE element.
It also has column names, which are strings. What I would like to do is the following:
For every row in p I see a TRUE I would like to replace with the name of the corresponding column
I would then like to collapse the data.frame, which now contains FALSEs and column names, to a single vector, which will have m elements.
I would like to do this in an R-thonic manner, so as to continue my enlightenment in R and contribute to a world without for-loops.
I can do step 1 using the following for loop:
for (i in seq(length(colnames(p)))) {
p[p[,i]==TRUE,i]=colnames(p)[i]
}
but theres's no beauty here and I have totally subscribed to this for-loops-in-R-are-probably-wrong mentality. Maybe wrong is too strong but they're certainly not great.
I don't really know how to do step 2. I kind of hoped that the sum of a string and FALSE would return the string but it doesn't. I kind of hoped I could use an OR operator of some kind but can't quite figure that out (Python responds to False or 'bob' with 'bob'). Hence, yet again, I appeal to you beautiful Rstats people for help!
Here's some sample data:
df <- data.frame(a=c(FALSE, TRUE, FALSE), b=c(TRUE, FALSE, FALSE), c=c(FALSE, FALSE, TRUE))
You can use apply to do something like this:
names(df)[apply(df, 1, which)]
Or without apply by using which directly:
idx <- which(as.matrix(df), arr.ind=T)
names(df)[idx[order(idx[,1]),"col"]]
Use apply to sweep your index through, and use that index to access the column names:
> df <- data.frame(a=c(TRUE,FALSE,FALSE),b=c(FALSE,FALSE,TRUE),
+ c=c(FALSE,TRUE,FALSE))
> df
a b c
1 TRUE FALSE FALSE
2 FALSE FALSE TRUE
3 FALSE TRUE FALSE
> colnames(df)[apply(df, 1, which)]
[1] "a" "c" "b"
>

Resources