r language: how to find rows with match in list? - r

In R, if I have a data frame with a column for which each row entry contains a list, how can I search for those rows containing a match in that list?
For example, how can I return the indices for those rows containing "Algebra" (e.g. rows 1 and 3) in the following:
> df[1:3,]$classes
[[1]]
[1] "Algebra" "Calculus"
[[2]]
[1] "Geometry"
[[3]]
[1] "Geometry" "Quantum Mechanics" "Algebra"

We can use sapply to loop over the list, use grepl to get logical vector, wrap with any to return only a single TRUE/FALSE value per list element.
sapply(df[1:3,]$classes, function(x) any(grepl('Algebra', x)))
#[1] TRUE FALSE TRUE
Or we can use %in% to return only a single TRUE/FALSE per list element.
sapply(df[1:3,]$classes, '%in%', x='Algebra')
#[1] TRUE FALSE TRUE
Another option is is.element
sapply(df[1:3,]$classes, is.element, el='Algebra')
#[1] TRUE FALSE TRUE
Or as #Richard Scriven mentioned, == can be used
sapply(df[1:3,]$classes, function(x) any(x == "Algebra"))

Related

Compare 2 strings in R

I have data as below:
vec <- c("ABC|ADC|1","ABC|ADG|2")
I need to check if below substring is present or not
"ADC|DFG", it should return false for this as I need to match exact pattern.
"ABC|ADC|1|5" should return True as this is a child element for the first element in vector.
I tried using grepl but it returns true if I just pass ADC as well, any help is appreciated.
grepl returns true because the pipe character | in regex is a special one. a|b means match a or b. all you need to do is escape it.
frtest<-c("ABC|ADC","ABC|ADC|1|2","ABC|ADG","ABC|ADG|2|5")
# making the last number and it's pipe optional
test <- gsub('(\\|\\d)$', '(\\1)?', frtest)
# escaping all pipes
test<-gsub('\\|' ,'\\\\\\\\|',test)
# testing if any of the strings is in vec
res <- sapply(test, function(x) any(grepl(x, vec)) )
# reassigning the names so they're readable
names(res) <-frtest
#> ABC|ADC ABC|ADC|1|2 ABC|ADG ABC|ADG|2|5
TRUE TRUE TRUE TRUE
For two vectors vec and test, this returns a vector which is TRUE if either the corresponding element of test is the start of one of the elements of vec, or one of the elements of vec is the start of the corresponding element of test.
vec <- c("ABC|ADC|1","ABC|ADG|2")
test <- c("ADC|DFG", "ABC|ADC|1|5", "ADC|1", "ABC|ADC")
colSums(sapply(test, startsWith, vec) | t(sapply(vec, startsWith, test))) > 0
# ADC|DFG ABC|ADC|1|5 ADC|1 ABC|ADC
# FALSE TRUE FALSE TRUE

Remove the numbers < 4 digits in list in a data frame in R

I have a data frame like this this, i need to remove the values less than 4 digits in the item column,
department item
xyz009 c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
Output
department item
xyz009 676547,434567,567369,987654,54546676732
Thank you for your help
Maybe you can try nchar+subset
> subset(v,nchar(v)>4)
[1] "676547" "434567" "567369"
[4] "987654" "54546676732"
DATA
v <- c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
1.Create a minimal reproducible example
xyz009 <- c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
2.Suggested solution using base R:
The vector xyz009 is of type character
typeof(xyz009)
[1] "character"
In order to do maths with it (i.e. use >) we have to cast it to numeric using as.numeric
num_xyz <- as.numeric(xyz009)
Now we can use an index to 'filter' values where an expression evaluates to TRUE:
test_result <- num_xyz > 9999
The vector test_result consists of booleans
test_result
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
We can use these booleans as an 'index' (R keeps only values where the index is TRUE):
num_xyz[test_result]
This returns:
[1] 676547 434567 567369 987654 54546676732
Using base R you can use unlist, and lapply:
xyz009<-c("1","676547","2","434567","3","567369","4","987654","6","54546676732")
unlist(lapply(xyz009,function(x) x[nchar(x)>3]))
The result is:
[1] "676547" "434567" "567369" "987654" "54546676732"

How to apply list of regex pattern on list

I have a list of strings and a list of patterns
like:
links <- c(
"http://www.google.com"
,"google.com"
,"www.google.com"
,"http://google.com"
,"http://google.com/"
,"www.google.com/#"
,"www.google.com/xpto"
,"http://google.com/xpto"
,"http://google.com/xpto&utml"
,"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA")
patterns <- c(".com$","/$")
what i want is wipe out all links that matches this patterns.
and get this result:
"www.google.com/#"
"www.google.com/xpto"
"http://google.com/xpto"
"http://google.com/xpto&utml"
"www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
if i use
x<-lapply (patterns, grepl, links)
i get
[[1]]
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
what takes me to this 2 lists
> links[!x[[2]]]
[1] "http://www.google.com" "google.com"
[3] "www.google.com" "http://google.com"
[5] "www.google.com/#" "www.google.com/xpto"
[7] "http://google.com/xpto" "http://google.com/xpto&utml"
[9] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
> links[!x[[1]]]
[1] "http://google.com/" "www.google.com/#"
[3] "www.google.com/xpto" "http://google.com/xpto"
[5] "http://google.com/xpto&utml" "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
in this case each result list wiped 1 pattern out.. but i wanted 1 list with all patterns wiped... how to apply the regex to only one result ... or somehow to merge the n boolean vectors always choosing true.
like:
b[1] <- c(TRUE,FALSE,FALSE,TRUE,FALSE)
b[2] <- c(FALSE,FALSE,TRUE,TRUE,FALSE)
b[3] <- c(FALSE,FALSE,FALSE,FALSE,FALSE)
res <- somefunction(b)
res
TRUE,FALSE,TRUE,TRUE,FALSE
In most cases the best solution will be to merge the regular expression patterns, and to apply a single pattern search, as shown in Thomas’ answer.
However, it is also trivial to merge logical vectors by combining them with logical operations. In your case, you want to compute the member-wise logical disjunction. Between two vectors, this can be computed as x | y. Between a list of multiple vectors, it can be computed using Reduce(|, logical_list).
In your case, this results in:
any_matching = Reduce(`|`, lapply(patterns, grepl, links))
result = links[! any_matching]
This should do what you want:
links[!sapply("(\\.com|/)$", grepl, links)]
Explanation:
You can use sapply so you get a vector and not a list
I'd use the pattern "(\\.com|/)$" (i.e. ends with .com OR /).
In the end I negate the resulting boolean vector using !.
You can try the base R code below, using grep
r <- grep(paste0(patterns,collapse = "|"),links,value = TRUE,invert = TRUE)
such that
> r
[1] "www.google.com/#"
[2] "www.google.com/xpto"
[3] "http://google.com/xpto"
[4] "http://google.com/xpto&utml"
[5] "www.google.com/gclid=102938120391820391+ajdakjsdsjkajasn_JAJSDSJA"
You can do this using stringr::str_subset() function.
library(stringr)
str_subset(links, pattern = ".com$|/$", negate = TRUE)

Remove vector element with %in% returns character(0)

Got a quick question. I'm trying to remove a vector element with the code below. But I get character(0) returned instead of the rest of the vector elements.
What have I done wrong?
> str(ticker.names)
chr [1:10] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ASSA-B.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST"
> ticker.names[! 'AAK.ST' %in% ticker.names]
character(0)
If we need to remove the elements in `ticker.names' that are not 'AAK.ST'.
ticker.names[!ticker.names %in% 'AAK.ST']
Or use setdiff
setdiff(ticker.names, 'AAK.ST')
Consider the approach OP is using,
'AAK.ST' %in% ticker.names
#[1] TRUE
ticker.names['AAK.ST' %in% ticker.names]
#[1] "AAK.ST" "ABB.ST" "ALFA.ST"
By negating,
!'AAK.ST' %in% ticker.names
#[1] FALSE
ticker.names[!'AAK.ST' %in% ticker.names]
#character(0)
In the former case, the TRUE is recycled to the length of the 'ticker.names', so all the elements of the vector are returned, while in the latter, the FALSE gets recycled and no elements are returned.
data
ticker.names <- c('AAK.ST', 'ABB.ST', 'ALFA.ST')

R match part of string against vector of strings

I have a comma separated character class
A = "123,456,789"
and I am trying to get a logical vector for when one of the items in the character class are present in a character array.
B <- as.array(c("456", "135", "789", "111"))
I am looking for logical result of size 4 (length of B)
[1] TRUE FALSE TRUE FALSE
Fairly new to R so any help would be appreciated. Thanks in advance.
You can use a combination of sapply and grepl, which returns a logical if matched
sapply(B, grepl, x=A)
Since your comparison vector is comma-separated, you can use this as a non-looping method.
B %in% strsplit(A, ",")[[1]]
# [1] TRUE FALSE TRUE FALSE
And one other looping method would be to use Vectorize with grepl. This uses mapply internally.
Vectorize(grepl, USE.NAMES = FALSE)(B, A)
# [1] TRUE FALSE TRUE FALSE

Resources