Regex Matching Negative values - r

I'm trying to create some simple and easy to write content-clusters with multiple regexes.
Imagine a list of strings: c("a","b","ac")
The groups I need to define are "All: a's" and "All: b's". So the values "a" and "ac" are "A" and "b" is "B".
myDF$contentGroup <- sub(".*a.*", "A", myDF$stringList)
However this will result in a column within my dataframe "contentGroup" which contains the value of "stringList" if no match occured. So if I do the same line of code with "B" it will overwrite the "A"s.
myDF$contentGroup <- sub(".*b.*", "B", myDF$stringList)
I just cant figure out how to do simple clustering in a single line of code. Making it as simple as possible.

You can use grep to match 'a' and 'b', and replace as follows,
x[grep('a', x, fixed = TRUE)] <- 'A'
x[grep('b', x, fixed = TRUE)] <- 'B'
x
#[1] "A" "B" "A"

Related

how to change a vector to corresponding names without using for loop

I have a vector c(1,3,4,2,5,4,3,1,6,3,1,4,2), and I want make 1="a", 2="b", and so on
So my final outputs should look like c(a,c,d,b...)
I know that I can use for loop and if statement to do this, but is there any other quicker ways to do?
You may use the built-in constant letters.
vec <- c(1,3,4,2,5,4,3,1,6,3,1,4,2)
res <- letters[vec]
res
#[1] "a" "c" "d" "b" "e" "d" "c" "a" "f" "c" "a" "d" "b"
To replace with any other values you can construct a vector to subset.
value <- c('apples', 'banana', 'grapes', .....)
res <- value[vec]
We may use match
letters[match(vec, unique(vec))]

Is it possible to remove variables with a certain pattern from a datatable or list?

For example if I have a list which contains: "a", "ab", "b", "c", "ad" as variables.
Is it possible to remove all variables which contain an "a", without writing every single variable down?
I think grep or grepl could help
> grep("a",v,value = TRUE, invert = TRUE)
[1] "b" "c"
or
> v[!grepl("a",v)]
[1] "b" "c"
Data
v <- c("a","ab","b","c","ad")
“variables” are conventionally called “names” in R.
So if you want to remove them from a list-like structure, you can manipulate its names, and then subset the list with the resulting vector of names.
x = x[grep('a', names(x), value = TRUE, invert = TRUE)]
Or, using grepl instead:
x = x[! grepl('a', names(x))]
An option with str_subset
library(stringr)
str_subset(v, "a", negate = TRUE)
#[1] "b" "c"
data
v <- c("a","ab","b","c","ad")

Two Column R Dataframe to Named LIst [duplicate]

This question already has answers here:
Named List To/From Data.Frame
(4 answers)
Closed 2 years ago.
I am trying to convert a two-column dataframe to a named list. There are several solutions on StackOverflow where every value in the first column becomes the 'name', but I am looking to collapse the values in column 2 into common values in column 1.
For example, the list should look like the following:
# Create a Named list of keywords associated with each file.
fileKeywords <- list(fooBar.R = c("A","B","C"),
driver.R = c("A","F","G"))
Where I can retrieve all keywords for "fooBar.R" using:
# Get the keywords for a named file
fileKeywords[["fooBar.R"]]
My data frame looks like:
df <- read.table(header = TRUE, text = "
file keyWord
'fooBar.R' 'A'
'fooBar.R' 'B'
'fooBar.R' 'C'
'driver.R' 'A'
'driver.R' 'F'
'driver.R' 'G'
")
I'm sure there is a simple solution that I am missing.
You could use unstack:
as.list(unstack(rev(df)))
$driver.R
[1] "A" "F" "G"
$fooBar.R
[1] "A" "B" "C"
This is equivalent to as.list(unstack(df, keyWord~file))
We can use stack in base R
stack(fileKeywords)[2:1]
if it is the opposite, then we can do
with(df, tapply(keyWord, file, FUN = I))
-output
#$driver.R
#[1] "A" "F" "G"
#$fooBar.R
#[1] "A" "B" "C"

Why is this grep exclusion failed to work in R?

I am trying to do exclude certain characters when using grep in R. But I cannot get the result that I expect.
Here is the code:
x <- c("a", "ab", "b", "abc")
grep("[^b]", x, value=T)
> [1] "a" "ab" "abc"
I want to grab anything in vector x that does not contain b. It should not return "ab" or "abc".
Ultimately I want to pick up any element that contains "a" but not "b".
This is the result that I would expect:
grep("a[^b]", x, value=T)
> [1] "a"
How can I do that?
Try this:
grep("^[^b]*a[^b]*$", x, value=TRUE)
# [1] "a"
It looks for the start of the string, then allows any number of characters that are not "b", then an "a", then any number of characters that are not "b" again and then the end of the string is reached.
We can use the invert property of grep which returns values which do not match. So here it returns those values which do not have "b" in them.
grep("b", x, value = TRUE, invert = TRUE)
#[1] "a"
I've got the result, what are you looking for, using this regular expression in grep:
grep("^[^b]*$", x, value=TRUE)
[1] "a"

Determine all characters present in a vector of strings

Say I have the following dataframe consisting of two vectors containing character strings:
df <- data.frame(
"ID"= c("1a", "1b", "1c", "1d"),
"Codes" = c("BX.MX|GX.WX", "MX.RX|BX.YX", "MX.OX|GX.GX", "MX.OX|YX.OX"),
stringsAsFactors = FALSE)
I'd like a simple way to determine which characters have been used in a given vector. In other words, the output of such a function would reveal:
find.characters(df$Codes) # hypothetical function
[1] "B" "G" "M" "W" "X" "R" "Y" "O" "|" "."
find.characters(df$ID) # hypothetical function
[1] "1" "a" "b" "c" "d"
You can create a custom function to do this. The idea is to split the strings into individual characters (strsplit(v1, '')), output will be list. We can unlist it to make it a vector, then get the unique elements. But, this is not sorted yet. Based on the example showed, you may want to sort the letters and other characters differently. So, we use grep to index the 'LETTER' character, and use this to separately sort the subset of vectors and concatenate c( it together.
find.characters <- function(v1){
x1 <- unique(unlist(strsplit(v1, '')))
indx <- grepl('[A-Z]', x1)
c(sort(x1[indx]), sort(x1[!indx]))
}
find.characters(df$Codes)
#[1] "B" "G" "M" "O" "R" "W" "X" "Y" "|" "."
find.characters(df$ID)
#[1] "1" "a" "b" "c" "d"
NOTE: Generally, I would use grepl('[A-Za-z]', x1), but I didn't do that because the expected result for the 'ID' column is different.
find.characters<-function(x){
unique(c(strsplit(split="",x),recursive = T))
}

Resources