Split character vector with vector of patterns in R - r

I'm trying to write a function that builds a matrix by splitting a character vector repeatedly using successive elements in a vector of patterns.
Let's call the function I'm trying to write str_split_vector(). Here's an example of the output I'm looking for:
char <- c("A & P | B & C # D",
"E & Q | F & G # H",
"I & R | J & K # L")
splits <- c(" \\| ", " & ", " # ")
str_split_vector(char, splits)
# [,1] [,2] [,3] [,4]
# [1,] "A & P" "B" "C" "D"
# [2,] "E & Q" "F" "G" "H"
# [3,] "I & R" "J" "K" "L"
The char vector is split by each pattern in turn, leaving "A & P" intact. (Although it might be easiest to manage that last bit with particular regex patterns.)
I've been able to accomplish this task only iteratively, with a pretty ad hoc loop:
for(ii in 1:length(splits)) {
if(ii == 1) {
char_mat <- matrix(char)
char_mat <- do.call(rbind, strsplit(char_mat[ , ii], splits[ii]))
} else {
char_mat <- cbind(char_mat[ , 1:ii - 1],
do.call(rbind,
strsplit(char_mat[ , ii], splits[ii])
)
)
}
}
That process looks inefficient to me, since I'm "growing" char_mat with the repeated cbind() calls. Even worse, I find it almost impossible to understand what's going on without actually running the code.
Is there a simpler way to write this, potentially ignoring the requirement that "A & P" not be split?

Maybe the following is what you want. No loops.
str_split_vector <- function(x, y){
s <- strsplit(x, paste(y, collapse = "|"))
do.call(rbind, s)
}
str_split_vector(char, splits)
# [,1] [,2] [,3] [,4] [,5]
#[1,] "A" "P" "B" "C" "D"
#[2,] "E" "Q" "F" "G" "H"
#[3,] "I" "R" "J" "K" "L"
An approach that uses grouping and won't perform any splitting on the first & is the following:
do.call(rbind, strsplit(gsub("(.*) \\| (.*) & (.*) # (.*)", "\\1_\\2_\\3_\\4", char), "_"))
It basically replaces the characters you wish to split on with an underscore and then splits on those underscores.

Related

how to remove non alphabetic characters and columns from an csv file

I have a csv file that looks like this:
And in some portions the data in the columns is like this:
so as you can see, and because the "=" sign is present it wants to convert it into a formula, but what I need is the word in this case "rama...
I have extracted this term from a spam file and with R converted into a sparse matrix. So the question that I have is how can I get rid of the non-alphanumeric characters from this header in R, and then convert it again into a csv file?
Thanks
If you want a literal answer, you could try using gsub to replace any entry having one or more non alphanumeric characters:
df <- data.frame(v1=c(1,2,3), v2=c("#NAME?", "two", "#NAME?"),
stringsAsFactors=FALSE)
df <- data.frame(sapply(df, function(x) gsub(".*[^A-Za-z0-9].*", "", x)))
df
v1 v2
1 1
2 2 two
3 3
Demo
But the best/easiest thing to do here is probably to just fix your Excel formulas such that you catch these errors, and just display empty string, or some other sensible message. From what I can see, this is basically an Excel, not R, problem.
You can use gsub for that:
## A dummy matrix
example <- matrix(paste0("=", letters[1:9]),3,3)
# [,1] [,2] [,3]
#[1,] "= a" "= d" "= g"
#[2,] "= b" "= e" "= h"
#[3,] "= c" "= f" "= i"
You can remove the "=" by replacing it by "" in gsub
## Replacing the "=" by "" (nothing)
gsub("=", "", example)
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "b" "e" "h"
#[3,] "c" "f" "i"
Or only in the first row (or in the column name, etc.)
## Removing the "=" in the first row
example <- gsub("=", "", example[,1])
# [,1] [,2] [,3]
#[1,] "a" "d" "g"
#[2,] "=b" "=e" "=h"
#[3,] "=c" "=f" "=i"

Finding midpoint of string in R (mid character of a word)

I'd like to find the midpoint of any word after the following is done to the word:
>x = 'hello'
>y = strsplit(x, '')
>y
[[1]]
[1] "h" "e" "l" "l" "o"
>z = unlist(y)
>z
[1] "h" "e" "l" "l" "o"
Doing this then allows for :
> z[1]
[1] "h"
> z[4]
[1] "l"
The difference being that before z=unlist(y) when you try z[index] you get back NA, example:
> x = 'hello'
> strsplit(x, '')
[[1]]
[1] "h" "e" "l" "l" "o"
> x[1]
[1] "hello"
> x[2]
[1] NA
Anyways, what I want to do is find the mid point of words that are in this format so that the output would be something like:
"l"
in the case of the word "hello". Also, in this example we have a word with 5 letters allowing to easily designate a single character as the midpoint but for a word like "bake" I would like to designate both "a" and "k" together as the midpoint.
Try
f1 <- function(str1){
N <- nchar(str1)
if(!N%%2){
res <- substr(str1, N/2, (N/2)+1)
}
else{
N1 <- median(sequence(N))
res <- substr(str1, N1, N1)
}
res
}
f1('bake')
#[1] "ak"
f1('hello')
#[1] "l"
Another option. get_middle assumes the word has already been split into characters, as per your description:
get_middle <- function(x) {
mid <- (length(x) + 1) / 2
x[unique(c(ceiling(mid), floor(mid)))]
}
Then:
words <- c("bake", "hello")
lapply(strsplit(words, ""), get_middle)
Produces:
[[1]]
[1] "k" "a"
[[2]]
[1] "l"
You could try this:
midpoint <- function(word) {
# Split the word into a vector of letters
split <- strsplit(word, "")[[1]]
# Get the number of letters in the word
n <- nchar(word)
# Get the two middle letters for words of even length,
# otherwise get the single middle letter
if (n %% 2 == 0) {
c(split[n/2], split[n/2+1])
} else {
split[ceiling(n/2)]
}
}
In the case of a word of even length, the middle two characters are returned as a vector.
midpoint("hello")
#[1] "l"
midpoint("bake")
#[1] "a" "k"
How about:
mid<-function(str)substr(str,(nchar(str)+1)%/%2,(nchar(str)+2)%/%2)
Or slightly more legibly:
mid2<-function(str){
n1<-nchar(str)+1
substr(str,n1%/%2,(n1+1)%/%2)
}
> mid("bake")
[1] "ak"
> mid("hello")
[1] "l"
This has the advantage that it immediately vectorizes:
> mid(c("bake","hello"))
[1] "ak" "l"
It is slower than #akrun's solution for long words, but my second version is faster; apparently counting characters can be costly for longer strings.
If you want the final product in a list, you can just strsplit the result:
mid3<-function(str)strsplit(mid2(str),"")
word = c("bake","hello")
print(nchar(word))
q = ifelse (nchar(word)%%2==0, substr(word,nchar(word)/2,nchar(word)/2+1),substr(word,nchar(word)/2+1,nchar(word)/2+1))
print(q)
[1] 4 5
[1] "ak" "l"

a sequence of str_match in a data.table

I have a string variable to parse into two parts. I figured I'd approach this using str_match from the stringr package, which returns a matrix with the original string in the first column and each extracted part in the other columns.
I found about a dozen regular expressions to extract these two parts. (The parts are a ladder and rung on a pay schedule, and it's very messy. I've verified that my regexes work by defining a function with a bunch of nested ifelse statements.)
library(stringr)
library(data.table)
my_strs <- c("A 01","G 00","A 2")
mydt <- data.table(strs = my_strs)
rx1 <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2 <- '(A) ([[:digit:]])'
I want to check the regexes in sequence and extract the parts using the first one that checks out. If I only had one regex, I could do this:
myfun <- function(x){
y <- str_match(x,rx1)
return(y)
}
mydt[,myfun(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA NA NA
(It took me a long time to even get that to work, trying all combinations of Vectorize and as.list on the function and *applying in the call.)
My best attempt at checking the regexes in sequence is this rather ugly kludge:
myfun2 <- function(x){
y <- str_match(x,rx1)
ifelse(!is.na(y[1]),"",(y <- str_match(x,rx2))[1])
return(y)
}
mydt[1:2,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
mydt[3,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 2" "A" "2"
mydt[1:3,myfun2(strs)]
# [,1] [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA NA NA
As you can see, it doesn't quite work yet.
Do you have any idea about a better way to approach this? I have about 3.5 m rows in my data set, but only about 2000 unique values for this string, so I'm not really worried about efficiency.
Try this using strapply from the gsubfn package. We define a function that accepts the matches and returns the first two non-empty ones. Then use it with the regular expression paste(rx1, rx2, sep = "|") for each component of my_str :
library(gsubfn)
# test data
# there was an addition to the question in the comments. It asked to be able to handle
# one regular expression which has only a single capture. Make sure its at the end.
rx3 <- "^([[:digit:]]{2})$"
my_strs2 <- c(my_strs, "99")
# code
first2 <- function(...) { x <- c(..., NA); head(x[x != ""], 2) }
strapply(my_strs2, paste(rx1, rx2, rx3, sep = "|"), first2, simplify = TRUE)
The last line returns:
[,1] [,2] [,3] [,4]
[1,] "A " "G " "A" "99"
[2,] "01" "00" "2" NA
(If there are components of my_strs that do not match at all then a list will be returned in which those components are NULL. In that case you may prefer to drop the simplify = TRUE and always have it return a list.)
Note: strapplyc in the same package is much faster than strapply since the guts of it are written in tcl (a string processing language) whereas strapply is written in R. Thus you might want to break it up this way to leverage off of the faster routine:
L <- strapplyc(my_strs2, paste(rx1, rx2, rx3, sep = "|"))
sapply(L, first2)
For posterity, here is another solution I found today:
mydt[,{
i_rx <- min(which(unlist(sapply(rx_list,function(x)grepl(x,strs)))))
as.list(str_match(strs,rx_list[[i_rx]]))
},by=1:nrow(mydt)]
I made some minor alterations to the regexes and put them in a list.
rx1 <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2a <- "^(A) ([[:digit:]])$"
rx3a <- "^()([[:digit:]]{2})$"
rx_list <- list(rx1,rx2a,rx3a)

Replace letters with ciphertext ones

I've been playing with R's gsub2 function R: replace characters using gsub, how to create a function? to create a ciphertext:
from<-c('s','l','k','u','m','i','x','j','o','p','n','q','b','v','w','z','f','y','t','g','h','a','e','d','c','r')
to<-c('z','e','b','r','a','s','c','d','f','g','h','i','j','k','l','m','n','o','p','q','t','u','v','w','x','y')
For example:
original text: the who's 1973
ciphertext: ptv ltn'm 1973
Problem is, that gsub2 replaces some letters twice, (o->f->n and s->z->n), which messes up my ciphertext and makes it almost impossible to decode. Could anyone point out the mistake I'm making? Thanks!
One way is to use a named vector as the encoding cipher. An easy way to create such a named vector is to use setNames:
cipher <- setNames(to, from)
cipher
s l k u m i x j o p n q b v w z f y t g h a e d c r
"z" "e" "b" "r" "a" "s" "c" "d" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "t" "u" "v" "w" "x" "y"
For the encoding function, you can use strsplit and paste:
encode <- function(x){
splitx <- strsplit(x, "")[[1]]
xx <- cipher[splitx]
xx[is.na(xx)] <- splitx[is.na(xx)]
paste(xx, collapse="")
}
encode("the who's 1973")
[1] "ptv ltf'z 1973"
You can also use chartr as mentionned in a (popular : 12 ups) answer to the question you quoted:
cipher <- function(x)
chartr( "slkumixjopnqbvwzfytghaedcr", "zebrascdfghijklmnopqtuvwxy", x )

R - generate all combinations from 2 vectors given constraints

I would like to generate all combinations of two vectors, given two constraints: there can never be more than 3 characters from the first vector, and there must always be at least one characters from the second vector. I would also like to vary the final number of characters in the combination.
For instance, here are two vectors:
vec1=c("A","B","C","D")
vec2=c("W","X","Y","Z")
Say I wanted 3 characters in the combination. Possible acceptable permutations would be: "A" "B" "X"or "A" "Y" "Z". An unacceptable permutation would be: "A" "B" "C" since there is not at least one character from vec2.
Now say I wanted 5 characters in the combination. Possible acceptable permutations would be: "A" "C" "Z" "Y" or "A" "Y" "Z" "X". An unacceptable permutation would be: "A" "C" "D" "B" "X" since there are >3 characters from vec2.
I suppose I could use expand.grid to generate all combinations and then somehow subset, but there must be an easier way. Thanks in advance!
I'm not sure wheter this is easier, but you can leave away permutations that do not satisfy your conditions whith this strategy:
generate all combinations from vec1 that are acceptable.
generate all combinations from vec2 that are acceptable.
generate all combinations taking one solution from 1. + one solution from 2. Here I'd do the filtering with condition 3 afterwards.
(if you're looking for combinations, you're done, otherwise:) produce all permutations of letters within each result.
Now, let's have
vec1 <- LETTERS [1:4]
vec2 <- LETTERS [23:26]
## lists can eat up lots of memory, so use character vectors instead.
combine <- function (x, y)
combn (y, x, paste, collapse = "")
res1 <- unlist (lapply (0:3, combine, vec1))
res2 <- unlist (lapply (1:length (vec2), combine, vec2))
now we have:
> res1
[1] "" "A" "B" "C" "D" "AB" "AC" "AD" "BC" "BD" "CD" "ABC"
[13] "ABD" "ACD" "BCD"
> res2
[1] "W" "X" "Y" "Z" "WX" "WY" "WZ" "XY" "XZ" "YZ"
[11] "WXY" "WXZ" "WYZ" "XYZ" "WXYZ"
res3 <- outer (res1, res2, paste0)
res3 <- res3 [nchar (res3) == 5]
So here you are:
> res3
[1] "ABCWX" "ABDWX" "ACDWX" "BCDWX" "ABCWY" "ABDWY" "ACDWY" "BCDWY" "ABCWZ"
[10] "ABDWZ" "ACDWZ" "BCDWZ" "ABCXY" "ABDXY" "ACDXY" "BCDXY" "ABCXZ" "ABDXZ"
[19] "ACDXZ" "BCDXZ" "ABCYZ" "ABDYZ" "ACDYZ" "BCDYZ" "ABWXY" "ACWXY" "ADWXY"
[28] "BCWXY" "BDWXY" "CDWXY" "ABWXZ" "ACWXZ" "ADWXZ" "BCWXZ" "BDWXZ" "CDWXZ"
[37] "ABWYZ" "ACWYZ" "ADWYZ" "BCWYZ" "BDWYZ" "CDWYZ" "ABXYZ" "ACXYZ" "ADXYZ"
[46] "BCXYZ" "BDXYZ" "CDXYZ" "AWXYZ" "BWXYZ" "CWXYZ" "DWXYZ"
If you prefer the results split into single letters:
res <- matrix (unlist (strsplit (res3, "")), nrow = length (res3), byrow = TRUE)
> res
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "B" "C" "W" "X"
[2,] "A" "B" "D" "W" "X"
[3,] "A" "C" "D" "W" "X"
[4,] "B" "C" "D" "W" "X"
(snip)
[51,] "C" "W" "X" "Y" "Z"
[52,] "D" "W" "X" "Y" "Z"
Which are your combinations.

Resources