Split a character string by the symbol "*" - r

> test = "23*45"
I'd like to split testby the symbol *
I tried...
> strsplit(test,'*')
and I got...
[[1]]
[1] "2" "3" "*" "4" "5"
What I aim to have is:
[[1]]
[1] "23" "45"

You need to escape the star...
test = "23*45"
strsplit( test , "\\*" )
#[[1]]
#[1] "23" "45"
The split is a regular expression and * means the preceeding item is matched zero or more times. You are splitting on nothing , i.e. splitting into individual characters, as noted in the Details section of strsplit(). \\* means *treat * as a literal *.
Alternatively use the fixed argument...
strsplit( test , "*" , fixed = TRUE )
#[[1]]
#[1] "23" "45"
Which gets R to treat the split pattern as literal and not a regular expression.

You might want to look at this package:
http://www.rexamine.com/resources/stringi/
To install this package simply run:
install.packages("stringi")
Example:
stri_split_fixed(test, "*")

Related

Trying to find this specific character "|" location in a string

I'm trying to find this specific character "|" location in a string.
for example: 8,75.2|6,0.376
the answer I expect is 7
I trying to use regexpr:
regexpr('|',"8,75.2|6,0.376")
but it didn't worked (although it did work to when I looked for the ",")
any ideas?
The '|' character is a special character in regular expression. You can search for a '|' by using the escape character '\' regexpr("\\|","8,75.2|6,0.376")
Another option is to use lapply:
> str <- '8,75.2|6,0.376'
> chars <- strsplit(str, '')
> chars
[[1]]
[1] "8" "," "7" "5" "." "2" "|" "6" "," "0" "." "3" "7" "6"
> loc <- lapply(chars, function(elem) which (elem == '|'))
> loc
[[1]]
[1] 7
See the lapply documentation
You can use the stringr package:
library(stringr)
str_locate("8,75.2|6,0.376",fixed('|'))
#or
str_locate("8,75.2|6,0.376",'\\|')
sample result:
start end
[1,] 7 7

Regular expressions, extract specific parts of pattern

I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.
My expressions look like the following two:
crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt
Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):
cr*_*_g_*_*_*_f*_
Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".
Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?
You can use sub with captures and then strsplit to get a list of the separated elements:
str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b" "gdp" "100000" "16" "16" "tv"
#[[2]]
#[1] "t" "r" "25000" "20" "40" "lin"
Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.
We can also use regmatches and regexec to extract these values like this:
regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"
[3] "gdp" "100000"
[5] "16" "16"
[7] "tv"
[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t" "r"
[4] "25000" "20" "40"
[7] "lin"
Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["
lapply(regmatches(str,
regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
"[", -1)
[[1]]
[1] "b" "gdp" "100000" "16" "16" "tv"
[[2]]
[1] "t" "r" "25000" "20" "40" "lin"

Getting all characters ahead of first appearance of special character in R

I want to get all characters that are ahead of the first "." if there is one. Otherwise, I want to get back the same character ("8" -> "8").
Example:
v<-c("7.7.4","8","12.6","11.5.2.1")
I want to get something like this:
[1] "7 "8" "12" "11"
My idea was to split each element at "." and then only take the first split. I found no solution that worked...
You can use sub
sub("\\..*", "", v)
#[1] "7" "8" "12" "11"
or a few stringi options:
library(stringi)
stri_replace_first_regex(v, "\\..*", "")
#[1] "7" "8" "12" "11"
# extract vs. replace
stri_extract_first_regex(v, "[^\\.]+")
#[1] "7" "8" "12" "11"
If you want to use a splitting approach, these will work:
unlist(strsplit(v, "\\..*"))
#[1] "7" "8" "12" "11"
# stringi option
unlist(stri_split_regex(v, "\\..*", omit_empty=TRUE))
#[1] "7" "8" "12" "11"
unlist(stri_split_fixed(v, ".", n=1, tokens_only=TRUE))
unlist(stri_split_regex(v, "[^\\w]", n=1, tokens_only=TRUE))
Other sub variations that use a capture group to target the leading characters specifically:
sub("(\\w+).+", "\\1", v) # \w matches [[:alnum:]_] (i.e. alphanumerics and underscores)
sub("([[:alnum:]]+).+", "\\1", v) # exclude underscores
# variations on a theme
sub("(\\w+)\\..*", "\\1", v)
sub("(\\d+)\\..*", "\\1", v) # narrower: \d for digits specifically
sub("(.+)\\..*", "\\1", v) # broader: "." matches any single character
# stringi variation just for fun:
stri_extract_first_regex(v, "\\w+")
scan() would actually work well for this. Since we want everything before the first ., we can use that as a comment character and scan() will remove everything after and including that character, for each element in v.
scan(text = v, comment.char = ".")
# [1] 7 8 12 11
The above returns a numeric vector, which might be where you are headed. If you need to stick with characters, add the what argument to denote we want a character vector returned.
scan(text = v, comment.char = ".", what = "")
# [1] "7" "8" "12" "11"
Data:
v <- c("7.7.4", "8", "12.6", "11.5.2.1")

How to grep for quotation symbol (")

I have this code:
> head(row.names(django_c1))
[1] "10" "16" "25" "26" "28" "48"
> row.names(django_c1) <- gsub("\"", "", row.names(django_c1))
> head(row.names(django_c1))
[1] "10" "16" "25" "26" "28" "48"
What I am trying to do is to delete all the quotation marks ("), however, it doesn't seem to work.
I have also tried:
row.names(django_c1) <- as.numeric(row.names(django_c1))
and:
row.names(django_c1) <- gsub(""", "", row.names(django_c1))
But none of these seem to work either. How can I delete the quotation marks?
You have this display because the result of rownames is a vector of character strings typeof()=="character", so R displays " around it to show that fact.
If you do head(django_c1) you won't see them.
Row and colum names are always character strings, if you want to access by arbitrary index, either use a list (but you will probably create issues since list[[2]]=0 automatically creates list[[1]]=NA) or use django_c1[str(custom_index),].
Precision : gsub('"', '', string) will perfectly remove " from your string if they are really part of your string, which means 'hey"' (displayed "hey\"") becomes 'hey' (displayed "hey"). The distinction must be clear between the content of the string and the way it is displayed.

How can I use grep with parameters in R?

Obviously I dont get the way grep works in R. If I use grep on my OS X terminal, I am able to use the parameter -o which makes grep only return the matching part. In R, I can't find how to do a corresponding thing. Reading the manual I thought values was the right approach, which is better inasmuch that it returns characters not indexes, but still returns the whole string.
# some string fasdjlk465öfsdj123
# R
test <- fasdjlk465öfsdj123
grep("[0-9]",test,value=TRUE) # returns "fasdjlk465öfsdj123"
# shell
grep -o '[0-9]' fasdjlk465öfsdj123
# returns 4 6 5 1 2 3
What's the parameter I am missing in R ?
EDIT: Joris Meys' suggestions comes really close to what I am trying to do. I get a vector as a result of readLines. And I'd like to check every element of the vector for numbers and return these numbers. I am really surprised there's no standard solution for that. I thought of using some regexp function that works on a string and returns the match like grep -o and then use lapply on that vector. grep.custom comes closest – i'll try to make that work for me.
Spacedman said it already. If you really want to simulate grep in the shell, you have to work on the characters itself, using strsplit() :
> chartest <- unlist(strsplit(test,""))
> chartest
[1] "f" "a" "s" "d" "j" "l" "k" "4" "6" "5" "ö" "f" "s" "d" "j" "1" "2" "3"
> grep("[0-9]",chartest,value=T)
[1] "4" "6" "5" "1" "2" "3"
EDIT :
As Nico said, if you want to do this for complete regular expressions, you need to use the gregexpr() and substr(). I'd make a custom function like this one :
grep.custom <- function(x,pattern){
strt <- gregexpr(pattern,x)[[1]]
lngth <- attributes(strt)$match.length
stp <- strt + lngth - 1
apply(cbind(strt,stp),1,function(i){substr(x,i[1],i[2])})
}
Then :
> grep.custom(test,"sd")
[1] "sd" "sd"
> grep.custom(test,"[0-9]")
[1] "4" "6" "5" "1" "2" "3"
> grep.custom(test,"[a-z]s[a-z]")
[1] "asd" "fsd"
EDIT2 :
for vectors, use the function Vectorize(), eg:
> X <- c("sq25dfgj","sqd265jfm","qs55d26fjm" )
> v.grep.custom <- Vectorize(grep.custom)
> v.grep.custom(X,"[0-9]+")
$sq25dfgj
[1] "25"
$sqd265jfm
[1] "265"
$qs55d26fjm
[1] "55" "26"
and if you want to call grep from the shell, see ?system
That's because 'grep' for R works on vectors - it will do the search on every element and return the element indices that match. It says 'which elements in this vector match this pattern?' For example, here we make a vector of 3 and then ask 'which elements in this vector have a single number in them?'
> test = c("fasdjlk465öfsdj123","nonumbers","123")
> grep("[0-9]",test)
[1] 1 3
Elements 1 and 3 - not 2, which is only characters.
You probably want gsub - substitute anything that doesn't match digits with nothing:
> gsub("[^0-9]","",test)
[1] "465123" "" "123"
All this dancing around with strings is the problem the stringr package was designed to solve.
library(stringr)
str_extract_all('fasdjlk465fsdj123', '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
# It is vectorized too
str_extract_all(rep('fasdjlk465fsdj123',3), '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
[[2]]
[1] "4" "6" "5" "1" "2" "3"
[[3]]
[1] "4" "6" "5" "1" "2" "3"
The motivation behind stringr is to unify string operations in R under two principles:
Use a sane and consistent naming scheme for functions (str_do_something).
Make it so that all the string operations that take one step in other programing languages, yet fifty steps in R, take only one step in R.
grep will only tell you whether the string matches or not.
For instance if you have:
values <- c("abcde", "12345", "abc123", "123abc")
Then
grep <- ("[0-9]", values)
[1] 2 3 4
This tells you that elements 2,3 and 4 of the array match the regexp. You can pass value=TRUE to return the strings rather then the indices.
If you want to check where the match is happening you can use regexpr instead
> regexpr("[0-9]", values)
[1] -1 1 4 1
attr(,"match.length")
[1] -1 1 1 1
which tells you where the first match is happening.
Even better, you can use gregexpr for multiple matches
> gregexpr("[0-9]", values)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1 2 3 4 5
attr(,"match.length")
[1] 1 1 1 1 1
[[3]]
[1] 4 5 6
attr(,"match.length")
[1] 1 1 1
[[4]]
[1] 1 2 3
attr(,"match.length")
[1] 1 1 1
No idea where you get the impression that
> test <- "fasdjlk465öfsdj123"
> grep("[0-9]",test)
[1] 1
returns "fasdjlk465öfsdj123"
If you want to return the matches, you need to break test into it's component parts, grep on those and then use the thing returned from grep to index test.
> test <- strsplit("fasdjlk465öfsdj123", "")[[1]]
> matched <- grep("[0-9]", test)
> test[matched]
[1] "4" "6" "5" "1" "2" "3"
Or just return the matched strings directly, depends what you want:
> grep("[0-9]", test, value = TRUE)
[1] "4" "6" "5" "1" "2" "3"
strapply in the gsubfn package can do such extraction:
> library(gsubfn)
> strapply(c("ab34de123", "55x65"), "\\d+", as.numeric, simplify = TRUE)
[,1] [,2]
[1,] 34 55
[2,] 123 65
Its based on the apply paradigm where the first argument is the object, the second is the modifier (margin for apply, regular expression for strapply) and the third argument is the function to apply on the matches.
str_extract_all(obj, re) in the stringr package is similar to strapply specialized to use c for the function, i.e. its the similar to strapply(obj, re, c) .
strapply supports the sets of regular expressions supported by R and also supports tcl regular expressions.
See the gsubfn home page at http://gsubfn.googlecode.com

Resources