extract numerical suffixes from strings in R - r

I have this character vector:
variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" "vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
Desired output:
> suffixes(variables)
[1] 1 1 4 4 5 6 11 12 2
In other words, I need a function that will return a numeric vector showing the suffixes (each of which be 1 or 2 digits long). Note, I need something that can work with a much larger number of strings which may or may not have numbers somewhere the middle. The numerical suffixes range from 1 to 99.
Many thanks

Just use gsub:
> gsub(".*?([0-9]+)$", "\\1", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
Wrap it in as.numeric if you want the result as a number.

You could use sub function.
> variables <- c("ret.SMB.l1", "ret.mkt.l1", "ret.mkt.l4", "vix.l4", "ret.mkt.l5" ,"vix.l6", "slope.l11", "slope.l12", "us2yy.l2")
> sub(".*\\D", "", variables)
[1] "1" "1" "4" "4" "5" "6" "11" "12" "2"
.*\\D matches all the characters from the start upto the last non-digit character. Replacing those matched characters with an empty string will give you the desired output.

Related

R: Using gsub to replace a digit matched by pattern (n) with (n-1) in character vector

I am trying to match the last digit in a character vector and replace it with the matched digit - 1. I have believe gsub is what I need to use but I cannot figure out what to use as the 'replace' argument. I can match the last number using:
gsub('[0-9]$', ???, chrvector)
But I am not sure how to replace the matched number with itself - 1.
Any help would be much appreciated.
Thank you.
We can do this easily with gsubfn
library(gsubfn)
gsubfn("([0-9]+)", ~as.numeric(x)-1, chrvector)
#[1] "str97" "v197exdf"
Or for the last digit
gsubfn("([0-9])([^0-9]*)$", ~paste0(as.numeric(x)-1, y), chrvector2)
#[1] "str97" "v197exdf" "v33chr138d"
data
chrvector <- c("str98", "v198exdf")
chrvector2 <- c("str98", "v198exdf", "v33chr139d")
Assuming the last digit is not zero,
chrvector <- as.character(1:5)
chrvector
#[1] "1" "2" "3" "4" "5"
chrvector <- paste(chrvector, collapse='') # convert to character string
chrvector <- paste0(substring(chrvector,1, nchar(chrvector)-1), as.integer(gsub('.*([0-9])$', '\\1', chrvector))-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "4" "4"
This works even if you have the last digit zero:
chrvector <- c(as.character(1:4), '0') # [1] "1" "2" "3" "4" "0"
chrvector <- paste(chrvector, collapse='')
chrvector <- as.character(as.integer(chrvector)-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "3" "9"

Calling labels from the table function in R

I am currently using R to convert data from an experiment into a high quality dataset. One of the features of my code is to detect repetitions of the experiment and label them accordingly. I have written the following code for this:-
DAYREP<-function(a){
CAPS<-c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P",
"Q","R","S","T","U","V","W","X","Y","Z")
if (unique(table(a))==1 && length(unique(table(a)))==1){
return(a)
}
else{
for (i in a){
if (table(a)[[i]]>=2){
CAPS.sum<-CAPS[1:as.vector(table(a)[[i]])-1]
val<-c(i,paste0(i,CAPS.sum))
del<-a[!a %in% i]
vec<-append(del,val,after=i-1)
return(vec)
}
}
}
}
I have used the following vectors of day numbers for testing and they highlight every possible outcome known so far.
a<-c(1,2,3,4,5,6,7,8,9)
b<-c(1,2,3,4,5,6,7,8,8)
c<-c(1,2,3,3,4,5,6)
d<-c(1,1,1,1,1,1)
e<-c(1,2,2,3,4,5,6,6,7)
f<-c(2,7,8,10,11,11,14)
It produces the following output:-
> DAYREP(a)
[1] 1 2 3 4 5 6 7 8 9
> DAYREP(b)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "8A"
> DAYREP(c)
[1] "1" "2" "3" "3A" "4" "5" "6"
> DAYREP(d)
[1] "1" "1A" "1B" "1C" "1D" "1E"
> DAYREP(e)
[1] "1" "2" "2A" "3" "4" "5" "6" "6" "7"
> DAYREP(f)
Error in table(a)[[i]] : subscript out of bounds
The function works on all the tests but e and f. With e it only converts the first set of repeated values, and with f it returns an error message.
I am aware that the problem is being caused by the table(a)[[i]] element calling the frequency value from the table, however I am unsure as to whether or not there is a method to call the values being tabulated from the table. E.g.
> table(e)
e
1 2 3 4 5 6 7
1 2 1 1 1 2 1
The method I am using is calling the bottom line, however I wish to call the top line. Does anybody know of a solution to this?
#cr1msonB1ade has kindly suggested the use of the make.unique function which is able to perform what the above function does with slight variation.
> make.unique(e)
[1] "1" "2" "2.1" "3" "4" "5" "6" "6.1" "7"
Thank you!
As stated in my comment I think what you want is the builtin function make.unique, but there are also some issues with how you are using the table, so I would like to address those as well. When you want to access the values in a table via the name of the variable (i in your for loop), you want to index with single brackets [ not double brackets [[. The other issue is that table converts the values to factors and thus you would have to index with an as.character(i). I don't think this completely fixed your script, but it might get you close enough.

Convert a vector of integers to a vector of strings

toString seems to convert a whole vector to a single string -
toString(c(1,2))
[1] "1, 2"
how does one map the string conversion over each element; i.e. for the above example, to obtain ("1", "2") ?
> as.character(c(1,2))
[1] "1" "2"
Is the output I get from the R-console.
Since the result is a character vector with a single element, the strategy of using as.character will have no effect. Need to use scan:
> scan(text = toString(0:11), sep="," )
Read 12 items
[1] 0 1 2 3 4 5 6 7 8 9 10 11
Then you can use as.character if that is needed:
> res <- scan(text = toString(0:11), sep="," )
Read 12 items
> as.character(res)
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
I prefer paste0 since it's shorter and (from what I can tell) accomplishes the same thing as as.character:
> paste0(1:2)
[1] "1" "2"
> identical(paste0(1:2),as.character(1:2))
[1] TRUE

Iterating over characters of string R

Could somebody explain me why this does not print all the numbers separately in R.
numberstring <- "0123456789"
for (number in numberstring) {
print(number)
}
Aren't strings just arrays of chars? Whats the way to do it in R?
In R "0123456789" is a character vector of length 1.
If you want to iterate over the characters, you have to split the string into
a vector of single characters using strsplit.
numberstring <- "0123456789"
numberstring_split <- strsplit(numberstring, "")[[1]]
for (number in numberstring_split) {
print(number)
}
# [1] "0"
# [1] "1"
# [1] "2"
# [1] "3"
# [1] "4"
# [1] "5"
# [1] "6"
# [1] "7"
# [1] "8"
# [1] "9"
Just for fun, here are a few other ways to split a string at each character.
x <- "0123456789"
substring(x, 1:nchar(x), 1:nchar(x))
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
regmatches(x, gregexpr(".", x))[[1]]
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
scan(text = gsub("(.)", "\\1 ", x), what = character())
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
Possible with tidyverse::str_split
numberstring <- "0123456789"
str_split(numberstring,boundary("character"))
1. '0''1''2''3''4''5''6''7''8''9'
Here's a naive approach for iterating a string using a for loop and substring. This isn't any better than existing answers for the common case, but it might be useful if you want to break out of the loop early instead of always traversing the entire string once up front, as str_split/scan/substring(x, 1:nchar(x), 1:nchar(x))/regmatches requires.
s <- "0123456789"
if (s != "") {
for (i in 1:nchar(s)) {
print(substring(s, i, i))
}
}
The if is needed to avoid looping backwards from 1 to 0, inclusive of both ends.
Your question is not 100% clear as to the desired outcome (print each character individually from a string, or store each number in a way that the given print loop will result in each number being produced on its own line).
To store numberstring such that it prints using the loop you included:
numberstring<-c(0,1,2,3,4,5,6,7,8,9)
for(number in numberstring){print(number);}
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
>

How can I use grep with parameters in R?

Obviously I dont get the way grep works in R. If I use grep on my OS X terminal, I am able to use the parameter -o which makes grep only return the matching part. In R, I can't find how to do a corresponding thing. Reading the manual I thought values was the right approach, which is better inasmuch that it returns characters not indexes, but still returns the whole string.
# some string fasdjlk465öfsdj123
# R
test <- fasdjlk465öfsdj123
grep("[0-9]",test,value=TRUE) # returns "fasdjlk465öfsdj123"
# shell
grep -o '[0-9]' fasdjlk465öfsdj123
# returns 4 6 5 1 2 3
What's the parameter I am missing in R ?
EDIT: Joris Meys' suggestions comes really close to what I am trying to do. I get a vector as a result of readLines. And I'd like to check every element of the vector for numbers and return these numbers. I am really surprised there's no standard solution for that. I thought of using some regexp function that works on a string and returns the match like grep -o and then use lapply on that vector. grep.custom comes closest – i'll try to make that work for me.
Spacedman said it already. If you really want to simulate grep in the shell, you have to work on the characters itself, using strsplit() :
> chartest <- unlist(strsplit(test,""))
> chartest
[1] "f" "a" "s" "d" "j" "l" "k" "4" "6" "5" "ö" "f" "s" "d" "j" "1" "2" "3"
> grep("[0-9]",chartest,value=T)
[1] "4" "6" "5" "1" "2" "3"
EDIT :
As Nico said, if you want to do this for complete regular expressions, you need to use the gregexpr() and substr(). I'd make a custom function like this one :
grep.custom <- function(x,pattern){
strt <- gregexpr(pattern,x)[[1]]
lngth <- attributes(strt)$match.length
stp <- strt + lngth - 1
apply(cbind(strt,stp),1,function(i){substr(x,i[1],i[2])})
}
Then :
> grep.custom(test,"sd")
[1] "sd" "sd"
> grep.custom(test,"[0-9]")
[1] "4" "6" "5" "1" "2" "3"
> grep.custom(test,"[a-z]s[a-z]")
[1] "asd" "fsd"
EDIT2 :
for vectors, use the function Vectorize(), eg:
> X <- c("sq25dfgj","sqd265jfm","qs55d26fjm" )
> v.grep.custom <- Vectorize(grep.custom)
> v.grep.custom(X,"[0-9]+")
$sq25dfgj
[1] "25"
$sqd265jfm
[1] "265"
$qs55d26fjm
[1] "55" "26"
and if you want to call grep from the shell, see ?system
That's because 'grep' for R works on vectors - it will do the search on every element and return the element indices that match. It says 'which elements in this vector match this pattern?' For example, here we make a vector of 3 and then ask 'which elements in this vector have a single number in them?'
> test = c("fasdjlk465öfsdj123","nonumbers","123")
> grep("[0-9]",test)
[1] 1 3
Elements 1 and 3 - not 2, which is only characters.
You probably want gsub - substitute anything that doesn't match digits with nothing:
> gsub("[^0-9]","",test)
[1] "465123" "" "123"
All this dancing around with strings is the problem the stringr package was designed to solve.
library(stringr)
str_extract_all('fasdjlk465fsdj123', '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
# It is vectorized too
str_extract_all(rep('fasdjlk465fsdj123',3), '[0-9]')
[[1]]
[1] "4" "6" "5" "1" "2" "3"
[[2]]
[1] "4" "6" "5" "1" "2" "3"
[[3]]
[1] "4" "6" "5" "1" "2" "3"
The motivation behind stringr is to unify string operations in R under two principles:
Use a sane and consistent naming scheme for functions (str_do_something).
Make it so that all the string operations that take one step in other programing languages, yet fifty steps in R, take only one step in R.
grep will only tell you whether the string matches or not.
For instance if you have:
values <- c("abcde", "12345", "abc123", "123abc")
Then
grep <- ("[0-9]", values)
[1] 2 3 4
This tells you that elements 2,3 and 4 of the array match the regexp. You can pass value=TRUE to return the strings rather then the indices.
If you want to check where the match is happening you can use regexpr instead
> regexpr("[0-9]", values)
[1] -1 1 4 1
attr(,"match.length")
[1] -1 1 1 1
which tells you where the first match is happening.
Even better, you can use gregexpr for multiple matches
> gregexpr("[0-9]", values)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1 2 3 4 5
attr(,"match.length")
[1] 1 1 1 1 1
[[3]]
[1] 4 5 6
attr(,"match.length")
[1] 1 1 1
[[4]]
[1] 1 2 3
attr(,"match.length")
[1] 1 1 1
No idea where you get the impression that
> test <- "fasdjlk465öfsdj123"
> grep("[0-9]",test)
[1] 1
returns "fasdjlk465öfsdj123"
If you want to return the matches, you need to break test into it's component parts, grep on those and then use the thing returned from grep to index test.
> test <- strsplit("fasdjlk465öfsdj123", "")[[1]]
> matched <- grep("[0-9]", test)
> test[matched]
[1] "4" "6" "5" "1" "2" "3"
Or just return the matched strings directly, depends what you want:
> grep("[0-9]", test, value = TRUE)
[1] "4" "6" "5" "1" "2" "3"
strapply in the gsubfn package can do such extraction:
> library(gsubfn)
> strapply(c("ab34de123", "55x65"), "\\d+", as.numeric, simplify = TRUE)
[,1] [,2]
[1,] 34 55
[2,] 123 65
Its based on the apply paradigm where the first argument is the object, the second is the modifier (margin for apply, regular expression for strapply) and the third argument is the function to apply on the matches.
str_extract_all(obj, re) in the stringr package is similar to strapply specialized to use c for the function, i.e. its the similar to strapply(obj, re, c) .
strapply supports the sets of regular expressions supported by R and also supports tcl regular expressions.
See the gsubfn home page at http://gsubfn.googlecode.com

Resources