Checking if a vector starts with a number - r

I have a pretty straight forward question. Sorry if this has already been asked somewhere, but I could not find the answer...
I want to check if genenames start with a number, and if they do start with a number, I want to add 'aaa_' to the genename. Therefor I used the following code:
geneName <- "2310067B10Rik"
if (is.numeric(substring(geneName, 1, 1))) {
geneName <<- paste("aaaa_", geneName, sep="")
}
What I want to get back is aaaa_2310067B10Rik. However, is.numeric returns a FALSE, because the substring gives "2" in quotations as a character. I've also tries to use noquote(), but that didnt work, and as.numeric() around the substring, but then it also applies the if code to genes that don't start with a number. Any suggestions? Thanks!

Here is a solution with regex (Learning Regular Expressions ):
geneName <- c("2310067B10Rik", "Z310067B10Rik")
sub("^(\\d)", "aaa_\\1", geneName)
or as PERL-flavoured variant (thx to #snoram):
sub("^(?=\\d)", "aaa_", geneName, perl = TRUE)

Using the replace() function:
start_nr <- grep("^\\d", geneName)
replace(geneName, start_nr, paste0("aaaa_", geneName[start_nr]))
[1] "aaaa_2310067B10Rik" "foo" "aaaa_9bar"
Where:
geneName <- c("2310067B10Rik", "foo", "9bar")

geneName <- c("2310067B10Rik", "foo")
ifelse(substring(geneName, 1,1) %in% c(0:9), paste0("aaaa_", geneName), geneName)
[1] "aaaa_2310067B10Rik" "foo"
Or based on above comment, you could replace substring(geneName, 1,1) %in% c(0:9) by grepl("^\\d", geneName)

Using regex:
You can first check the first character of your geneName and if it is digit then you can append as follow:
geneName <- "2310067B10Rik"
ifelse(grepl("^[0-9]*$", substring(geneName, 1,1)),paste("aaaa",geneName,sep="_"),)
Output:
[1] "aaaa_2310067B10Rik"

geneName=function(x){
if( grepl("^[0-9]",x) ){
as.character(glue::glue('aaaa_{x}'))
}else{x}
}
> geneName("2310067B10Rik")
[1] "aaaa_2310067B10Rik"
> geneName("sdsad")
[1] "sdsad"

Related

R: using a switch function with a numeric type value? [duplicate]

I am new to R programming. I don't know whether we could use switch statements for numerical objects.
This is my code,
myfunction <- function() {
x <- 10
switch(x,
1={
print("one")
},
2={
print("two")
},
3={
print("three")
},
{
print("default") #Edited one..
}
)
}
I got this error,
test.R:4:18: unexpected '='
3: switch(x,
4: 1=
^
Please help me out this problem.
To take full advantage of switch's functionality (in particular its ability to handle arbitrary values with a final "default" expression) and to handle numbers other than 1,2,3,..., you'd be better off converting any input to a character string.
I would do something like this:
myfunction <- function(x) {
switch(as.character(x),
"1" = print("one"),
"2" = print("two"),
"3" = print("three"),
print("something other than 'one', 'two', or 'three'"))
}
myfunction(1)
# [1] "one"
myfunction(345)
# [1] "something other than 'one', 'two', or 'three'"
myfunction <- function(x) {
switch(x,
print("one"),
print("two"),
print("three"))}
myfunction(1)
## [1] "one"
Edit:
As mentioned in comments, this method isn't evaluating the values that are being entered, rather uses them as an index. Thus, it works in your case but it won't work if the statements were to be reordered (see #Joshs answer for better approach).
Either way, I don't think switch is the right function to use in this case, because it is mainly meant for switching between different alternatives, while in your case, you are basically running the same function over and over. Thus, adding extra a statement for each alternative seems like too much work (if you, for example, wanted to display 20 different numbers, you'll have to write 20 different statements).
Instead, you could try the english package which will allow you to display as many numbers as you will define in the ifelse statement
library(english)
myfunction2 <- function(x) {
ifelse(x %in% 1:3,
as.character(as.english(x)),
"default")}
myfunction2(1)
## [1] "one"
myfunction2(4)
## [1] "default"
Alternatively, you could also avoid using switch (though not necessarily recommended) by using match
myfunction3 <- function(x) {
df <- data.frame(A = 1:3, B = c("one", "two", "three"), stringsAsFactors = FALSE)
ifelse(x %in% 1:3,
df$B[match(x, df$A)],
"default")}
myfunction3(1)
## [1] "one"
myfunction3(4)
## [1] "default"
I would suggest reading the ?switch help page. This seems fairly well described there. Names in R can never be numeric, ie c(1=5) is not allowed, nor is f(1=5, 2=5). If you really have 1,2 or 3, then you want just
switch(x,
{print("one")},
{print("two")},
{print("three")}
)
(omit the names for numeric values)

R Switch first part and second part of string at puncuation

I have a data.frame with columns:
names(data) = ("newid","Player.WR","data_col.WR","Trend.WR","Player.QB","data_col.QB","Trend.QB","Player.RB","data_col.RB","Trend.RB","Player.TE","data_col.TE","Trend.TE" )
However, I need to flip the first and second portions of each name at the period so it looks like this:
names(data) = ("newid", "WR.Player", "WR.data_col", "WR.Trend", "QB.Player", "QB.data_col", "QB.Trend", "RB.Player", "RB.data_col", "RB.Trend", "TE.Player", "TE.data_col", "TE.Trend")
My initial thought was to try to do a strsplit and then somehow do an lapply statement to reorder, but I wasn't sure how to make the lapply work.
Thanks!
With a vector of names v, you could also try:
v <- c("newid","Player.WR","data_col.WR","Trend.WR",
"Player.QB","data_col.QB","Trend.QB","Player.RB",
"data_col.RB","Trend.RB","Player.TE","data_col.TE","Trend.TE")
gsub(
'(.*)\\.(.*)',
'\\2\\.\\1',
v
)
Output:
[1] "newid" "WR.Player" "WR.data_col" "WR.Trend" "QB.Player" "QB.data_col" "QB.Trend" "RB.Player"
[9] "RB.data_col" "RB.Trend" "TE.Player" "TE.data_col" "TE.Trend"
And to directly assign it to names:
names(data) <- gsub('(.*)\\.(.*)', '\\2\\.\\1', v)
I would suggest next approach using a function to exchange position of values and lapply():
#Data
vec <- c("newid","Player.WR","data_col.WR","Trend.WR",
"Player.QB","data_col.QB","Trend.QB","Player.RB",
"data_col.RB","Trend.RB","Player.TE","data_col.TE","Trend.TE" )
#Split
L <- lapply(vec,strsplit,split='\\.')
#Format function
myfun <- function(x)
{
y <- x[[1]]
#if check
if(length(y)!=1)
{
z <- paste0(y[c(2,1)],collapse = '.')
} else
{
z <- y
}
return(z)
}
#Apply
L2 <- lapply(L,FUN = myfun)
#Bind
do.call(c,L2)
Output:
[1] "newid" "WR.Player" "WR.data_col" "WR.Trend" "QB.Player" "QB.data_col" "QB.Trend"
[8] "RB.Player" "RB.data_col" "RB.Trend" "TE.Player" "TE.data_col" "TE.Trend"
Last output can be saved in a new vector like vecnamesnew <- do.call(c,L2)
Arg0naut91's answer is quite concise, and I would recommend using Arg0naut91's approach. However, for the sake of providing a (somewhat) concise solution using strsplit and lapply with (perhaps) a bit more readability for those unfamiliar with gsub syntax, I submit the following:
names<-c("newid","Player.WR","data_col.WR","Trend.WR",
"Player.QB","data_col.QB","Trend.QB","Player.RB",
"data_col.RB","Trend.RB","Player.TE","data_col.TE","Trend.TE" )
newnames<-lapply(names,function(x) paste(rev(unlist(strsplit(x,split="\\."),use.names=FALSE)),collapse="."))
print(newnames)
which yields
[1] "newid" "WR.Player" "WR.data_col" "WR.Trend" "QB.Player" "QB.data_col" "QB.Trend"
[8] "RB.Player" "RB.data_col" "RB.Trend" "TE.Player" "TE.data_col" "TE.Trend"
as output.

Replace variable name in string with variable value [R]

I have a character in R, say "\\frac{A}{B}". And I have values for A and B, say 5 and 10. Is there a way that I can replace the A and B with the 5 and 10?
I tried the following.
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
output <- do.call("substitute", list(parse(text=words)[[1]], numbers))
But I get an error on the \. Is there a way that I can do this? I an trying to create equations with the actual variable values.
You could use the stringi function stri_replace_all_fixed()
stringi::stri_replace_all_fixed(
words, names(numbers), numbers, vectorize_all = FALSE
)
# [1] "\\frac{5}{10}"
Try this:
sprintf(gsub('\\{\\w\\}','\\{%d}',words),5,10)
I'm more familiar with gsub than substitute. The following works:
words <- "\\frac{A}{B}"
numbers <- list(A=5, B=10)
arglist = mapply(list, as.list(names(numbers)), numbers, SIMPLIFY=F)
for (i in 1:length(arglist)){
arglist[[i]]["x"] <- words
words <- do.call("gsub", arglist[[i]])
}
But of course this is unsafe because you're iterating over the substitutions. If, say, the first variable has value "B" and the second variable has name "B", you'll have problems. There's probably a cleaner way.

Speed up a loop in r, using character strings simplification

I have a data frame sp which contains several species names but as they come from different databases, they are written in different ways.
For example, one specie can be called Urtica dioica and Urtica dioica L..
To correct this, I use the following code which extracs only the two first words from a row:
paste(strsplit(sp[i,"sp"]," ")[[1]][1],strsplit(sp[i,"sp"]," ")[[1]][2],sep=" ")
For now, this code is integrated in a for loop, which works but takes ages to finish:
for (i in seq_along(sp$sp)) {
sp[i,"sp2"] = paste(strsplit(sp[i,"sp"]," ")[[1]][1],
strsplit(sp[i,"sp"]," ")[[1]][2],
sep=" ")
}
If there a way to improve this basic code using vectors or an apply function?
You could just use vectorized regular expression functions:
library(stringr)
x <- c("Urtica dioica", "Urtica dioica L.")
> str_extract(string = x,"\\w+ \\w+")
[1] "Urtica dioica" "Urtica dioica"
I happen to have found stringr convenient here, but with the right regex for your specific data you could do this just as well with base functions like gsub.
You might want to check to see if there are more than 2 words in the string before doing each extraction:
if((sapply(gregexpr("\\W+", i), length) + 1) > 2){
...
}
There's a function for that.
Also from stringr, the word function
> choices <- c("Urtica dioica", "Urtica dioica L..")
> library(stringr)
> word(choices, 1:2)
# [1] "Urtica" "dioica"
> word(choices, rep(1:2, 2))
# [1] "Urtica" "dioica" "Urtica" "dioica"
These return individual strings. For two strings containing the first and last names,
> word(choices, 1, 2)
# [1] "Urtica dioica" "Urtica dioica"
The final line gets the first two words from each string in the vector choices

grep at the beginning of the string with fixed =T in R?

How to grep with fixed=T, but only at the beginning of the string?
grep("a.", c("a.b", "cac", "sss", "ca.f"), fixed = T)
# 1 4
I would like to get only the first occurrence.
[Edit: the string to match is not known in advance, and can be anything. "a." is just for the sake of example]
Thanks.
[Edit: I sort of solved it now, but any other ideas are highly welcome. I will accept as an answer any alternative solution.
s <- "a."
res <- grep(s, c("a.b", "cac", "sss", "ca.f"), fixed = T, value = T)
res[substring(res, 1, nchar(s)) == s]
]
If you want to match an exact string (string 1) at the beginning of the string (string 2), then just subset your string 2 to be the same length as string 1 and use ==, should be fairly fast.
Actually, Greg -and you- have mentioned the cleanest solution already. I would even drop the grep altogether:
> name <- "a#"
> string <- c("a#b", "cac", "sss", "ca#f")
> string[substring(string, 1, nchar(name)) == name]
[1] "a#b"
But if you really insist on grep, you can use Dwins approach, or following mindboggling solution:
specialgrep <- function(x,y,...){
grep(
paste("^",
gsub("([].^+?|[#\\-])","\\\\\\1",x)
,sep=""),
y,...)
}
> specialgrep(name,string,value=T)
[1] "a#b"
It might be I forgot to include some characters in the gsub. Be sure you keep the ] symbol first and the - last in the characterset, otherwise you'll get errors. Or just forget about it, use your own solution. This one is just for fun's sake :-)
Do you want to use fixed=T because of the . in the pattern? In that case you can just escape the . this would work:
grep("^a\\.", c("a.b", "cac", "sss", "ca.f"))
If you only want the focus on the first two characters, then only present that much information to grep:
> grep("a.", substr(c("a.b", "cac", "sss", "ca.f"), 1,2) ,fixed=TRUE)
[1] 1
You could easily wrap it into a function:
> checktwo <- function (patt,vec) { grep(patt, substr(vec, 1,nchar(patt)) ,fixed=TRUE) }
> checktwo("a.", c("a.b", "cac", "sss", "ca.f") )
[1] 1
I think Dr. G had the key to the solution in his answer, but didn't explicitly call it out: "^" in the pattern specifies "at the beginning of the string". ("$" means at the end of the string)
So his "^a." pattern means "at the beginning of the string, look for an 'a' followed by one character of anything [the '.']".
Or you could just use "^a" as the pattern unless you don't want to match the one character string containing only "a".
Does that help?
Jeffrey

Resources