Extracting numbers (in decimal and </> form) from strings in R - r

I have a dataset in which a column (the result variable) contains data in both numeric and character form [e.g. positive, negative, <0.1, 600, >1000 etc].
I want to extract only the numeric data in this column (i.e. <0.1, 600, >1000). Ideally without the use of any external packages.
I tried the following:
x<-gsub('\\D','', x)
But it removes the decimals or less than/more than sign (e.g. 1.56 became 156, <1.0 became 10)
I then tried the following:
x<-as.numeric(gsub("(\\D)\\.","", x))
This time round it keeps the decimal but coerced other values such as <0.1, >100 to become NAs instead.
So my question is, is there any way I can modify the function such that it will keep values containing the '<' or '>" as it is without replacement.
Meaning from
x = c("negative","positive","1.22","<1.0",">200")
I will be able to get back
x = c("","","1.22","<1.0",">200)
I would really appreciate if someone can teach me how to resolve this issue thanks!

Do you need this?
> gsub("[^0-9.<>]", "", x)
[1] "" "" "1.22" "<1.0" ">200"

Does this work for you ? Using grep we can find which all items of the vectors contains numbers, then using value=TRUE will give us those items present. Another way could be using grepl to get logical output for the match. Also in your case \\D would not work as it is match to all non digits including dot, greater than signs.
grep('\\d+', x, value=TRUE)
would yield : [1] "1.22" "<1.0" ">200"
grepl('\\d+', x)
would yield: [1] FALSE FALSE TRUE TRUE TRUE
You may also try gsub using:
> gsub('[a-zA-Z]+', '', x)
[1] "" "" "1.22" "<1.0" ">200"

Using str_remove
library(stringr)
str_remove_all(x, "[A-Za-z]+")
#[1] "" "" "1.22" "<1.0" ">200"

What, what about something like this? Find that elements that do not match your conditions and set them to an empty string.
x <- x[grep('[a-zA-Z]', x)] <- ""

Related

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

replace characters after occurrence of a specific character in R

I have a list of characters like this:-
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
I am trying to get an output like this using base R.
NM020506, NM, NM00
i.e ignore everything after "_".
I tried something like this. But clearly it is not correct.
a
[1] "NM020506_1" "NM_020519_1" "NM00_1030297.2"
> substr(a,1,unlist(gregexpr(pattern ='_',a))-1)
[1] "NM020506" "NM" "NM00_1030"
>
You can use sub function, whereby you substitute everything after _ with empty.
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sub("_.*","",a)
[1] "NM020506" "NM" "NM00"
No need to use gregexpr since it is greedy and yet you only need the first - . You can rather use regexpr which is not greedy
substr(a,1,regexpr(pattern ='_',a)-1)
[1] "NM020506" "NM" "NM00"
You can use strsplitas:
#data
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sapply(strsplit(a,"_"),function(x)x[1])
#[1] "NM020506" "NM" "NM00"

r grepl to distinguish between no and not

I am dealing with two strings like this below
x1 <- "Unknown, because not discussed"
x2 <- "Not at goal, no."
How do i use grepl function to distinguish between these two strings ?
When I use grepl("no", x1), it shows TRUE, which is not correct. This is picking up the no in not or Unknown. How do i use string parsing function to detect strings with the word no explicitly ? Any advise is much appreciated.
You can use word boundary \\b to distinguish them. \\bno\\b will match no only without preceding and following word characters:
grepl("\\bno\\b", x1)
# [1] FALSE
grepl("\\bno\\b", x2)
# [1] TRUE
I can think of a couple of options for matching "no" but not "not":
Using the \b "word boundary" pattern:
> x = c("Unknown, because not discussed", "Not at goal, no.")
> grepl("\\bno\\b", x)
[1] FALSE TRUE
Using [^t] to exclude "not":
> grepl("\\bno[^t]", x)
[1] FALSE TRUE
For matching the word "no" by itself the word boundary option "\\bno\\b" is probably best.

Removing Bracket in R and convert value to negative

I faced this issue for some numeric columns in R.Some of negative values in some columns are taken in brackets and column is convert into factor.
How to remove brackets in R and make value to negative? Eg. "(265)" to -265
How can I use gsub function in R to do this? If any other method is available, please suggest.
Here is an alternative. Regex match is made on values that start and end with a round bracket, and contain one or more numeric characters between, returning the middle-group (numeric characters) with a minus-sign in front. The whole lot is then cast to numeric:
as.numeric(gsub("^\\(([1-9]+)\\)$","-\\1",x))
Just in case there is something else going on with numbers:
convert.brackets <- function(x){
if(grepl("\\(.*\\)", x)){
paste0("-", gsub("\\(|\\)", "", x))
} else {
x
}
}
x <- c("123", "(456)", "789")
sapply(x, convert.brackets, USE.NAMES = F)
[1] "123" "-456" "789"
Otherwise simply:
paste0("-", gsub("\\(|\\)", "", x))

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

Resources