Removing Bracket in R and convert value to negative - r

I faced this issue for some numeric columns in R.Some of negative values in some columns are taken in brackets and column is convert into factor.
How to remove brackets in R and make value to negative? Eg. "(265)" to -265
How can I use gsub function in R to do this? If any other method is available, please suggest.

Here is an alternative. Regex match is made on values that start and end with a round bracket, and contain one or more numeric characters between, returning the middle-group (numeric characters) with a minus-sign in front. The whole lot is then cast to numeric:
as.numeric(gsub("^\\(([1-9]+)\\)$","-\\1",x))

Just in case there is something else going on with numbers:
convert.brackets <- function(x){
if(grepl("\\(.*\\)", x)){
paste0("-", gsub("\\(|\\)", "", x))
} else {
x
}
}
x <- c("123", "(456)", "789")
sapply(x, convert.brackets, USE.NAMES = F)
[1] "123" "-456" "789"
Otherwise simply:
paste0("-", gsub("\\(|\\)", "", x))

Related

Extracting numbers (in decimal and </> form) from strings in R

I have a dataset in which a column (the result variable) contains data in both numeric and character form [e.g. positive, negative, <0.1, 600, >1000 etc].
I want to extract only the numeric data in this column (i.e. <0.1, 600, >1000). Ideally without the use of any external packages.
I tried the following:
x<-gsub('\\D','', x)
But it removes the decimals or less than/more than sign (e.g. 1.56 became 156, <1.0 became 10)
I then tried the following:
x<-as.numeric(gsub("(\\D)\\.","", x))
This time round it keeps the decimal but coerced other values such as <0.1, >100 to become NAs instead.
So my question is, is there any way I can modify the function such that it will keep values containing the '<' or '>" as it is without replacement.
Meaning from
x = c("negative","positive","1.22","<1.0",">200")
I will be able to get back
x = c("","","1.22","<1.0",">200)
I would really appreciate if someone can teach me how to resolve this issue thanks!
Do you need this?
> gsub("[^0-9.<>]", "", x)
[1] "" "" "1.22" "<1.0" ">200"
Does this work for you ? Using grep we can find which all items of the vectors contains numbers, then using value=TRUE will give us those items present. Another way could be using grepl to get logical output for the match. Also in your case \\D would not work as it is match to all non digits including dot, greater than signs.
grep('\\d+', x, value=TRUE)
would yield : [1] "1.22" "<1.0" ">200"
grepl('\\d+', x)
would yield: [1] FALSE FALSE TRUE TRUE TRUE
You may also try gsub using:
> gsub('[a-zA-Z]+', '', x)
[1] "" "" "1.22" "<1.0" ">200"
Using str_remove
library(stringr)
str_remove_all(x, "[A-Za-z]+")
#[1] "" "" "1.22" "<1.0" ">200"
What, what about something like this? Find that elements that do not match your conditions and set them to an empty string.
x <- x[grep('[a-zA-Z]', x)] <- ""

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?
Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"
We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"
Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

put left padded zeros inside string

i want to write a function which takes a character Vector(including numbers) as Input and left pads zeroes to the numbers in it. for example this could be an Input Vector :
x<- c("abc124.kk", "77kk-tt", "r5mm")
x
[1] "abc124.kk" "77kk-tt" "r5mm"
each string of the input Vector contains only one Vector but there all in different positions(some are at the end, some in the middle..)
i want the ouput to look like this:
"abc124.kk" "077kk-tt" "r005mm"
that means to put as many leading Zeros to the number included in the string so that it has as many Digits as the longest number.
but i want a function who does this for every string Input not only my example(the x Vector).
i already started extracting the numbers and letters and turned the numbers the way i want them but how can i put them back together and back on the right Position?
my_function<- function(x){
letters<- str_extract_all(x,"[a-z]+")
numbers<- str_extract_all(x, "[0-9]+")
digit_width<-max(nchar(numbers))
numbers_correct<- str_pad(numbers, width=digit_width, pad="0")
}
and what if i have a Vector which includes some strings without numbers? how can i exclude them and get them back without any changes ?
for example if teh Input would be
y<- c("12ab", "cd", "ef345")
the numbers variable Looks like that:
[[1]]
[1] "12"
[[2]]
character(0)
in this case i would want that the ouput at the would look like this:
"012ab" "cd" "ef345"
An option would be using gsubfn to capture the digits, convert it to numeric and then pass it to sprintf for formatting
library(gsubfn)
gsubfn("([0-9]+)", ~ sprintf("%03d", as.numeric(x)), x)
#[1] "abc124.kk" "077kk-tt" "r005mm"
x <- c("12ab", "cd", "ef345")
s = gsub("\\D", "", x)
n = nchar(s)
max_n = max(n)
sapply(seq_along(x), function(i){
if (n[i] < max_n) {
zeroes = paste(rep(0, max_n - n[i]), collapse = "")
gsub("\\d+", paste0(zeroes, s[i]), x[i])
} else {
x[i]
}
})
#[1] "012ab" "cd" "ef345"

Extract first X Numbers from Text Field using Regex

I have strings that looks like this.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
I need to end up with:
"2134", "0983", and "8723"
Essentially, I need to extract the first four characters that are numbers from each element. Some begin with a letter (disallowing me from using a simple substring() function).
I guess technically, I could do something like:
x <- gsub("^P","",x)
x <- substr(x,1,4)
But I want to know how I would do this with regex!
You could use str_match from the stringr package:
library(stringr)
print(c(str_match(x, "\\d\\d\\d\\d")))
# [1] "2134" "0983" "8723"
You can do this with gsub too.
> sub('.?([0-9]{4}).*', '\\1', x)
[1] "2134" "0983" "8723"
>
I used sub instead of gsub to assure I only got the first match. .? says any single character and its optional (similar to just . but then it wouldn't match the case without the leading P). The () signify a group that I reference in the replacement '\\1'. If there were multiple sets of () I could reference them too with '\\2'. Inside the group, and you had the syntax correct, I want only numbers and I want exactly 4 of them. The final piece says zero or more trailing characters of any type.
Your syntax was working, but you were replacing something with itself so you wind up with the same output.
This will get you the first four digits of a string, regardless of where in the string they appear.
mapply(function(x, m) paste0(x[m], collapse=""),
strsplit(x, ""),
lapply(gregexpr("\\d", x), "[", 1:4))
Breaking it down into pieces:
What's going on in the above line is as follows:
# this will get you a list of matches of digits, and their location in each x
matches <- gregexpr("\\d", x)
# this gets you each individual digit
matches <- lapply(matches, "[", 1:4)
# individual characters of x
splits <- strsplit(x, "")
# get the appropriate string
mapply(function(x, m) paste0(x[m], collapse=""), splits, matches)
Another group capturing approach that doesn't assume 4 numbers.
x <- c("P2134.asfsafasfs","P0983.safdasfhdskjaf","8723.safhakjlfds")
gsub("(^[^0-9]*)(\\d+)([^0-9].*)", "\\2", x)
## [1] "2134" "0983" "8723"

Resources