removing numeric character from data - r

I want to remove the number 0 from all the names in my data.frame.
I have tried to do it myself, however, working with strings is a first time for me.
I have tried:
gsub('\0', '', df )
reproducible code:
df <- c("y2016.09", "y2010.05", "y2010.06", "y2010.07", "y2010.08",
"y2010.09")
expected output
y2016.9
y2010.5
y2010.6
y2010.7
y2010.8
y2010.9

We can specify the location of . (. is a metacharacter in regex - for any character, so it is escaped \\ to evaluate it literally) and 0 or more character of 0's is matched (0*), in the replacement, replace with . i.e. the one we removed by matching
sub("\\.0*", ".", df)
#[1] "y2016.9" "y2010.5" "y2010.6" "y2010.7" "y2010.8" "y2010.9"

Here is another regex solution using lookarounds, but not as simple as the one by #akrun
> gsub("(?<=\\.)0+","",df,perl = TRUE)
[1] "y2016.9" "y2010.5" "y2010.6" "y2010.7" "y2010.8" "y2010.9"

Related

Erase comma and apostrophe in character R

I want to remove the comma and the apostrophe but the point of the following character. After that pass to numeric
I have this:
characterExample <- "234'564,900.99"
I want 234564900.99
I try the following but I can't:
result <- gsub("[:punct:].","", characterExample)
Another option is to explicitly remove the characters you want to remove:
gsub("[',]", "", characterExample)
#[1] "234564900.99"
``
An option is to not match the digits or the . by using ^ within the square bracket
gsub("[^0-9.]+","", characterExample)
#[1] "234564900.99"
Or another option is to make use of SKIP/FAIL for the ., while matching the rest of the punct
gsub("(\\.)(*SKIP)(*F)|[[:punct:]]+", "", characterExample, perl = TRUE)
#[1] "234564900.99"
NOTE: Both solutions make sure that it matches any punct characters other than the . and replace with blank ("")
It can also use the pipe symbol like this:
#Code
gsub(",|'","", characterExample)
Output:
gsub(",|'","", characterExample)
[1] "234564900.99"

Remove some special characters from a string and convert to decimal a column of a data frame [duplicate]

I have a column in a dataframe as follows:
COL1
$54,345
$65,231
$76,234
How do I convert it into this:
COL1
54345
65231
76234
The way I tried it at first was:
df$COL1<-as.numeric(as.character(df$COL1))
That didn't work because it said NA's were introduced.
Then I tried it like this:
df$COL1<-as.numeric(gsub("\\$","",as.character(df$COL1)))
And the same this happened.
Any ideas?
We could use parse_number from readr package which removes any non-numeric characters.
library(readr)
parse_number(df$COL1)
#[1] 54345 65231 76234
The reason why the gsub didn't work was there was , in the column, which is still non-numeric. So when convert to 'numeric' with as.numeric, all the non-numeric elements are converted to NA. So, we need to remove both , and $ to make it work.
df1$COL1 <- as.numeric(gsub('[$,]', '', df1$COL1))
We match the $ and , inside the square brackets ([$,]) so that it will be considered as that character ($ left alone has special meaning i.e. it signifies the end of the string.) and replace it with ''.
Or we can escape (\\) the character ($) to match it and replace by ''.
df1$COL1 <- as.numeric(gsub('\\$|,', '', df1$COL1))
Another option using stringr library to remove '$' and ',' then convert as follows:
df %>% mutate(COL1 = COL1 %>% str_remove_all("\\$,") %>% as.numeric())
Nested gsub to handle negatives and transform to make it functional and to take advantage of NSE
transform(df, COL1 = as.numeric(gsub("[$),]", "", gsub("^\\(", "-", COL1))))

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)
Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.
sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

conditionally remove leading or trailing `.` character in R

I have a vector of names where some names have leading and trailing . characters, and some do not. Here is an example:
test <- c('.name.1.','name.2','.name.3.')
I would like to conditionally remove leading and trailing . characters in these names, to return
c('name.1','name.2','name.3')
Use regular expressions:
test <- c('.name.1.','name.2','.name.3.')
gsub('^\\.|\\.$', '', test)
# [1] "name.1" "name.2" "name.3"
The two backslashes, \\, in the regular expression escape the dot, ., which would actually mean any character. The caret, ^, marks the beginning of the string, the dollar, $, the end of the string. The pipe, |, is a logical "or". So in essence the regular expression matches a dot at the beginning of the string or a dot at the end of the string and replaces it with an empty string.
More information on regular expressions can be found here and information on gsub and related functions here.
A quick function using the substr function:
fun1 <- function(x) substr(x, 1 + (1 * as.numeric(substr(x,1,1)=='.')), nchar(x) - (1 * as.numeric(substr(x, nchar(x), nchar(x)) == '.')))
We use substr to check for a . in the first and last elements of the string, then we use substr again to extract certain parts of the text. For example, if there is a . in the first character, but not in the second, we'll extract: substr(text, 2, nchar(text)).
fun1(test)
[1] "name.1" "name.2" "name.3"
Just for fun, here is a method with substring and grepl.
substring(test, 1L + grepl("^\\.", test), nchar(test) - grepl("\\.$", test))
[1] "name.1" "name.2" "name.3"
This will work replacing substring with substr. The cool thing about these functions is that they take vectors for their second and third arguments. Here, we can use grepl to increment between 1L and 2L for the second argument and between the position of the final character and the penultimate character.
You can also use str_extract from stringr:
library(stringr)
str_extract(test, "\\w+\\.\\d")
or str_replace_all (stringr-equivalent to gsub):
str_replace_all(test, "[.](.+)[.]", "\\1")
# [1] "name.1" "name.2" "name.3"

Convert currency with commas into numeric

I have a column in a dataframe as follows:
COL1
$54,345
$65,231
$76,234
How do I convert it into this:
COL1
54345
65231
76234
The way I tried it at first was:
df$COL1<-as.numeric(as.character(df$COL1))
That didn't work because it said NA's were introduced.
Then I tried it like this:
df$COL1<-as.numeric(gsub("\\$","",as.character(df$COL1)))
And the same this happened.
Any ideas?
We could use parse_number from readr package which removes any non-numeric characters.
library(readr)
parse_number(df$COL1)
#[1] 54345 65231 76234
The reason why the gsub didn't work was there was , in the column, which is still non-numeric. So when convert to 'numeric' with as.numeric, all the non-numeric elements are converted to NA. So, we need to remove both , and $ to make it work.
df1$COL1 <- as.numeric(gsub('[$,]', '', df1$COL1))
We match the $ and , inside the square brackets ([$,]) so that it will be considered as that character ($ left alone has special meaning i.e. it signifies the end of the string.) and replace it with ''.
Or we can escape (\\) the character ($) to match it and replace by ''.
df1$COL1 <- as.numeric(gsub('\\$|,', '', df1$COL1))
Another option using stringr library to remove '$' and ',' then convert as follows:
df %>% mutate(COL1 = COL1 %>% str_remove_all("\\$,") %>% as.numeric())
Nested gsub to handle negatives and transform to make it functional and to take advantage of NSE
transform(df, COL1 = as.numeric(gsub("[$),]", "", gsub("^\\(", "-", COL1))))

Resources