This question already has answers here:
Factor with comma and percentage to numeric
(3 answers)
Closed 5 years ago.
Is there a way to convert into a numeric type a string type?
For instance:
> as.numeric("1.560,65")
[1] NA
Warning message:
NAs introduced by coercion
I receive the above error.
I need the thousands to be displayed and separated by dot while (i.e. 1.560) the decimals to be displays and separated by comma.
> as.numeric("1.560")
[1] 1.56
> as.numeric("1.560")>2
[1] FALSE
In the above example while I want R to convert 1.560 into numeric it translates it to 1.56 which is not in thousands and this lower than 2 and thus my computations are wrong.
Any help is much appreciated.
You have to use a regexpr to format your strings into an understandable format for R and then convert it as numeric
as.numeric(gsub(",", "\\.", gsub("\\.","", "1.560,65")))
[1] 1560.65
For numeric formating see formatC
formatC(1560.65, format = "f", big.mark = ".", decimal.mark = ",")
[1] "1.560,6500"
String pattern matching and replacement can be done by using gsub function. Here is an example for your case:
str_numbers <- c("1.560,65", "134,2","123","0,32")
as.numeric(gsub(",", "\\.", gsub("\\.", "", str_numbers)))
The first call replaces the . with empty string. The second the , with .
> (tmp <- gsub("\\.", "", str_numbers))
[1] "1560,65" "134,2" "123" "0,32"
> gsub(",", "\\.", tmp)
[1] "1560.65" "134.2" "123" "0.32"
Related
Context
When importing columns with identical names from a spreadsheet software, readxl transform doublons with the following syntax : "Col1","Col1" becomes : "Col1","Col1...2". I would like instead to transform it into "Col1","Col1A".
Here is a reproducible example :
Example
# Original string :
library(stringr)
string <- c("G01","G01...2","G02","G03","G04","G04...6","G05","G05...8")
# Desired result
result <- c("G01","G01A","G02","G03","G04","G04A","G05","G05A")
# this line successfully detects the wrongful entries :
str_detect(string,pattern = "[:alpha:][:digit:][:digit:]...[:digit:]")
# this line fails to address the issue correctly :
str_replace(string,"[:alpha:][:digit:][:digit:]...[:digit:]", "[:alpha:][:digit:][:digit:]A")
#output :
[1] "G01" "[:alpha:][:digit:][:digit:]A" "G02"
[4] "G03" "G04" "[:alpha:][:digit:][:digit:]A"
[7] "G05" "[:alpha:][:digit:][:digit:]A"
We could use str_remove to remove the substring that start with one or more . followed by any other characters and then use make.unique to change the duplicates by appending .1, .2 etc
library(stringr)
make.unique(str_remove(string, "\\.+.*"))
If we need to add LETTERS, the issue would be that there will be only 26 duplicates that can be filled
Assuming there will not be more than 26 duplicates, you could do
nm = sapply(strsplit(string, "\\.{3}"), function(x) x[1])
paste0(nm, ave(nm, nm, FUN = function(x) c("", LETTERS)[seq_along(x)]))
# [1] "G01" "G01A" "G02" "G03" "G04" "G04A" "G05" "G05A"
This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
In the column text how it is possible to remove all punctuation remarks but keep only the ?
data.frame(id = c(1), text = c("keep<>-??it--!##"))
expected output
data.frame(id = c(1), text = c("keep??it"))
A more general solution would be to used nested gsub commands that converts ? to a particular unusual string (like "foobar"), gets rid of all punctuation, then writes "foobar" back to ?:
gsub("foobar", "?", gsub("[[:punct:]]", "", gsub("\\?", "foobar", df$text)))
#> [1] "keep??it"
Using gsub you could do:
gsub("(\\?+)|[[:punct:]]","\\1",df$text)
[1] "keep??it"
gsub('[[:punct:] ]+',' ',data) removes all punctuation which is not what you want.
But this is:
library(stringr)
sapply(df, function(x) str_replace_all(x, "<|>|-|!|#|#",""))
id text
[1,] "1" "a"
[2,] "2" "keep??it"
Better IMO than other answers because no need for nesting, and lets you define whichever characters to sub.
Here's another solution using negative lookahead:
gsub("(?!\\?)[[:punct:]]", "", df$text, perl = T)
[1] "keep??it"
The negative lookahead asserts that the next character is not a ? and then matches any punctuation.
Data:
df <- data.frame(id = c(1), text = c("keep<>-??it--!##"))
This question already has answers here:
Remove the letters between two patterns of strings in R
(3 answers)
Closed 2 years ago.
I have a data frame with this kind of expression in column C:
GT_rs9628326:N_rs9628326
GT_rs1111:N_rs1111
GT_rs8374:N_rs8374
Using R, I want to remove everything between the first "T" and ":", as well as everything after the "N". I know this can be done with gsub. I would get:
GT:N
GT:N
GT:N
Maybe you can try
gsub("_\\w+","",s)
giving
[1] "GT:N" "GT:N" "GT:N"
Data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
Another option would be splitting the strings by : and then replace non necessary text in order to collapse all together again by same split symbol (I have used #ThomasIsCoding data thanks):
#Data
v1 <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
#Code
unlist(lapply(lapply(strsplit(v1,split = ':'),
function(x) sub("_[^_]+$", "", x)),
function(x) paste0(x,collapse = ':')))
Output:
[1] "GT:N" "GT:N" "GT:N"
Using str_remove from stringr
library(stringr)
str_remove_all(s, "_\\w+")
#[1] "GT:N" "GT:N" "GT:N"
data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
Remove a word after either "T" or "N". Using #ThomasIsCoding's data.
gsub('(?<=T|N)\\w+', '', s, perl = TRUE)
#[1] "GT:N" "GT:N" "GT:N"
Trying to drop a euro character code from the start of a column. Column was ingested as character by readr, but I need to convert to integers
data$price[1:3]
[1] "\u0080343,000.00" "\u0080185,000.00" "\u0080438,500.00"
so need to get rid of \u0080 from the start (and , and . but we'll deal with those later)
tried:
data$price <- sub("\u0080", "", data$price)
-- no change(!!!)
data$price <- substr(data$price, 7, 100)
-- invalid multibyte string, element 1 (???)
I'd like to get to:
343000, 185000, 438500
But not sure how to get there. Any wisdom would be much appreciated!
You can tell R to use the exact text rather than regular expressions by using the fixed = TRUE option.
price <- c("\u0080343,000.00", "\u0080185,000.00", "\u0080438,500.00")
sub("\u0080", "", price, fixed = TRUE)
[1] "343,000.00" "185,000.00" "438,500.00"
To remove the comma and convert to an integer, you can use gsub.
as.integer(gsub(",", "", sub("\u0080", "", price, fixed = TRUE)))
[1] 343000 185000 438500
You can do this:
gsub("[^ -~]+", "", price)
"343,000.00" "185,000.00" "438,500.00"
Explanation:
The Euro sign is a non-ASCII character. So to get rid of it in the values in price we define a character class of ASCII characters in [ -~]; by negating the class through the caret ^ we match non-ASCII characters (such as €). This pattern is matched in gsuband replaced by "", i.e., nothing.
To convert to integer, proceed as in #Adam's answer. To convert to numeric, you can do this:
as.numeric(gsub(",", "", gsub("[^ -~]+", "", price)))
This question already has answers here:
Remove everything after space in string
(5 answers)
Closed 6 years ago.
I have a R data frame in which one column has factor data type with all text in that column. I want to extract string from that column considering text before space. I tried gsub( " .*$", "", data[, 3] ),where summary is my that field.But it is not working.
For example: My data is like "abcd efgh ijk" & I want "abcd"
I tried to convert my factor field as a character field using
data[, 3] <- sapply(data[, 3], as.character)
But it's failed to retrieve the string before first space. Can you please help?
Sorry I can't able to put data here as it is a client data
or
x <- "abcd efgh ijk"
strsplit(x, " ")[[1]][1]
Try gsub( "\\s.*", "", data[3,] ) \s is the regular expression for white space. You need the extra \ so R doesn't interpret \ as an escape character.
x<-"abcd efgh ijk"
gsub( "\\s.*", "", x )
[1] "abcd"
Here is a useful cheat sheet of regular expressions:
https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/#downloads