Losing information when converting from character to numerical - r

I'm trying to convert characters like "9.230" to a numeric type.
First I erased the dots, because it was returning me "NA", and then I converted to numerical.
The problem is that when I convert to numerical I lose the trailing zero:
Example:
a<-9.230
as.numeric(gsub(".","",a,fixed=TRUE))
Returns: 923
Does anyone know how avoid this?

You assign the number 9.230 which is the same as 9.23. How is the system supposed to know that there was a trailing zero? If you want to transform a string, work with the string "9.230".

Look for result of
a<-9.230
gsub(".","",a,fixed=TRUE)
#[1] "923"
Question will be why? Because fixed=TRUE have been used in argument of gsub. Hence . is replaced by the 2nd argument of gsub that is "".
Basically thats the reason why as.numeric(gsub(".","",a,fixed=TRUE)) is resulting in 923
There is another point. How a <- 9.230 was changed to character in gsub function. This has been explained in r documentation for gsub:
Arguments: x, text
a character vector where matches are sought, or an object
which can be coerced by as.character to a character vector. Long
vectors are supported.
Final question: How to avoid such behavior?
Dont use gsub. Use sprintf("%.3f",a)

Related

Remove empty strings from library(stringr) in R

I have a vector of string named words, and I need to remove all empty strings using library(stringr). I tried str_remove_all(words, pattern = ""), but it showed me:
Error: Empty `pattern`` not supported.
What should I do? Any help would be appreciated.
How about just use a subset in base R:
words <- words[words != ""]
If perhaps your "empty" words are not really empty, but actually contain one or more whitespace characters, then use grepl to remove them:
words <- words[!grepl("^\\s+$", words)]
If you really want to use stringr, then you probably want to use str_subset, which removes elements of a vector, instead of removing the match from each element. Here is a pattern that keeps only strings with at least one character:
str_subset(words, ".+")

How to use wildcard in gsub replacement

I have a column of strings, e.g.
strings <- c("SometextPO0001moretext", "SometextPO0008moretext")
The 'sometext' and 'moretext' portions are variable in length. I want to remove the PO000* portion of the strings, where * is a wildcard. I've tried
gsub("PO000*", "", strings)
and Googled quite a bit but surprisingly haven't found an answer to this seemingly simple question. Since the last character varies, I would like to be able to do the removal this way vs. hard-coding a large number of variants. Any help would be appreciated!
For a single wild card, you need to use a .. * which you have used is repeate 0 or more times for the last character, which was 0.
gsub("PO000.", "", strings) would remove both PO0001 and PO0008
I think it should be gsub("PO000\\d{1}", "", strings)
And the result is :
[1] "Sometextmoretext" "Sometextmoretext"

How to convert special characters into unicode in R?

When doing some textual data cleaning in R, I can found some special characters. In order to get rid of them, I have to know their unicodes, for example € is \u20AC. I would like to know if it is possible "see" the unicodes with a function that take into account the string within the special character as an input?
Refering to Cath comment, iconv can do the job :
iconv("é", toRaw = TRUE)
Then, you may want to unlist and paste with \u00.
special_char <- "%"
Unicode::as.u_char(utf8ToInt(special_char))

Issue with a column containing special characters

I have dataframe in R that contains a column of type character with values as follows
"\"121.29\""
"\"288.1\""
"\"120\""
"\"V132.3\""
"\"800\""
I am trying to get rid of the extra " and \ and retain clean values as below
121.29
288.10
120.00
V132.30
800.00
I tried gsub("([\\])","", x) also str_repalce_all function so far no luck. I would much appreciate it if anybody can help me resolve this issue. Thanks in advance.
Try
gsub('\\"',"",x)
[1] "121.29" "288.1" "120" "V132.3" "800"
Since the fourth entry is not numeric and an atomic vector can only contain entries of the same mode, the entries are all characters in this case (the most flexible mode capable of storing the data). So there still will be quotes around each entry.
Because \ is a special character, it needs to be escaped with a backslash, so the expression \\" is passed as a first parameter to gsub(). Moreover, as suggested by #rawr, one can use single quotes to address the double quote.
An alternative would be to use double quotes and escape them, too:
gsub("\\\"","",x)
which yields the same result.
Hope this helps.

Write table row names problems in R

I'm using the write.table() method to write a matrix into a text file. My matrix has row and column names. I noticed that R messes up that names.
First of all names that start with a digit are wrote using X as prefix. For example 1005_at will become X1005_at.
Second characters as - and / are substituted with a dot ..
Why is this happening? Is there a way to avoid this crazy issue?
make.names is used to convert names to syntactically valid ones. Check out this small example:
> make.names(c(".1 - / q", "if", "0", "NA"))
[1] "X.1.....q" "if." "X0" "NA."
The documentation says:
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number.
<...>
The character "X" is prepended if necessary. All invalid characters
are translated to "."

Resources