Adding leading zero once imported into R - r

I have a data frame which includes a Reference column. This is a 10 digit number, which could start with zeros.
When importing into R, the leading zeros disappear, which I would like to add back in.
I have tried using sprintf and formatC, but I have different problems with each.
DF=data.frame(Reference=c(102030405,2567894562,235648759), Data=c(10,20,30))
The outputs I get are the following:
> sprintf('%010d', DF$Reference)
[1] "0102030405" " NA" "0235648759"
Warning message:
In sprintf("%010d", DF$Reference) : NAs introduced by coercion
> formatC(DF$Reference, width=10, flag="0")
[1] "001.02e+08" "02.568e+09" "02.356e+08"
The first output gives NA when the number already has 10 digits, and the second stores the result in standard form.
What I need is:
[1] 0102030405 2567894562 0235648759

library(stringi)
DF = data.frame(Reference = c(102030405,2567894562,235648759), Data = c(10,20,30))
DF$Reference = stri_pad_left(DF$Reference, 10, "0")
DF
# Reference Data
# 1 0102030405 10
# 2 2567894562 20
# 3 0235648759 30
Alternative solutions: Adding leading zeros using R.
When importing into R, the leading zeros disappear, which I would like
to add back in.
Reading the column(s) in as characters would avoid this problem outright. You could use readr::read_csv() with the col_types argument.

formatC
You can use
formatC(DF$Reference, digits = 0, width = 10, format ="f", flag="0")
# [1] "0102030405" "2567894562" "0235648759"
sprintf
The use of d in sprintf means that your values are integers (or they have to be converted with as.integer()). help(integer) explains that:
"the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly."
That is why as.integer(2567894562) returns NA.
Another work around would be to use a character format s in sprintf:
sprintf('%010s',DF$Reference)
# [1] " 102030405" "2567894562" " 235648759"
But this gives spaces instead of leading zeros. gsub() can add zeros back by replacing spaces with zeros:
gsub(" ","0",sprintf('%010s',DF$Reference))
# [1] "0102030405" "2567894562" "0235648759"

Related

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

R: Incorrect encoding of narrow space in data frame and resulting .csv

I scraped data and received some character variables containing a narrow no break space (unicode U+202F). The resulting character variable shows up fine in R if it is in a vector. For example, the return of test shows up with a narrow space in the console:
test <- "variable1 variable2"
<br>
test
(html code here because the code environment does not show the narrow space)
However, if I add the vector to a list/data frame/tibble, it shows up as variable1<U+202F>variable2 . If I save this data frame as a csv file with fileEncoding = "UTF-8" and open it with the corresponding encoding, still shows up in the observations. My workaround right now is to use gsub but I am wondering what I am doing wrong.
The offender is format.default:
test <- "variable1\u202Fvariable2"
print(test)
[1] "variable1 variable2"
format(test)
#[1] "variable1<U+202F>variable2"
format gets called by format.data.frame which in turn is called by print.data.frame.
A solution might be to define a character method:
format.character <- function(x, ...) x
DF <- data.frame(x = 1:5) #beware of stringsAsFactors
DF$test <- test
DF #spaces are actually thin spaces in R console
# x test
#1 1 variable1 variable2
#2 2 variable1 variable2
#3 3 variable1 variable2
#4 4 variable1 variable2
#5 5 variable1 variable2
Obviously, such a simple method will break functions relying on other format arguments.
OTOH, why do you care how thin spaces are printed?
Anbody having the same problem: There is a package called textclean which replaces or removes non-ascii characters by replace_non_ascii().
One method is to convert all unicode characters to blank using gsub:
text <- "variable1\u202Fvariable2"
new_text <- gsub("[^\x20-\x7E]", " ", text)
Here I match the negation of all commonly used ASCII characters, ranging from hex code 20 (SPACE) to 7E (~). The disadvantage of this method is that you might unintentionally remove more than what you wish, but you can always add exclusions to the character class.
Output:
> format(text)
[1] "variable1<U+202F>variable2"
> format(new_text)
[1] "variable1 variable2"

How to keep leading zeros in double type in R column

I need to read a bunch of zip codes into R but they have to be in double type. I also need to keep the leading zeros for the ones that start with zero. I tried
for (i in 1:length(df$region)){
if (nchar(df$region[i])==4) {
df$region[i] <- paste0("0", df$region[i])
}
}
This converts the way I want to but it changes them all to character type and I can't read the region column into another function that requires numeric or double. If I convert to numeric or double it gets rid of the leading zeros again. Any ideas?
Why not store them as a numeric and just add the zeros when needed through formatC? For example,
tst <- 345
class(tst)
formatC(tst, width = 5, format = "d", flag = "0")
gives,
#[1] "numeric"
#[1] "00345"
For brevity, you could even write a wrapper:
zip <- function(z)formatC(z, width = 5, format = "d", flag = "0")
zip(tst)
#[1] "00345"
And this only adds leading zeroes when needed.
zip(12345)
#[1] "12345"
I would recommend keeping two columns, one in which the ZIP code appears as text, and the other as a double. You would have to first read in the ZIP codes as character data, then create the double column from that, e.g.
# given df$zip_code
df$zip_as_double <- as.double(df$zip_code)
Double variables don't normally maintain the number of leading zeroes, because those digits are not significant anyway. So I think storing your ZIP codes as character data is the only option here.

Make all elemants of a character vector the same length

Consider a character vector
test <- c('ab12','cd3','ef','gh03')
I need all elements of test to contain 4 characters (nchar(test[i])==4). If the actual length of the element is less than 4, the remaining places should be filled with zeroes. So, the result should look like this
> 'ab12','cd30','ef00','gh03'
My question is similar to this one. Yet, I need to work with a character vector.
We can use base R functions to pad 0 at the end of a string to get the number of characters equal. The format with width specified as max of nchar (number of characters) of the vector gives an output with trailing space at the end (as format by default justify it to right. Then, we can replace each space with '0' using gsub. The pattern in the gsub is a single space (\\s) and the replacement is 0.
gsub("\\s", "0", format(test, width=max(nchar(test))))
#[1] "ab12" "cd30" "ef00" "gh03"
Or if we are using a package solution, then str_pad does this more easily as it also have the argument to specify the pad.
library(stringr)
str_pad(test, max(nchar(test)), side="right", pad="0")

R - as.numeric matrix

I am new to R and I am trying to convert a dataframe to a numeric matrix using the below code
expData <- read.table("GSM469176.txt",header = F)
expVec <- as.numeric(as.matrix(exp_data))
When I use as.matrix, without as.numeric, it returns some numbers (as below)
0.083531 0.083496 0.083464 0.083435 0.083406 0.083377 0.083348"
[9975] "-0.00285 -0.0028274 -0.0028046 -0.0027814 -0.0027574 -0.0027319 -0.0027042
but when I put in the as.numeric, they are all converted to "NA"
I apologize if someone has asked this question before but I can't find a post that solves my problem.
Thanks in advance
You have 2 issues. First, if you examine the structure of the data frame, you'll note that the first column is characters:
head(expData)[, 1:4]
V1 V2 V3 V4
1 YAL002W(cer) 6.1497e-02 6.2814e-02 6.4130e-02
2 YAL002W(par) 7.1352e-02 7.3262e-02 7.5171e-02
3 YAL003W(cer) 2.2428e-02 3.8252e-02 5.4078e-02
4 YAL003W(par) 2.6548e-02 3.6747e-02 4.6947e-02
5 YAL005C(cer) 2.4023e-05 2.3243e-05 2.2462e-05
6 YAL005C(par) 2.0252e-02 2.0346e-02 2.0440e-02
Therefore, trying to convert the complete data frame to numeric will not work as expected.
Second, you are running as.numeric() after as.matrix(), which is converting the matrix to a vector:
x <- as.numeric(as.matrix(expData))
# Warning message:
# NAs introduced by coercion
class(x)
[1] "numeric"
dim(x)
# NULL not a matrix
length(x)
# [1] 14261302
I suggest you try this:
rownames(expData) <- expData$V1
expData$V1 <- NULL
expData <- as.matrix(expData)
dim(expData)
# [1] 7502 1900
class(expData[, 1])
# [1] "numeric"
You get the NA's when R doesn't know how to convert something to a number.
Specifically, the quotation mark in your output tells me that you have one (several) LNG string of numbers. To see why this is bad, try: as.nmeric("-0.00285 -0.0028274")
I don't know what your raw data is like, but as #alexwhan mentioned, the culprit is probably in your call to read.table
To fix it, try explicitly setting the sep argument (ie, next to where you have header)
I would suggest opening up the raw file in a simple text editor (TextEdit.app or notepad, not Word) and seeing how they are separated. M guess is
..., sep="\t"
should do the trick.

Resources