adding or retaining leading zeros without converting to character format - r

Is it possible to add or retain one or more leading zeros to a number without the result being converted to character? Every solution I have found for adding leading zeros returns a character string, including: paste, formatC, format, and sprintf.
For example, can x be 0123 or 00123, etc., instead of 123 and still be numeric?
x <- 0123
EDIT
It is not essential. I was just playing around with the following code and the last two lines gave the wrong answer. I just thought maybe if I could have leading zeros with numeric format obtaining the correct answer would be easier.
a7 = c(1,1,1,0); b7=c(0,1,1,1); # 4
a77 = '1110' ; b77='0111' ; # 4
a777 = 1110 ; b777=0111 ; # 4
length(b7[(b7 %in% intersect(a7,b7))])
R - count matches between characters of one string and another, no replacement
keyword <- unlist(strsplit(a77, ''))
text <- unlist(strsplit(b77, ''))
sum(!is.na(pmatch(keyword, text)))
ab7 <- read.fwf(file = textConnection(as.character(rbind(a777, b777))), widths = c(1,1,1,1), colClasses = rep("character", 2))
length(ab7[2,][(ab7[2,] %in% intersect(ab7[1,],ab7[2,]))])

You are not thinking correctly about what a "number" is. Programming languages store an internal representation which retains full precision to the machine limit. You are apparently concerned with what gets printed to your screen or console. By definition, those number characters are string elements, which is to say, a couple bytes are processed by the ASCII decoder (or equivalent) to determine what to draw on the screen. What x "is," to draw happily on Presidential Testimony, depends on your definition of what "is" is.

You could always create your own class of objects that has one slot for the value of the number (but if it is stored as numeric then what we see as 123 will actually be stored as as a binary value, something like 01111011 (though probably with more leading 0's)) and another slot or attribute for either the number of leading 0's or the number of significant digits. Then you can write methods for what to do with the number (and what effect that will have on the leading 0's, sig digits, etc.).
The print method could then make sure to print it with the leading zeros while keeping the internal value as a number.
But this seems a bit overkill in most cases (though I know that some fields make a big deal about indicating number of significant digits so that leading 0's could be important). It may be simpler to use the conversion to character methods that you already know about, but just do the printing in a way that does not look obviously like a number, see the cat and print functions for the options.

Related

R: How to count gaps at the beginning of a sequence alignment?

I am analyzing an alignment of amino acid sequences using R and need a reproducible way to figure out where the start is for each sequence. My alignment can be read in as a data frame. Here is a sample of 3.
alignment <- data.frame("Strains" = c("Strain.1", "Strain.2", "Strain.3"),
"Sequence" = c("MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSL---------------------",
"MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSLPTDFAIS--------------",
"-----------------------NIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTVEGVVIQGTNNVDRWLATILIEPNVQATNRTYNLFGQQEILLIENTSQTQWKFVDVSKTTPTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNVT-TGY-YSTTNYDTVNMT-----------------------------------------------------"))
Each of the dashes represents a space. What I want to do is read through my data frame and count how many spaces are at the beginning of each sequence. So far I've tried using the str_count function. For example:
alignment$shift <- str_count(alignment$Sequence, "-")
but this fails me when I have gaps downstream in my sequence. Really I'm only interested in the gaps that occur at the beginning of the sequences.
I stumbled across the regex function in a post that almost perfectly matches my problem, (How to count the number of hyphens at the beginning of a string in javascript?) but this is in Java and I'm not sure how to translate this to R.
My questions are:
1) Is it possible to have str_count stop looking for "-" characters once it reaches a non-"-" character?
2) Is there a way to use regex or a similar function in R that outputs the length of a character match at the beginning of a string?
You could do this...
alignment$Sequence <- as.character(alignment$Sequence) #in case they are factors (as above)
alignment$shift <- nchar(alignment$Sequence) - nchar(gsub("^-+", "", alignment$Sequence))
alignment$shift
[1] 0 0 23
It just counts the number of characters removed by telling gsub to delete the start of a string (the ^) followed by any number of spaces (-+). You could use str_replace instead of gsub.
Maybe this might help? It'll return the position index of the start and end of the "---" string only if it begins at the start of the string.
library(stringr)
str_locate_all(string = alignment$Sequence, pattern = "^-{1,}[A-Z]")
[[1]]
start end
[[2]]
start end
[[3]]
start end
[1,] 1 24

Data Frame containing hyphens using R

I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)

Long Numbers As A Character String

As part of my dataset, one of the columns is a series of 24-digit numbers.
Example:
bigonumber <- 429382748394831049284934
When I import it using either data.table::fread or read.csv, it shows up as numeric in exponential format (EG: 4.293827e+23).
options(digits=...) won't work since the number is longer than 22 digits.
When I do
as.character(bigonumber)
what I get is "4.29382748394831e+23"
Is there a way to get bigonumber converted to a character string and show all of the digits as characters? I don't need to do any math on it, but I do need to search against it and do dplyr joins on it.
I need to this after import, since the column number varies from month to month.
(Yes, in the perfect world, my upstream data provider would use a hash instead of a long number and a static number of columns that stay the same every month, but I don't get to dictate that to them.)
You can specify colClasses on your fread or read.csv statement.
bignums
429382748394831049284934
429382748394831049284935
429382748394831049284936
429382748394831049284937
429382748394831049284938
429382748394831049284939
bignums <- read.csv("~/Desktop/bignums.txt", sep="", colClasses = 'character')
You can suppress the scientific notation with
options(scipen=999)
If you define the number then
bigonumber <- 429382748394831049284934
you can convert it into a string:
big.o.string <- as.character(bigonumber)
Unfortunately, this does not work because R converts the number to a double, thereby losing precision:
#[1] "429382748394831019507712"
The last digits are not preserved, as pointed out by #SabDeM. Even setting
options(digits=22)
doesn't help, and in any case 22 is the largest number that is allowed; and in your case there are 24 digits. So it seems that you will have to read the data directly as character or factor. Great answers have been posted showing how this can be achieved.
As a side note, there is a package called gmp that allows using arbitrarily large integer numbers. However, there is a catch: they have to be read as characters (again, in order to prevent R's internal conversion into double).
library(gmp)
bigonumber <- as.bigz("429382748394831049284934")
> bigonumber
Big Integer ('bigz') :
[1] 429382748394831049284934
> class(bigonumber)
[1] "bigz"
The advantage is that you can indeed treat these entries as numbers and perform calculations while preserving all the digits.
> bigonumber * 2
#Big Integer ('bigz') :
#[1] 858765496789662098569868
This package and my answer here may not solve your problem, because reading the numbers directly as characters is an easier way to achieve your goal, but I thought I might post this anyway as an information for users who may need to use large integers with more than 22 digits.
Use digest::digest on bigonumber to generate an md5 hash of the number yourself?
bigonumber <- 429382748394831049284934
hash_big <- digest::digest(bigonumber)
hash_big
# "e47e7d8a9e1b7d74af6a492bf4f27193"
I saw this before I posted my answer, but dont see it here anymore.
set options(scipen) to a big value so that there is no truncation:
options(scipen = 999)
bigonumber <- 429382748394831049284934
bigonumber
# [1] 429382748394831019507712
as.character(bigonumber)
# [1] "429382748394831019507712"
Use "scan" to read the file - the "what" parameter lets you define the input type of each column.
If you want numbers as numbers you can't print all values. The digits options allows a maximum of 22 digits. The range is from 1 to 22. It uses the print.default method. You can set it with:
options( digits = 22 )
Even with this options, the numbers will change. I ignore why that happens, most likely due to the fact that the object your are about to print (the number) is longer than the allowed amount of digits and so R does some weird stuff. I'll investigate about it.

When I type the name of a character vector in R, how can I print without unnecessary lines?

I am working with some character vector, with the size of each elements ranging from 4 to 855.
Let's assume the name of such vector is y. When I type "y", R prints out the whole vector, and the problem is that each element takes the same number of lines. Thus, if it takes 8 lines to print the elements with 855 character, then R also give 8 lines to the elements with only 4 character, and this way shows a lot of unnecessary lines
I want to remove this unnecessary lines. For example, I want
[1] "a"
[2]"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
to be printed as
[1] "a"
[2]"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
when I type the name of this vector.
How can I change this setting?
Note that the behavior you describe with extra lines being added to the printed output is not universal. For instance, my R Console on Mac (3.1.2) does not print the extra lines you described, but I get the extra lines when running the exact same version of R on the terminal.
You can do this with cat, looping through the vector and constructing the output as you wish:
> for (i in seq(y)) cat(paste0("[", i, "] \"", y[i], "\"\n"))
[1] "aaaa"
[2] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[3] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
[4] "aa"
Here is a reproducible example:
y<-c("aaaa", "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", "aa")
y
You will notice that the top 2 lines have gaps underneath them when returned in the console.
The help from ?format details what is going on:
"Justification for character vectors (and objects converted to character vectors by their methods) is done on display width (see nchar), taking double-width characters and the rendering of special characters (as escape sequences, including escaping backslash but not double quote: see print.default) into account. Thus the width is as displayed by print(quote = FALSE) and not as displayed by cat. Character strings are padded with blanks to the display width of the widest. (If na.encode = FALSE missing character strings are not included in the width computations and are not encoded.)"
I think the solution here is to use cat for the printing.
An easy way to see how each element is padded with blank characters is to do the following:
format(y, trim=T)

Adding conditional leading or trailing zeros

I need help conditionally adding leading or trailing zeros.
I have a dataframe with one column containing icd9 diagnoses. as a vector, the column looks like:
"33.27" "38.45" "9.25" "4.15" "38.45" "39.9" "84.1" "41.5" "50.3"
I need all the values to have the length of 5, including the period in the middle (not counting ""). If the value has one digit before the period, it need to have a leading zero. If value has one digit after the period, it need to have zero at the end. So the result should look like this:
"33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Here is the vector for R:
icd9 <- c("33.27", "38.45", "9.25", "4.15", "38.45", "39.9", "84.1", "41.5", "50.3" )
This does it in one line
formatC(as.numeric(icd9),width=5,format='f',digits=2,flag='0')
ICD-9 codes have some formatting quirks which can lead to misinterpretation with simple string processing. The icd package on CRAN takes care of all the corner cases when doing ICD processing, and has been battle-tested over about six years of use by many R users.
Using this function called change that accepts the argument of the max number of characters, i think it can help
change<-function(x, n=max(nchar(x))) gsub(" ", "0", formatC(x, width=n))
icd92<-gsub(" ","",paste(change(icd9,5)))
You can also use sprintf after converting the vector into numeric.
sprintf("%05.2f", as.numeric(icd9))
[1] "33.27" "38.45" "09.25" "04.15" "38.45" "39.90" "84.10" "41.50" "50.30"
Notes
The examples in ?sprint to get work out the proper format.
There is some risk of introducing errors due to numerical precision here, though it works well in the example.

Resources