Display numbers with commas in console - r

I am working with a bunch of large numbers. I know how to convert numbers to comma format from: Comma separator for numbers in R? . What I don't know how to do is display numbers in the console with commas without converting the class from numeric. I want to be able to see the commas so I can compare numbers when working - but need to keep the numbers as numeric to make calculations. I know you can get rid of scientific notation from:How to disable scientific notation? - but can't find an equivalent for a comma or dollar format.

You could create a new method for print(), for a custom class I will call "bignum":
print.bignum <- function(x) {
print(format(x, scientific = FALSE, big.mark = ",", trim = TRUE))
}
x <- c(1e6, 2e4, 5e8)
class(x) <- c(class(x), "bignum")
x
[1] "1,000,000" "20,000" "500,000,000"
x * 2
[1] "2,000,000" "40,000" "1,000,000,000"
y <- x + 1
y
[1] "1,000,001" "20,001" "500,000,001"
class(y) <- "numeric"
y
[1] 1000001 20001 500000001
For any numeric object x, if you add "bignum" to the class attribute via class(x) <- c(class(x), "bignum"), it will always print how you've described you want it to print, but should behave as a numeric otherwise, as shown above.

Related

Create decimal values of character hex

With the following code I'm reading a text file and displaying one column on the screen.
externalData <- read.delim("testdata.txt", header = FALSE, sep = "#")
i = 1;
while(i < 11) {
appel <- as.character(externalData[i,3,1])
i = i + 1;
print(appel)
}
The output looks like this:
I'm trying to convert these values from hexadecimal to decimal.
I've tried the following:
strtoi(c(appel))
but this doesn't seem to work, this only removes the quotation marks from the first one and the last one, and sets everything in-between to N/A (probably because there are letters in them).
Here are 3 ways to convert hexadecimal(character) to decimal(numeric).
x <- c("158308", "bb1787", "853f91")
# 1.
strtoi(x, base = 16L)
# 2.
as.integer(as.hexmode(x))
# 3.
as.integer(paste0("0x", x))
# more general edition:
# as.integer(ifelse(!grepl("^0x", x), paste0("0x", x), x))
From ?strtoi
Convert strings to integers according to the given base using the C function strtol, or choose a suitable base following the C rules.
Arguments
x a character vector, or something coercible to this by as.character.
base an integer which is between 2 and 36 inclusive, or zero (default).
1.Create a reproducible minimal example
appel <- c("158308", "d8db89")
2.Solution using strtoi base argument:
strtoi(appel, base=16)
Returns:
[1] 1409800 14211977

Translation and mapping of emoticons encoded as UTF-8 code in text

I am working with text which includes emoticons. I need to be able to find these and replace them with tags which can be analysed. How to do this?
> main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
> grepl("\xf0", main$text[[4]])
[1] FALSE
I tried the above. Why did it not work? I also tried iconv into ASCII, then the byte encoding I got, could be searched with grepl.
> abc<-iconv(main$text[[4]], "UTF-8", "ASCII", "byte")
> abc
[1] "Spread d wrd<f0><9f><98><8e>"
> grepl("<f0>", abc)
[1] TRUE
I really do not understand what I did here and what happened. I also do not understand how the above conversion introduced \n characters into the text.
I also did not know how to encode these, once they were searcheable. I found a list here, but it fell short (for example, "U+E00E" - <ee><80><8e> was not in the list). Is there a comprehensive list for such a mapping?
ADDENDUM
After a lot of trial and error, here is what I realised. There are two kinds of encodings for the emojis in the data. One is in the form of bytes, which is searchable by grepl("\x9f", ...., useBytes=T), like the main$text[[4]], and another (main$text[[6]]) which is searchable as the unicode character without useBytes=T, i.e. grepl("\ue00e",....). Even the way they are displayed in View() and when called on the console is different. I am absolutely confused as to what is going on here.
main$text[[4]]
[1] "Spread d wrd\xf0\u009f\u0098\u008e"
main[4,]
timestamp fromMe remoteResource remoteResourceDisplayName type
b 2014-08-30 02:58:58 FALSE 112233#s.whatsapp.net ABC text
text date
b Spread d wrd<f0><U+009F><U+0098><U+008E> 307114
main$text[[6]]
[1] ""
main[6,]
timestamp fromMe remoteResource remoteResourceDisplayName type text
b 2014-08-30 02:59:17 FALSE 12345#s.whatsapp.net XYZ text <U+E00E>
date
b 307114
grepl("\ue00e", main$text[[6]])
[1] TRUE
grepl("<U+E00E>", main$text[[6]])
[1] FALSE
grepl("\u009f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]])
[1] FALSE
grepl("\x9f", main$text[[4]], fixed=T)
[1] FALSE
grepl("\x9f", main$text[[4]], useBytes=T)
[1] TRUE
The maps I have are also different. The one for the bytes case works well. But the other one doesnot, since I am unable to create the "\ue00e" required to search. Here is the sample of the other map, corresponding to the Softbank <U+E238>.
emmm[11]
[1] "E238"
Searching for a single byte of a multi-byte UTF-8 encoded character only works if done with useBytes = TRUE. The fact that "\xf0" here is a part of a multi-byte character is obscured by the less than perfect Unicode support of R on Windows (used in the original example, I presume). How to match by bytes:
foo <- "\xf0\x9f\x98\x8e" # U+1F60E SMILING FACE WITH SUNGLASSES
Encoding(foo) <- "UTF-8"
grepl("\xf0", foo, useBytes = TRUE)
I don't see much use for matching one byte, though. Searching for the whole character would then be:
grepl(foo, paste0("Smiley: ", foo, " and more"), useBytes = TRUE)
Valid ASCII codes correspond to integers 0–127. The iconv() conversion to ASCII in the example replaces any invalid byte 0xYZ (corresponding to integers 128–255) with the literal text <yz> where y and z are hexadecimal digits. As far as I can see, it should not introduce any newlines ("\n").
Using the character list linked to in the question, here is some example code which performs one kind of "emoji tagging" to input strings, namely replacing the emoji with its (slightly formatted) name.
emoji_table <- read.csv2("https://github.com/today-is-a-good-day/Emoticons/raw/master/emDict.csv",
stringsAsFactors = FALSE)
emoji_names <- emoji_table[, 1]
text_bytes_to_raw <- function(x) {
loc <- gregexpr("\\x", x, fixed = TRUE)[[1]] + 2
as.raw(paste0("0x", substring(x, loc, loc + 1)))
}
emoji_raw <- lapply(emoji_table[, 3], text_bytes_to_raw)
emoji_utf8 <- vapply(emoji_raw, rawToChar, "")
Encoding(emoji_utf8) <- "UTF-8"
gsub_many <- function(x, patterns, replacements) {
stopifnot(length(patterns) == length(replacements))
x2 <- x
for (k in seq_along(patterns)) {
x2 <- gsub(patterns[k], replacements[k], x2, useBytes = TRUE)
}
x2
}
tag_emojis <- function(x, codes, names) {
gsub_many(x, codes, paste0("<", gsub("[[:space:]]+", "_", names), ">"))
}
each_tagged <- tag_emojis(emoji_utf8, emoji_utf8, emoji_names)
all_in_one <- tag_emojis(paste0(emoji_utf8, collapse = ""),
emoji_utf8, emoji_names)
stopifnot(identical(paste0(each_tagged, collapse = ""), all_in_one))
As to why U+E00E is not on that emoji list, I don't think it should be. This code point is in a Private Use Area, where character mappings are not standardized. For comprehensive Unicode character lists, you cannot find a better authority than the Unicode Consortium, e.g. Unicode Emoji. Additionally, see convert utf8 code point strings like <U+0161> to utf8 .
Edit after addendum
When there is a string of exactly four hexadecimal digits representing a Unicode code point (let's say "E238"), the following code will convert the string to the corresponding UTF-8 representation, the occurrence of which can be checked with the grep() family of functions. This answers the question of how to "automatically" generate the character that can be manually created by typing "\uE238".
library(stringi)
hex4_to_utf8 <- function(x) {
stopifnot(grepl("^[[:xdigit:]]{4}$", x))
stringi::stri_enc_toutf8(stringi::stri_unescape_unicode(paste0("\\u", x)))
}
foo <- "E238"
foo_utf8 <- hex4_to_utf8(foo)
The value of the useBytes option should not matter in the following grep() call. In the previous code example, I used useBytes = TRUE as a precaution, as I'm not sure how well R on Windows handles Unicode code points U+10000 and larger (five or six digits). Clearly it cannot properly print such codepoints (as shown by the U+1F60E example), and input with the \U + 8 digits method is not possible.
The example in the question shows that R (on Windows) may print Unicode characters with the <U+E238> notation rather than as \ue238. The reason seems to be format(), also used in print.data.frame(). For example (R for Windows running on Wine):
> format("\ue238")
[1] "<U+E238>"
When tested in an 8-bit locale on Linux, the same notation is already used by the default print method. One must note that in this case, this is only a printed representation, which is different from how the character is originally stored.

Finding the position of a character within a string

I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"

Extracting variables from a formula when there are subscripts

There are several posts related to obtaining a list of variables in a regression formula in R - the basic answer being to use all.vars. For example,
> all.vars(log(resp) ~ treat + factor(dose))
[1] "resp" "treat" "dose"
This is nice because it strips out all of the functions and operators (as well as repeats, not shown). However, this is problematic when the formula contains $ operators or subscripts, such as in
> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> all.vars(form)
[1] "cows" "weight" "bulls" "herd" "breed"
Here, the data frame names cows, bulls, and herd are identified as variables, and the names of the actual variables are decoupled or lost. Instead, what I really want is this result:
> mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
What is the most elegant way to do this? I have one proposal that I'll post as an answer, but maybe someone has a more elegant solution and will earn more votes!
One approach that works, though a bit tedious, is to replace the operators $, etc. with legal characters for variable names, turn the string back into a formula, apply all.vars, and un-mangle the results:
All.vars = function(expr, retain = c("\\$", "\\[\\[", "\\]\\]"), ...) {
# replace operators with unlikely patterns _Av1_, _Av2_, ...
repl = paste("_Av", seq_along(retain), "_", sep = "")
for (i in seq_along(retain))
expr = gsub(retain[i], repl[i], expr)
# piece things back together in the right order, and call all.vars
subs = switch(length(expr), 1, c(1,2), c(2,1,3))
vars = all.vars(as.formula(paste(expr[subs], collapse = "")), ...)
# reverse the mangling of names
retain = gsub("\\\\", "", retain) # un-escape the patterns
for (i in seq_along(retain))
vars = gsub(repl[i], retain[i], vars)
vars
}
Use the retain argument to specify the patterns that we wish to retain rather than treat as operators. The defaults are $, [[, and ]] (all duly escaped) Here are some results:
> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> All.vars(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
Change retain to also include ( and ):
> All.vars(form, retain = c("\\$", "\\(", "\\)", "\\[\\[", "\\]\\]"))
[1] "log(cows$weight)" "factor(bulls[[3]])" "herd$breed"
The dots are passed to all.vars, which is really the same as all.names but with different defaults. So we can also obtain the functions and operators not in retain:
> All.vars(form, functions = TRUE)
[1] "~" "log" "cows$weight" "*"
[5] "factor" "bulls[[3]]" "herd$breed"
This isn't sufficient for a general use case, but just for fun I thought I'd take a crack at it:
mystery.fcn = function(string) {
string = gsub(":", " ", string)
string = unlist(strsplit(gsub("\\b.*\\b\\(|\\(|\\)|[*~+-]", "", string), split=" "))
string = string[nchar(string) > 0]
return(string)
}
form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
form1 = ~x[[y]]
mystery.fcn(form1)
[1] "x[[y]]"
form2 = z$three ~ z$one + z$two - z$x_y
mystery.fcn(form2)
[1] "z$three" "z$one" "z$two" "z$x_y"
form3 = z$three ~ z$one:z$two
mystery.fcn(form3)
[1] "z$three" "z$one" "z$two"

Convert binary vector to decimal

I have a vector of a binary string:
a<-c(0,0,0,1,0,1)
I would like to convert this vector into decimal.
I tried using the compositions package and the unbinary() function, however, this solution and also most others that I have found on this site require g-adic string as input argument.
My question is how can I convert a vector rather than a string to decimal?
to illustrate the problem:
library(compositions)
unbinary("000101")
[1] 5
This gives the correct solution, but:
unbinary(a)
unbinary("a")
unbinary(toString(a))
produces NA.
You could try this function
bitsToInt<-function(x) {
packBits(rev(c(rep(FALSE, 32-length(x)%%32), as.logical(x))), "integer")
}
a <- c(0,0,0,1,0,1)
bitsToInt(a)
# [1] 5
here we skip the character conversion. This only uses base functions.
It is likely that
unbinary(paste(a, collapse=""))
would have worked should you still want to use that function.
There is a one-liner solution:
Reduce(function(x,y) x*2+y, a)
Explanation:
Expanding the application of Reduce results in something like:
Reduce(function(x,y) x*2+y, c(0,1,0,1,0)) = (((0*2 + 1)*2 + 0)*2 + 1)*2 + 0 = 10
With each new bit coming next, we double the so far accumulated value and add afterwards the next bit to it.
Please also see the description of Reduce() function.
If you'd like to stick to using compositions, just convert your vector to a string:
library(compositions)
a <- c(0,0,0,1,0,1)
achar <- paste(a,collapse="")
unbinary(achar)
[1] 5
This function will do the trick.
bintodec <- function(y) {
# find the decimal number corresponding to binary sequence 'y'
if (! (all(y %in% c(0,1)))) stop("not a binary sequence")
res <- sum(y*2^((length(y):1) - 1))
return(res)
}

Resources