How to make format() yield numeric objects - r

Whenever I format a table as such:
example <- sample(LETTERS, replace = T)
format(table(example), scientific = T)
The numbers become characters. How can I tell format() my object is numeric without resorting to as.numeric()? I can't find any such parameters in the function's help page. It says that format() objects are usually numeric, so I guess I'm missing some basic command.
My real data looks like this:
> xtabs(...)
PRU/DF PSU/ILH PSU/JFA PSU/MCL PSU/SRM PSU/ULA
1.040771e+01 0.000000e+00 2.280347e-01 0.000000e+00 0.000000e+00 8.186240e+00
PSU/URA PSU/VGA PU/AC PU/AM PU/AP PU/BA
0.000000e+00 1.534169e+01 8.184747e+01 1.410106e+01 1.028717e+01 1.099289e+00
PU/GO PU/MA PU/MG PU/MT PU/PA PU/PI
0.000000e+00 4.369910e+01 5.350849e+00 0.000000e+00 4.706721e-01 0.000000e+00
I want to have the console print the numbers prettier so my co-workers don't have a heart attack. This is what I've come up with:
> format(xtabs(...), scientific = F, digits = 1)
PRU/DF PSU/ILH PSU/JFA PSU/MCL PSU/SRM PSU/ULA PSU/URA PSU/VGA
"10.4077" " 0.0000" " 0.2280" " 0.0000" " 0.0000" " 8.1862" " 0.0000" "15.3417"
PU/AC PU/AM PU/AP PU/BA PU/GO PU/MA PU/MG PU/MT
"81.8475" "14.1011" "10.2872" " 1.0993" " 0.0000" "43.6991" " 5.3508" " 0.0000"
PU/PA PU/PI PU/RO PU/RR PU/TO PRU/RJ PSU/CPS PSU/NRI
" 0.4707" " 0.0000" "40.6327" "10.3247" " 0.0000" "10.9644" " 0.0000" "55.4122"
I'd like to get rid of those quotes so the data looks better on the console.

The format function returns character vectors with the numbers now "formatted" per your specifications. If you convert the result back to numbers then any formatting is now lost. I think your problem may be rather the difference between creating the formatted output and printing the formatted output. When you create an object, but don't do anything with it, the object is automatically printed using default arguments, one of the defaults is to put quotes around the character strings. If you don't want the quotes then just call print yourself and tell it to not include the quotes:
> example <- sample(LETTERS, replace = T)
> print(format(table(example), scientific = T), quote=FALSE)
example
B E F G H J K L Q S T U W X Z
1 1 1 2 1 2 3 1 1 1 3 1 1 5 2
If your main goal is to not use scientific notation then you should look at the zapsmall function which will turn extremely small values (often the culprit in switching to scientific notation) into zeros. Or do options(scipen=5) (or some other value than 5) which will reduce the likelihood of switching to scientific notation in subsequent printing.

Format returns character vectors, so this is in general to be expected. Your problem, I think, results from the fact that the "format" representation of an integer vector c( 1, 2, 3) is "1" "2" "3", whereas the representation of a numeric (real) vector c( 1, 2, 3 ) is "1e+00" "2e+00" "3e+00".
Consider the following:
format( 1:10 )
format( 1:10, scientific= TRUE )
format( as.numeric( 1:10 ), scientific= TRUE )
Therefore, try
format( as.numeric( table( example ) ), scientific= TRUE )
Unfortunately, you cannot omit the as.numeric, since table generates integer values, and you need real.

From your comments on #January's answer, it appears you're just looking for c(table(example)).

Here's a solution I've found:
View(format(xtabs(...), scientific = F, digits = 1))
The table will appear in another window instead of inside the console, like I originally wanted, but it solves my problem of quickly showing pretty data without resorting to long commands. More elegant solutions are welcome!

Related

pattern matching a formula in R

I want to do a pattern matching of variables in a formula. The ideal solution should be able to perform as below:
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456' and output should be variable_1, variable_2,variable_3, variable_4, variable_5.
Note: variable name can contain character, underscore (_), numbers only and operations are limited to +,-,*,/. formula may contain constants as well (like here it is 456). The output should contain only variables names and should ignore any numeric constants.
I have tried the below codes. I was only able to check for the variable name containing only character and minus operation (-) does not work as well.
formula <- "variableX +variableY*VariableZ"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] ""
[3,] "variableY"
[4,] ""
[5,] "VariableZ"
which is correct BUT when i include minus operation (-), the strapplyc gives wrong results
formula <- "variableX -variableY"
strapplyc(gsub(" ", "", format(formula), fixed = T), "-?|[a-zA-Z_]+", simplify = T, ignore.case = T) gives below output
[,1]
[1,] "variableX"
[2,] "-"
[3,] "variableY"
I would appreciate if anyone could help me on ideal solution.
You can use regular expressions for this:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5"
gsub("[\\+\\*\\-\\/]", ", ", formula)
Explanation of the regex:
[ and ] start and end a group of characters that you want to select
\\+ escapes the + sign, with you want to replace with ", "
\\* escapes the * sign, with you want to replace with ", "
\\- escapes the - sign, with you want to replace with ", "
\\/ escapes the / sign, with you want to replace with ", "
Edit to reflect OP's updated request
Another way would be just to extract your variables. The below works if you hold the format lowercaseletters_numberfor your variable name:
formula <- "variable_1+variable_2*variable_3-variable_4/variable_5+34+brigadeiro_5"
paste(regmatches(formula, gregexpr("variable_[0-9]", formula))[[1]],
collapse = ", ")
You can also use the stringr package if you want the code to look a little cleaner:
library(stringr)
str_extract_all(formula, "[a-z]*_[0-9]*")
You could use strsplit() with some extras.
res <- trimws(el(strsplit(formula, "\\+|\\-|\\*|\\/")))
Thereafter we want those elements yielding NA when we try to coerce them as.numeric().
res[is.na(suppressWarnings(as.numeric(res)))]
# [1] "variable_1" "variable_2" "variable_3" "variable_4" "variable_5"
Data
formula <- 'variable_1+variable_2*variable_3-variable_4/variable_5 + 456'

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

how do you format numbers in vector without having extra spaces and quotes around the numbers

I have a vector like this:
dput(yy)
c(97.1433841613379, 1102.1208262592, 32.5418522860492, 217.694780086999,
1306.31759309228, 202.18335752298, 22.8301149425287)
I need to only keep 2 decimal points and I am doing this to get rid of additional decimal points:
yy<-format(yy, digits=1)
When I do dput(yy), I get additional spaces in front of the my values as this:
dput(yy)
c(" 97.14", "1102.12", " 32.54", " 217.69", "1306.32", " 202.18",
" 22.83")
Is there an easy way to format the numbers without inserting extra space and quotes around the numbers?
You could use ?sprintf (it use the same syntax like sprintf in C):
x <- c(97.1433841613379, 1102.1208262592, 32.5418522860492, 217.694780086999, 1306.31759309228, 202.18335752298, 22.8301149425287)
sprintf("%.2f", x)
# [1] "97.14" "1102.12" "32.54" "217.69" "1306.32" "202.18" "22.83"
EDIT:
Or do you look for ?round?
round(x, digits=2)
# [1] 97.14 1102.12 32.54 217.69 1306.32 202.18 22.83
If you want to keep everything as numbers then use round(x, 2), however that will change a number like 1.5000002 to 1.5 rather than 1.50 that you could get with format or sprintf.

How to format a number with specified level of precision?

I would like to create a function that returns a vector of numbers a precision reflected by having only n significant figures, but without trailing zeros, and not in scientific notation
e.g, I would like
somenumbers <- c(0.000001234567, 1234567.89)
myformat(x = somenumbers, n = 3)
to return
[1] 0.00000123 1230000
I have been playing with format, formatC, and sprintf, but they don't seem to want to work on each number independently, and they return the numbers as character strings (in quotes).
This is the closest that i have gotten example:
> format(signif(somenumbers,4), scientific=FALSE)
[1] " 0.000001235" "1235000.000000000"
You can use the signif function to round to a given number of significant digits. If you don't want extra trailing 0's then don't "print" the results but do something else with them.
> somenumbers <- c(0.000001234567, 1234567.89)
> options(scipen=5)
> cat(signif(somenumbers,3),'\n')
0.00000123 1230000
>
sprintf seems to do it:
sprintf(c("%1.8f", "%1.0f"), signif(somenumbers, 3))
[1] "0.00000123" "1230000"
how about
myformat <- function(x, n) {
noquote(sapply(a,function(x) format(signif(x,2), scientific=FALSE)))
}

Resources