I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?
I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).
So you want to translate SI unit abbreviations ('K','M',...) into exponents, and thus numerical powers-of-ten.
Given that all units are single-letter, and the exponents are uniformly-spaced powers of 10**3, here's working code that handles 'Kilo'...'Yotta', and any future exponents:
> 10 ** (3*as.integer(regexpr('T', 'KMGTPEY')))
[1] 1e+12
Then just multiply that power-of-ten by the decimal value you have.
Also, you probably want to detect and handle the 'no-match' case for unknown letter prefixes, otherwise you'd get a nonsensical -1*3
> unit_to_power <- function(u) {
exp_ <- 10**(as.integer(regexpr(u, 'KMGTPEY')) *3)
return (if(exp_>=0) exp_ else 1)
}
Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, #DanielV's case-specific code is better.
If you want negative SI prefixes too, use as.integer(regexpr(u, 'zafpnum#KMGTPEY')-8) where # is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.
base::regexpr is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance use stringr::str_locate.
Give this a shot:
Text_Num <- function(x){
if (grepl("M", x, ignore.case = TRUE)) {
as.numeric(gsub("M", "", x, ignore.case = TRUE)) * 1e6
} else if (grepl("k", x, ignore.case = TRUE)) {
as.numeric(gsub("k", "", x, ignore.case = TRUE)) * 1e3
} else {
as.numeric(x)
}
}
In your case you can using gsubfn
a=c('12M','1.2k')
dict<-list("k" = "e3", "M" = "e6")
as.numeric(gsubfn::gsubfn(paste(names(dict),collapse="|"),dict,a))
[1] 1.2e+07 1.2e+03
I am glad to meet you.
I wrote another answer
Define function
res = function (x) {
result = as.numeric(x)
if(is.na(result)){
text = gsub("k", "*1e3", x, ignore.case = T)
text = gsub("m", "*1e6", text, ignore.case = T)
result = eval(parse(text = text))
}
return(result)
}
Result
> res("5M")
[1] 5e+06
> res("4K")
[1] 4000
> res("100")
[1] 100
> res("4k")
[1] 4000
> res("1e3")
[1] 1000
Related
I want to compare two character values in R and see which characters where added and deleted to display it later similar to git diff --color-words=. (see screenshot below)
For example:
a <- "hello world"
b <- "helo world!"
diff <- FUN(a, b)
where diff would somehow show that an l was dropped and a ! was added.
The ultimate goal is to construct an html string like this hel<span class="deleted">l</span>o world<span class="added">!</span>.
I am aware of diffobj but so far I cannot get it to return the character differences, only the differences between elements.
Output of git diff --color-words=.
the output looks like this:
Base R has a function adist that computes the generalized Levenshtein distance. With arguments count and partial attribute "trafos" is set to the sequence of matches, insertions and deletions needed to go from one string to the other. From the documentation, section Value, my emphasis:
If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively. Additionally, if partial = FALSE, the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements ‘M’, ‘I’, ‘D’ and ‘S’ indicating a match, insertion, deletion and substitution, respectively. If partial = TRUE, the offsets (positions of the first and last element) of the matched substrings are returned as the "offsets" attribute of the return value (with both offsets -1−1 in case of no match).
a <- "hello world"
b <- "helo world!"
attr(adist(a, b, counts = TRUE), "trafos")
#> [,1]
#> [1,] "MMDMMMMMMMMI"
Created on 2022-05-31 by the reprex package (v2.0.1)
There is a deletion in the 3rd character and an insertion at the end of string a.
Found a solution using diffobj::ses_dat() and splitting the data into its characters before.
get_html_diff <- function(a, b) {
aa <- strsplit(a, "")[[1]]
bb <- strsplit(b, "")[[1]]
s <- diffobj::ses_dat(aa, bb)
m <- cumsum(as.integer(s$op) != c(Inf, s$op[1:(length(s$op) - 1)]))
res <- paste(
sapply(split(seq_along(s$op), m), function(i) {
val <- paste(s$val[i], collapse = "")
if (s$op[i[[1]]] == "Insert")
val <- paste0("<span class=\"add\">", val, "</span>")
if (s$op[i[[1]]] == "Delete")
val <- paste0("<span class=\"del\">", val, "</span>")
val
}),
collapse = "")
res
}
get_html_diff("hello world", "helo World!")
#> [1] "hel<span class=\"del\">l</span>o <span class=\"del\">w</span><span class=\"add\">W</span>orld<span class=\"add\">!</span>"
Created on 2022-05-31 by the reprex package (v2.0.1)
We use diffobj to compare configuration files (in more or less production environment), and it works just right. In your case, wouldn't diffobj::diffChr be what you want?
diffobj::diffChr("hello world", "helo world!", color.mode = 'rgb')
I have the below R code .
OBJECTIVE : I am trying to check strings present in kind object is composite of word object by iterating & comparing the character positioning of the two objects. If it is composite of the other ,it returns POSITIVE else NEGATIVE.
PROBLEM STATEMENT :
If kind object value has minimal characters in each string c('abcde','crnas','onarous','ravus') it gives me better response. If the strings present in the kind object has more string length ( 10 ^ 5) c('cdcdc.....{1LCharacters}','fffw....{1LCharacters}','efefefef..{1LCharacters}'). It takes more time to process. Is there a better way to put this in , so that compilation time can be relatively small.
Suggestions / Corrections are highly appreciated.
word <- "coronavirus"
total <- "3"
kind <- c('abcde','crnas','onarous','ravus')
invisible(lapply(kind,function(x) {
if (length(x) > length(word)) {
cat("NEGATIVE",sep='\n')
}
index=1;
for (i in seq(from=1,to=nchar(word)-1,by=1)) {
if(substr(word,i,i) == substr(x,index,index))
{
index<-index+1;
}
}
if (index == nchar(x))
{
cat("POSITIVE",sep='\n')
}
else
{
cat("NEGATIVE",sep='\n')
}
}))
Output :
NEGATIVE
POSITIVE
NEGATIVE
POSITIVE
You could also do:
vals <- attr(adist(kind, word,counts = TRUE), 'counts')[,,3]
ifelse(vals>0, 'NEGATIVE', 'POSITIVE')
[1] "NEGATIVE" "POSITIVE" "NEGATIVE"
Update
If you want to print the result vertically, you can try cat like below
cat(
paste0(c("NEGATIVE", "POSITIVE")[
1 +
sapply(
gsub("(?<=.)(?=.)", ".*", kind, perl = TRUE),
grepl,
x = word
)
], collapse = "\n"),
"\n"
)
which gives
NEGATIVE
POSITIVE
NEGATIVE
I guess you can try gsub + grepl like below
c("NEGATIVE", "POSITIVE")[
1 +
sapply(
gsub("(?<=.)(?=.)", ".*", kind, perl = TRUE),
grepl,
x = word
)
]
which gives
[1] "NEGATIVE" "POSITIVE" "NEGATIVE"
For a library call I have to provide a separator, which must not occur in the in the text, because otherwise the library call gets confused.
Now I was wondering how I can adapt my code to assure that the separator I use is guaranteed not to occur in the input text.
I am solving this issue with a while loop: I make a (hardcoded) assumption about the most unlikely string in the input, check if it is present and if so, just enlarges the string. This works but feels very hackish, so I was wondering whether there is a more elegant version (e.g. an existing base R function, or a loop free solution), which does the same for me? Ideally the found separator is also minimal in length.
I could simply hardcode a large enough set of potential separators and look for the first one not occuring in the text, but this may also break at some point if all of these sepeatirs happen to occur in my input.
Reasoning for that is that even if it will never happen (well never say never), I am afraid that in some distant future there will be this one input string which requires thousands of while loops before finding an unused string.
input_string <- c("a/b", "a#b", "a//b", "a-b", "a,b", "a.b")
orig_sep <- sep <- "/" ## first guess as a separator
while(any(grepl(sep, input_string, fixed = TRUE))) {
sep <- paste0(sep, orig_sep)
}
print(sep)
# "///"
In case 1 ASCII can be found you can use table.
tt <- table(factor(strsplit(paste(input_string, collapse = ""), "")[[1]]
, rawToChar(as.raw(32:126), TRUE)))
names(tt)[tt==0]
rawToChar(as.raw(32:126), TRUE) gives you all ASCII's, which are used as factor levels. And table counts all cases. If there is at least one 0 you can use it.
In case you need 2 ASCII you can try the following returning all possible delimiters:
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
Or for n-ASCII:
orig_sep <- x <- rawToChar(as.raw(32:126), TRUE)
sep <- x[0]
repeat {
sep <- x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
if(length(sep) > 0) break;
x <- c(outer(x, orig_sep, paste0))
}
sep
Search for 1-2 ASCII with only a sapply-loop and taking separator with minimal length.
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(x, outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})][1]
#[1] " "
In case you want to know how many times a character needs to be repeated to work as a separator, as you do it in the question, you can use gregexpr.
strrep("/", max(sapply(gregexpr("/*", input_string)
, function(x) max(attributes(x)$match.length)))+1)
#[1] "///"
strrep("/", max(c(0, sapply(gregexpr("/+", input_string)
, function(x) max(attributes(x)$match.length))))+1)
#[1] "///"
I made some benchmarks, and the sad news is that only if we have a lot of occurrences of the separator in the input string the regex solution will pay off. I won't expect long repetitions of the separator, so from that perspective the while solution should be preferable, but it would be the first time in my R life that I actually had to rely on a while construct.
Code
library(microbenchmark)
sep <- "/"
make_input <- function(max_occ, vec_len = 1000) {
paste0("A", strrep(sep, sample(0:max_occ, vec_len, TRUE)))
}
set.seed(1)
no_occ <- make_input(0)
typ_occ <- make_input(1)
mid_occ <- make_input(10)
high_occ <- make_input(100)
while_fun <- function(in_str) {
my_sep <- sep
while(any(grepl(my_sep, in_str, fixed = TRUE))) {
my_sep <- paste0(my_sep, sep)
}
my_sep
}
greg_fun <- function(in_str) {
strrep(sep,
max(sapply(gregexpr(paste0(sep, "+"), in_str),
purrr::attr_getter("match.length")), 0) + 1)
}
microbenchmark(no_occ_w = while_fun(no_occ),
no_occ_r = greg_fun(no_occ),
typ_occ_w = while_fun(typ_occ),
typ_occ_r = greg_fun(typ_occ),
mid_occ_w = while_fun(mid_occ),
mid_occ_r = greg_fun(mid_occ),
high_occ_w = while_fun(high_occ),
high_occ_r = greg_fun(high_occ))
Results
Unit: microseconds
expr min lq mean median uq max neval cld
no_occ_w 12.3 13.30 15.947 14.60 16.55 51.1 100 a
no_occ_r 1074.8 1184.90 1981.637 1253.45 1546.20 7037.9 100 b
typ_occ_w 33.8 36.00 42.842 38.55 41.45 229.2 100 a
typ_occ_r 1090.4 1192.15 2090.526 1283.80 1547.10 8490.7 100 b
mid_occ_w 277.9 283.35 336.466 288.30 309.45 3452.2 100 a
mid_occ_r 1161.6 1269.50 2204.213 1368.45 1789.20 7664.7 100 b
high_occ_w 3736.4 3852.95 4082.844 3962.30 4097.60 6658.3 100 d
high_occ_r 1685.5 1776.15 2819.703 1868.10 4065.00 7960.9 100 c
I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"
I'm having trouble with simply writing a function that turns a string representation of a number into a decimal representation. To boil the issue down to essentials, consider the following function:
f <- function(x) {
y <- as.numeric(x)
return(y)
}
When I apply this function to the string "47.418" I get back 47.42, but what I want to get back is 47.418. It seems like the return value is being rounded for some reason.
Any suggestions would be appreciated
You have done something to your print options. I get no rounding:
> f <- function(x) { y <- as.numeric(x); return(y) }
> f(47.418)
[1] 47.418
?options
The default value for digits is 7:
> options("digits")
$digits
[1] 7
Further questions should be accompanied by dput() on the object in question.