Decode a string in hexadecimal representation - r

In R, what is an efficient way to convert a string encoded in hexadecimal, such as "40414243" to its equivalent characters e.g. "#ABC"?
For instance, the equivalent of this code:
library(stringr)
FromHexString <- function (hex.string) {
result <- ""
length <- str_length(hex.string)
for (i in seq(1, length, by=2)) {
hex.value <- str_sub(hex.string, i, i + 1)
char.code <- strtoi(hex.value, 16)
char <- rawToChar(as.raw(char.code))
result <- paste(result, char, sep="")
char
}
result
}
Which produces:
> FromHexString("40414243")
[1] "#ABC"
While the above code works, it's not efficient at all, using a lot of string concatenations.
So the question is how to write an idiomatic, efficient R function that does this operation.
Edit: My sample works only for ASCII encoding, not for UTF-8 encoded byte arrays.

Test if that is more efficient (for longer strings):
string <- "40414243"
intToUtf8(
strtoi(
do.call(
paste0,
as.data.frame(
matrix(
strsplit(string, split = "")[[1]],
ncol=2,
byrow=TRUE),
stringsAsFactors=FALSE)),
base=16L)
)
#[1] "#ABC"
Otherwise you could look for a C/C++ implementation.

Modify your code so that it uses lookup tables an example for R here. Your lookup table will have 255 values. Put them in vector and get their values from that vector.
Not: No other solution will beat this one if you need to do a lot of conversions.

If you don't want to use a lookup table (or just like codegolfing :-) ) , consider writing a vectorized version of something like:
bar <- unlist(strsplit(foo,'')) #separates input into individual elements
items <- sapply(1:(length(bar)/2),function(j)paste0(bar[(2*j-1):(2*j)],sep='',collapse=''))
followed with strtoi or whatever.
But even easier (I hope...) is
sapply(1:(nchar(foo)/2) function(j) substr(foo,(2*j-1),(2*j)))

Related

How to mask a string based on a pattern of string of same length

I have the following set of string:
core_string <- "AFFVQTCRE"
mask_string <- "*KKKKKKKK"
What I want to do is to mask core_string with mask_string.
Whenever the * coincide with character in core_string, we will keep that character,
otherwise replace it.
So the desired result is:
AKKKKKKKK
Other example
core_string <- "AFFVQTCRE"
mask_string <- "*KKKK*KKK"
# result AKKKKTKKK
The length of both strings is always the same.
How can I do that with R?
Here's a helper function that will do just that
apply_mask <- function(x, mask) {
unlist(Map(function(z, m) {
m[m=="*"] <- z[m=="*"]
paste(m, collapse="")
}, strsplit(x, ""), strsplit(mask, "")))
}
basically you just split up the string into characters and replace the characters that have a "*" then paste the strings back together.
I used the Map to make sure the function is still vectorized over the inputs. For example
core_string <- c("AFFVQTCRE", "ABCDEFGHI")
mask_string <- "*KKKK*KKK"
apply_mask(core_string, mask_string)
# [1] "AKKKKTKKK" "AKKKKFKKK"
regmatches in replacement form <- can be handy here:
regmatches(core_string, gregexpr("K", mask_string)) <- "K"
core_string
#[1] "AKKKKKKKK"
If it's a 1:1 match of characters rather than a constant, then it has to be changed up a little:
ss <- strsplit(mask_string, "")[[1]]
regmatches(core_string, gregexpr("[^*]", mask_string)) <- ss[ss != "*"]

Modify the object without using return in R function

I am trying to reverse a string without using extra space in R. Below is the code for the same. My question is how to get the ReverseString function change the input without using extra space. I even tried using <<- without any luck.
ReverseString <- function(TestString){
TestString <- unlist(strsplit(TestString, ""))
Left <- 1
Right <- length(TestString)
while (Left < Right){
Temp <- TestString[Left]
TestString[Left] <- TestString[Right]
TestString[Right] <- Temp
Left <- Left + 1
Right <- Right - 1
}
return(paste(TestString, collapse = ""))
}
## Input
a = "StackOverFlow"
## OutPut
ReverseString(a)
"wolFrevOkcatS"
##
a
"StackOverFlow"
It is always better to take advantage of the vectorization in R (instead of for or while loops). So, in base-R, without any packages, it would be something like:
ReverseString <- function(x) {
#splitstring splits every character, and rev reverses the order
out <- rev(strsplit(x, split = '')[[1]])
#paste to paste them together
paste(out, collapse = '')
}
a <- "StackOverFlow"
ReverseString(a)
#[1] "wolFrevOkcatS"
According to your comment you want to reverse the string without calling any function that does the reversal, i.e. no rev and co. Both of the solutions below do this.
I think you are also trying to modify global a from within the function, which is why you tried <<-. I'm not sure why it didn't work for you, but you might have used it incorrectly.
You should know that using <<- alone does not mean that you are using less space. To really save space you would have to call or modify global a at each step in your function where you call or modify TestString. This would entail some combination of assign, do.call, eval and parse - not to mention all the pasteing you would have to do to access elements of a by integer position. Your function would end up bulky, nearly unreadable, and very likley less efficient due to the numerous function calls, despite having saved a negligible amount of space by not storing a copy of a. If you're dead set on creating such an abomination, then take a look at the functions I just listed and figure out how to use them.
Your energy would be better spent by improving upon you string-reversing function in other ways. For example, you can shorten it quite a bit by using a numerical sequence such as 13:1 in sapply:
reverse_string <- function(string) {
vec <- str_split(string, "")[[1]]
paste(sapply(length(vec):1, function(i) vec[i]), collapse = "")
}
reverse_string("StackOverFlow")
#### OUTPUT ####
[1] "wolFrevOkcatS"
If your interviewers also have a problem with reverse sequences then here's another option that's closer to your original code, just a little cleaner. I also did my best to eliminate other areas where "extra space" was being used (indices stored in single vector, no more Temp):
reverse_string2 <- function(string){
vec <- str_split(string, "")[[1]]
i_vec <- c(1, length(vec))
while(i_vec[1] < i_vec[2]) {
vec[i_vec] <- vec[c(i_vec[2], i_vec[1])]
i_vec <- i_vec + c(1, -1)
}
paste(vec, collapse = "")
}
reverse_string2("StackOverFlow")
#### OUTPUT ####
[1] "wolFrevOkcatS"
It can be done easily with stringi
library(stringi)
a <- "StackOverFlow"
stri_reverse(a)
#[1] "wolFrevOkcatS"
I'm not sure I understood exactly the problem, but I think you're looking for a way to reverse the string object and automatically assign it to the original object without having to do a <- ReverseString(a) (assuming this is the reason why you tried using <<-). My solution to this is using deparse(substitute()) to read the original variable name inside the function and assign (using envir = .GlobalEnv) to assign your result over the original variable.
ReverseString <- function(TestString){
nm <- deparse(substitute(TestString))
TestString <- unlist(strsplit(TestString, ""))
Left <- 1
Right <- length(TestString)
while (Left < Right){
Temp <- TestString[Left]
TestString[Left] <- TestString[Right]
TestString[Right] <- Temp
Left <- Left + 1
Right <- Right - 1
}
assign(nm, paste(TestString, collapse = ""), envir = .GlobalEnv)
}
## Input
a = "StackOverFlow"
ReverseString(a)
a
#[1] "wolFrevOkcatS"

Resolving a formatter string

Suppose I have the following:
format.string <- "#AB#-#BC#/#DF#" #wanted to use $ but it is problematic
value.list <- c(AB="a", BC="bcd", DF="def")
I would like to apply the value.list to the format.string so that the named value is substituted. So in this example I should end up wtih a string: a-bcd/def
I tried to do it like the following:
resolved.string <- lapply(names(value.list),
function(x) {
sub(x = save.data.path.pattern,
pattern = paste0(c("#",x,"#"), collapse=""),
replacement = value.list[x]) })
But it doesn't seem to be working correctly. Where am I going wrong?
The glue package is designed for this. You can change the opening and closing delimiters using .open and .close, but they have to be different. Also note that value.list has to be either a list or a dataframe:
library(glue)
format.string <- "{AB}-{BC}/{DF}"
value.list <- list(AB="a", BC="bcd", DF="def")
glue_data(value.list, format.string)
# a-bcd/def
To answer your actual question, by using lapply over names(value.list) you, as your output shows, take each of the elements of value.list and perform the replacement. However, all this happens independently, i.e., the replacements aren't ultimately combined to a single result.
As to make something very similar to your approach work, we can use Reduce which does exactly this combining:
Reduce(function(x, y) sub(paste0(c("#", y, "#"), collapse = ""), value.list[y], x),
init = format.string, names(value.list))
# [1] "a-bcd/def"
If we call the anonymous function f, then the result is
f(f(f(format.string, "A"), "B"), "C")
exactly as you intended, I believe.
We can use gsubfn that can take a key/value pair as replacement to change the pattern with the 'value'
library(gsubfn)
gsub("#", "", gsubfn("[^#]+", as.list(value.list), format.string))
#[1] "a-bcd/def"
NOTE: 'value.list' is a vector and not a list

Converting encoding of deparsed strings

I have the following vector:
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
I want to convert it a vector with unicode character replaced with the UTF-8, like so:
goal <- list("Chamberlain", "Romañach", "<node>")
The deparsed string is causing problems. If the second string was instead:
wouldbenice <- "Roma\u00F1ach"
Then enc2native(wouldbenice) would do the right thing. (or lapply(x, enc2native) for the whole string.
I can get the second string to display correctly in UTF-8 with:
# displays "Romañach"
eval(parse(text = x[[2]]))
However, this goes poorly (throws parse errors) with x[1] and x[2]. How can I reliably parse the entire list into the appropriate encoding?
Use stringi package.
From stringi use stri_replace_all_regex for replacement and stri_unescape_unicode to unescape Unicode symbols.
library(stringi)
x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")
removed_quotes <- stri_replace_all_regex(x, "\"", "")
unescaped <- stri_unescape_unicode(removed_quotes)
# [1] "Chamberlain" "Romañach" "<node>"
This satisfies the objective in base R, but seems less than ideal in other ways. Putting it here so readers can compare, though I think the stringi-based solution is probably the way to go.
utf8me <- function(x){
i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
x[i] <- eval(parse(text=x[i])) #
x
}
lapply(x, utf8me)

How to remove common parts of strings in a character vector in R?

Assume a character vector like the following
file1_p1_analysed_samples.txt
file1_p1_raw_samples.txt
f2_file2_p1_analysed_samples.txt
f3_file3_p1_raw_samples.txt
Desired output:
file1_p1_analysed
file1_p1_raw
file2_p1_analysed
file3_p1_raw
I would like to compare the elements and remove parts of the string from start and end as much as possible but keep them unique.
The above one is just an example. The parts to be removed are not common to all elements. I need a general solution independent of the strings in the above example.
So far I have been able to chuck off parts that are common to all elements, provided the separator and the resulting split parts are of same length. Here is the function,
mf <- function(x,sep){
xsplit = strsplit(x,split = sep)
xdfm <- as.data.frame(do.call(rbind,xsplit))
res <- list()
for (i in 1:ncol(xdfm)){
if (!all(xdfm[,i] == xdfm[1,i])){
res[[length(res)+1]] <- as.character(xdfm[,i])
}
}
res <- as.data.frame(do.call(rbind,res))
res <- apply(res,2,function(x) paste(x,collapse="_"))
return(res)
}
Applying the above function:
a = c("a_samples.txt","b_samples.txt")
mf(a,"_")
V1 V2
"a" "b"
2.
> b = c("apple.fruit.txt","orange.fruit.txt")
> mf(b,sep = "\\.")
V1 V2
"apple" "orange"
If the resulting split parts are not same length, this doesn't work.
What about
files <- c("file1_p1_analysed_samples.txt", "file1_p1_raw_samples.txt", "f2_file2_p1_analysed_samples.txt", "f3_file3_p1_raw_samples.txt")
new_files <- gsub('_samples\\.txt', '', files)
new_files
... which yields
[1] "file1_p1_analysed" "file1_p1_raw" "f2_file2_p1_analysed" "f3_file3_p1_raw"
This removes the _samples.txt part from your strings.
Why not:
strings <- c("file1_p1_analysed_samples.txt",
"file1_p1_raw_samples.txt",
"f2_file2_p1_analysed_samples.txt",
"f3_file3_p1_raw_samples.txt")
sapply(strings, function(x) {
pattern <- ".*(file[0-9].*)_samples\\.txt"
gsub(x, pattern = pattern, replacement = "\\1")
})
Things that match between ( and ) can be called back as a group in the replacement with backwards referencing. You can do this with \\1. You can even specify multiple groups!
Seeing your comment on Jan's answer. Why not define your static bits and paste together a pattern and always surround them with parentheses? Then you can always call \\i in the replacement of gsub.

Resources