Import raw bytes as raw bytes in R - r

I have imported a string into R from a database. The db column type is BYTEA (Postgres). In order for me to use it as intended, it should be of type raw. Instead, it is of type character. I want to convert it to raw in the following sense:
The string representation is
\x1f8b080000000000
If I use charToRaw, it is converted to the array
5c 78 31 66 38 62 30 38
Instead I need it to be the array
1f 8b 08 00 00 00 00 00
How do I acheive this.
Edit #1 Reply to Chris
library(RPostgreSQL)
conn <- dbConnect(dbDriver("PostgreSQL"), dbname = "somename",
host = "1.2.3.4", port = 5432,
user = "someuser", password = pw)
some_value <- dbGetQuery(conn, "select value from schema.key_value where key like '%somekey%' limit 1")
some_value$value
# [1] "\\x1f8b080000000000000

This works for converting a single character string of the type you've described to a vector of raws.
## The string I think you're talking about
dat <- "\\x1f8b080000000000"
cat(dat, "\n")
## \x1f8b080000000000
## A function to convert one string to an array of raw
f <- function(x) {
## Break into two-character segments
x <- strsplit(x, "(?<=.{2})", perl=TRUE)[[1]]
## Remove the first element, "\\x"
x <- x[-1]
## Complete the conversion
as.raw(as.hexmode(x))
}
## Check that it works
f(dat)
## [1] 1f 8b 08 00 00 00 00 00

Related

Changing salt with no effect: Why does the hashed password not change for different salts?

To protect passwords we can add salt while hashing. Different salt texts lead to different results for the very same password. But I noticed that I get the same result for different salt texts with bcrypt. See this example:
library("bcrypt")
my_password <- "password1"
my_salt_1 <- paste0("$2a$10$", paste0(letters[1:22], collapse= ""), collapse= "")
my_salt_2 <- paste0("$2a$10$", paste0(letters[c(1:21, 21)], collapse= ""), collapse= "")
hashpw(my_password, my_salt_1) == hashpw(my_password, my_salt_2)
TRUE
In fact it is easy to create more salts that result in the same hashing password. For example, we get the same hashing password using the salt paste0("$2a$10$", paste0(letters[c(1:21, 26)], collapse= ""), collapse= ""). Why is this happening? Is this only because the salts are qiute similiar or is here something else going on?
If you jump through a lot of the source code (and I'm not 100% sure I did correctly but it seems to match up) it looks like the main issue that the salt bytes are base64 encoded. The two strings you created actually have the exact same base64 decoded value (see Can two different BASE 64 encoded strings result into same string if decoded). Observe
s1 <- "abcdefghijklmnopqrstuv"
s2 <- "abcdefghijklmnopqrstuu"
openssl::base64_decode(s1)
# [1] 69 b7 1d 79 f8 21 8a 39 25 9a 7a 29 aa bb 2d
openssl::base64_decode(s2)
# [1] 69 b7 1d 79 f8 21 8a 39 25 9a 7a 29 aa bb 2d
Thus you are using the identical salt. If you want to get a random salt, the bcrypt::gensalt() function is a safer alternative
(my_salt_1 <- gensalt(10))
# [1] "$2a$10$XBGMfrY0DIVHX3KZVwKmM."
(my_salt_2 <- gensalt(10))
# [1] "$2a$10$NM8t5AsKmHJHs0d/hIFlbe"
hashpw(my_password, my_salt_1) == hashpw(my_password, my_salt_2)
# [1] FALSE

identical strings from different data files won't match in R

I was doing some file merging, and two files wouldn't - despite having a key column that matches (I actually generated one key column by copy-pasting from the other). It's the damndest thing, and I worry that I'm either going crazy or missing something fundamental. As an example (and I cannot figure out how to make it reproducible, as when I copy and paste these strings into new objects, they compare just fine), here's my current console:
> q
[1] "1931 80th Anniversary"
> z
[1] "1931 80th Anniversary"
> q == z
[1] FALSE
I str-ed both, just in case I missed something, and...
> str(q)
chr "1931 80th Anniversary"
> str(z)
chr "1931 80th Anniversary"
What could be going on here?
This was a great puzzler. To answer - to diagnose the problem, charToRaw() was the answer.
> charToRaw(q)
[1] 31 39 33 31 c2 a0 38 30 74 68 c2 a0 41 6e 6e 69 76 65
[19] 72 73 61 72 79
> charToRaw(z)
[1] 31 39 33 31 20 38 30 74 68 20 41 6e 6e 69 76 65 72 73
[19] 61 72 79
Oh! Different! It seems to lie in the encoding, which, given that these were both plain ole' CSVs I loaded from, I never would have guessed, but
> Encoding(q)
[1] "UTF-8"
> Encoding(z)
[1] "unknown"
In the end, I used iconv() on q to make it work
> iconv(q, from = 'UTF-8', to = 'ASCII//TRANSLIT') == z
[1] TRUE
This has been a weird journey, and I hope this helps someone else who is as baffled as I was - and they learn a few new functions along the way.
It looks like you have non-breaking spaces in your string, which isn't really an encoding issue. This happens to me all the time because alt + space inserts a non-breaking space on a Mac, and I use alt on my German keyboard for all sorts of special characters, too. My pinkies are my slowest fingers and they don't always release alt fast enough when I transition from some special character to a space. I discovered this problem writing bash scripts, where <command> | <command> is common and | is alt + 7.
I think stringr::str_replace_all(q, "\\s", " ") should fix your current issue. Alternatively, you can try targeting specific non-printables, e.g. in your situation stringr::str_replace_all(q, "\uA0", " "). To expose the offending characters you can use stringi::stri_escape_unicode(q), which would return "1931\\u00a080th\\u00a0Anniversary". You can then just copy and paste to get the same results as above: stringr::str_replace_all(q, "\u00a0", " ")

Convert fullwidth string into halfwidth string

How can I convert Ab9876543210 into Ab9876543210? Is there a solution by regular expression?
test <- dput("Ab9876543210")
Disclaimer: The following works on my machine, but since I can't replicate your full width string based purely on the example provided, this is a best guess based on my version of the problem (pasting the string into a text file, save it with UTF-8 encoding, & loading it in with coding specified as UTF-8.
Step 1. Reading in the text (I added a half width version for comparison):
> test <- readLines("fullwidth.txt", encoding = "UTF-8")
> test
[1] "Ab9876543210" "Ab9876543210"
Step 2. Verifying that the full & half width versions are not equal:
# using all.equal()
test1 <- test[1]
test2 <- test[2]
> all.equal(test1, test2)
[1] "1 string mismatch"
# compare raw bytes
> charToRaw(test1)
[1] ef bb bf ef bc a1 62 ef bc 99 ef bc 98 ef bc 97 ef bc 96 ef bc 95 ef
[24] bc 94 ef bc 93 ef bc 92 ef bc 91 ef bc 90
> charToRaw(test2)
[1] 41 62 39 38 37 36 35 34 33 32 31 30
For anyone interested, if you paste the raw byte version into a utf-8 decoder as hexadecimal input, you'll see that except for letter b (mapped from 62 in the 7th byte), the rest of the letters were formed by 3-byte sequences. In addition, the first 3-byte sequence maps to "ZERO WIDTH NO-BREAK SPACE character", so it's not visible when you print the string to console.
Step 3. Converting from full width to half width using the Nippon package:
library(Nippon)
test1.converted <- zen2han(test1)
> test1.converted
[1] "Ab9876543210"
# If you want to compare against the original test2 string, remove the zero
# width character in front
> all.equal(substring(test1.converted, 2), test2)
[1] TRUE
Here is a base R solution
Full width characters are in the range 0xFF01:0xFFEF, and can be offset like this.
x <- "Ab9876543210"
iconv(x, to = "utf8") |>
utf8ToInt() |>
(\(.) ifelse(. > 0xFF01 & . <= 0xFFEF, . - 65248, .))() |>
intToUtf8()
[1] "Ab9876543210"

Issue reading downloaded CSV

I am trying to download a CSV from my database online and it appears to be working,
tempCSV <- postForm(myDB_URL, .params=myParameters)
but when I try to read the file,
> dat <- read.csv(textConnection(tempCSV))
I get this error:
Error in textConnection(tempCSV) : invalid 'text' argument
I've tried this too (don't laugh if I'm completely grabbing for straws here, it is 2:19am on a Friday night)
> dat <- read.csv(tempCSV)
with this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file' must be a character string or connection
For reference, this is what the data actually looks like:
> tempCSV[1:20]
[1] 43 69 74 79 73 70 61 6e 20 43 6c 61 73 73 20 2d 20 46 69 6e
RStudio says it's raw[225758]
Here's what happens if I print the whole thing, if that helps at all:
> tempCSV
...
[ reached getOption("max.print") -- omitted 215758 entries ]
attr(,"Content-Type")
"application/x-comma-separated-values"

accessing dropbox files with spaces in path name using the api

This question is related to the my previous question reading raw data in R to be saved as .RData file using the dropbox api
I am running into problems when my path includes non-url standard characters
the db.file.name in the previous question is just the path to the relevant file in dropbox.
however the path has a space in it along with exclamation marks. I have a feeling that these need to be converted to a relevant format so that the GET request can work...but not too sure what the conversion is....
so using and continuing from my previous example...
require(httr)
require(RCurl)
db.file.name <- "!! TEST FOLDER/test.RData"
db.app <- oauth_app("db",key="xxxxx", secret="xxxxxxx")
db.sig <- sign_oauth1.0(db.app, token="xxxxxxx", token_secret="xxxxxx")
response <- GET(url=paste0("https://api-content.dropbox.com/1/files/dropbox/",curlEscape(db.file.name)),config=c(db.sig,add_headers(Accept="x-dropbox-metadata")))
The response is an error, and no file is downloaded...using the documentation page https://www.dropbox.com/developers/reference/api it suggests putting the URL into a UTF-8 encoding...which I'm not sure how to do/not sure it works.
Any help would be greatly appreciated.
I was close before...I just needed to re-insert the slashes using gsub in order for the GET request to work... so the result was
response <- GET(url=paste0("https://api-content.dropbox.com/1/files/dropbox/",gsub("%2F","/",curlEscape(db.file.name))),config=c(db.sig,add_headers(Accept="x-dropbox-metadata")))
the quick copy-past from ?iconv,
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
charToRaw(xx <- iconv(x, "latin1", "UTF-8"))
[1] 68 74 74 70 3a 2f 2f 73 74 61 63 6b 6f 76 65 72 66 6c 6f 77 2e 63 6f 6d
Encoding(x)
[1] "latin1"
Encoding(xx)
[1] "UTF-8"
does this answer your question?

Resources