I am trying to analyze bill texts from LegisScan, but am running into problems decoding the text from the API pull response. Turns out LegisScan encodes the full text of all legislations in base 64 when pulled through their API and I am having some trouble decoding it.
This downloaded JSON request is an example of the full text portion of the JSON result that I downloaded through the API. However, the usual methods do not seem to be working on it.
What I have tried:
Legiscan does not seem to support R directly, so I used the package LegiscanR. I used LegiscanR's BillText function to get the correct JSON link, then used parseBillText to try to decode the text from the link into UTF-8. However, it throws up a fromJSON error even with the correct API key and document id stated in the link:
Error in fromJSON(content, handler, default.size, depth, allowComments, :
object 'Strict' not found
Using the base64decode (base64enc package) or base64Decode (RCurl package) function to convert the text from base 64 to raw, and then using the rawToChar function to convert it into characters.
My code:
text <- base64decode("https://www.dropbox.com/s/5ozd0a1zsb6y9pi/Legiscan_fulltext.txt?dl=0")
rawToChar(text)
Nul <- text == as.raw(00)
text[Nul] <- as.raw(20)
text2 <- rawToChar(text)
However, trying to use the rawToChar alone gives me an "embedded nul in string" error
Error in rawToChar(test2) :
embedded nul in string: '%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<>>>\r\nendobj\r\n2 0 obj\r\n<>\r\nendobj\r\n3 0 obj\r\n<>/ExtGState<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<>\r\nstream\r\nx\x9c\xb5ZYs\xdb8\022~w\x95\xff\003*O\u0516M\021ཛJ\x95\xe3ę̵\x99\xb1\xa7f\xb7\x92y\xa0$\xca\xe2\x86"\025\036\xf6\xe6\xdfow\003\x94\bR0sh\x93*\x99G\xa3\001|\xdd\xfdu7\xa4\xf9U\xd5d\xebdٰ\xe7\xcf\xe7WM\x93,7銽\x9f\u07d5\xbb\xbf\xe6w\x9fw\xe9\xfc]r\x9f\025I\x93\x95\xc5\xfc\xb6]4\xf8\xe8\x874Y\xa5Ջ\027\xec\xe5\xabk\xf6\xf2\xee\xfcl~\xc3Yl\xc7\
Substituting these nulls out to represent spaces allows rawToChar to run, but the output is gibberish, or in another form of encoding that is not the expected English text characters.
[1] "\x86\xdbi\xb3\xff\xf0\xc3\ak\xa2\x96\xe8\xc5\xca&\xfe\xcf\xf9\xa37tk\\xeco\xac\xbd\xa6/\xcbz\b\xacq\xa9\u07faYm{\033m\xc6\xd7e"
Any other ideas on what else to try? Thanks.
I have been dealing with the same problem in Python and in Python the following code worked:
import base64
raw = base64.b64decode(bill_text['doc'])
pdf_result = open(output_file, "wb").write(raw)
I think maybe in your case you are trying to convert the document immediately into text but that may not be so easy and I did in Python by parsing the saved PDF file with functions from the PyPDF2 library.
Related
I've seen that since 4.0.0, R supports raw strings using the syntax r"(...)". Thus, I could do:
r"(C:\THIS\IS\MY\PATH\TO\FILE.CSV)"
#> [1] "C:\\THIS\\IS\\MY\\PATH\\TO\\FILE.CSV"
While this is great, I can't figure out how to make this work with a variable, or better yet with a function. See this comment which I believe is asking the same question.
This one can't even be evaluated:
construct_path <- function(my_path) {
r"my_path"
}
Error: malformed raw string literal at line 2
}
Error: unexpected '}' in "}"
Nor this attempt:
construct_path_2 <- function(my_path) {
paste0(r, my_path)
}
construct_path_2("(C:\THIS\IS\MY\PATH\TO\FILE.CSV)")
Error: '\T' is an unrecognized escape in character string starting ""(C:\T"
Desired output
# pseudo-code
my_path <- "C:\THIS\IS\MY\PATH\TO\FILE.CSV"
construct_path(path)
#> [1] "C:\\THIS\\IS\\MY\\PATH\\TO\\FILE.CSV"
EDIT
In light of #KU99's comment, I want to add the context to the problem. I'm writing an R script to be run from command-line using WIndows's CMD and Rscript. I want to let the user who executes my R script to provide an argument where they want the script's output to be written to. And since Windows's CMD accepts paths in the format of C:\THIS\IS\MY\PATH\TO, then I want to be consistent with that format as the input to my R script. So ultimately I want to take that path input and convert it to a path format that is easy to work with inside R. I thought that the r"()" thing could be a proper solution.
I think you're getting confused about what the string literal syntax does. It just says "don't try to escape any of the following characters". For external inputs like text input or files, none of this matters.
For example, if you run this code
path <- readline("> enter path: ")
You will get this prompt:
> enter path:
and if you type in your (unescaped) path:
> enter path: C:\Windows\Dir
You get no error, and your variable is stored appropriately:
path
#> [1] "C:\\Windows\\Dir"
This is not in any special format that R uses, it is plain text. The backslashes are printed in this way to avoid ambiguity but they are "really" just single backslashes, as you can see by doing
cat(path)
#> C:\Windows\Dir
The string literal syntax is only useful for shortening what you need to type. There would be no point in trying to get it to do anything else, and we need to remember that it is a feature of the R interpreter - it is not a function nor is there any way to get R to use the string literal syntax dynamically in the way you are attempting. Even if you could, it would be a long way for a shortcut.
This might have been asked and solved before, I just can't get a straightforward answer.
I got the following:
text <- 'Testing to be translated'
Which I am trying to get into JSON format like:
[{"Text": "Testing to be translated"}]
I have tried using toJSON but I could not get that structure.
Additionally, I did some quick-fix:
paste0('[{"Text":"', text, '"}]')
Which would work fine; however, I have some strings with the " and ' characters in it and they would break this code.
Any input would be helpful.
More context: I am using a GET request to translate text from Azure server, could not use translateR so I am creating my own function.
To create an array, pass jsonlite::toJSON an unnamed list or vector. You should also set auto_unbox=TRUE so that scalars aren't treated as arrays.
text <- 'Testing to be translated'
jsonlite::toJSON(list(list(Text=text)), auto_unbox=TRUE)
# [{"Text":"Testing to be translated"}]
I am using google API to translate a sentence. Once translated I use text to speech google API with the result of the translation.
Translation and text to speech work pretty well in general. However, I have a problem with the apostrophes. For example:
1) Translation result: I & # 3 9 ; m tired (Note: I had to separate the characters with spaces because it was shown as "I´m tired" in the preview...
2) Text to speech result says : "I and hash thirty nine m tired" (or something similar)
What kind of encoding do I need to use in the 1st step to get the output string right (i.e. I´m tired)
The program is in python. I include an extract here:
def tts_translated_text (self, input_text, input_language):
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input_text = input_text.encode ("utf-8")
# Set the text input to be synthesized
synthesis_input = texttospeech.types.SynthesisInput(text=input_text)
voice = texttospeech.types.VoiceSelectionParams( language_code=input_language, ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)
audio_config = texttospeech.types.AudioConfig(
audio_encoding=texttospeech.enums.AudioEncoding.LINEAR16)
response = client.synthesize_speech(synthesis_input, voice, audio_config)
# The response's audio_content is binary.
with open('output.wav', 'wb') as out:
# Write the response to the output file.
out.write(response.audio_content)
Thanks in advance,
Ester
I finally found what was wrong. Google Translate API returns the string with HTML encoding. And Google Text-To-Speech expects UTF-8 encoding.
I am forced to use python2.7 so I did the following:
translated_text = HTMLParser.HTMLParser().unescape (translated_text_html)
Where translated_text_html is the returned string from the translation API invocation
In python3 it should be:
translated_text = html.unescape (translated_text_html)
I am trying to convert from hex to base64 but the conversion I get with functions like base64Encode or base64_enc do not match with the conversion I get from this site https://conv.darkbyte.ru/ or this site http://tomeko.net/online_tools/hex_to_base64.php?lang=en
library(RCurl)
library(jsonlite)
hex_number="9d0a5a7d6771dd7fa321a48a820f93627657df
3292548df1389533913a60328300a9cc80d982875a8d08bb7
602c59935cacae88ea635ed8d3cea9ef57b1884cc"
base64_enc(hex_number)
#"OWQwYTVhN2Q2NzcxZGQ3ZmEzMjFhNDhhODIwZjkzNjI3NjU3ZGYKMzI5M
#jU0OGRmMTM4OTUz\nMzkxM2E2MDMyODMwMGE5Y2M4MGQ5ODI4NzVhOGQwO
#GJiNwo2MDJjNTk5MzVjYWNhZTg4ZWE2\nMzVlZDhkM2NlYTllZjU3YjE4ODRjYw=="
base64Encode(hex_number)
#"OWQwYTVhN2Q2NzcxZGQ3ZmEzMjFhNDhhODIwZjkzNjI3NjU3ZGYKMzI5M
#jU0OGRmMTM4OTUzMzkxM2E2MDMyODMwMGE5Y2M4MGQ5ODI4NzVhOGQwOGJiNwo
#2MDJjNTk5MzVjYWNhZTg4ZWE2MzVlZDhkM2NlYTllZjU3YjE4ODRjYw=="
#desired result:
#nQpafWdx3X+jIaSKgg+TYnZX3zKSVI3xOJUzkTpgMoMAqcyA2YKHWo0Iu3YCxZk1ysrojqY17Y086p71exiEzA==
Also I have tried to chenge the HEX to text before changing it to HEX with the code in this page http://blog.entropic-data.com/2017/04/19/short-dealing-with-embedded-nul-in-string-manipulation-with-r/ I didn't get the result I want.
Borrow some code from the wkb package (or just install and use it directly) to convert the hex string into a raw vector before passing it to one of the base 64 conversion routines:
hex_number <- "9d0a5a7d6771dd7fa321a48a820f93627657df3292548df1389533913a60328300a9cc80d982875a8d08bb7602c59935cacae88ea635ed8d3cea9ef57b1884cc"
I'm "source-ing" this but you should copy the code locally if you plan on using it as GH could go down or the code could change.
source_url("https://raw.githubusercontent.com/ianmcook/wkb/master/R/hex2raw.R",
sha1 = "4443c72fb3831e002359ad564f1f2a8ec5e45e0c")
openssl::base64_encode(hex2raw(hex_number))
## [1] "nQpafWdx3X+jIaSKgg+TYnZX3zKSVI3xOJUzkTpgMoMAqcyA2YKHWo0Iu3YCxZk1ysrojqY17Y086p71exiEzA=="
OR (if you are willing to have the wkb package as a dependency:
openssl::base64_encode(wkb::hex2raw(hex_number))
I wanted to validate my Website for example with http://validator.w3.org but I always get the following error:
Sorry, I am unable to validate this document because on line 11 it
contained one or more bytes that I cannot interpret as utf-8 (in other
words, the bytes found are not valid values in the specified Character
Encoding). Please check both the content of the file and the character
encoding indication. The error was: utf8 "\xFC" does not map to
Unicode
Does anybody know where I can locate/get rid of the error?
open the css-file with your favorite text editor.
There, switch the encoding to UTF8.
Goto line 11 and look for strange looking symbols.
Delete/replace them.