Import Json file into R with unicode error - r

I am having issues importing a JSON file into R
Below is the code
library(RJSON)
json_file <- "file_path"
json_data <-fromJSON(file=json_file)
Which returns - Error in fromJSON(file = json_file) : unexpected character '<ef>'
which i suspect is due to the unicode / hashtag as the JSON file contains hex codes e.g #000000.
If i remove the # and replace it with %23 the file will load correctly.
how do i properly load the JSON file, i do not want to manually replace the "#" as i have thousands of these files.
An idea would be to use REGEX to replace # with %23, however i am not too sure how to do that.
Does anyone have any suggestions?

Related

Using readtext to extract text from XML

I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. I'm interested in the text within the tag "regtext" in this and other similar XML files.
I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:
regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Start tag expected, '<' not found [4]
I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including .csv or .docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.
Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) :
The xml format does not fit for the extraction without xPath
Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")
I end up with a dataframe with the correct doc_id, but no text.
From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.
It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT".
Here is a solution using the xml2 package. (I find this package provides a simpler interface and is easier to use)
library(xml2)
url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)
#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")
#convert the regtext nodes into vector of strings
xml_text(regtext)

unable to read csv file saved with encoding "UTF-8-SIG"

I got data that I crawled using Scrapy, which saves as csv file with encoding utf-8-sig. The data has many different special characters: Korean, Russian, Chinese, Spanish,..., a star symbol (★), and this 🎵, and this 🎄...
So Scrapy can save, and I can view those on Notepad++ or app like CSVFileView. But when I load in R using mydata <- read.csv(<path_to_file>, fileEncoding="UTF-8-SIG", header=FALSE), I got this error:
Error in file(file, "rt", encoding = fileEncoding) :
unsupported conversion from 'UTF-8-SIG' to ''
If I don't specify the encoding, I can load but the symbols will become characters like ☠and the first column head will be appended with ï..
Which encoding should I choose to include all characters?
As the input is already encoded as UTF-8, you should use the encoding argument to read the file as-is. Using fileEncoding will attempt to re-encode the file.
mydata <- read.csv(<path_to_file>, encoding="UTF-8", header=FALSE)

R - cannot read CSV file with JSON string column using data.table and jsonlite because of double backslashes

I have trouble reading a CSV file I exported from a mysql database which contains a column with a JSON string. More concretely, I want to get access to all values in the JSON string. I created a simple example to visualize my problem:
This is my CSV file (test.csv):
"id","code","values"
1,"12b222a","{\"first\": 5, \"second\": 5}"
This is how I read it in R:
library(data.table)
library(jsonlite)
test_data<-fread("test.csv")
When I try
rd <- fromJSON(test_data[,"values"])
I reveive the following error message:
Error: Argument 'txt' must be a JSON string, URL or file.
The problem is that when I run
test_data[,"values"]
I receive the following content which contains double backslashes as escape characters:
values
1: {\\"first\\": 5, \\"second\\": 5}
How can I avoid having two backslashes that cause the trouble with fromJSON?

Remove special characters in xml file R

I have a few thousand xml files that I would like to read into R. The problem is that some of these files have three special characters "" in the beginning of the file that stops xmlTreeParse from reading the xml file. The error that I get is the following...
Error: 1: Start tag expected, '<' not found
This is due to the first line in the xml file that is the following...
<?xml version="1.0" encoding="utf-8"?>
If I manually remove the characters using notepad, I have this in the beginning of the xml file and I am able to read the xml file...
<?xml version="1.0" encoding="utf-8"?>
I'd like to be able to remove the characters automatically. The following is the code that I have written currently.
filenames <- list.files("...filepath...", pattern="*.xml", full.names=TRUE)
files <- lapply(filenames, function(f) {
xmlfile <-tryCatch(xmlTreeParse(file = f), error=function(e) print(f))
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
name <- unname(plantcat$EntityNames)
return(name)
})
I'm wondering how I can read the xml file in by removing the special characters in R. I have tried tryCatch as you can see above but I'm not sure how can edit the xml file without actually reading it in first. Any help would be appreciated!
Edit: Using the following parsing code fixed the problem. I think when I opened the xml file in notepad, it was showing "" but in reality it was this following string "". It's possible that this was due to the encoding of the file but I'm not sure of the specifics. Thank you #Prem.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
The special chars from the beginning might come from a different encoding for the file, especially if your xml contains some special characters.
Try to specify the encoding. To identify what encoding is used, open the file as hexa and read the first bytes.
My hunch is that your special chars comes from BOM:
http://unicode.org/faq/utf_bom.html
In your code use readLines to read file and then gsub can be used to remove junk value from the string.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
Have you tryed with the gsub function?. It is a very convenient function for characters replacement (and deletion). This works for me:
gsub ('','',string, fixed=TRUE)
On a string = '<?xml version="1.0" encoding="utf-8"?>' variable.
EDIT: I would also suggest you to use the sed function if you're using a computer with GNU/Linux. It's a very powerful tool that would deal perfectly with this task.

How to convert JSON data from R to not be in one line

I extracted data in R using the twitteR package and searchtwitter. I am then converting to a JSON file. It works, but when I view the JSON in notepad++ it is one long string. Is there a way to get it to separate so that each tweet with its specific information is separate.
testreal <- searchTwitteR('startup', n = 100, lang = 'en')
testrealdf <- do.call("rbind", lapply(testreal, as.data.frame))
write(exportJson, file = "testrealdf.json")
json_realdf <- fromJSON(file="testrealdf.json")
My file in notepad looks like.....
Use jsonlite::toJSON(x=yourDataFrame, pretty=TRUE). From ?toJSON:
pretty adds indentation whitespace to JSON output. Can be TRUE/FALSE
or a number specifying the number of spaces to indent. See prettify
This is just for cosmetics. The whitespace in JSON is not syntactically meaningful. For a serialization format with syntactically meaningful whitespace, check out R's yaml package, which can also read JSON.

Resources