Wrong encoding with fromJSON of jsonlite library - r

I have converted a json data from a .json file to an R object with using fromJSON() of jsonlite library like this:
library(jsonlite)
jsonR<-fromJSON(txt="data.json")
If I explore the stringed values of the jsonR object I meet some strange sequences of chars.
For example if a string value of the original "data.json" was 😩 then R read it as \xf0\u009f\u0098©. And when I write this value back to file with cat() it becomes < f0>.
Can anyone suggest how one is supposed to do for keeping the correct original encoding while converting?

There must be something wrong with your requested URL. If that's not the problem, try it with the following packages:
library(RCurl) or library(RJSONIO)

Related

Error when parsing JSON file into R - how to fix?

Using the package rtweet, I have streamed some tweets and saved them in a JSON file.
When using the following: tweets_df <- parse_stream('file.json'), I get the following error during the process:
Does anyone have any idea how to fix this so that the JSON file can be read into R as a data frame?
Have you tried it this way? I don't personally use rtweet but work with json files.
#load library to read json
library(jsonlite)
json_data <- fromJSON("db.json")
It reads it as a nested list but then you can simply change it to a dataframe using
df<-rlist::list.stack(x, fill=TRUE )'
You might have to adapt it and for example use a loop if your json file contains several users.

Parsing error in jsonlite reading multiple validated json files in R

I'm new to json files in R. I got some json files for STONKS are you can see scraped online. The werid thing is each individual json file can be read perfectly using fromJSON from the jsonlite library. I have taken individual files to check on jsonlint.com to validate, and the web confirmed they are valid json files.
I have tried to use the validate () from jsonlite on single files, but there the file failed to pass the validation.
However, when I want to use lapply to read all the files R gave me error on
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
No data available for the given
Here are my simple codes:
library(tidyr)
library(jsonlite)
ls <- list.files(pattern = "*.json")
Data.fromJson <- lapply(ls, fromJSON)
Sorry but I think the problem might be with the data so I dont want to just trim it and make dummy as it might mess things up. Instead I have uploaded a few json files to my google drive and here is the link:
https://drive.google.com/drive/folders/1zM4vj1TIseFKBSiNWe5yMY9BJPg-CIsv?usp=sharing

Problem with encoding of character strings when loading json files to RStudio under Windows 10

I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.
To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).
The code I used is the following:
df <- streamR::parseTweets("test.json")
The output I get with df$text is: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
The output should be: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?
Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.
Some additional information:
Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:
lt <- rjson::fromJSON(file="test.json")
I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.
The json files were created by another person under macOS using streamR - if I am not mistaken.
The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.
The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.
Thanks to MrFlick (see his comment), here is a solution using jsonlite that results in a very similar data frame structure and reads the encoding correctly:
df <- jsonlite::stream_in(file("~/../Downloads/test.json"))
Just some further information for those being used to the luxury of streamR handling tweets that might encounter a similar issue in the future, there are two main differences in the data frames created by parseTweets and stream_in:
parseTweets does not extract data for broken Tweets. stream_in does. Hence, the data frame has more rows when using stream_in but contains the same Tweets.
stream_in creates fewer variables since some of the columns in the data frame are themselves data frames. This might lead to issues when using the data frame without further transforming the data frame created with stream_in. parseTweets does that for you.

How to get public github JSON into R as a list of lists?

I would like to experiment with some soccer data made publicly available by Statsbomb on their github page. Here's a list to one of the JSONs from their github page that is available:
https://raw.githubusercontent.com/statsbomb/open-data/master/data/events/7298.json
My question is, how can I get this into R? I have tried the following:
httr::content(httr::GET("https://raw.githubusercontent.com/statsbomb/open-data/master/data/events/7298.json"))
however, this simply returns a length-1 character vector with the whole JSON squeezed into the string. I would preferably like this as a list of lists. How could I do this?
Thanks !!
EDIT: here is Statsbomb's public github repo - if it helps at all!
If you want to turn the JSON file in to an R object, you'll need to actually parse the data, not just downline the file. the jsonlite library makes this easy
url <- "https://raw.githubusercontent.com/statsbomb/open-data/master/data/events/7298.json"
mydata <- jsonlite::read_json(url)
And then mydata is now a big list with all the parsed values from the JSON object.

How to read Unicode file in R

I have one unicode (UTF-8) file with column delimiter as 'þ', I'm trying to read it using R csv reader as follows
data <- read.csv(file_name,,sep="þ",encoding="UTF-8")
in my data frame I'm getting everything in a single column, can someone correct me what I'm doing wrong here?
I think your script needs to be encoded as utf-8 too if you're using non-ascii characters.
Save your code in for example myfile.r and then
Try this:
source("myfile.r", encoding="utf-8")
hopefully your error will go away

Resources