How to parse a file with stacked multiple JSONs in R?

How to parse a file with stacked multiple JSONs in R? - r

I have the following "stacked JSON" object within R, example1.json:
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
"Code":[{"event1":"A","result":"1"},…]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
"Code":[{"event1":"B","result":"1"},…]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
"Code":[{"event1":"B","result":"0"},…]}
These are not comma-separated. The fundamental goal would be to parse certain fields (or all fields) into an R data.frame or data.table:
Timestamp Usefulness
0 20140101 Yes
1 20140102 No
2 20140103 No
Normally, I would read in a JSON within R as follows:
library(jsonlite)
jsonfile = "example1.json"
foobar = fromJSON(jsonfile)
This however throws a parsing error:
Error: lexical error: invalid char in json text.
[{"event1":"A","result":"1"},…]} {"ID":"1A35B","Timestamp"
(right here) ------^
This is a similar question to the following, but in R: multiple Json objects in one file extract by python
EDIT: This file format is called a "newline delimited JSON", NDJSON.

The three dots ... invalidate your JSON, hence your lexical error.
You can use jsonlite::stream_in() to 'stream in' lines of JSON.
library(jsonlite)
jsonlite::stream_in(file("~/Desktop/examples1.json"))
# opening file input connection.
# Imported 3 records. Simplifying...
# closing file input connection.
# ID Timestamp Usefulness Code
# 1 12345 20140101 Yes A, 1
# 2 1A35B 20140102 No B, 1
# 3 AA356 20140103 No B, 0
Data
I've cleaned your example data to make it valid JSON and saved it to my desktop as ~/Desktop/examples1.json
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes","Code":[{"event1":"A","result":"1"}]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No","Code":[{"event1":"B","result":"1"}]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No","Code":[{"event1":"B","result":"0"}]}

Related

R tibble with comma separated fields - read/write_csv() incorrectly parses data as double

I hope the title makes sense. I will explain a bit here.
I am working with data that comes from a network performance monitoring tool running synthetic transactions (mimicking user activity by making timed and measurable transactions allowing for performance analysis and problem detection). Several of the output fields are capturing different values like Header Read Times, TLS Times, etc for multiple transactions in a single test. These fields have the data separated by comma. When the data is first retrieved from the API and converted from JSON to a tibble, theses fields are correctly parsed as:
metrics.HeaderReadTimes
"120,186,191,184,186,182,190,186,192"
"232,310,282,289,354,292,292,293,306"
...
I have verified also that these fields are typed as character when they are imported from the API and stored in the tibble. I even checked this during debug just before write_csv() gets called.
However, when I write this data to CSV for storage and then read it back in later, the output of read_csv() has these fields as if they were re-typed as double:
metrics.HeaderReadTimes
"1.34202209222205e+26"
"4.17947405424481e+26"
...
I used mutate() to type these fields as.character() on read, but that doesn't seem to fix the issue, it just gives me a double that has beeen coerced into a character.
I'm beginning to think that the best solution is to change the delimiter in those fields before I call write_csv(), but I'm unsure how to do this in an efficient manner. It's probably something stupidly obvious, and I'm going to keep researching, but I figured it wouldn't hurt to ask...

csv-files does not store any information about the column type, why you'd want to specify the column type in readr (or alternatively save the data as .Rdata or .RDS).
read_csv("filename.csv",
col_types = cols(metrics.HeaderReadTimes = col_character()))

An alternative that is a little more agnostic to column names.
The issue is that the locale is inferring the commas as "grouping marks", i.e., thousands indicators. We can change that with readr::locale.
Failing:
readr::read_csv(I('a,b\n"1,2",3'))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
# a b
# <dbl> <dbl>
# 1 12 3
Working as intended:
readr::read_csv(I('a,b\n"1,2",3'), locale = readr::locale(grouping_mark = ""))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# chr (1): a
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
# a b
# <chr> <dbl>
# 1 1,2 3

Yet another "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')". I have checked, but data seems to be ok

I'm trying to prepare a dataset to use it as training data for a deep neural network. It consists of 13 .txt files, each between 500MB and 2 GB large. However, when trying to run a "data_prepare.py" file, I get the Value error of this post's title.
Reading answers from previous posts, I have loaded my data into R and checked both for NaN and infinite numbers, but the commands used tell me there appears to be nothing wrong with my data. I have done the following:
I load my data as one single dataframe using magrittr, data.table and purrr packages(there are about 300 Million rows, all with 7 variables):
txt_fread <-
list.files(pattern="*.txt") %>%
map_df(~fread(.))
I have used sapply to check for finite and NaN values:
>any(sapply(txt_fread, is.finite))
[1] TRUE
> any(sapply(txt_fread, is.nan))
[1] FALSE
I have also tried loading each data frame into a jupyter notebook and check individually for those values using the following commands:
file1= pd.read_csv("File_name_xyz_intensity_rgb.txt", sep=" ", header=None)
np.any(np.isnan(file1))
False
np.all(np.isfinite(file1))
True
And when I use print(file1.info()), this is what I get as info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22525176 entries, 0 to 22525175
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 0 float64
1 1 float64
2 2 float64
3 3 int64
4 4 int64
5 5 int64
6 6 int64
dtypes: float64(3), int64(4)
memory usage: 1.2 GB
None
I know the file containing the code (data_prepare.py) works because it runs properly with a similar dataset. I therefore know it must be a problem with the new data I mention here, but I don't know what I have missed or done wrong while checking for NaNs and infinites. I have also tried reading and checking the .txt files individually, but it also hasn't helped much.
Any help is really appreciated!!
Btw: the R code with map_df came from a post by leerssej in How to import multiple .csv files at once?

Selecting following characters after a pattern match in R

I have a data frame which has one column of text with info that I need to extract, here is one observation from that column: each question has three attributes associated to it objectives,KeyResults and responsible
[{"text":"Newideas.","translationKey":"new.question-4","id":4,"objectives":"Great","KeyResults":"Awesome","responsible":"myself"},{"text":"customer focus.","translationKey":"new.question-5","id":5,"objectives":"Goalset","KeyResults":"Amazing","responsible":"myself"}
-------------------------DESIRED OUTPUT -----------------------
Question# Objectives KeyResults responsible Question# Objectives KeyResults responsible
4 Great Awesome myself 5 Goalset Amazin myself

Data is a valid json (but you need square bracket closing ] on it). You can read json into R object using json parser package (eg. jsonlite)
Let say your text is in column text of data frame df, then this will transform that text into R dataframe.
library(jsonlite)
dat <- fromJSON(df$text)
dat
# text translationKey id objectives KeyResults responsible
# 1 Newideas. new.question-4 4 Great Awesome myself
# 2 customer focus. new.question-5 5 Goalset Amazing myself
You need to install jsonlite to make it works
install.packages("jsonlite")

Decoded Base64 to dataframe

From an API I get a Base64 encoded dataset. I use RCurl::base64 to decode it, but it's serialized. How do I convert it to a dataframe?
After decoding my return, I get a long textstring with semi colon separated data and column names. Looks like this:
[1] "\"lfdn\";\"pseudonym\";\"external_lfdn\";\"tester\"\r\n\"50\";\"434444345\";\"0\";\"0\"\r\n\"91\";\"454444748\";\"0\";\"0\"\r\n\
You can see the structure with a simple cat(x):
"lfdn";"pseudonym";"external_lfdn";"tester"
"50";"434444345";"0";"0"
"91";"454444748";"0";"0"
"111";"444444141";"0";"0"
I've tried the obvious unserialize(x), but I get:
R> Error in unserialize(enc) :
R> character vectors are no longer accepted by unserialize()
Whatever I throw at it... I can write the object to disk, and read it back in, but I prefer to avoid that.
Getting the data from the serialized textstring into a dataframe with column names would be great!

This should do the trick:
read.table(text=j, header = TRUE, sep = ";")
# lfdn pseudonym external_lfdn tester
# 1 50 434444345 0 0
# 2 91 454444748 0 0
Note. I copied your string from above, it does not contain the last row with 111 in it.

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex