I am struggling to parse a JSON in R which contains newlines both within character strings and between key/value pairs (and whole objects).
Here's the sort of format I mean:
{
"id": 123456,
"name": "Try to parse this",
"description": "Thought reading a JSON was easy? \r\n Try parsing a newline within a string."
}
{
"id": 987654,
"name": "Have another go",
"description": "Another two line description... \r\n With 2 lines."
}
Say that I have this JSON saved as example.json. I have tried various techniques to overcome parsing problems, suggested elsewhere on SO. None of the following works:
library(jsonlite)
foo <- readLines("example.json")
foo <- paste(readLines("example.json"), collapse = "")
bar <- fromJSON(foo)
bar <- jsonlite::stream_in(textConnection(foo))
bar <- purrr::map(foo, jsonlite::fromJSON)
bar <- ndjson::stream_in(textConnection(foo))
bar <- read_json(textConnection(foo), format = "jsonl")
I gather that this is really NDJSON format, but none of the specialised packages cope with it. Some suggest streaming in the data with either jsonlite or ndjson (or this one and this one). Others suggest mapping the function across lines (or similarly in base R).
Everything raises one of the following errors:
Error: parse error: trailing garbage or Error: parse error: premature EOF or problems opening the text connection.
Does anyone have a solution?
Edit
Knowing that the json is wrongly formatted, we lose some ndjson efficiency but I think we can fix it in real time, assuming that we clearly have a close-brace (}) followed by nothing or some whitespace (including newlines) followed by an open-brace ({)
fn <- "~/StackOverflow/TomWagstaff.json"
wrongjson <- paste(readLines(fn), collapse = "")
if (grepl("\\}\\s*\\{", wrongjson))
wrongjson <- paste0("[", gsub("\\}\\s*\\{", "},{", wrongjson), "]")
json <- jsonlite::fromJSON(wrongjson, simplifyDataFrame = FALSE)
str(json)
# List of 2
# $ :List of 3
# ..$ id : int 123456
# ..$ name : chr "Try to parse this"
# ..$ description: chr "Thought reading a JSON was easy? \r\n Try parsing a newline within a string."
# $ :List of 3
# ..$ id : int 987654
# ..$ name : chr "Have another go"
# ..$ description: chr "Another two line description... \r\n With 2 lines."
From here, you can continue with
txtjson <- paste(sapply(json, jsonlite::toJSON, pretty = TRUE), collapse = "\n")
(Below is the original answer, hoping/assuming that the format was somehow legitimate.)
Assuming your data is actually like this:
{"id":123456,"name":"Try to parse this","description":"Thought reading a JSON was easy? \r\n Try parsing a newline within a string."}
{"id": 987654,"name":"Have another go","description":"Another two line description... \r\n With 2 lines."}
then it is as you suspect ndjson. From that you can do this:
fn <- "~/StackOverflow/TomWagstaff.json"
json <- jsonlite::stream_in(file(fn), simplifyDataFrame = FALSE)
# opening file input connection.
# Imported 2 records. Simplifying...
# closing file input connection.
str(json)
# List of 2
# $ :List of 3
# ..$ id : int 123456
# ..$ name : chr "Try to parse this"
# ..$ description: chr "Thought reading a JSON was easy? \r\n Try parsing a newline within a string."
# $ :List of 3
# ..$ id : int 987654
# ..$ name : chr "Have another go"
# ..$ description: chr "Another two line description... \r\n With 2 lines."
Notice I've not simplified to a frame. To get your literal block on the console, do
cat(sapply(json, jsonlite::toJSON, pretty = TRUE), sep = "\n")
# {
# "id": [123456],
# "name": ["Try to parse this"],
# "description": ["Thought reading a JSON was easy? \r\n Try parsing a newline within a string."]
# }
# {
# "id": [987654],
# "name": ["Have another go"],
# "description": ["Another two line description... \r\n With 2 lines."]
# }
If you want to dump it to a file in that way (though nothing in jsonlite or similar will be able to read it, since it is no longer legal ndjson nor legal json as a whole file), then you can
txtjson <- paste(sapply(json, jsonlite::toJSON, pretty = TRUE), collapse = "\n")
and then save that with writeLines or similar.
Related
I am reading several SAS files from a server and load them all into a list into R. I removed one of the datasets because I didn't need it in the final analysis ( dateset # 31)
mylist<-list.files("path" , pattern = ".sas7bdat")
mylist <- mylist[- 31]
Then I used lapply to read all the datasets in the list ( mylist) at the same time
read.all <- lapply(mylist, read_sas)
the code works well. However when I run view(read.all) to see the the datasets, I can only see a number ( e.g, 1, 2, etc) instead of the names of the initial datasets.
Does anyone know how I can keep the name of datasets in the final list?
Also, can anyone tell me how I can work with this list in R?
is it an object ? may I read one of the dateset of the list ? or how can I join some of the datasets of the list?
Use basename and tools::file_path_sans_ext:
filenames <- head(list.files("~/StackOverflow", pattern = "^[^#].*\\.R", recursive = TRUE, full.names = TRUE))
filenames
# [1] "C:\\Users\\r2/StackOverflow/1000343/61469332.R" "C:\\Users\\r2/StackOverflow/10087004/61857346.R"
# [3] "C:\\Users\\r2/StackOverflow/10097832/60589834.R" "C:\\Users\\r2/StackOverflow/10214507/60837843.R"
# [5] "C:\\Users\\r2/StackOverflow/10215127/61720149.R" "C:\\Users\\r2/StackOverflow/10226369/60778116.R"
basename(filenames)
# [1] "61469332.R" "61857346.R" "60589834.R" "60837843.R" "61720149.R" "60778116.R"
tools::file_path_sans_ext(basename(filenames))
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
somedat <- setNames(lapply(filenames, readLines, n=2),
tools::file_path_sans_ext(basename(filenames)))
names(somedat)
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
str(somedat)
# List of 6
# $ 61469332: chr [1:2] "# https://stackoverflow.com/questions/61469332/determine-function-name-within-that-function/61469380" ""
# $ 61857346: chr [1:2] "# https://stackoverflow.com/questions/61857346/how-to-use-apply-family-instead-of-nested-for-loop-for-my-problem?noredirect=1" ""
# $ 60589834: chr [1:2] "# https://stackoverflow.com/questions/60589834/add-columns-to-data-frame-based-on-function-argument" ""
# $ 60837843: chr [1:2] "# https://stackoverflow.com/questions/60837843/how-to-remove-all-parentheses-from-a-vector-of-string-except-whe"| __truncated__ ""
# $ 61720149: chr [1:2] "# https://stackoverflow.com/questions/61720149/extracting-the-original-data-based-on-filtering-criteria" ""
# $ 60778116: chr [1:2] "# https://stackoverflow.com/questions/60778116/how-to-shift-data-by-a-factor-of-two-months-in-r" ""
Each "name" is the character representation of (in this case) the stackoverflow question number, with the ".R" removed. (And since I typically include the normal URL as the first line then an empty line in the files I use to test/play and answer SO questions, all of these files look similar at the top two lines.)
I feel this must be an easy issue, but my search fu is failing me, so your assistance is very welcome, and apologies if it is indeed answered elsewhere.
I'm working with JSON data from a REST API (specifically GitHub data for pull requests) which contains nested arrays (in this case, the comments on the PRs, which then nest other things, like the data for the comment author). I use JSONlite::fromJSON to parse this, and I get a dataframe with nested sets of lists and data.frames. Here's a cut-down example of a single row (PR):
jsn = '[
{
"pr":123,
"comment_total":2,
"comments":[
{
"user":{"name":"Me Myself","username":"me"},
"body":"comment 1"
},
{
"user":{"name":"Me Myself","username":"me"},
"body":"comment 2"
}
]
}
]'
This represents a single pull request, which has two comments on it. If I load this with JSONLite, I get 1 row as expected:
> df = jsonlite::fromJSON(jsn)
> str(df)
'data.frame': 1 obs. of 3 variables:
$ pr : int 123
$ comment_total: int 2
$ comments :List of 1
..$ :'data.frame': 2 obs. of 2 variables:
.. ..$ user:'data.frame': 2 obs. of 2 variables:
.. .. ..$ name : chr "Me Myself" "Me Myself"
.. .. ..$ username: chr "me" "me"
.. ..$ body: chr "comment 1" "comment 2"
I'd like to unwrap the first level of this comments column, so that I get one row per PR comment, but I'm struggling to do so. What I'm aiming for is something like:
pr comment_total comments.user comments.body
1 123 2 <data.frame> comment 1
2 123 2 <data.frame> comment 2
I thought tidyr::unnest() would deal with this, but it doesn't seem to like the nested data.frames:
> unnest(df)
Error in bind_rows_(x, .id) :
Argument 1 can't be a list containing data frames
I also looked at purrr::map_dfr which outputs rows, but I can't seem to get that right either - I'm using it to access the data.frame directly, but it's still unhappy:
> map_dfr(df,.id="comments", `[[`,1)
Error in bind_rows_(x, .id) :
Argument 3 can't be a list containing data frames
I'm sure I'm missing something obvious, but I can't see it - someone enlighten me? Thanks!
EDIT: The code I'm using to get the data from GitHub looks like below - if there are better ways to query this, I'm interested.
library(httr)
base_url = 'https://api.github.com/repos/ansible/ansible'
# `pr` comes from a loop, e.g. pr = 38508
issue_url = paste0(base_url,'/issues/',pr,'/comments')
# api_user and api_key are my GitHub credentials
i_resp <- GET(issue_url, authenticate(api_user,api_key))
issue_comments = as.tibble(
jsonlite::fromJSON(
content(i_resp,as="text"),
flatten = TRUE
)
)
One way is to tell jsonlite to not be as helpful as it usually is and then work the unwinding on your own (much like XML processing, heavily nested output from custom API endpoints tend to need some domain/API knowledge to get the data into rectangles):
jsonlite::fromJSON(
txt = jsn,
simplifyVector = FALSE,
simplifyDataFrame = FALSE,
flatten = FALSE
) %>%
map_df(~{ # in case there is more than one
dplyr::as_data_frame(.x) %>%
mutate(
body = map_chr(comments, ~.x$body),
username = map_chr(comments, ~.x$user$name),
name = map_chr(comments, ~.x$user$name)
) %>%
select(-comments)
})
## # A tibble: 2 x 5
## pr comment_total body username name
## <int> <int> <chr> <chr> <chr>
## 1 123 2 comment 1 Me Myself Me Myself
## 2 123 2 comment 2 Me Myself Me Myself
I have a dataframe with the following variables:
doc_id text URL author date forum
When I run
samplecorpus <- Corpus(DataframeSource(sampledataframe))
the documentation says I should get a corpus with all of the extra variables added as document-level metadata.
https://rdrr.io/rforge/tm/man/DataframeSource.html
http://finzi.psych.upenn.edu/R/library/tm/html/DataframeSource.html
Instead, I get a corpus that has all of the right documents in the right order, but all of their metadata is blank. I need this metadata to filter the documents for future analysis.
Someone else asked a similar question, but it never got answered...
In tm version a readTabular() replacement tm package DataframeSource () ignores my other columns as metadata
Does anyone have any ideas on how to fix this?
Thanks!
The documentation for tm explains this if you dig down (see ??tm::DublicCore). From the docs:
A corpus has two types of metadata. Corpus metadata ("corpus") contains corpus specific metadata in form of tag-value pairs. Document level metadata ("indexed") contains document specific metadata but is stored in the corpus as a data frame. Document level metadata is typically used for semantic reasons (e.g., classifications of documents form an own entity due to some high-level information like the range of possible values) or for performance reasons (single access instead of extracting metadata of each document). The latter can be seen as a from of indexing, hence the name "indexed". Document metadata ("local") are tag-value pairs directly stored locally at the individual documents.
DataframeSource automatically assigns only the the corpus metadata*. For example, see what the following prints:
library(tm)
data <- data.frame(doc_id = c(234345345, 1299),
text = c("The Prince and the Pauper",
"Little Women"),
author = c('Mark Twain', 'Louisa May Alcott'),
date = c(1881, 1868),
stringsAsFactors = FALSE)
samplecorpus <- Corpus(DataframeSource(data))
meta(samplecorpus)
# Or even
meta(samplecorpus[1], tag = 'author')
In order to assign metadata at the document level, you can work with meta to change tags. Bizarrely, this only works if you use VCorpus. So changing the above slightly, you can do:
samplecorpus <- VCorpus(DataframeSource(data))
# Can now set document metadata tags
meta(samplecorpus[[1]], tag = 'author') <- 'Mark Twain'
*EDIT:
Contemplating further (and responding to OP's comment), I agree that the documentation is not a completely accurate description of the package's observed behavior. The quoted documentation above refers to three levels (Corpus, indexed document level, and local document level), which in my example appear to correspond to samplecorpus, samplecorpus[1], and samplecorpus[[1]], respectively. If this correct, then the metadata is being assigned by DataframeSource at the promised level (if somewhat vaguely, as they never specified which document-level). However, the docs also claims the indexed document level is stored as a data frame and local as tag-value pairs, but both are stored as lists. Confusing. I can only conclude that this is either a bug in the package implementation or an error in the docs.
Barring contacting the package authors to clear this up (not a bad idea), I would propose the following workaround:
samplecorpus <- VCorpus(DataframeSource(data))
transfer_metadata <- function(x, i, tag){
return(meta(x[i], tag=tag)[[tag]])
}
tags <- colnames(data)
tags <- tags[! tags %in% c('doc_id', 'text')]
for(i in 1:length(samplecorpus)){
for (tag in tags){
meta(samplecorpus[[i]], tag=tag) <- transfer_metadata(samplecorpus, i=i, tag=tag)
}
}
You have to check if everything is loaded correctly. I made an example docs data.frame so you can see how it works. I used the same column names you have and added 1 extra (tags). Based on this example you might check if you have an issue somewhere.
docs <- data.frame(doc_id = c("doc_1", "doc_2"),
text = c("This is a text.", "This another one."),
url = c("https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r",
"https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r"),
author = c("Emi", "Emi"),
date = as.Date(c("2018-09-20", "2018-09-21")),
forum = c("stackoverflow", "stackoverflow"),
tags = c("r", "tm"),
stringsAsFactors = T)
# use Corpus or VCorpus
my_corpus <- Corpus(DataframeSource(docs))
meta(my_corpus)
url author date
1 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-20
2 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-21
forum tags
1 stackoverflow r
2 stackoverflow tm
my_index <- meta(my_corpus, "tags") == "r"
inspect(my_corpus[my_index])
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 5
Content: documents: 1
doc_1
This is a text.
Now beware there is a difference in how meta is treated. If you do str(my_corpus) you will see the following:
List of 2
$ doc_1:List of 2
..$ content: chr "This is a text."
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2018-09-21 08:55:44"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "doc_1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
$ doc_2:List of 2
......
The meta info you see here is from meta(my_corpus, type = "local"). The metadata loaded with DataframeSource is of type indexed, meta(my_corpus, type = "indexed")
Page 5 of the vignette is important to read and experiment with to see all the different options that meta and DublinCore.
I'm scraping data from a large online database (GBIF), which requires three steps: (1) match a GBIF "key" identifier to a species name, (2) send a query to the database, getting a download key ("res") in return, and (3) download, import, and filter the data associated with that species. I've written a function for each of these (not including the actual code here, since it's unfortunately very long and requires login credentials):
get_gbif_key <- function(species) {}
get_gbif_res <- function(gbifkey) {}
get_gbif_dat <- function(gbifres) {}
I have a list of several hundred species to which I want to apply these three functions in order. I know they work individually, but I can't figure out how to feed them into each other (probably using purrr?) and reference the correct inputs from the nested outputs of the previous function.
So, for example:
> testlist <- c('Gadus morhua','Caretta caretta')
> testkey <- map(testlist, get_gbif_key)
> testkey
[[1]]
[1] 8084280
[[2]]
[1] 8894817
Here's where I'm stuck. I want to feed the keys in this list structure into the next function, but I don't know how to properly reference them using map or other functions. I can do it by manually creating a new list for the next function:
> testlist2 <- c('8084280','8894817')
> testres <- map(testlist2, get_gbif_res)
> testres
[[1]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001342-180412121330197
[[2]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001343-180412121330197
EDIT: the structure of this output may be posing a problem here. When I run listviewer::jsonedit(testres), it just looks like a normal nested list with entries 0 and 1 holding the two download keys. However, when I run str(testres), I get the following:
> str(testres)
List of 2
$ :Class 'occ_download' atomic [1:1] 0001342-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
$ :Class 'occ_download' atomic [1:1] 0001343-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
And, again, for the third one:
> testlist3 <- c('0001342-180412121330197','0001343-180412121330197')
> testdat <- map(testlist3, get_gbif_dat)
Which successfully loads a list object with the desired data into R (it has two unnamed elements, 0 and 1, each of which is a list of 28 requested variables for each species). Any advice for scripting this get_gbif_key %>% get_gbif_res %>% get_gbif_dat workflow in a way that unpacks the preceding list structures correctly?
Here's what you should try based on the evidence provided so far. Basically, the results suggest you should be able to succeed with nested map-ping:
yourData <- map( unlist( # to make same class as your single func version
map(
map(testlist,
get_gbif_key), # returns gbifkeys
get_gbif_res)), # returns gbif_res's
get_gbif_dat) # returns data items
The last item that you showed the structure for is just a list of atomic character vectors with some extra attributes and your functions seems to handle that without difficulty, so mapping should succeed.
I have a problem when importing .csv file into R. With my code:
t <- read.csv("C:\\N0_07312014.CSV", na.string=c("","null","NaN","X"),
header=T, stringsAsFactors=FALSE,check.names=F)
R reports an error and does not do what I want:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I guess the problem is because my data is not well formatted. I only need data from [,1:32]. All others should be deleted.
Data can be downloaded from:
https://drive.google.com/file/d/0B86_a8ltyoL3VXJYM3NVdmNPMUU/edit?usp=sharing
Thanks so much!
Open the .csv as a text file (for example, use TextEdit on a Mac) and check to see if columns are being separated with commas.
csv is "comma separated vectors". For some reason when Excel saves my csv's it uses semicolons instead.
When opening your csv use:
read.csv("file_name.csv",sep=";")
Semi colon is just an example but as someone else previously suggested don't assume that because your csv looks good in Excel that it's so.
That's one wonky CSV file. Multiple headers tossed about (try pasting it to CSV Fingerprint) to see what I mean.
Since I don't know the data, it's impossible to be sure the following produces accurate results for you, but it involves using readLines and other R functions to pre-process the text:
# use readLines to get the data
dat <- readLines("N0_07312014.CSV")
# i had to do this to fix grep errors
Sys.setlocale('LC_ALL','C')
# filter out the repeating, and wonky headers
dat_2 <- grep("Node Name,RTC_date", dat, invert=TRUE, value=TRUE)
# turn that vector into a text connection for read.csv
dat_3 <- read.csv(textConnection(paste0(dat_2, collapse="\n")),
header=FALSE, stringsAsFactors=FALSE)
str(dat_3)
## 'data.frame': 308 obs. of 37 variables:
## $ V1 : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ V2 : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ V3 : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ V4 : chr "" "" "" "" ...
## .. more
## $ V36: chr "" "" "" "" ...
## $ V37: chr "0" "0" "0" "0" ...
# grab the headers
headers <- strsplit(dat[1], ",")[[1]]
# how many of them are there?
length(headers)
## [1] 32
# limit it to the 32 columns you want (Which matches)
dat_4 <- dat_3[,1:32]
# and add the headers
colnames(dat_4) <- headers
str(dat_4)
## 'data.frame': 308 obs. of 32 variables:
## $ Node Name : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ RTC_date : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ RTC_time : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ N1 Bat (VDC) : chr "" "" "" "" ...
## $ N1 Shinyei (ug/m3): chr "" "" "0.23" "null" ...
## $ N1 CC (ppb) : chr "" "" "null" "null" ...
## $ N1 Aeroq (ppm) : chr "" "" "null" "null" ...
## ... continues
If you only need the first 32 columns, and you know how many columns there are, you can set the other columns classes to NULL.
read.csv("C:\\N0_07312014.CSV", na.string=c("","null","NaN","X"),
header=T, stringsAsFactors=FALSE,
colClasses=c(rep("character",32),rep("NULL",10)))
If you do not want to code up each colClass and you like the guesses read.csv then just save that csv and open it again.
Alternatively, you can skip the header and name the columns yourself and remove the misbehaved rows.
A<-data.frame(read.csv("N0_07312014.CSV",
header=F,stringsAsFactors=FALSE,
colClasses=c(rep("character",32),rep("NULL",5)),
na.string=c("","null","NaN","X")))
Yournames<-as.character(A[1,])
names(A)<-Yournames
yourdata<-unique(A)[-1,]
The code above assumes you do not want any duplicate rows. You can alternatively remove rows that have the first entry equal to the first column name, but I'll leave that to you.
try read.table() instead of read.csv()
I was also facing the same issue. Now solved.
Just use header = FALSE
read.csv("data.csv", header = FALSE) -> mydata
I had the same problem. I opened my data in textfile and double expressions are separated by semicolons, you should replace them with a period
I was having this error that was caused by multiple rows of meta data at the top of the file. I was able to use read.csv by doing skip= and skipping those rows.
data <- read.csv('/blah.csv',skip=3)
For me, the solution was using csv2 instead of csv.
read.csv("file_name.csv", header=F)
Setting the HEADER to be FALSE will do the job perfectly for you...