I am trying to merge multiple json files into one database and despite trying all the approaches found on SO, it fails.
The files provide sensor data. The stages I've completed are:
1. Unzip the files - produces json files saved as '.txt' files
2. Remove the old zip files
3. Parse the '.txt' files to remove some bugs in the content - random 3
letter + comma combos at the start of some lines, e.g. 'prm,{...'
I've got code which will turn them into data frames individually:
stream <- stream_in(file("1.txt"))
flat <- flatten(stream)
df_it <- as.data.frame(flat)
But when I put it into a function:
df_loop <- function(x) {
stream <- stream_in(x)
flat <- flatten(stream)
df_it <- as.data.frame(flat)
df_it
}
And then try to run through it:
df_all <- sapply(file.list, df_loop)
I get:
Error: Argument 'con' must be a connection.
Then I've tried to merge the json files with rbind.fill and merge to no avail.
Not really sure where I'm going so terribly wrong so would appreciate any help.
You need a small change in your function. Change to -
stream <- stream_in(file(x))
Explanation
Start with analyzing your original implementation -
stream <- stream_in(file("1.txt"))
The 1.txt here is the file path which is getting passed as an input parameter to file() function. A quick ?file will tell you that it is a
Function to create, open and close connections, i.e., “generalized
files”, such as possibly compressed files, URLs, pipes, etc.
Now if you do a ?stream_in() you will find that it is a
function that implements line-by-line processing of JSON data over a
connection, such as a socket, url, file or pipe
Keyword here being socket, url, file or pipe.
Your file.list is just a list of file paths, character/strings to be specific. But in order for stream_in() to work, you need to pass in a file object, which is the output of file() function which takes in the file path as a string input.
Chaining that together, you needed to do stream_in(file("/path/to/file.txt")).
Once you do that, your sapply takes iterates each path, creates the file object and passes it as input to stream_in().
Hope that helps!
Related
I am using the purrr:walk to read multiple excel files and it failed. I have 3 questions:
(1) I used the function list.files to read the excel file list in one folder. But the returned values also included the subfolders. I tried set value for the parameters recursive= and include.dirs=, but it didn't work.
setwd(file_path)
files<-as_tibble(list.files(file_path,recursive=F,include.dirs=F)) %>%
filter(str_detect(value,".xlsx"))
files
(2) When I used the following piece of code, it can run without any error or warning message, but there is no returned data.
###read the excel data
file_read <- function(value1) {
print(value1)
file1<-read_excel(value1,sheet=1)
}
walk(files$value,file_read)
When I used the following, it worked. Not sure why.
test<-read_excel(files$value,sheet=1)
(3) In Q2, actually I want to create file1 to file6, suppose there are 6 excel files. How can I dynamically assign the dataset name?
list.files has pattern argument where you can specify what kind of files you are looking for. This will help you avoid filter(str_detect(value,".xlsx")) step. Also list.files only returns the files that are included in the main directory (file_path) and not it's subdirectory unless you specify recursive = TRUE.
library(readxl)
setwd(file_path)
files <- list.files(pattern = '\\.xlsx')
In the function you need to return the object.
file_read <- function(value1) {
data <- read_excel(value1,sheet=1)
return(data)
}
Now you can use map/lapply to read the files.
result <- purrr::map(files,file_read)
I have 500+ .json files that I am trying to get a specific element out of. I cannot figure out why I cannot read more than one at a time..
This works:
library (jsonlite)
files<-list.files(‘~/JSON’)
file1<-fromJSON(readLines(‘~/JSON/file1.json),flatten=TRUE)
result<-as.data.frame(source=file1$element$subdata$data)
However, regardless of using different json packages (eg RJSONIO), I cannot apply this to the entire contents of files. The error I continue to get is...
attempt to run same code as function over all contents in file list
for (i in files) {
fromJSON(readLines(i),flatten = TRUE)
as.data.frame(i)$element$subdata$data}
My goal is to loop through all 500+ and extract the data and its contents. Specifically if the file has the element ‘subdata$data’, i want to extract the list and put them all in a dataframe.
Note: files are being read as ASCII (Windows OS). This does bot have a negative effect on single extractions but for the loop i get ‘invalid character bytes’
Update 1/25/2019
Ran the following but returned errors...
files<-list.files('~/JSON')
out<-lapply(files,function (fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in file(i): object 'i' not found
Also updated function, this time with UTF* errors...
files<-list.files('~/JSON')
out<-lapply(files,function (i,fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in parse_con(txt,bigint_as_char):
lexical error: invalid bytes in UTF8 string. (right here)------^
Latest Update
Think I found out a solution to the crazy 'bytes' problem. When I run readLines on the .json file, I can then apply fromJSON),
e.x.
json<-readLines('~/JSON')
jsonread<-fromJSON(json)
jsondf<-as.data.frame(jsonread$element$subdata$data)
#returns a dataframe with the correct information
Problem is, I cannot apply readLines to all the files within the JSON folder (PATH). If I can get help with that, I think I can run...
files<-list.files('~/JSON')
for (i in files){
a<-readLines(i)
o<-fromJSON(file(a),flatten=TRUE)
as.data.frame(i)$element$subdata}
Needed Steps
apply readLines to all 500 .json files in JSON folder
apply fromJSON to files from step.1
create a data.frame that returns entries if list (fromJSON) contains $element$subdata$data.
Thoughts?
Solution (Workaround?)
Unfortunately, the fromJSON still runs in to trouble with the .json files. My guess is that my GET method (httr) is unable to wait/delay and load the 'pretty print' and thus is grabbing the raw .json which in-turn is giving odd characters and as a result giving the ubiquitous '------^' error. Nevertheless, I was able to put together a solution, please see below. I want to post it for future folks that may have the same problem with the .json files not working nicely with any R json package.
#keeping the same 'files' variable as earlier
raw_data<-lapply(files,readLines)
dat<-do.call(rbind,raw_data)
dat2<-as.data.frame(dat,stringsasFactors=FALSE)
#check to see json contents were read-in
dat2[1,1]
library(tidyr)
dat3<-separate_rows(dat2,sep='')
x<-unlist(raw_data)
x<-gsub('[[:punct:]]', ' ',x)
#Identify elements wanted in original .json and apply regex
y<-regmatches(x,regexc('.*SubElement2 *(.*?) *Text.*',x))
for loops never return anything, so you must save all valuable data yourself.
You call as.data.frame(i) which is creating a frame with exactly one element, the filename, probably not what you want to keep.
(Minor) Use fromJSON(file(i),...).
Since you want to capture these into one frame, I suggest something along the lines of:
out <- lapply(files, function(fn) {
o <- fromJSON(file(fn), flatten = TRUE)
as.data.frame(o)$element$subdata$data
})
allout <- do.call(rbind.data.frame, out)
### alternatives:
allout <- dplyr::bind_rows(out)
allout <- data.table::rbindlist(out)
I'm working on a project for school that requires me to combine ~600 JSON files into one CSV file. I have minimal coding knowledge in R, and I keep getting errors that I can't resolve, probably due to my minimal knowledge. Here's the code I'm using:
filenames <- list.files(pattern="*.json")
myJson <- lapply(filenames, function(x) fromJSON(file=x))
This returns a list of the JSON contents of all my files (hooray), and it's where things break down. If I use:
myJson <- toJSON(myJson)
to try converting all my list of JSON data into one JSON, I get this error:
Error in toJSON(myJson) : unable to escape string. String is not utf8
If I use unlist(myJson), I lose all the columns and get a useless single column of all my data. Any assistance would be hugely appreciated! Thank you.
I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu
I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns
I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process.
saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame
On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original.
df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent
tmp.df2 <- rbind(df2, tmp.df) #append new to existing
saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it
rm(df2) #housecleaning
rm(tmp.df2) #housecleaning
This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases.
Is there something like an appendRDS (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?
I think you can safeguard your process by using connections, opening and closing it before the next process takes over.
con <- file("tmp.rds")
open(con)
df <- readRDS(con)
df.new <- rbind(df,df)
saveRDS(df.new, con)
close(con)
Update:
You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency.
while(is.Open(con)) { # untested but something of this nature should work
sys.Sleep(2)
}
Is there anything wrong with using a series of numbered RDS files in a directory instead of a single RDS file? I don't think is is possible to append to a data frame an an RDS file without rewriting the entire file, since data frames are simply lists of columns, so presumably they are serialized one column at a time, so only the last column ends near the end of the file.
If you want to stick with a single file but minimize the risk of reading inconsistent data from a RDS file, you can read it in, do the append operation, and then write it out to a temp file and rename the temp file to the original name once it is finished. Then at least your period of risk is not dependent on the size of the file. I'm not familiar with what kind of atomicity is guaranteed by various filesystems when renaming a file to an existing name, but it's probably better than the time taken by saveRDS.