Zarr open() returns FSPathExistNotDir error - zarr

When I run zarr.open('result.zarr', mode='r') I get the following error:
FSPathExistNotDir: path exists but is not a directory: %r
According to the example in the Zarr documentation located at https://zarr.readthedocs.io/en/stable/tutorial.html#persistent-arrays, this zarr.open() function should return a zarr.core.Array:
z2 = zarr.open('data/example.zarr', mode='r')
np.all(z1[:] == z2[:])
How come the zarr.open() function is looking for a directory in my case?

I see my confusion. For me, example.zarr is the name of the file (it seems I wrongly named it), and not a directory.
I was also confused because zarr.open() creates a zarray but does not open an existing zarray like the function name implies.
From kaggle.com/kneroma/zarr-files-and-l5kit-data-for-dummies:
z1 = zarr.open('data/example.zarr', mode='w', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4') z1
The array above will store its configuration metadata and all
compressed chunk data in a directory called ‘data/example.zarr’
relative to the current working directory.
The zarr.convenience.open() function provides a convenient way to
create a new persistent array or continue working with an existing
array.

Related

Not able to access certain JSON properties in Autoloader

I have a JSON file that is loaded by two different Autoloaders.
One uses schema evolution and besides replacing spaces in the json property names, writes the json directly to a delta table, and I can see all the values are there properly.
In the second one I am mapping to a defined schema and only use a subset of properties. So use a lot of withColumn and then a select to narrows to my defined column list.
Autoloader definition:
df = (spark
.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'json')
.option('multiLine', 'true')
.option('cloudFiles.schemaEvolutionMode','rescue')
.option('cloudFiles.includeExistingFiles','true')
.option('cloudFiles.schemaLocation', bronze_schema)
.option('cloudFiles.inferColumnTypes', 'true')
.option('pathGlobFilter','*.json')
.load(upload_path)
.transform(lambda df: remove_spaces_from_columns(df))
.withColumn(...
Writer:
df.writeStream.format('delta') \
.queryName(al_stream_name) \
.outputMode('append') \
.option('checkpointLocation', checkpoint_path) \
.option('mergeSchema', 'true') \
.trigger(once = True) \
.table(bronze_table)
Issue is that some of the source columns are ok load and I get their values, and others are constantly null in the output table.
For example:
.withColumn('vl_rating', col('risk_severity.value')) # works
.withColumn('status', col('status.name')) # always null
...
.select(
'rating',
'status',
...
json is quite simple, these are all string values, they are always populated. The same code works against another simular json file in another autoloader without issue.
I have run out of ideas to fault find on this. My imports are minimal, outside of Autoloader the JSON loads fine.
e.g
%python
import pyspark.sql.functions as psf
jsontest = spark.read.option('inferSchema','true').json('dbfs:....json')
df = jsontest.withColumn('status', psf.col('status.name')).select('status')
display(df)
Results in the values of the status.name property of the json file
Any ideas would be greatly appreciated.
I have found generally what is causing this. Interesting cause!
I am scanning a whole directory of json files, and the schema evolves over time (as expected). But when I clear out the autoloader schema and checkpoint directories and only scan the latest json file it all works correctly.
So what I surmise is that something in schema evolution with the older json files causes Autoloader to get into a state where it will not put certain properties into the stream to the writer.
If anyone has any recommendation on how to implement some data quality analysis in an Autoloader I would be most appreciative if you would share.

How to get the name of the originally ran file in Julia

I'd like to create a log file which is named after the originally run Julia file, for example here julia foo.jl I'd want foo.jl. From within a Julia session how can I get this information>
The global constant PROGRAM_FILE is set to the script name.
This can be done by inspecting the stack
# first get the top of the stack
f = stacktrace()[1]
# then get the file's name as a string, note the is absolute.
abs_filename = String(f.file)
println(abs_filename)
# to get only the filename use
println(basename(abs_filename))

Return one folder above current directory in Julia

In Julia, I can get the current directory from
#__DIR__
For example, when I run the above in the "Current" folder, it gives me
"/Users/jtheath/Dropbox/Research/Projects/Coding/Current"
However, I want it to return one folder above the present folder; i.e.,
"/Users/jtheath/Dropbox/Research/Projects/Coding"
Is there an easy way to do this in a Julia script?
First, please note that #__DIR__ generally expands to the directory of the current source file (it does however return the current working directory if there are no source files involved, e.g when run from the REPL). In order to reliably get the current working directory, you should rather use pwd().
Now to your real question: I think the easiest way to get the path to the parent directory would be to simply use dirname:
julia> dirname("/Users/jtheath/Dropbox/Research/Projects/Coding/Current")
"/Users/jtheath/Dropbox/Research/Projects/Coding"
Note that AFAIU this only uses string manipulations, and does not care whether the paths involved actually exist in the filesystem (which is why the example above works on my system although I do not have the same filesystem structure as you). dirname is also relatively sensitive to the presence/absence of a trailing slash (which shouldn't be a problem if you feed it something that comes directly from pwd() or #__DIR__).
I sometimes also use something like this, in the hope that it might be more robust when I want to work with paths that actually exist in the filesystem:
julia> curdir = pwd()
"/home/francois"
julia> abspath(joinpath(curdir, ".."))
"/home/"

Standard ML / NJ: Loading in file of functions

I'm trying to write a bunch of functions in an SML file and then load them into the interpreter. I've been googling and came across this:
http://www.smlnj.org/doc/interact.html
Which has this section:
Loading ML source text from a file
The function use: string -> unit interprets its argument as a file name relative to sml's current directory and loads the text from that file as though it had been typed in. This should normally be executed at top level, but the loaded files can also contain calls of use to recursively load other files.
So I have a test.sml file in my current directory. I run sml, all good so far. Then I try use test.sml; and I get:
stdIn:1.6-1.14 Error: unbound structure: test in path test.sml
Not sure why this isn't working. Any ideas?
Thanks,
bclayman
As you mentioned, the function use has type string -> unit. This means it takes a string and returns unit. When you do use test.sml, you are not giving it a string. You need to do use "test.sml" (notice the quotes)

How to create a new output file in R if a file with that name already exists?

I am trying to run an R-script file using windows task scheduler that runs it every two hours. What I am trying to do is gather some tweets through Twitter API and run a sentiment analysis that produces two graphs and saves it in a directory. The problem is, when the script is run again it replaces the already existing files with that name in the directory.
As an example, when I used the pdf("file") function, it ran fine for the first time as no file with that name already existED in the directory. Problem is I want the R-script to be running every other hour. So, I need some solution that creates a new file in the directory instead of replacing that file. Just like what happens when a file is downloaded multiple times from Google Chrome.
I'd just time-stamp the file name.
> filename = paste("output-",now(),sep="")
> filename
[1] "output-2014-08-21 16:02:45"
Use any of the standard date formatting functions to customise to taste - maybe you don't want spaces and colons in your file names:
> filename = paste("output-",format(Sys.time(), "%a-%b-%d-%H-%M-%S-%Y"),sep="")
> filename
[1] "output-Thu-Aug-21-16-03-30-2014"
If you want the behaviour of adding a number to the file name, then something like this:
serialNext = function(prefix){
if(!file.exists(prefix)){return(prefix)}
i=1
repeat {
f = paste(prefix,i,sep=".")
if(!file.exists(f)){return(f)}
i=i+1
}
}
Usage. First, "foo" doesn't exist, so it returns "foo":
> serialNext("foo")
[1] "foo"
Write a file called "foo":
> cat("fnord",file="foo")
Now it returns "foo.1":
> serialNext("foo")
[1] "foo.1"
Create that, then it returns "foo.2" and so on...
> cat("fnord",file="foo.1")
> serialNext("foo")
[1] "foo.2"
This kind of thing can break if more than one process might be writing a new file though - if both processes check at the same time there's a window of opportunity where both processes don't see "foo.2" and think they can both create it. The same thing will happen with timestamps if you have two processes trying to write new files at the same time.
Both these issues can be resolved by generating a random UUID and pasting that on the filename, otherwise you need something that's atomic at the operating system level.
But for a twice-hourly job I reckon a timestamp down to minutes is probably enough.
See ?files for file manipulation functions. You can check if file exists with file.exists, and then either rename the existing file, or create a different name for the new one.

Resources