I have a spreadsheet that when i open in excel the cells in question show formatting as date mm-dd-yyyy.
When I run this through php excel reader (xlsx file) it does not recognize that it is a date.
I opened the file in ms open xml sdk and it shows in the styles for numFmts
numFmtId="102" formatCode="mm-dd-yyyy"
numFmtId="104" formatCode="mm-dd-yyyy"
numFmtId="106" formatCode="mm-dd-yyyy"
numFmtId="108" formatCode="mm-dd-yyyy"
numFmtId="110" formatCode="mm-dd-yyyy"
numFmtId="112" formatCode="mm-dd-yyyy"
numFmtId="114" formatCode="mm-dd-yyyy"
numFmtId="116" formatCode="mm-dd-yyyy"
numFmtId="118" formatCode="mm-dd-yyyy"
It would only convert to date after I added
self::$_builtInFormats[102] = 'mm-dd-yyyy';
self::$_builtInFormats[104] = 'mm-dd-yyyy';
self::$_builtInFormats[106] = 'mm-dd-yyyy';
self::$_builtInFormats[108] = 'mm-dd-yyyy';
self::$_builtInFormats[110] = 'mm-dd-yyyy';
self::$_builtInFormats[112] = 'mm-dd-yyyy';
self::$_builtInFormats[114] = 'mm-dd-yyyy';
self::$_builtInFormats[116] = 'mm-dd-yyyy';
self::$_builtInFormats[118] = 'mm-dd-yyyy';
to NumberFormat.php.
Is this supposed to be the case
MS Excel uses format codes 0 to 163 for "built-in" formats, but there are a lot of unused entries in this set, and nothing is officially defined for format codes 102 to 118. The codes listed as built-in in the PHPExcel PHPExcel_Style_NumberFormat class are all the main built-ins, only ignoring a few special localised formats for Chinese/Korean/Thai/Japanese localised.
However, this restriction doesn't prevent a lot of naughty homebrew xlsx writers from using "reserved" id values that aren't actually defined in the ISO specification.
Typically, these values should be defined in the /xl/styles.xml file in a block looking like:
<numFmts count="2">
<numFmt formatCode="mm-dd-yyyy" numFmtId="102"/>
<numFmt formatCode="mm-dd-yyyy" numFmtId="104"/>
...
</numFmts>
While the latest production release of PHPExcel adheres to the ISO standard, and ignores any values below 164 unless they are explicitly defined in the formal specification (ie uses only values in the built-in list), a change has been made in the 1.8 branch to "tolerate" this misuse of the standard, and the code will read user-defined number format values below 164 unless they over-ride a value defined in the standard.
Related
I'm trying to move a parquet file that is written out in R using the arrow library to BigTable. I have validated the arrow package instalation and made sure that the snappy codec is available using codec_is_available("snappy").
For some reason in the third step of the workflow I run into the following error:
Error message from worker: java.lang.RuntimeException:
org.apache.beam.sdk.util.UserCodeException:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 1 in file
ReadableFile{
metadata=Metadata{
resourceId=gs://mybucket/userdata_2.parquet,
sizeBytes=85550,
isReadSeekEfficient=true,
checksum=null,
lastModifiedMillis=0}, compression=UNCOMPRESSED}
It is unclear to me why it gives this error, but also why it says compression=UNCOMPRESSED. The file has been compressed with snappy.
I have tried to change the arrow version from 1.0 to 2.0, and have tried to change compression codecs, including uncompressed (even though the uncompressed format does not seem to be supported by Google Data Flow). The error stays the same.
Using a utility like parquet-tools gives no indication that there is anything wrong with the files I'm uploading.
Is there any special requirement to the parquet format for Google Data Flow that I'm missing here? I've iterated through the ones available to me in the arrow package to no avail.
I was also seeing this error when trying to use my own pyarrow-generated parquets with the parquet_to_bigtable dataflow template.
The issue boiled down to schema mismatches. While the data in the parquet matched the expected format perfectly, and printing known-good and my own versions showed the exact same contents, parquets contain additional metadata that describes the schema, like so:
➜ ~ parq my_pyarrow_generated.parquet -s
# Schema
<pyarrow._parquet.ParquetSchema object at 0x12d7164c0>
required group field_id=-1 schema {
optional binary field_id=-1 key;
optional group field_id=-1 cells (List) {
repeated group field_id=-1 list {
optional group field_id=-1 item {
optional binary field_id=-1 family (String);
optional binary field_id=-1 qualifier;
optional double field_id=-1 timestamp;
optional binary field_id=-1 value;
}
}
}
}
I knew this schema probably wasn't precisely what they use themselves, so to get an understanding of how far off I was from what was needed, I used the inverse template bigtable_to_parquet to get a sample parquet file that has the correct metadata encoded within it:
➜ ~ parq dataflow_bigtable_to_parquet.parquet -s
# Schema
<pyarrow._parquet.ParquetSchema object at 0x1205c6a80>
required group field_id=-1 com.google.cloud.teleport.bigtable.BigtableRow {
required binary field_id=-1 key;
required group field_id=-1 cells (List) {
repeated group field_id=-1 array {
required binary field_id=-1 family (String);
required binary field_id=-1 qualifier;
required int64 field_id=-1 timestamp;
required binary field_id=-1 value;
}
}
}
as seen, the schemas are very close, but not exact.
With this though, we can build a simple workaround. It's gross, but I'm still actively debugging this right now and this is what just worked finally.
bigtable_schema_parquet = pq.read_table(pa.BufferReader(bigtable_to_parquet_file_bytes))
keys = []
cells = []
.......
df = pd.DataFrame({'key': keys, 'cells': cells})
table = pa.Table.from_pandas(df, schema=bigtable_schema_parquet.schema)
tl;dr: Use the bigtable_to_parquet dataflow template to get a sample parquet that has the schema that the parquet_to_bigtable input must use. Then load that schema in-memory and pass it to from_pandas to overwrite whatever schema it would have otherwise inferred
It has been confusing for me what the difference is between the version-vaild-for number (offset 92) and the file change counter (offset 96) in the database file header.
The entries at offsets 92 and 96 were added in later version of the SQLite library.
When an older version modifies the file, it will change the change counter (offset 24), but not adjust the version-valid-for number or the write library version number. So the library version number is no longer correct, because a different version last wrote to the file.
The version-valid-for number allows a new library to detect this case: if the change counter and the version-valid-for number do not match, then the write library version number is outdated, and must be ignored.
When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day.
This does not conform to any parquet logical type. The schema in the parquet file does not, then, give an indication of the column being anything but an integer.
My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer?
Semantics is determined based on the metadata. We'll need some imports:
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
example data:
val path = "/tmp/ts"
Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
.withColumn("ts", $"ts".cast("timestamp"))
.write.mode("overwrite").parquet(path)
and Hadoop configuration:
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)
Now we can access Spark metadata:
ParquetFileReader
.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
.get(0)
.getParquetMetadata
.getFileMetaData
.getKeyValueMetaData
.get("org.apache.spark.sql.parquet.row.metadata")
and the result is:
String = {"type":"struct","fields: [
{"name":"id","type":"integer","nullable":false,"metadata":{}},
{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
Equivalent information can be stored in the Metastore as well.
According to the official documentation this is used to achieve compatibility with Hive and Impala:
Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
and can be controlled using spark.sql.parquet.int96AsTimestamp property.
we have recently upgraded to oe rdbms 11.3 version from 9.1d. While generating
reports,i found the field value of a field comes as 2'239,00 instead of
2,239.00.I checked the format its >,>>>,>>9.99.
what could be the reason behind this?
The admin installing the database didn't to it's homework and selected wrong default numeric and decimal separator.
However no greater harm done:
Set these startup parameters
-numsep 44 -numdec 46
This is an simplified database startup example with added parameters as above:
proserve /db/db -H dbserver -S dbservice -numsep 44 -numdec 46
When you install Progress you are prompted for the numeric format to use. That information is then written to a file called "startup.pf" which is located in the install directory (C:\Progress\OpenEdge by default on Windows...)
If you picked the wrong numeric format you can edit startup.pf with any text editor. It should look something like this:
#This is a placeholder startup.pf
#You may put any global startup parameters you desire
#in this file. They will be used by ALL Progress modules
#including the client, server, utilities, etc.
#
#The file dlc/prolang/locale.pf provides examples of
#settings for the different options that vary internationally.
#
#The directories under dlc/prolang contain examples of
#startup.pf settings appropriate to each region.
#For example, the file dlc/prolang/ger/german.pf shows
#settings that might be used in Germany.
#The file dlc/prolang/ger/geraus.pf gives example settings
#for German-speaking Austrians.
#
#Copy the file that seems appropriate for your region or language
#over this startup.pf. Edit the file to meet your needs.
#
# e.g. UNIX: cp /dlc/prolang/ger/geraus.pf /dlc/startup.pf
# e.g. DOS, WINDOWS: copy \dlc\prolang\ger\geraus.pf \dlc\startup.pf
#
# You may want to include these same settings in /dlc/ade.pf.
#
#If the directory for your region or language does not exist in
#dlc/prolang, please check that you have ordered AND installed the
#International component. The International component provides
#these directories and files.
#
-cpinternal ISO8859-1
-cpstream ISO8859-1
-cpcoll Basic
-cpcase Basic
-d mdy
-numsep 44
-numdec 46
Changes to the startup.pf file are GLOBAL -- they impact all sessions started on this machine. If you only want to change a single session then you can add the parameters to the command line (or the shortcut icons properties) or to a local .pf file or to an ini file being used by that session.
You can also programmatically override the format in your code by using the SESSION system handle:
assign
session:numeric-decimal-point = "."
session:numeric-separator = ","
.
display 123456.999.
(You might want to consider saving the current values and restoring them if this is a temporary change.)
(You can also use the shorthand session:numeric-format = "american". or "european" for the two most common cases.)
I need to import the Geonames database (http://download.geonames.org/export/dump/) into SQLite (file is about a gigabyte in size, ±8,000,000 records, tab-delimited).
I'm using the built-in SQLite-possibilities of Mac OS X, accessed through terminal. All goes well, until record 381174 (tested with older file, the exact number varies slightly depending on the exact version of the Geonames database, as it is updated every few days), where the error "expected 19 columns of data but found 18" is displayed.
The exact line causing the problem is:
126704 Gora Kyumyurkey Gora Kyumyurkey Gora Kemyurkey,Gora
Kyamyar-Kup,Gora Kyumyurkey,Gora Këmyurkëy,Komur Qu",Komur
Qu',Komurkoy Dagi,Komūr Qū’,Komūr Qū”,Kummer Kid,Kömürköy Dağı,kumwr
qwʾ,كُمور
قوء 38.73335 48.24133 T MT AZ AZ 00 0 2471 Asia/Baku 2014-03-05
I've tested various countries separately, and the western countries all completely imported without a problem, causing me to believe the problem is somewhere in the exotic characters used in some entries. (I've put this line into a separate file and tested with several other database-programs, some did give an error, some imported without a problem).
How do I solve this error, or are there other ways to import the file?
Thanks for your help and let me know if you need more information.
Regarding the question title, a preliminary search resulted in
the GeoNames format description ("tab-delimited text in utf8 encoding")
https://download.geonames.org/export/dump/readme.txt
some libraries (untested):
Perl: https://github.com/mjradwin/geonames-sqlite (+ autocomplete demo JavaScript/PHP)
PHP: https://github.com/robotamer/geonames-to-sqlite
Python: https://github.com/commodo/geonames-dump-to-sqlite
GUI (mentioned by #charlest):
https://github.com/sqlitebrowser/sqlitebrowser/
The SQLite tools have import capability as well:
https://sqlite.org/cli.html#csv_import
It looks like a bi-directional text issue. "كُمور قوء" is expected to be at the end of the comma-separated alternate name list. However, on account of it being dextrosinistral (or RTL), it's displaying on the wrong side of the latitude and longitude values.
I don't have visibility of your import method, but it seems likely to me that that's why it thinks a column is missing.
I found the same problem using the script from the geonames forum here: http://forum.geonames.org/gforum/posts/list/32139.page
Despite adjusting the script to run on Mac OS X (Sierra 10.12.6) I was getting the same errors. But thanks to the script author since it helped me get the sqlite database file created.
After a little while I decided to use the sqlite DB Browser for SQLite (version 3.11.2) rather than continue with the script.
I had errors with this method as well and found that I had to set the "Quote character" setting in the import dialog to the blank state. Once that was done the import from the FULL allCountries.txt file ran to completion taking just under an hour on my MacBookPro (an old one but with SSD).
Although I have not dived in deeper I am assuming that the geonames text files must not be quote parsed in any way. Each line simply needs to be handled as tab delimited UTF-8 strings.
At the time of writing allCountries.txt is 1.5GB with 11,930,517 records. SQLite database file is just short of 3GB.
Hope that helps.
UPDATE 1:
Further investigation has revealed that it is indeed due to the embedded quotes in the geonames files, and looking here: https://sqlite.org/quirks.html#dblquote shows that SQLite has problems with quotes. Hence you need to be able to switch off quote parsing in SQLite.
Despite the 3.11.2 version of DB Browser being based on SQLite 3.27.2 which does not have the required mods to ignore the quotes, I can only assume it must be escaping the quotes when you set the "Quote character" to blank.