I have some .ttl files with doubles and floats with . (point) as the decimal separator.
Is possible to change the decimal separator to a , (comma) when loading to OpenLink Virtuoso v07.20.3213?
Turtle relies on XML Schema Datatypes, in which the only valid decimal separator is the dot.
Subsequent (re)presentation of these values may vary based on locale (which may change the decimal separator to comma and/or add a thousands separator), but that seems like a different question...
(Note that v07.20.3213 is rather elderly, as of this writing; updating to current v7.20.3217 or later is recommended for all users, whether Open Source or Commercial Edition.)
(ObDisclaimer: I work for OpenLink Software, producer of Virtuoso.)
If a CSV file structure differs from the default CSV file settings, the loader will look for a configuration file of the same name as the CSV file with a .cfg filename extension. This file should contain parameters similar to those below, indicating the CSV file's structure:
[csv]
csv-delimiter=<delimiter char>
csv-quote=<quote char>
header=<zero based header offset>
offset=<zero based data offset>
Invisible "tab" and "space" delimiters should be specified by those names, without the quotation marks.
Other delimiter characters (comma, period, etc.) should simply be typed in.
"Smart" quotation marks which differ at start and end (including but not limited to « », ‹ ›, “ ”, and ‘ ’) are not currently supported.
Example
Consider loading a gzipped CSV file, csv-example.csv.gz, with the non-default CSV structure below:
'Southern North Island wood availability forecast for the period
2008-2040' 'Table 14: Wood availability and average clearfell age
for other species in Eastern Southern North Island' 'Year
ending' 'Recoverable volume' 'Average age' 'December' '(000 m3
i.b.)' '(years)' 2006 0 0 2007 0 0 2008 48 49 2009 45 46
...
In this example
the header is on the third line, #2 with a zero-base
the data starts from the fifth line, #4 with a zero-base
the delimiter is tab
the quote char is the single-quote, or apostrophe
Loading this file requires the creation of a configuration file, csv-example.cfg, containing the entries:
[csv]
csv-delimiter=tab
csv-quote='
header=2
offset=4
More Info..
Related
Take this CSV file:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300
It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300
But that would render the file completely useless for all other programs (R, SAS, etc.)
Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?
Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt
So your second example represents this data:
Obs id name value
1 1 Blah 100
2 2 Has space 200
3 3 Ends with quotes" 300
4 4 "Surrounded with quotes" 300
If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:
id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300
But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.
Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.
But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:
Ends with quotes",300
4,"Surrounded with quotes",300
So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.
This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel
A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:
read.table(
text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
sep = ",",
header = TRUE
)
I've got text like this in one of my sqlite table columns:
Mantas are found in temperate, subtropical and tropical waters. Both species are pelagic; M. birostris migrates across open oceans, singly or in groups, while M. alfredi tends to be resident and coastal. They are filter feeders and eat large quantities of zooplankton, which they swallow with their open mouths as they swim. Gestation lasts over a year, producing live pups.
Mantas may visit cleaning stations for the removal of parasites. Like whales, they breach, for unknown reasons.
The last two lines are broken from the previous by either a \r or \n. I want to be able to see the actual value of \r or \n in the shell output of of the column. Any ideas?
There doesn't seem to be any way to directly do this in the SQLite shell, but you can use .output <file> to output the result to a file, then use a text or hex editor to see what the line endings are.
I'm reading a csv file in R that includes a conversion ID column. The issue I'm running into is that my conversionID is being rounded as an exponential number. Below is snapshot of the CSV file (opened in Excel) that I'm reading into R. As you can see, the conversion ID is an exponential format, but the value is: 383305820480.
When I read the data into R, using the following lines, I got the following output. Which looks like it's rounding the string of conversion IDs.
x<-read.csv("./Test2.csv")
options("scipen"=100, "digits"=15)
x
When I export the file as CSV, using the code
write.csv(x,"./Test3.csv")
I get the following output. As you can see, I no longer have a unique identifier as it rounds the number.
I also tried reading the file as a factor, using the code, but I get the same output with numbers rounded. I need the Conversion.ID to be a unique identifier.
x<-read.csv("./Test2.csv", colClasses="character")
The only way I can get the Conversion ID column to stay as a unique identifier is to open the CSV file and write a ' in front of each conversion ID. That is not scalable because I have hundreds of files.
I can't replicate your experience.
(Update: OP reports that the problem is actually with Excel converting/rounding the data on import [!!!])
I created a file on disk with full precision (I don't know the least-significant digits of your data, you didn't show them except for the first element, but I put a non-zero value in the units place for illustration):
writeLines(c(
"Conversion ID",
" 383305820480",
" 39634500000002",
" 213905000000002",
"1016890000000002",
"1220910000000002"),
con="Test2.csv")
Read the file and print it with full precision (use check.names=FALSE for perfect "round trip" capability -- not something you want to do on a regular basis):
x <- read.csv("Test2.csv",check.names=FALSE)
options(scipen=100)
print(x,digits=20)
## Conversion ID
## 1 383305820480
## 2 39634500000002
## 3 213905000000002
## 4 1016890000000002
## 5 1220910000000002
Looks OK.
Now write output (use row.names=FALSE to avoid adding row names/allow a clean round-trip):
write.csv(x,"Test3.csv",row.names=FALSE,quote=FALSE)
The least-mediated way to examine a file on disk from within R is file.show():
file.show("Test3.csv")
## Conversion ID
## 383305820480
## 39634500000002
## 213905000000002
## 1016890000000002
## 1220910000000002
x3 <- read.csv("Test3.csv",check.names=FALSE)
all.equal(x,x3) ## TRUE
Use system tools to check that the files are the same (except for white space differences -- the original file was right-justified):
system("diff -w Test2.csv Test3.csv") ## no difference
If you have even longer ID strings you will need to read them as character to avoid loss of precision:
read.csv("Test2.csv",colClasses="character")
## Conversion.ID
## 1 383305820480
## 2 39634500000002
## 3 213905000000002
## 4 1016890000000002
## 5 1220910000000002
You could probably round-trip through Excel more safely (if you still think that's a good idea) by importing as character and exporting with quotation marks to protect the values.
I just figured out the issue. It looks like my version of Excel is converting the data, causing it to lose the digits. If I avoid opening the file in Excel after downloading it, it retains all the digits. I'm not sure if this is a known issue with newer version. I'm using Excel Office Professional Plus 2013.
I have a .csv that causes different problems with read.table() and fread().
There is an unknown character that causes read.table() to stop (reminiscent of read.csv stops reading at row 523924 even thouhg the file has 799992 rows). Excel, Notepad, and SAS System Viewer render it like a rightwards arrow (although if I use Excel's insert symbol to insert u2192 it appears different); emacs renders it ^Z.
fread() gets past the unknown character (bringing it in as \032) but there is another issue that prevents this from being the solution to my problem: the data set uses quotation marks as an abbreviation for inches, thus embedded (even mismatched) quotes.
Does anyone have any suggestions short of modifying the original .csv file, e.g., by globally replacing the strange arrow?
Thanks in advance!
In case of Paul's file, I was able to read the file (after some experimentation) using fread() with the cmd "unzip -cq" and quote = "" parameters without error or warnings. I suppose that this might work as well with Kristian's file.
On Windows, it might be necessary to install the Rtools beforehand.
library(data.table) # development version 1.14.1 used
download.file("https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip",
"extrfoia-master-dt2021-07-02.zip")
txt1 <- fread(cmd = "unzip -cq extrfoia-master-dt2021-07-02.zip", quote = "")
Caveat: This will download a file of 38 MBytes
According to the unzip man page, the -c option automatically performs ASCII-EBCDIC conversion.
The quote = "" was required because in at least one case a data field contained double quotes within the text.
I have also tried the -p option of unzip which extracts the data without conversion. Then, we can see that there is \032 embedded in the string.
txt2 <- fread(cmd = "unzip -p extrfoia-master-dt2021-07-02.zip", quote = "")
txt2[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOP\032OULOS
The \032 does not appear in the converted version
txt1[47096, 1:2]
CUST-ID LEGAL-NAME
1: 1253096 JOHN A. GIANNAKOPOULOS
We can search for all occurrences of \032 in all character fields by
melt(txt2, id.vars = "CUST-ID", measure.vars = txt[, names(.SD), .SDcols = is.character])[
value %flike% "\032"][order(`CUST-ID`)]
CUST-ID variable value
1: 1253096 LEGAL-NAME JOHN A. GIANNAKOP\032OULOS
2: 2050751 DBA-NAME colbert ball tax tele\032hone rd
3: 2082166 LEGAL-NAME JUAN DE J. MORALES C\032TALA
4: 2273606 LEGAL-NAME INTRINSIC DM\032 INC.
5: 2300016 MAIL-ADDR1 PO BOX \03209
6: 2346154 LEGAL-NAME JOEL I GONZ\032LEZ-APONTE CPA
7: 2384445 LEGAL-NAME NUMBERS CAF\032 PLLC
8: 2518214 MAIL-ADDR1 556 W 800 N N\03211
9: 2518214 BUSN-ADDR1 556 W 800 N N\03211
10: 13718109 DBA-NAME World Harvest Financial Grou\032
11: 13775763 LEGAL-NAME Fiscally Responsible Consulting LLC\032
12: 13775763 DBA-NAME Fiscally Responsible Consulting LLC\032
This may help to identify the records of the file to fix manually.
I hit this problem today, so its still there in R 4.0.5
The data I'm using is public, from the Internal Revenue service. Somehow the unrecognized characters become "^Z" in the database. So far as I can tell, "^Z" gets inadvertently created when people enter characters that are not recognized by original program that receives. The IRS distributes a CSV file from the database.
In the example file I'm dealing with, there are 13 rows (out of 360,000) that have the ^Z in various spots. Manually deleting them one-at-a-time lets R read.table get a little further. I found no encoding setting in R that made a difference on this problem.
I found 2 solutions.
Get rid of the "^Z" symbol with text tools before using read.csv.
Switch to Python. The pandas package function read_csv, with encoding as "utf-8" correctly obtains all rows. However, in the pandas.DataFrame that results, the unrecognized character is in the data, it looks like an empty square.
If you want an example to explore, here's the address: https://www.irs.gov/pub/irs-utl/extrfoia-master-dt2021-07-02.zip. The first "^Z" you find is line 47096.
I am trying to convert data from Act 2000 to a MySQL database. I have successfully imported the DBF files into individual MySQL tables. However I am having issues with the *.BLB file, which seems to be a non-standard memo file.
The DBF files, identifies themselves as dbase III Plus, No memo format. There is a single *.BLB which is a memo file for multiple DBFs to share BLOB data.
If you read this document: http://cicorp.com/act/sdk/ACT6-SDK-ChapterA.htm#_Toc483994053)
You can see that the REGARDING column is a 6 character one. The description is: This 6-byte field is supplied by the system and contains a reference to a field in the Binary Large Object (BLOB) Database.
Now upon opening the *.BLB I can see that the block size is 64 bytes. All the blocks of text are NULL padded out to that size.
Where I am stumbling is trying to convert the values stored in the REGARDING column to blocks location in the BLB file. My assumption is that 6 character field is an offset.
For example, one value for REGARDING is, (ignoring the square brackets): [ ",J$]
In my Googling, I found this: http://ulisse.elettra.trieste.it/services/doc/dbase/DBFstruct.htm#C1.5
It explains that in memo fields (in normal DBF files at least) the space value is ignore (i.e. it's padding out the column).
Therefore if I'm correct (again, square brackets) [",J$] should be the offset in my BLB file. Luckily I've still got access to the original ACT2000 software, so I can compare the full text in the program / MySQL and BLB file.
Using my example value, I know that the DB row with REGARDING value of [ ",J$] corresponds to a 1024 byte offset (or 16 blocks, assuming my guess of a 64 byte sized block).
I've tried reading some Python code for open source projects that read DBF files - but I'm in over my head.
I think what I need to do is unpack the characters to binary, but am not sure.
How can I find the 64-block based spot to read from based on what's found in the DBF files?
EDIT by Jerry Dodge
I've attempted to reverse-engineer the strings in this field to hexadecimal values, and then to an integer value using StrToInt64, but the result still does not match up with the blob file. I've also tried multiplying this integer value by 64 and not multiplying, but the result keeps winding up outside of the size of the blob file, not actually finding any data.
For example, a value of ___/BD (_ = space) translates to $2f4244 hexidecimal, which in turn translates to the integer value of 3097156, but does not correspond with any relevant portion of data in the blob file, even when multiplied or divided by 64.
According to the SDK you linked, the following happens as I understand:
There is a TYPE field (right behing REGARDING) that encodes what REGARDING is used for (see the second table of the linked chapter). So I'd assume that if type=6 (meeting not held) the REGARDING is either irrelevant or only contains a meeting ID reference from some other table. On that line of thought I would only expect REGARDING to be a BLB offset if type=101 (or possibly 100). I'd also not abandon the thought that in these relevant cases TYPE might be a concatenation of BLB file index and offset (because there is a mention that each file must not be longer than 30K chars and I really expect to be able to store much more data even in one table).