Duckdb_read_csv struggling with with auto detecting column data types in R - r

I have some very large CSV files (~183mio. rows by 8 columns) that I want to load into a database using R. I use duckdb for this and it its built-in function duckdb_read_csv, which is supposed to auto-detect datatypes for each column. If I enter the following code:
con = dbConnect(duckdb::duckdb(), dbdir="testdata.duckdb", read_only = FALSE)
duckdb_read_csv(con, "d15072021","mydata.csv",
header = TRUE)
It produces this error:
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Could not convert string '2' to BOOL between line 12492801 and 12493825 in column 9. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0
I've looked at the rows in question and I can't find any irregularities in column 9. Unfortunately, I cannot post the dataset because it's confidential. But the entire column is filled with either FALSE or TRUE.
If I set the parameter nrow.check to something larger than 12493825 it doesn't produce the same error but takes very long and simply converts the column to VARCHAR instead of a logical. Setting nrow.check to -1 (meaning it checks every row for a pattern) crashes R and my PC completely.
The weird thing: This isn't consistent. Earlier I imported the dataset whilst keeping the default value for nrow.check at 500 and it read the file with no issue (though still converting column 9 to VARCHAR). I have to read a lot of files that are the same pattern so I need to have a reliable way of reading them. Anyone know how duckdb_read_csv actually works and why I might get this error?
Note that reading the files into memory and then into a database isn't an option because I run out of memory instantly.

the way the sniffer works is by sampling nrow.check rows to figure out the data type, so the result can differ from runs if you get unlucky, increasing it will reduce the chances of failing it, mainly because the sniffer looks at more rows.
If increasing the number of rows is not possible due to performance issues, you can of course first define the schema of the CSV file. But then you must know the schema beforehand.
As an example of how you can define the schema and turn off the sniffer:
select * from
SELECT * FROM read_csv('test.csv', COLUMNS=STRUCT_PACK(a := 'INTEGER', b := 'INTEGER'), auto_detect='false')

Related

Why does read.csv2 work just fine, yet read.csv2.sql shows an error/warning?

I am trying to read a csv file in R using read.csv2.sql, since I would like to use a SELECT query from SQL to help me filter my data, but before I can even get to my SELECT query, I discovered that simply reading my csv file using read.csv2.sql already generates a warning message.
This is my code:
investment2 <- read.csv2.sql("investmentdata.csv")
This is the warning message:
Warning message:
In result_fetch(res#ptr, n = n) :
Column 'Capital.Investment': mixed type, first seen values of type real, coercing other values of type string
However, when I use the normal read.csv2 function, there is no error. In particular, the following code works fine with no warning messages:
investment <- read.csv2("investmentdata.csv")
Next, I tried to resolve this issue by casting the Capital.Investment column to be real as follows:
investment3 <- read.csv2.sql("investmentdata.csv", "SELECT *, CAST(Capital.Investment AS real) FROM file")
However, R now generates the following error:
Error: no such column: Capital.Investment
Thus, I have two questions. Firstly, why does using read.csv2.sql generate that warning message when read.csv2 works just fine? Secondly, why does R (or SQL) not recognise my Capital.Investment column when I try to cast it as real?
Perhaps it is also worth noting that I cannot simply ignore this warning that the read.csv2.sql function is showing, because I discovered that as a consequence of this warning, it has automatically casted some of the NA rows in my Capital.Investment column to 0, which I cannot allow - the NA rows must stay as NA. I do not seem to be having this problem with the other columns of my csv file though.
As I am quite new to R, any help and explanations will be greatly appreciated :)
Edit
The coded version of what my truncated csv file looks like is as follows. In particular, the name of the column-in-question is indeed Capital.Investment.
id;targetC;year;comp_id;homeC;Industry.Activity;Capital.Investment;Estimated;Jobs.Created;Estimated.1;Project.Type;geographic distance;SIC;listed;sales;assets;cap_structure;rnd;profit;rndintensity;polcon;homeC_gdp;targetC_gdp;homeC_gdppc;targetC_gdppc
1302;AUS;2008;FR338966385;FRA;Design, Development & Testing;33.1;Yes;36;Yes;New;15.26414042;3669;Unlisted;4333088.972;4037211.732;0;NA;-1339221.733;NA;0.489032525;2.92347E+12;1.05456E+12;45413.06571;49628.11513
1311;AUS;2008;US*190521496652;USA;Research & Development;8.4;Yes;30;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1313;AUS;2008;GB05817296;GBR;Business Services;9.7;Yes;10;Yes;New;15.31094496;7389;Unlisted;NA;87.64187374;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
1318;AUS;2008;US129687150L;USA;Business Services;1.3;Yes;225;Yes;New;15.24712914;7373;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1351;AUS;2008;GB*P0060071;GBR;Electricity;516;No;51;Yes;New;15.31094496;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9925;AUS;2008;GB00034121;GBR;Business Services;34.8;Yes;37;Yes;New;15.31094496;4412;Unlisted;NA;2079288.611;0.355157008;NA;94320.15469;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9932;AUS;2008;CA30060NC;CAN;Sales, Marketing & Support;3.2;Yes;11;Yes;New;14.88812529;1094;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.54913E+12;1.05456E+12;46596.33599;49628.11513
9935;AUS;2008;US940890210;USA;Manufacturing;771;Yes;266;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9938;AUS;2008;US770059951;USA;Technical Support Centre;9.1;Yes;104;Yes;Co-Locati;15.24712914;3661;Listed;34922000;53340000;0.120134983;4598000;7333000;0.086201723;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9946;AUS;2008;US010562944;USA;Extraction;535.8;Yes;198;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9955;AUS;2008;DE5030147191;DEU;Logistics, Distribution & Transportation;21.2;Yes;134;Yes;New;14.6718338;4311;Listed;93495971.01;346629334.8;0.036629492;0;2044745.934;0;0.489032525;3.75237E+12;1.05456E+12;45699.19832;49628.11513
9958;AUS;2008;US126012192L;USA;Business Services;9.7;Yes;10;Yes;New;15.24712914;8111;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9969;AUS;2008;US135409005;USA;Extraction;NA;No;538;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9977;AUS;2008;JP000000728JPN;JPN;ICT & Internet Infrastructure;128.6;Yes;77;Yes;New;7.0333688;3571;Listed;53255396.85;38181450.16;0.190244908;2584585.523;480589.4308;0.067692176;0.489032525;5.03791E+12;1.05456E+12;39339.29757;49628.11513
9984;AUS;2008;US841547578;USA;Sales, Marketing & Support;13.6;Yes;23;Yes;New;15.24712914;2095;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9993;AUS;2008;US258715604L;USA;Customer Contact Centre;1.8;No;40;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
This issue was resolved in chat, to be one of two issues:
see my original answer below, this was causing an Error; when that is fixed, we see that ...
there is a warning, informing about the fact that a column (happens to be the same column) looks numeric but has a non-numeric cell somewhere within the guts of the file.
The first is resolved below, the second is just a warning.
However, because the OP is asking to convert to numeric via SQL, the NA is converted to 0, which is not good. My recommendation is to either cast([Capital.Investment] as char) as [Capital.Investment] and use R's as.numeric to convert to numeric (preserving the NA-nature), or to just read.csv2(.) the file outright and use sqldf(.) to use its SQL querying on table-like data.
Up front: add brackets or quotes around your column name.
Rationale: Capital.Investment is seen as a dot-delimited table-column or schema-table or something similarly not what you intend. I believe in general in SQL that field names with embedded dots need this escaping. If your data has an embedded space, realize that R does not like spaces in its field names, so it is by-default using make.names when reading it in (which replaces spaces with dots).
Setup:
Save the following as "quux.csv". (I've named it csv, but since I'm changing it to be ;-delimited, it behaves the same.)
quux;Capital.Investment
1;100
2;200
(Or you can use Capital Investment, it's the same thing.)
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast(Capital.Investment as real) from file')
# Error: no such column: Capital.Investment
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast([Capital.Investment] as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast("Capital.Investment" as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200

Is it possible to remove the binary values that prefix the output of teradata fexp?

I am trying to run teradata fexp with a simple sql script.
The select output column is a string expression and as such results in 2 extra length indicator bytes at the start of each row output.
I have searched for solutions online to the problem. I would like to avoid having to post-process if possible.
There is a thread suggesting the possibility of using an OUTMOD. I don't know what that is.
https://forums.teradata.com/forum/tools/fastexport-remove-binaryindicator-values-in-outmod
http://teradataforum.com/teradata/20100726_155313.htm
And yet another thread suggests casting to a fixed width string type but this would result in padding which I'd like to avoid.
https://forums.teradata.com/forum/tools/fexp-data-doubt
The desired output is actually a delimited plain text file. Is there a way to do it?

Maximum Length of Value in R Data Frame, RODBC

I am trying to do a simple query of a DB2 database using the RODBC package in R (myQuery<-sqlQuery(channel,paste0("..."))) One of the columns is a Varchar of length 3000. The resulting data frame shows a "NA" in that column when there should be text. Exporting it to csv also only shows "NA". A query in Access shows an odd character encoding (only after clicking on the cell). Is there a maximum length of a value in a R data frame or a maximum length of a field that can be pulled using RODBC? Or is it the encoding of the field that causes the "NA" to appear?
I did an end to end test on DB2 (LUW 9.7) and R (3.2.2 Windows) and it worked fine for me.
SQL code:
create table test (foo varchar(3000));
--actual insert is 3000 chars
insert into test values ('aaaaaa .... a');
--this select worked fine in my normal SQL client
select * from test
R code:
long = sqlQuery(connection, "select * from test");
#Displays the 3000 character value.
long;
My guess is the problem is for some other reason than simply the size of the field:
Character encoding issues. If you are seeing something funny in Access, perhaps the content of the field is something not acceptable in the character encoding R is using, so it is being discarded. (I'm not familiar with character encoding in R in particular, but it is in general a thorny issue for software development).
Overall size of the results. Maybe the problem is due to the overall length of a row rather than the length of a single field. Is the query also returning lots of other stuff? Have you tried a simple test of just this field?
Problem in another version. Maybe you are using a different version than I was, and there is indeed a problem with your version. If you think so, update your question with more information.

skip and autostart in fread

I am using the following code to read a file with the data.table library:
fread(myfile, header=FALSE, sep=",", skip=100, colClasses=c("character","numeric","NULL","numeric"))
but I get the following error:
The supplied 'sep' was not found on line 80. To read the file as a single character column set sep='\n'.
It says it did not find sep on line 80, however I set skip=100 so it should not pay attention to the first 100 lines.
UPDATE:
I tried with skip=101 and it worked but it skips the first line where the data starts
I am using version 1.9.2 of the data.table package and R version 3.02 64 bit on windows 7
We don't know the version number you're using, but I can make a guess in this case.
Try setting autostart=101.
Note the first paragraph of Details in ?fread :
Once the separator is found on line autostart, the number of columns is determined. Then the file is searched backwards from autostart until a row is found that doesn't have that number of columns. Thus, the first data row is found and any human readable banners are automatically skipped. This feature can be particularly useful for loading a set of files which may not all have consistently sized banners. Setting skip>0 overrides this feature by setting autostart=skip+1 and turning off the search upwards step.
the skip argument has :
If -1 (default) use the procedure described below starting on line autostart to find the first data row. skip>=0 means ignore autostart and take line skip+1 as the first data row (or column names according to header="auto"|TRUE|FALSE as usual). skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).
and the autostart argument has :
Any line number within the region of machine readable delimited text, by default 30. If the file is shorter or this line is empty (e.g. short files with trailing blank lines) then the last non empty line (with a non empty line above that) is used. This line and the lines above it are used to auto detect sep, sep2 and the number of fields. It's extremely unlikely that autostart should ever need to be changed, we hope.
In your case perhaps the human readable header is much larger than 30 rows, which is why I guess setting autostart=101 might work. No need to use skip.
One motivation is for convenience when a file contains multiple tables. By setting autostart to any row inside the table that you want to pluck out of the file, it'll find the first data row and header row for you automatically, and then read just that table. You don't have to worry about getting the exact line number at the start of data like you do with skip. fread can only read one table currently. It could feasibly return a list of tables from a single file, but that's getting a bit complicated and nobody has asked for that.

In PhpExcel library when i am assigning values to IW4 the assigned value not generatted there

In PhpExcel library when i am assigning values to IW4 the assigned value not generatted there
Steps:
We are using The code to generate the Value to cell in PHPExcel
**$objPHPExcel->getActiveSheet()->setCellValue('A1', 'cell value here');**
When i am using it to generate value to IW4 cell the value not getting generatted
**$objPHPExcel->getActiveSheet()->setCellValue('IW4', 'cell value here');**
Please Help me to find the solution
BIFF format Excel files only allow 256 columns (up to IV), OfficeOpenXML allows more.
If you set a value in a column beyond the limit, PHPExcel only knows it's invalid at the point where you save (when it knows whether you're saving as an Excel5 or Excel2007 file), Rather than trigger an exception at that point (which would be much more frustrating if it was a long running script), it silently discards the invalid columns or rows.
This is similar behaviour to Excel itself, if you open an xlsx file in an earlier version of Excel that doesn't support as many rows and columns.

Resources