Why does read.csv2 work just fine, yet read.csv2.sql shows an error/warning? - r

I am trying to read a csv file in R using read.csv2.sql, since I would like to use a SELECT query from SQL to help me filter my data, but before I can even get to my SELECT query, I discovered that simply reading my csv file using read.csv2.sql already generates a warning message.
This is my code:
investment2 <- read.csv2.sql("investmentdata.csv")
This is the warning message:
Warning message:
In result_fetch(res#ptr, n = n) :
Column 'Capital.Investment': mixed type, first seen values of type real, coercing other values of type string
However, when I use the normal read.csv2 function, there is no error. In particular, the following code works fine with no warning messages:
investment <- read.csv2("investmentdata.csv")
Next, I tried to resolve this issue by casting the Capital.Investment column to be real as follows:
investment3 <- read.csv2.sql("investmentdata.csv", "SELECT *, CAST(Capital.Investment AS real) FROM file")
However, R now generates the following error:
Error: no such column: Capital.Investment
Thus, I have two questions. Firstly, why does using read.csv2.sql generate that warning message when read.csv2 works just fine? Secondly, why does R (or SQL) not recognise my Capital.Investment column when I try to cast it as real?
Perhaps it is also worth noting that I cannot simply ignore this warning that the read.csv2.sql function is showing, because I discovered that as a consequence of this warning, it has automatically casted some of the NA rows in my Capital.Investment column to 0, which I cannot allow - the NA rows must stay as NA. I do not seem to be having this problem with the other columns of my csv file though.
As I am quite new to R, any help and explanations will be greatly appreciated :)
Edit
The coded version of what my truncated csv file looks like is as follows. In particular, the name of the column-in-question is indeed Capital.Investment.
id;targetC;year;comp_id;homeC;Industry.Activity;Capital.Investment;Estimated;Jobs.Created;Estimated.1;Project.Type;geographic distance;SIC;listed;sales;assets;cap_structure;rnd;profit;rndintensity;polcon;homeC_gdp;targetC_gdp;homeC_gdppc;targetC_gdppc
1302;AUS;2008;FR338966385;FRA;Design, Development & Testing;33.1;Yes;36;Yes;New;15.26414042;3669;Unlisted;4333088.972;4037211.732;0;NA;-1339221.733;NA;0.489032525;2.92347E+12;1.05456E+12;45413.06571;49628.11513
1311;AUS;2008;US*190521496652;USA;Research & Development;8.4;Yes;30;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1313;AUS;2008;GB05817296;GBR;Business Services;9.7;Yes;10;Yes;New;15.31094496;7389;Unlisted;NA;87.64187374;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
1318;AUS;2008;US129687150L;USA;Business Services;1.3;Yes;225;Yes;New;15.24712914;7373;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1351;AUS;2008;GB*P0060071;GBR;Electricity;516;No;51;Yes;New;15.31094496;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9925;AUS;2008;GB00034121;GBR;Business Services;34.8;Yes;37;Yes;New;15.31094496;4412;Unlisted;NA;2079288.611;0.355157008;NA;94320.15469;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9932;AUS;2008;CA30060NC;CAN;Sales, Marketing & Support;3.2;Yes;11;Yes;New;14.88812529;1094;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.54913E+12;1.05456E+12;46596.33599;49628.11513
9935;AUS;2008;US940890210;USA;Manufacturing;771;Yes;266;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9938;AUS;2008;US770059951;USA;Technical Support Centre;9.1;Yes;104;Yes;Co-Locati;15.24712914;3661;Listed;34922000;53340000;0.120134983;4598000;7333000;0.086201723;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9946;AUS;2008;US010562944;USA;Extraction;535.8;Yes;198;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9955;AUS;2008;DE5030147191;DEU;Logistics, Distribution & Transportation;21.2;Yes;134;Yes;New;14.6718338;4311;Listed;93495971.01;346629334.8;0.036629492;0;2044745.934;0;0.489032525;3.75237E+12;1.05456E+12;45699.19832;49628.11513
9958;AUS;2008;US126012192L;USA;Business Services;9.7;Yes;10;Yes;New;15.24712914;8111;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9969;AUS;2008;US135409005;USA;Extraction;NA;No;538;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9977;AUS;2008;JP000000728JPN;JPN;ICT & Internet Infrastructure;128.6;Yes;77;Yes;New;7.0333688;3571;Listed;53255396.85;38181450.16;0.190244908;2584585.523;480589.4308;0.067692176;0.489032525;5.03791E+12;1.05456E+12;39339.29757;49628.11513
9984;AUS;2008;US841547578;USA;Sales, Marketing & Support;13.6;Yes;23;Yes;New;15.24712914;2095;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9993;AUS;2008;US258715604L;USA;Customer Contact Centre;1.8;No;40;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513

This issue was resolved in chat, to be one of two issues:
see my original answer below, this was causing an Error; when that is fixed, we see that ...
there is a warning, informing about the fact that a column (happens to be the same column) looks numeric but has a non-numeric cell somewhere within the guts of the file.
The first is resolved below, the second is just a warning.
However, because the OP is asking to convert to numeric via SQL, the NA is converted to 0, which is not good. My recommendation is to either cast([Capital.Investment] as char) as [Capital.Investment] and use R's as.numeric to convert to numeric (preserving the NA-nature), or to just read.csv2(.) the file outright and use sqldf(.) to use its SQL querying on table-like data.
Up front: add brackets or quotes around your column name.
Rationale: Capital.Investment is seen as a dot-delimited table-column or schema-table or something similarly not what you intend. I believe in general in SQL that field names with embedded dots need this escaping. If your data has an embedded space, realize that R does not like spaces in its field names, so it is by-default using make.names when reading it in (which replaces spaces with dots).
Setup:
Save the following as "quux.csv". (I've named it csv, but since I'm changing it to be ;-delimited, it behaves the same.)
quux;Capital.Investment
1;100
2;200
(Or you can use Capital Investment, it's the same thing.)
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast(Capital.Investment as real) from file')
# Error: no such column: Capital.Investment
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast([Capital.Investment] as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast("Capital.Investment" as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200

Related

Duckdb_read_csv struggling with with auto detecting column data types in R

I have some very large CSV files (~183mio. rows by 8 columns) that I want to load into a database using R. I use duckdb for this and it its built-in function duckdb_read_csv, which is supposed to auto-detect datatypes for each column. If I enter the following code:
con = dbConnect(duckdb::duckdb(), dbdir="testdata.duckdb", read_only = FALSE)
duckdb_read_csv(con, "d15072021","mydata.csv",
header = TRUE)
It produces this error:
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Could not convert string '2' to BOOL between line 12492801 and 12493825 in column 9. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0
I've looked at the rows in question and I can't find any irregularities in column 9. Unfortunately, I cannot post the dataset because it's confidential. But the entire column is filled with either FALSE or TRUE.
If I set the parameter nrow.check to something larger than 12493825 it doesn't produce the same error but takes very long and simply converts the column to VARCHAR instead of a logical. Setting nrow.check to -1 (meaning it checks every row for a pattern) crashes R and my PC completely.
The weird thing: This isn't consistent. Earlier I imported the dataset whilst keeping the default value for nrow.check at 500 and it read the file with no issue (though still converting column 9 to VARCHAR). I have to read a lot of files that are the same pattern so I need to have a reliable way of reading them. Anyone know how duckdb_read_csv actually works and why I might get this error?
Note that reading the files into memory and then into a database isn't an option because I run out of memory instantly.
the way the sniffer works is by sampling nrow.check rows to figure out the data type, so the result can differ from runs if you get unlucky, increasing it will reduce the chances of failing it, mainly because the sniffer looks at more rows.
If increasing the number of rows is not possible due to performance issues, you can of course first define the schema of the CSV file. But then you must know the schema beforehand.
As an example of how you can define the schema and turn off the sniffer:
select * from
SELECT * FROM read_csv('test.csv', COLUMNS=STRUCT_PACK(a := 'INTEGER', b := 'INTEGER'), auto_detect='false')

R read csv with comma in column

Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).

Character values stored in DATAFRAME with Double Quotes while reading into R

I have a csv file with almost 4 millions records and 30 + columns.
The Columns are of varied type that includes Numeric, Alphanumeric, Date Column, character etc.
Attempt 1:
When I first read the file in R using read.csv Function then only 2 millions of the records were read.
This may have happened because of some special characters in the DATA.
Attempt 2:
I provided the argument quote = "" in read.csv Function and all the records were read succesfully.
However this brings up 2 issues:
a. all teh Columns were appended with 'x.' modifier:
egs.: x.date , x.name
b. all the Character Columns were loaded in dataframe, enclosed with double quotes ""
Can someone, please advise me that how to resolve these 2 issues and get the data loaded in R succesfully?
I work for a financial insititution and the data is highly sensitive, hence cannot paste the screenshot over here.
I also tried to create the scenario at my home but all my efforts were of little or of no avail.
The below screenshot is closest I have came to the exact scenario:
DATAFRAME SCREENSHOT: Not exact copy

Maximum Length of Value in R Data Frame, RODBC

I am trying to do a simple query of a DB2 database using the RODBC package in R (myQuery<-sqlQuery(channel,paste0("..."))) One of the columns is a Varchar of length 3000. The resulting data frame shows a "NA" in that column when there should be text. Exporting it to csv also only shows "NA". A query in Access shows an odd character encoding (only after clicking on the cell). Is there a maximum length of a value in a R data frame or a maximum length of a field that can be pulled using RODBC? Or is it the encoding of the field that causes the "NA" to appear?
I did an end to end test on DB2 (LUW 9.7) and R (3.2.2 Windows) and it worked fine for me.
SQL code:
create table test (foo varchar(3000));
--actual insert is 3000 chars
insert into test values ('aaaaaa .... a');
--this select worked fine in my normal SQL client
select * from test
R code:
long = sqlQuery(connection, "select * from test");
#Displays the 3000 character value.
long;
My guess is the problem is for some other reason than simply the size of the field:
Character encoding issues. If you are seeing something funny in Access, perhaps the content of the field is something not acceptable in the character encoding R is using, so it is being discarded. (I'm not familiar with character encoding in R in particular, but it is in general a thorny issue for software development).
Overall size of the results. Maybe the problem is due to the overall length of a row rather than the length of a single field. Is the query also returning lots of other stuff? Have you tried a simple test of just this field?
Problem in another version. Maybe you are using a different version than I was, and there is indeed a problem with your version. If you think so, update your question with more information.

Bracket-escaped table names with dplyr

I'm programmatically fetching a bunch of datasets, many of them having silly names that begin with numbers and have special characters like minus signs in them. Because none of the datasets are particularly large, and I wanted the benefit R making its best guess about data types, I'm (ab)using dplyr to dump these tables into SQLite.
I am using square brackets to escape the horrible table names, but this doesn't seem to work. For example:
data(iris)
foo.db <- src_sqlite("foo.sqlite3", create = TRUE)
copy_to(foo.db, df=iris, name="[14m3-n4m3]")
This results in the error message:
Error in sqliteSendQuery(conn, statement, bind.data) : error in statement: no such table: 14m3-n4m3
This works if I choose a sensible name. However, due to a variety of reasons, I'd really like to keep the cumbersome names. I am also able to create such a badly-named table directly from sqlite:
sqlite> create table [14m3-n4m3](foo,bar,baz);
sqlite> .tables
14m3-n4m3
Without cracking into things too deeply, this looks like dplyr is handling the square brackets in some way that I cannot figure out. My suspicion is that this is a bug, but I wanted to check here first to make sure I wasn't missing something.
EDIT: I forgot to mention the case where I just pass the janky name directly to dplyr. This errors out as follows:
library(dplyr)
data(iris)
foo.db <- src_sqlite("foo.sqlite3", create = TRUE)
copy_to(foo.db, df=iris, name="14M3-N4M3")
Error in sqliteSendQuery(conn, statement, bind.data) :
error in statement: unrecognized token: "14M3"
This is a bug in dplyr. It's still there in the current github master. As #hadley indicates, he has tried to escape things like table names in dplyr to prevent this issue. The current problem you're having arises from lack of escaping in two functions. Table creation works fine when providing the table name unescaped (and is done with dplyr::db_create_table). However, the insertion of data to the table is done using DBI::dbWriteTable which doesn't support odd table names. If the table name is provided to this function escaped, it fails to find it in the list of tables (the first error you report). If it is provided escaped, then the SQL to do the insertion is not synatactically valid.
The second issue comes when the table is updated. The code to get the field names, this time actually in dplyr, again fails to escape the table name because it uses paste0 rather than build_sql.
I've fixed both errors at a fork of dplyr. I've also put in a pull request to #hadley and made a note on the issue https://github.com/hadley/dplyr/issues/926. In the meantime, if you wanted to you could use devtools::install_github("NikNakk/dplyr", ref = "sqlite-escape") and then revert to the master version once it's been fixed.
Incidentally, the correct SQL-99 way to escape table names (and other identifiers) in SQL is with double quotes (see SQL standard to escape column names?). MS Access uses square brackets, while MySQL defaults to backticks. dplyr uses double quotes, per the standard.
Finally, the proposal from #RichardScriven wouldn't work universally. For example, select is a perfectly valid name in R, but is not a syntactically valid table name in SQL. The same would be true for other reserved words.

Resources